# Demo 06

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

## Word Frequencies

### Words in the wild

In [None]:
word_counts_df = pd.read_csv("data/norvig_count1w.txt", sep="\t", header=None)
word_counts_df = word_counts_df.rename(columns={0: "type", 1: "count"})
word_counts_df

In [None]:
ax = word_counts_df.head(100).plot(kind="bar", x='type', rot=45)
ax.set_title("Zipf’s law from Google Web Trillion Word Corpus")

plt.xticks(np.arange(0, 100, step=5))

**Question:** What do we notice about these words?

In [None]:
ax = word_counts_df.head(500).plot(kind="bar", x='type', rot=45)
ax.set_title("Zipf’s law from Google Web Trillion Word Corpus")

plt.xticks(np.arange(0, 500, step=20))

(back to slides)

## Bag of Words

Let's get the BoW for this sentence

In [None]:
nltk.FreqDist(nltk.word_tokenize("Penny bought bright blue fishes on a bright blue sunny day"))

Let's now get BoWs for each of these sentences

In [None]:
texts = [
    "Penny bought bright blue fishes.",
    "Penny bought bright blue and orange fish.",
    "The cat ate a fish at the store.",
    "Penny went to the store. Penny ate a bug. Penny saw a fish.",
    "It meowed once at the bug, it is still meowing at the bug and the fish",
    "The cat is at the fish store. The cat is orange. The cat is meowing at the fish.",
    "Penny is a fish"
]

In [None]:
[nltk.FreqDist(nltk.word_tokenize(sent)) for sent in texts]

What if we want to compare them?

(back to slides)

## Document Vector

In [None]:
doc_vector = pd.Series(nltk.word_tokenize("Penny bought bright blue fishes on a bright blue sunny day."))
doc_vector.value_counts()

In [None]:
[pd.Series(nltk.word_tokenize(sent)).value_counts() for sent in texts]

## Document-Term Matrix

In [None]:
pd.DataFrame([pd.Series(nltk.word_tokenize(sent)).value_counts() for sent in texts]).T

In [None]:
pd.DataFrame([pd.Series(nltk.word_tokenize(sent)).value_counts() for sent in texts]).fillna(0)

(back to slides)

Sometimes we will transpose this matrix

In [None]:
pd.DataFrame([pd.Series(nltk.word_tokenize(sent)).value_counts() for sent in texts]).fillna(0).T



### Sklearn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

In [None]:
matrix = count_vectorizer.fit_transform(texts)
matrix

In [None]:
matrix.toarray()

In [None]:
pd.DataFrame(matrix.toarray())

In [None]:
count_vectorizer.get_feature_names()

In [None]:
pd.DataFrame(matrix.toarray(), columns= count_vectorizer.get_feature_names())

I sometimes like to transpose this matrix

In [None]:
pd.DataFrame(matrix.toarray().T, index= count_vectorizer.get_feature_names())

Let's put this into a function

In [None]:
def make_matrix(corpus):
    count_vectorizer = CountVectorizer()
    matrix = count_vectorizer.fit_transform(corpus)
    return pd.DataFrame(matrix.toarray().T, index= count_vectorizer.get_feature_names())

make_matrix(texts)

### Product Reviews

In [None]:
from nltk.corpus import product_reviews_1

In [None]:
product_reviews_1.fileids()

In [None]:
canon_g3_reviews = [review.sents() for review in product_reviews_1.reviews('Canon_G3.txt')]
canon_g3_reviews[0][0:2]

In [None]:
make_matrix(canon_g3_reviews).shape

So we need to extract the text from the list of reviews

In [None]:
review_corpus = [" ".join([" ".join(sent) for sent in review]) for review in canon_g3_reviews]
f"There are {len(review_corpus)} amount of reviews."

#### Exploring corpus

Let's get a sense of these reviews. It is usually a good idea to get a sense of the corpus

**Question:** How big is the vocabulary of these reviews?

In [None]:
nltk.FreqDist(" ".join(review_corpus).split())

In [None]:
len(nltk.FreqDist(" ".join(review_corpus).split()))

**Question:** How many tokens are in the corpus?

In [None]:
sum(nltk.FreqDist(" ".join(review_corpus).split()).values())

**Question:** What is the distribution of word types in the reviews?

In [None]:
nltk.FreqDist(" ".join(review_corpus).split()).plot()

**Question:** How long are these reviews?

In [None]:
pd.DataFrame(review_corpus).rename(columns={0:"text"})['text'].map(lambda x: len(x.split())).describe()

#### Document Matrix of reviews

In [None]:
make_matrix(review_corpus)

**Question:** What does each row represent?

**Question:** What do we notice about these rows?

**Question:** What do we think about the last few rows?

#### Pre-processing document matrix

##### Remove stop words

In [None]:
cleaned_corpus = []
for review in review_corpus:
    clean_review = []
    review_tokens = nltk.word_tokenize(review)
    for word in review_tokens:
        if word not in nltk.corpus.stopwords.words('english'):
            clean_review.append(word)
    cleaned_corpus.append(" ".join(clean_review))
make_matrix(cleaned_corpus)

##### Stem tokens


In [None]:
stemmer = nltk.stem.SnowballStemmer(language='english')
stemmer.stem('zooming')

In [None]:
cleaned_corpus = []
for review in review_corpus:
    clean_review = []
    review_tokens = nltk.word_tokenize(review)
    for word in review_tokens:
        if word not in nltk.corpus.stopwords.words('english'):
            clean_review.append(stemmer.stem(word))
    cleaned_corpus.append(" ".join(clean_review))
make_matrix(cleaned_corpus)

##### Remove punctuation and numbers

In [None]:

cleaned_corpus = []
for review in review_corpus:
    clean_review = []
    review_tokens = nltk.word_tokenize(review)
    for word in review_tokens:
        if word not in nltk.corpus.stopwords.words('english') and word.isalpha():
            clean_review.append(stemmer.stem(word))
    cleaned_corpus.append(" ".join(clean_review))
    
curr_matrix_df = make_matrix(cleaned_corpus)
curr_matrix_df

**Question:** What do we think the most common value is?

Consequently, these vectors can be called **sparse vectors**

Let's look at the first document

**Question:** How can we get the first document from the matrix?

In [None]:
first_doc = curr_matrix_df[0]
first_doc[first_doc != 0]

It looks like a lot of words appear just once

In [None]:
ax = first_doc[first_doc != 0].value_counts().plot(kind='bar', rot=0)
ax.set_title("Number of times each word appears in first Review")

So let's look at just words that appear more than once

In [None]:
first_doc[first_doc > 1]

**VALIDATE VALIDATE VALIDATE**

**Question:** How well does this vector capture the review?

(Run the next cell to see the original review and them compare)

In [None]:
review_corpus[0]

#### Most common word in each review

In [None]:
curr_matrix_df.apply(lambda x: (x.idxmax(), x.max()), axis=0).rename({0:"word", 1:"count"})

We might think that document 0 focuses more on "pictures" than document 6

**Question:** Is `pictur` actually much more prominent in review 0 than review 6? 

In [None]:
review_length_df = pd.DataFrame([(idx, len(review.split())) for idx, review in enumerate(cleaned_corpus)])
review_length_df = review_length_df.rename(columns={0: 'review_id', 1: 'review_length'})
review_length_df.head()

In [None]:
review_length_df[(review_length_df['review_id'] == 0)
                |
                (review_length_df['review_id'] == 6)]

### Convert counts to frequencies

In [None]:
curr_matrix_df[0] / sum(curr_matrix_df[0])

In [None]:
freq_df = curr_matrix_df.apply(lambda x: x/x.sum(), axis=0)
freq_df

The sum of each column should be one

In [None]:
freq_df.apply(lambda x: sum(x), axis=0)

**Question:** What is the most frequent word in each review?

In [None]:
freq_df.apply(lambda x: (x.idxmax(), x.max()), axis=0).rename({0:"word", 1:"freq"})

**Question:** Do we now think document 0 and document 6 equally discuss "pictures"?

**Question:** Are these words actually interesting or unique to specific documents?

(back to slides)

## Inverse Document Frequency

Let's compute it manually

idf of word w in Document D is log(Number of documents divided by number of documents that contain w)

## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
matrix = count_vectorizer.fit_transform(cleaned_corpus)
pd.DataFrame(matrix.toarray().T, index= count_vectorizer.get_feature_names())

### Let's apply this to all the product reviews

In [None]:
product_reviews_1.fileids()

In [None]:
product_reviews_1.reviews('Nokia_6610.txt')