# Tuning Count Vectorization - One Hot Encoding and other Features

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
plots_df = pd.read_csv("movie_plots.csv")

# filter only for American movies
plots_df = plots_df[plots_df["Origin/Ethnicity"] == "American"]

 # traditional CountVectorizer
vectorizer = CountVectorizer()

 # use English stopwords, and use one-hot encoding
vectorizer = CountVectorizer(stop_words="english", binary=True)

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=0.05) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

X = vectorizer.fit_transform(plots_df["Plot"])

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

FileNotFoundError: File b'movie_plots.csv' does not exist

# Cosine Similarity Example

### Intro to Algorithmic Marketing:
![alt text](images/cos-sim-textbook1.png "Logo Title Text 1")


## Finding Magnitude of a Vector

In [None]:
import math
import numpy as np
def magnitude(x): 
    return math.sqrt(sum(i**2 for i in x))

vectorA = [0,3,1,2]

print(f"First approach: {magnitude(vectorA)}")
print(f"Second approach: {np.linalg.norm(vectorA)}")

# Pointwise Mutual Information

It's important to identify a **context window** when analyzing co-occurence. In the image below, the context window size is 4 (2 tokens to either side of the target word):

![alt text](images/context_window.png "Logo Title Text 1")

For the purposes of the next section, we'll define the **entire document as the context window.**

Pointwise mutual information measures the ratio between the **joint probability of two events happening** with the probabilities of the two events happening, assuming they are independent. It can be defined with the following equation:

$$
PMI_{A,B} = log\frac{p(A,B)}{p(A)p(B)}
$$

Remember that when two events are independent, $P(i,j) = P(i)P(j)$. Using PMI to just a raw word count is often preferable because very common words have extreme skew ("the" and "of" will co-occur frequently in the same  )

```python
import math
def pmi(tokenA, tokenB, documents, word_counts):
    
    # word_counts[token_A] => number of times tokenA appears in the documents
    # float(len(documents)) => number of documents
    # bigram_freq => a dictionary of the number of times tokenA and tokenB are in the same document together
    
    prob_A = word_counts[tokenA] / float(len(documents))
    prob_B = word_counts[tokenB] / float(len(documents))
    prob_A_B = bigram_freq[" ".join([tokenA, tokenB])] / float(len(documents))
    return math.log(prob_A_B/float(prob_A*prob_B),2) 
```

# Collocation

Many times, in previous homeworks, we've had to manually try to find phrases that belong together. For example, `New York City`.

From [nltk.org](http://www.nltk.org/howto/collocations.html), **collocation** can be defined as

> expressions of multiple words which commonly co-occur together. 

In [2]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english') + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"])

In [3]:
documents = []
articles = [f"bbcsport/football/00{i}.txt" for i in range(1,10)]

for article in articles:
    article = open(article) # open each sports article
    for line in article.readlines():
        line = line.replace("\n", "") # replace the new line escape character
        if len(line) > 0: # if the line is not empty, process it
            line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
            documents.append(line)

In [6]:
documents

[['Man', 'Utd', 'stroll', 'to', 'Cup', 'win'],
 ['Wayne',
  'Rooney',
  'made',
  'a',
  'winning',
  'return',
  'to',
  'Everton',
  'a',
  'Manchester',
  'United',
  'cruised',
  'into',
  'the',
  'FA',
  'Cup',
  'quarter-finals',
  '.'],
 ['Rooney',
  'received',
  'a',
  'hostile',
  'reception',
  ',',
  'but',
  'goal',
  'in',
  'each',
  'half',
  'from',
  'Quinton',
  'Fortune',
  'and',
  'Cristiano',
  'Ronaldo',
  'silenced',
  'the',
  'jeer',
  'at',
  'Goodison',
  'Park',
  '.',
  'Fortune',
  'headed',
  'home',
  'after',
  '23',
  'minute',
  'before',
  'Ronaldo',
  'scored',
  'when',
  'Nigel',
  'Martyn',
  'parried',
  'Paul',
  'Scholes',
  "'",
  'free-kick',
  '.',
  'Marcus',
  'Bent',
  'missed',
  'Everton',
  "'s",
  'best',
  'chance',
  'when',
  'Roy',
  'Carroll',
  ',',
  'who',
  'wa',
  'later',
  'struck',
  'by',
  'a',
  'missile',
  ',',
  'saved',
  'at',
  'his',
  'foot',
  '.'],
 ['Rooney',
  "'s",
  'return',
  'wa',
  'always',
  'go

In [7]:
new_documents

[['Man', 'Utd', 'stroll', 'Cup', 'win'],
 ['Wayne',
  'Rooney',
  'made',
  'winning',
  'return',
  'Everton',
  'Manchester',
  'United',
  'cruised',
  'FA',
  'Cup',
  'quarter-finals'],
 ['Rooney',
  'received',
  'hostile',
  'reception',
  'goal',
  'half',
  'Quinton',
  'Fortune',
  'Cristiano',
  'Ronaldo',
  'silenced',
  'jeer',
  'Goodison',
  'Park',
  'Fortune',
  'headed',
  'home',
  '23',
  'minute',
  'Ronaldo',
  'scored',
  'Nigel',
  'Martyn',
  'parried',
  'Paul',
  'Scholes',
  'free-kick',
  'Marcus',
  'Bent',
  'missed',
  'Everton',
  'best',
  'chance',
  'Roy',
  'Carroll',
  'wa',
  'later',
  'struck',
  'missile',
  'saved',
  'foot'],
 ['Rooney',
  'return',
  'wa',
  'always',
  'going',
  'potential',
  'flashpoint',
  'wa',
  'involved',
  'angry',
  'exchange',
  'spectator',
  'even',
  'kick-off',
  'Rooney',
  'every',
  'touch',
  'wa',
  'met',
  'deafening',
  'chorus',
  'jeer',
  'crowd',
  'idolised',
  '19-year-old',
  'Everton',
  'star

In [4]:
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [5]:
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 15)

[('Champions', 'League'),
 ('Manchester', 'United'),
 ('Cristiano', 'Ronaldo'),
 ('Van', 'Nistelrooy'),
 ('Wayne', 'Rooney'),
 ('Alex', 'Ferguson'),
 ('FA', 'Cup'),
 ('Ferguson', 'wa'),
 ('Gary', 'Neville'),
 ('Man', 'Utd'),
 ('Manchester', 'City'),
 ('Sir', 'Alex'),
 ('national', 'team'),
 ('wa', "n't"),
 ('23', 'minute')]

# Term Frequency / Inverse Document Frequency


## Term Frequency
![alt text](images/tf-idf1.png "Term Frequency")

## Inverse Document Frequency
![alt text](images/tf-idf2.png "Inverse Document Frequency")

### Example Calculation

![alt text](images/tf-idf4.png "Example")

Essentially if a word appears a lot in one document, but not in other docs, it means that this word is important to this document. (has high TF, low df, high IDF)

## Using Scikit-Learn to Generate TF-IDF

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(ngram_range=(3,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.4, stop_words=stopwords.words())

In [None]:
df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")
corpus = list(df["review"].values)

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

In [None]:
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)

In [None]:
score.to_csv("scores.csv")

## Exercises

For the following exercises, use the definitions below:

**Term frequency**:
$$
tf = n(t,d)
$$
**Inverse document frequency**:
$$
idf = 1 + \frac{N}{df(t) + 1}
$$

In [None]:
documents = [
    "He ate the food",
    "He liked the meal",
    "She likes the food from McDonalds, but she avoids the food from Burger King",
    "They like to eat 3 meals a day"
]

### Calculate the TF-IDF score for `like` in each of the documents

### Calculate the TF-IDF score for `the food` bigram in each of the documents