# Cosine Similarity Example

### Intro to Algorithmic Marketing:
![alt text](images/cos-sim-textbook1.png "Logo Title Text 1")


## Finding Magnitude of a Vector

In [4]:
import math
import numpy as np
def magnitude(x): 
    return math.sqrt(sum(i**2 for i in x))

vectorA = [0,3,1,2]

print(f"First approach: {magnitude(vectorA)}")
print(f"Second approach: {np.linalg.norm(vectorA)}")

First approach: 3.7416573867739413
Second approach: 3.7416573867739413


# Tuning Count Vectorization - One Hot Encoding and other Features

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
plots_df = pd.read_csv("movie_plots.csv")

# filter only for American movies
plots_df = plots_df[plots_df["Origin/Ethnicity"] == "American"]

 # traditional CountVectorizer
vectorizer = CountVectorizer()

 # use English stopwords, and use one-hot encoding
vectorizer = CountVectorizer(stop_words="english", binary=True)

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=0.05) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

# use English stopwords, and use one-hot encoding, and the word must appear in at least two of the movie plots
# and keep only the top 200
vectorizer = CountVectorizer(stop_words="english", binary=True, min_df=2, max_features=200) 

X = vectorizer.fit_transform(plots_df["Plot"])

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df

Shape of dataframe is (655, 200)
Total number of occurences: 29652


Unnamed: 0,able,accidentally,agrees,appears,arrive,arrives,asks,attack,attacks,attempt,...,way,wife,woman,work,working,world,year,years,york,young
0,0,0,0,1,0,0,1,1,1,0,...,0,0,1,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,1,0,...,1,0,1,0,0,1,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,1,1,0,0
4,0,0,0,1,1,0,1,0,0,0,...,1,1,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,1,1,1,...,1,0,0,0,1,0,0,1,0,0
6,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,1,1,0
7,0,0,0,0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
8,0,1,0,1,1,1,1,0,0,0,...,0,0,0,0,1,0,0,1,0,0
9,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


# Collocation

In [98]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english') + [".",'.', ",",":", "''", "'s", "'", "``", "(", ")", "-"])

In [99]:
documents = []
articles = [f"bbcsport/football/00{i}.txt" for i in range(1,10)]

for article in articles:
    article = open(article) # open each sports article
    for line in article.readlines():
        line = line.replace("\n", "") # replace the new line escape character
        if len(line) > 0: # if the line is not empty, process it
            line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
            documents.append(line)

In [100]:
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        if word.strip().lower() not in stopwords:
            new_document.append(word)
    new_documents.append(new_document)

In [101]:
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

collocation_finder.nbest(measures.raw_freq, 30)

[('Champions', 'League'),
 ('Manchester', 'United'),
 ('Cristiano', 'Ronaldo'),
 ('Van', 'Nistelrooy'),
 ('Wayne', 'Rooney'),
 ('Alex', 'Ferguson'),
 ('FA', 'Cup'),
 ('Ferguson', 'wa'),
 ('Gary', 'Neville'),
 ('Man', 'Utd'),
 ('Manchester', 'City'),
 ('Sir', 'Alex'),
 ('national', 'team'),
 ('wa', "n't"),
 ('23', 'minute'),
 ('BBC', 'Radio'),
 ('Blues', 'bos'),
 ('Carling', 'Cup'),
 ('Carroll', 'Gary'),
 ('City', 'Sunday'),
 ('City', 'best'),
 ('Cup', 'final'),
 ('Five', 'Live'),
 ('Gallas', 'would'),
 ('Gerrard', 'ha'),
 ('Goodison', 'Park'),
 ('Home', 'International'),
 ('International', 'series'),
 ('Jose', 'Mourinho'),
 ('Man', 'City')]

# Pointwise Mutual Information

It's important to identify a **context window** when analyzing co-occurence. In the image below, the context window size is 4 (2 tokens to either side of the target word):

![alt text](images/context_window.png "Logo Title Text 1")

For the purposes of the next section, we'll define the **entire document as the context window.**

Pointwise mutual information measures the ratio between the **joint probability of two events happening** with the probabilities of the two events happening, assuming they are independent. It can be defined with the following equation:

$$
PMI_{A,B} = log\frac{p(A,B)}{p(A)p(B)}
$$

Remember that when two events are independent, $P(i,j) = P(i)P(j)$. Using PMI to just a raw word count is often preferable because very common words have extreme skew ("the" and "of" will co-occur frequently in the same  )

```python
import math
def pmi(tokenA, tokenB, documents, word_counts):
    
    # word_counts[token_A] => number of times tokenA appears in the documents
    # float(len(documents)) = number of documents
    
    prob_A = word_counts[tokenA] / float(len(documents))
    prob_B = word_counts[tokenB] / float(len(documents))
    prob_A_B = bigram_freq[" ".join([word1, word2])] / float(len(documents))
    return math.log(prob_A_B/float(prob_A*prob_B),2) 
```

# Term Frequency / Inverse Document Frequency


## Term Frequency
![alt text](images/tf-idf1.png "Term Frequency")

## Inverse Document Frequency
![alt text](images/tf-idf2.png "Inverse Document Frequency")

### Example Calculation

![alt text](images/tf-idf4.png "Example")

## Using Scikit-Learn

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(ngram_range=(3,4),
                             token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_df=0.4, stop_words=stopwords.words())

In [18]:
df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")
corpus = list(df["review"].values)

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

tf_idf.transpose()

X -> RNN, NB, SVM
# tf_idf = tf_idf.sum(axis=1)
# score = pd.DataFrame(tf_idf, columns=["score"])
# score["term"] = termsdsadasds
# score.sort_values(by="score", ascending=False, inplace=True)

Unnamed: 0,aaaaaaaahhhhhhhhhhh still feel,aaaaaaaahhhhhhhhhhh still feel situation,abbreviated menu worthy,abbreviated menu worthy mcdonald,abc kitchen numerous,abc kitchen numerous times,ability answer questions,ability answer questions menu,ability innovate launching,ability innovate launching products,...,zombies anyway course,zombies anyway course waiting,zombies appeared nowhere,zombies appeared nowhere including,zombies bikes stopped,zombies bikes stopped stare,zombies little less,zombies little less predictable,zoom line person,zoom line person waving
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
score.to_csv("scores.csv")