# Homework 4 (Due Friday, Nov. 19th, 11:59pm PST)

1. Identify **three pairs of documents** in the McDonalds review dataset that have over 0.85 cosine similarity using average token word2vec embeddings from spacy.

2. Using the `SMS_test` and `SMS_train` datasets, build a classification model (you can simply use the `sklearn.linear_model.LogisticRegression` model used. Please attempt at least two of the vectorization techniques below:
    * `CountVectorization`
    * `TfIdfVectorization`
    * `word2vec` spacy document-level vectors
        * if you pick Count or TfIdf Vectorization -> you MUST reduce dimensionality.

Make sure you perform the following:
* use train/test split
* use proper model evaluation metrics
* text preprocessing (regex, stemming/lemmatization, stopword removal, grouping entities, etc.)

A discussion of the following:
* **What techniques** you tried to improve the performance of your model.
* What you would try to do, given more time, that would improve the performance of your model.
* Provide an example of two **error cases** - a false positive and a false negative - that your model got wrong, and why the model did not predict the correct answer.

In [1]:
import spacy
from spacy import displacy
import pandas as pd 
import numpy as np
import re
import nltk
import sys

nlp = spacy.load("en_core_web_sm")


# Question 1

In [2]:
mcd_rev = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding = "ISO-8859-1")
reviews = mcd_rev['review']
reviews

0       I'm not a huge mcds lover, but I've been to be...
1       Terrible customer service. I came in at 9:30pm...
2       First they "lost" my order, actually they gave...
3       I see I'm not the only one giving 1 star. Only...
4       Well, it's McDonald's, so you know what the fo...
                              ...                        
1520    I enjoyed the part where I repeatedly asked if...
1521    Worst McDonalds I've been in in a long time! D...
1522    When I am really craving for McDonald's, this ...
1523    Two points right out of the gate: 1. Thuggery ...
1524    I wanted to grab breakfast one morning before ...
Name: review, Length: 1525, dtype: object

In [3]:
helper = reviews.copy()

x = []
for i in helper:
    y = nlp(i)
    x.append(y)


In [4]:
similarity_df = pd.DataFrame(index = range(1525), columns = range(1525))
for i in range(1525):
    for j in range(1525):
        value = x[i].similarity(x[j])
        if value < 0.849:
            continue
        if value == 1:
            value = 0
        else:
            similarity_df.loc[i,j] = value
            
similarity_df.head(10)

  value = x[i].similarity(x[j])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1515,1516,1517,1518,1519,1520,1521,1522,1523,1524
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,0.860516,,0.860886,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,0.860516,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,0.860886,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


In [5]:
# now we must return the first 3 documents that have cosine similarity >0.85

#we randomize a little bit since it was taking too long to get the highest value from the pairs
np.random.seed(17)
random_starts = np.random.randint(1,1500, size = 3)

lister = [np.nan, np.nan, np.nan]
ret_list = [0, 0, 0]

for i in range(random_starts[0],1525):
    for j in range(random_starts[1],1525):
        for k in range(random_starts[2], 1525):
            pair12 = similarity_df.loc[i,j]
            pair13 = similarity_df.loc[i,k]
            pair23 = similarity_df.loc[j,k]
            lister = [pair12, pair13, pair23]
            
            if np.isnan(lister).sum() == 0:
                ret_list = [i, j, k]
                break
        if np.isnan(lister).sum() == 0:
            break
    if np.isnan(lister).sum() == 0:
        break

print(f'Three documents with cosine similarities above 0.85 are {ret_list}\n') 

print(f'Document {ret_list[0]} has a cosine similarity of {lister[0]} to Document {ret_list[1]} and \
they look like \n   Document {ret_list[0]}: {helper[ret_list[0]]}\n\
   Document {ret_list[1]}: {helper[ret_list[1]]}\n')

print(f'Document {ret_list[0]} has a cosine similarity of {lister[2]} to Document {ret_list[2]} and \
they look like \n   Document {ret_list[0]}: {helper[ret_list[0]]}\n\
   Document {ret_list[2]}: {helper[ret_list[2]]}\n')

print(f'Document {ret_list[1]} has a cosine similarity of {lister[1]} to Document {ret_list[2]} and \
they look like \n   Document {ret_list[1]}: {helper[ret_list[1]]}\n\
   Document {ret_list[2]}: {helper[ret_list[2]]}')



Three documents with cosine similarities above 0.85 are [630, 1271, 221]

Document 630 has a cosine similarity of 0.8590865144015573 to Document 1271 and they look like 
   Document 630: I did it for you...Okay, we've all been there. This morning I decided to try out this new fangled chicken biscuit for breakfast. Not sure what all played into that idea. Maybe it was my craving for McFood after last nights Yelp event at the Dockside. Maybe it was the commercial right before I pulled into the drive through, whatever. At any rate, I did it for you. I tried it so you don't have to. Turns out, the chicken is different. Its moist & the batter is appealing to the eye. But as soon as you get one whiff you know that's some funky chicken. I think maybe they reached a little too far with this one. They tried to be too southern or too something and they screwed it up with some herb or flavor that just shouldn't be there. I think maybe it was dill? Either way, its all wrong. Take my word for it, s

# Question 2

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import roc_auc_score

train = pd.read_csv("SMS_train.csv", encoding = "ISO-8859-1")
train = train.iloc[:,1:]
train_X = train['Message_body']
train_y = train['Label']== "Spam"

test = pd.read_csv("SMS_test.csv", encoding = "ISO-8859-1")
test = test.iloc[:,1:]
test_X = test['Message_body']
test_y = test['Label'] == "Spam"

train.head()

Unnamed: 0,Message_body,Label
0,Rofl. Its true to its name,Non-Spam
1,The guy did some bitching but I acted like i'd...,Non-Spam
2,"Pity, * was in mood for that. So...any other s...",Non-Spam
3,Will ü b going to esplanade fr home?,Non-Spam
4,This is the 2nd time we have tried 2 contact u...,Spam


In [13]:
#regex cleaning, stopword setting

from nltk.corpus import stopwords
stopword_list = set(stopwords.words('english') + [".", "!", "?", ",",":", "''", "'s", "'", "``", "(", ")", "-"])

for i in range(len(train_X)):
    train_X[i] = re.sub("&lt;#&gt", "", train_X[i])
    train_X[i] = re.sub("&gt;", "", train_X[i])
    train_X[i] = re.sub("&lt;", "", train_X[i])

for i in range(len(test_X)):
    test_X[i] = re.sub("&lt;#&gt", "", test_X[i])
    test_X[i] = re.sub("&gt;", "", test_X[i])
    test_X[i] = re.sub("&lt;", "", test_X[i])

In [14]:
#count vectorize
vectorizer = CountVectorizer(ngram_range = (1,1),
                            stop_words = stopword_list, #remove stopwords
                            token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b',
                            max_features = 2000,
                            min_df = 0.01,
                            binary = True)

#train model
X_train = vectorizer.fit_transform(train_X)
lr1 = LogisticRegression()
lr1.fit(X_train, train_y)
y_train_pred = lr1.predict(X_train)
print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == train_y),4)} accurate.')

# reduce dimensions
svd = TruncatedSVD(n_components = 5)
Z = svd.fit_transform(X_train)
lr1.fit(Z, train_y)
y_red_train_pred = lr1.predict(Z)
print(f'With dimensionality reduction, on the training set, our model is {round(np.mean(y_red_train_pred == train_y),4)} accurate.')

# now we do on test set
X_test = vectorizer.fit_transform(test_X)
Z = svd.fit_transform(X_test)
y_red_test_pred = lr1.predict(Z)
print(f'\nWith dimensionality reduction, on the testing set, our model is {round(np.mean(y_red_test_pred == test_y),4)} accurate.\n')

print(f'Our AUROC is {round(roc_auc_score(y_red_test_pred, test_y),4)}\n')

print(f'And our confusion matrix looks like:\n {confusion_matrix(y_red_test_pred, test_y)}.')

Without dimensionality reduction, on the training set, our model is 0.9687 accurate.
With dimensionality reduction, on the training set, our model is 0.954 accurate.

With dimensionality reduction, on the testing set, our model is 0.6 accurate.

Our AUROC is 0.7475

And our confusion matrix looks like:
 [[49 50]
 [ 0 26]].


In [15]:
# spacy 

train_helper = train_X.copy()

x = []
t = []
for i in train_helper:
    y = nlp(i)
    x.append(y) # list of text
    t.append(y.vector) # list of associated vectors
    
lr = LogisticRegression()
lr.fit(t,train_y)
train_prediction = lr.predict(t)
print(f'Using Spacy, our model is {round(np.mean(train_prediction == train_y),4)} accurate on our training data')

###################################### testing set ##############
test_helper = test_X.copy()
u = []
z = []
for i in test_helper:
    p = nlp(i)
    u.append(p) # list of text
    z.append(p.vector) # list of associated vectors
test_prediction = lr.predict(z)

print(f'Using Spacy, our model is {round(np.mean(test_prediction == test_y),4)} accurate on our testing data')

print(f'Our AUROC is {round(roc_auc_score(test_prediction, test_y),4)}\n')

print(f'And our confusion matrix looks like:\n {confusion_matrix(test_prediction, test_y)}.')


Using Spacy, our model is 0.976 accurate on our training data
Using Spacy, our model is 0.84 accurate on our testing data
Our AUROC is 0.8551

And our confusion matrix looks like:
 [[49 20]
 [ 0 56]].


# Analysis

### CountVectorizer
- On our testing set, we achieved 60% (75/125) accuracy with an AUC of 0.7475
- Our confusion matrix looked like: 

|             | Non-Spam    | Spam          |
| :---        |    :----:   |          ---: |
| Non-Spam    |      49     |      50       |
| Spam        |      0      |     26        |

- We misclassify about 2/3 of our spam messages as non-spam , i.e. false negative errors

### Spacy
- On our testing set, we achieved 84% (105/125) accuracy with an AUC of 0.855
- Our confusion matrix looked like:

|             | Non-Spam    | Spam          |
| :---        |    :----:   |          ---: |
| Non-Spam    |      49     |      20       |
| Spam        |      0      |     56        |

- We still make most of our misclassification errors with false negatives, although this model performs better than our CountVectorizer


### Other comments
- **Techniques used to try and improve our models were:**
    - Different variations of ngram ranges for CountVectorizer (bugs with higher n values)
    - Adding in binary = True improved model performance
    - Adding certain stopwords and regex cleaning
    - Performing train/test split before and after vectorizing (same results)
- **Given more time, we would:**
    - Do more regex cleaning
    - Qualitatively assess the reviews to gain insights into additional stopwords to remove
    - Perform CV analysis to optimize our regression model
- **Two error cases:**
    - Our better model (spacy) misclassifies certain spam texts as non-spam (we have no messages tagged as spam when they are not supposed to be). Two examples of this are:
        - *Someone has contacted our dating service and entered your phone because they fancy you! To find out who it is call from a landline 09111032124 . PoBox12n146tf150p*
        - *Thanks for your Ringtone Order, Reference T91. You will be charged GBP 4 per week. You can unsubscribe at anytime by calling customer services on 09057039994*
    - These two are coming up as non-spam instead of spam since they *resemble* regular text messages more than spam texts based on our model. They also *seem* less like spam than a lot of the other messages. The key here is that we must refine our technique and model when training it