### Notebook #9

## Intro

As a final foray into the land of embeddings, we will examine the `Word2Vec Google News` embeddings' ability to predict sarcasm. The rationale behind this decision is that it has been trained on an enormous amount of data and possesses 3 million tokens, which may help it better identify relationships between sarcastic comments.

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import regex as re
import string

from gensim.models import Word2Vec
import gensim.downloader as api

from sklearn.model_selection import train_test_split


import warnings 
warnings.filterwarnings(action = 'ignore')

In [2]:
# reading in csv and checking head
reddit = pd.read_csv('reddit_comments.csv', index_col=0)

reddit.head()

Unnamed: 0,label,comment
0,0,NC and NH.
1,0,You do know west teams play against west teams...
2,0,"They were underdogs earlier today, but since G..."
3,0,"This meme isn't funny none of the ""new york ni..."
4,0,I could use one of those tools.


In [4]:
embeddings = api.load('word2vec-google-news-300')

In [17]:
shape = embeddings.vectors.shape

print(f'Word2Vec Google News consists of {shape[0]} tokens and {shape[1]} dimensional vectors')

Word2Vec Google News consists of 3000000 tokens and 300 dimensional vectors


At 3 million tokens, these embeddings outnumber the self-trained embedding tokens by a factor of 60.

We will again be performing a sort of quick 'EDA' on what these embeddings hold and the differences in their similarities compared to the self-trained embeddings.

In [17]:
embeddings.most_similar('sarcasm')

[('snark', 0.6535126566886902),
 ('humor', 0.6242319345474243),
 ('sarcastic', 0.6220389604568481),
 ('self_deprecating_humor', 0.6171834468841553),
 ('self_deprecation', 0.607752799987793),
 ('condescension', 0.6044954061508179),
 ('snarkiness', 0.5990619659423828),
 ('irony', 0.5971057415008545),
 ('facetiousness', 0.5904334783554077),
 ('wit', 0.5846673250198364)]

In [18]:
embeddings.most_similar('genuine')

[('genuinely', 0.5996334552764893),
 ('real', 0.5876179933547974),
 ('legitimate', 0.5874236822128296),
 ('sincere', 0.5649887323379517),
 ('bona_fide', 0.5497070550918579),
 ('geniune', 0.5368186235427856),
 ('thief_demagogue_liar', 0.5333326458930969),
 ('genuineness', 0.5200586318969727),
 ('legitimate_aand', 0.5138376951217651),
 ('Genuine', 0.5109951496124268)]

In [19]:
embeddings.most_similar('honest')

[('frank', 0.6541845202445984),
 ('truthful', 0.6527417898178101),
 ('brutally_honest', 0.6292597055435181),
 ('dignity_Aujali', 0.6224915981292725),
 ('gazillionaire_Carella', 0.6192165017127991),
 ('honestly', 0.6042546629905701),
 ('forthright', 0.603355348110199),
 ('honesty', 0.576174259185791),
 ('candid', 0.5652011632919312),
 ('scrupulously_honest', 0.5450152158737183)]

In [20]:
embeddings.most_similar('misleading')

[('mislead', 0.6863378286361694),
 ('inaccurate', 0.6856165528297424),
 ('grossly_misleading', 0.673831582069397),
 ('deliberately_misleading', 0.6662497520446777),
 ('misled', 0.6598524451255798),
 ('Misleading', 0.647882342338562),
 ('deceptive', 0.6444677710533142),
 ('misrepresented', 0.6297670602798462),
 ('misrepresenting', 0.628930926322937),
 ('false', 0.6191345453262329)]

In [5]:
# similarity scores for 'serious' and 'joking'
serious = embeddings.most_similar('serious')
joking = embeddings.most_similar('joking')

# setting to dataframe
pd.DataFrame(
        data={"serious": [word for word, sim in serious], 
            "joking": [word for word, sim in joking]})

Unnamed: 0,serious,joking
0,serous,joked
1,severe,cracking_jokes
2,Serious,kidding
3,seriously,laughing
4,minor,joke
5,seri_ous,laughed
6,seriousness,chuckling
7,gravest,jokingly
8,grievous,laugh
9,nonserious,chuckled


In [6]:
# finding most similar words to 'truth' and 'lie'
lie = embeddings.most_similar('lie')
truth = embeddings.most_similar('truth')

# setting to dataframe to visualize
pd.DataFrame(
        data={"lie": [word for word, sim in lie], 
            "truth": [word for word, sim in truth]})

Unnamed: 0,lie,truth
0,lies,truths
1,lying,falsehood
2,Lying,veritas_Latin
3,Terravista_complex,Fatma_Trad_veiled
4,lay,truthful
5,Lie,facts
6,perjure_yourself,Truth
7,sit,untruths
8,BE_TRUTHFUL_Do,falsity
9,lurk,unvarnished_truth


### Observations
- even more repetition than glove-twitter as exemplified in the 'lie' table
- interestingly, it incorporates completely different meanings of the word 'lie', like 'lay'(down), which the other embeddings did not do
- there is definitely more messiness in these embeddings with entries like: 'Fatma_Trad_veiled', which are difficult to understand
- most importantly, these embeddings have a completely different construction as they incorporate `bigrams`
- upper and lower case tokens are treated as different from one another

----------------------------------------------------------------------------------------------------------------------------------------------------------

**Below we will run our comments through the tokenizer and vectorizer functions and perform a train/test split.**

In [7]:
# custom tokenizer function

def tokenizer(document):

    
    #removing punctuation
    for punc in string.punctuation:
        document = document.replace(punc, '')
        # removing numbers and setting all documents to lowercase    
    document = re.sub("\d+", "", document).lower()
        # splitting documents and appending tokens list
    tokens = document.split(' ')
        
    
        

    return tokens

In [8]:
def vectorizer(document):

    tokens = tokenizer(document) # calling tokenizer function
    
    size = embeddings.vector_size # setting constant document size
    vec_doc = np.zeros(size) # populating list of document size with zeros 
    count = 1 # word count to be used for average
    
    # looping through documents
    for word in tokens:
        # checking if word exists in document
        if word in embeddings:
            count +=1
            vec_doc += embeddings[word] # adding word embedding to doc embedding
    
    vec_doc = vec_doc / count # taking average of embeddings
    
    
    return vec_doc

In [9]:
# applying vectorizer function
reddit['vectors'] = reddit['comment'].apply(vectorizer)

In [10]:
# instantiating x and y
X = reddit['vectors']
y = reddit['label']

In [11]:
X = list(X)

In [12]:
# train/test split
X_rem, X_test, y_rem, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_rem, y_rem, test_size=0.25, random_state=42)

Now we can fit our data to our MLP Classifier model

In [13]:
np.random.seed(42)

from sklearn.neural_network import MLPClassifier

MLP = MLPClassifier(verbose=True, max_iter=50, early_stopping=True)

MLP.fit(X_train, y_train)

Iteration 1, loss = 0.63216251
Validation score: 0.651645
Iteration 2, loss = 0.61039095
Validation score: 0.665770
Iteration 3, loss = 0.60059742
Validation score: 0.672594
Iteration 4, loss = 0.59378932
Validation score: 0.672401
Iteration 5, loss = 0.58912228
Validation score: 0.673861
Iteration 6, loss = 0.58514505
Validation score: 0.675690
Iteration 7, loss = 0.58167183
Validation score: 0.679331
Iteration 8, loss = 0.57861030
Validation score: 0.682128
Iteration 9, loss = 0.57588682
Validation score: 0.682761
Iteration 10, loss = 0.57342540
Validation score: 0.674265
Iteration 11, loss = 0.57119940
Validation score: 0.684027
Iteration 12, loss = 0.56926821
Validation score: 0.673403
Iteration 13, loss = 0.56743470
Validation score: 0.683939
Iteration 14, loss = 0.56574554
Validation score: 0.683869
Iteration 15, loss = 0.56395627
Validation score: 0.676780
Iteration 16, loss = 0.56276350
Validation score: 0.679718
Iteration 17, loss = 0.56143808
Validation score: 0.680597
Iterat

In [14]:
print(f'Train Score: {MLP.score(X_train, y_train)} \n Val Score: {MLP.score(X_val, y_val)}')

Train Score: 0.7145706616759832 
 Val Score: 0.683476774190144


In [15]:
predictions = MLP.predict(X_val)

In [16]:
from sklearn.metrics import classification_report

report = classification_report(y_val, predictions, target_names=['Non-Sarcastic', 'Sarcastic'])
print(report)

               precision    recall  f1-score   support

Non-Sarcastic       0.68      0.70      0.69     94698
    Sarcastic       0.69      0.67      0.68     94811

     accuracy                           0.68    189509
    macro avg       0.68      0.68      0.68    189509
 weighted avg       0.68      0.68      0.68    189509



**Again, we see similar performance to the other two embeddings, but more in line with the `Glove-Twitter` embeddings than the self-trained.**

## Conclusion

----------------------------------------------------------------------------------------------------------------------------------------------------------

We ran `Logistic Regression`, `KNN` and `MLP` on our text data, and `MLP` had the highest performance. We would be remise not to mention though that the tokenization and vectorization methods we used for the former two models were different, in `CountVectorizer` and `TFIDF`. This process demonstrated how the quality of the data and the transformations one makes in feature engineering are paramount to a strong result. If we were to use any of the word embeddings explored in these notebooks on `Logistic Regression` or `KNN` they would undoubtedly perform better.  

Interestingly, it was our self-trained embeddings that outperformed the pre-trained ones with both much less training data and a much lower vocabulary. An accuracy of 70% is far from perfect, but considering the complexity of the problem, it is a very welcome result.