### Notebook #8

## Table of Contents
1) <a href='#eda' id='top'>Glove Twitter</a>
2) <a href='#mod'>Model</a>
3) <a href='#fin'>Conclusion</a>

## Intro

This notebook endeavours to divine whether pre-trained word embeddings will outperform the ones trained on the dataset itself. The embeddings we have chosen are the `Glove-Twitter-200'. The rationale behind choosing these particular pre-trained embeddings is that the language and tone between Reddit and Twitter are more in line with one another than Reddit is with Wikipedia. This similarity, along with the more robust training and vocabulary that these embeddings possess may thereby lead to different, if not better results.

In [22]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import regex as re
import string

import gensim
from gensim.models import Word2Vec
import gensim.downloader as api
from gensim.models import KeyedVectors
from gensim.test.utils import datapath

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report



import warnings 
warnings.filterwarnings(action = 'ignore')

In [23]:
# reading in csv and verifying shape
reddit = pd.read_csv('reddit_comments.csv', index_col=0)

reddit.shape

(1010714, 2)

## <a href='#top' id='eda'>Glove Twitter</a>

In [24]:
# loading in twitter embeddings
embeddings = api.load('glove-twitter-200')

In [42]:
# verifying and printing shape
shape = embeddings.vectors.shape

print(f'Glove Twitter consists of {shape[0]} tokens and {shape[1]} dimensional vectors')

Glove Twitter consists of 1193514 tokens and 200 dimensional vectors


**Below we will take a look at the same group of words we have examined using our self-trained embeddings to see if we can spot any key differences**

In [25]:
embeddings.most_similar('sarcastic')

[('witty', 0.6640589237213135),
 ('bitchy', 0.6119109392166138),
 ('sarcasm', 0.6098357439041138),
 ('rude', 0.5943841934204102),
 ('insensitive', 0.5871057510375977),
 ('condescending', 0.5716186165809631),
 ('snarky', 0.5695698261260986),
 ('clever', 0.5692526698112488),
 ('sassy', 0.567287802696228),
 ('smartass', 0.5572746992111206)]

In [26]:
embeddings.most_similar('intention')

[('intentions', 0.6637458801269531),
 ('actions', 0.5589097142219543),
 ('intent', 0.5574334263801575),
 ('purpose', 0.5178261399269104),
 ('without', 0.4837496280670166),
 ('importance', 0.4837445616722107),
 ('intend', 0.4802425801753998),
 ('ambition', 0.4783967435359955),
 ('capable', 0.47621142864227295),
 ('consequence', 0.4741327464580536)]

In [27]:
embeddings.most_similar('mislead')

[('misled', 0.5900747776031494),
 ('deceive', 0.552757978439331),
 ('manipulate', 0.551977276802063),
 ('manipulated', 0.519615650177002),
 ('deceived', 0.5073312520980835),
 ('tricked', 0.49346911907196045),
 ('naive', 0.4644486606121063),
 ('portray', 0.45692503452301025),
 ('misinformed', 0.45582106709480286),
 ('brainwashed', 0.45131373405456543)]

In [29]:
embeddings.most_similar('genuine')

[('authentic', 0.5544952154159546),
 ('sincere', 0.5459144115447998),
 ('unique', 0.5418863296508789),
 ('silver', 0.5259337425231934),
 ('leather', 0.5243769884109497),
 ('honest', 0.518162727355957),
 ('quality', 0.5110219717025757),
 ('solid', 0.5067755579948425),
 ('genuinely', 0.50606369972229),
 ('truly', 0.5023698806762695)]

In [30]:
embeddings.most_similar('honest')

[('truthful', 0.691604733467102),
 ('rather', 0.672570526599884),
 ('loyal', 0.6538525819778442),
 ('honestly', 0.6515498161315918),
 ('honesty', 0.6430395841598511),
 ('lie', 0.6378211975097656),
 ('admit', 0.6351155638694763),
 ('being', 0.6288948059082031),
 ('person', 0.6275302171707153),
 ('true', 0.6265100836753845)]

In [39]:
# finding most similar words to 'truth' and 'lie'
lie = embeddings.most_similar('lie')
truth = embeddings.most_similar('truth')

# setting to dataframe to visualize
pd.DataFrame(
        data={"lie": [word for word, sim in lie], 
            "truth": [word for word, sim in truth]})

Unnamed: 0,lie,truth
0,lies,lie
1,tell,true
2,lying,lies
3,truth,nothing
4,n't,tell
5,never,know
6,dont,that
7,know,but
8,fool,words
9,say,about


In [40]:
# similarity scores for 'serious' and 'joking'
serious = embeddings.most_similar('serious')
joking = embeddings.most_similar('joking')

# setting to dataframe
pd.DataFrame(
        data={"serious": [word for word, sim in serious], 
            "joking": [word for word, sim in joking]})

Unnamed: 0,serious,joking
0,seriously,kidding
1,really,obviously
2,but,laughing
3,still,joke
4,funny,lololol
5,talk,obv
6,actually,lolol
7,damn,kiddin
8,something,jokin
9,thats,seriously


### Summary of differences
- 'sarcastic' nowhere to be found in 'serious' and 'joking' top 10
- abbreviations and acronyms
- more repetition: i.e. lie -> lies -> lying
- interestingly the most related word to truth is lie
- results are for the most part different than the pre-trained, but a similar pattern persists:
  i) words are either related or synonyms
  ii) grammatically equivalent
  iii) antonyms

---------------------------------------------------------------------------------------------------------------------------------------------------------

Below we will instantiate the tokenizer as well as the vectorizer function, set our X and y, and perfrom train/test split.

In [31]:
# custom tokenizer function

def tokenizer(document):
    
    
    #removing punctuation
    for punc in string.punctuation:
        document = document.replace(punc, '')
        # removing numbers and setting all documents to lowercase    
    document = re.sub("\d+", "", document).lower()
        # splitting documents and appending tokens list
    tokens = document.split(' ')
        
    
        

    return tokens

In [34]:
def vectorizer(document):

    tokens = tokenizer(document) # calling tokenizer function
    
    size = embeddings.vector_size # setting constant document size
    vec_doc = np.zeros(size) # populating list of document size with zeros 
    count = 1 # word count to be used for average
    
    # looping through documents
    for word in tokens:
        # checking if word exists in document
        if word in embeddings:
            count +=1
            vec_doc += embeddings[word] # adding word embedding to doc embedding
    
    vec_doc = vec_doc / count # taking average of embeddings
    
    
    return vec_doc

In [35]:
# applying vectorizer function to comments
reddit['vectors'] = reddit['comment'].apply(vectorizer)

In [36]:
# instantiating X and y
X = reddit['vectors']
y = reddit['label']

X = list(X)

In [37]:
# train/test split
X_rem, X_test, y_rem, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_rem, y_rem, test_size=0.25, random_state=42)

## <a href='#top' id='mod'>Model</a>

In [38]:
# random seed
np.random.seed(42)
# mlp classifier instatiation
MLP = MLPClassifier(verbose=True, max_iter=50, early_stopping=True)
# fitting to train
MLP.fit(X_train, y_train)

Iteration 1, loss = 0.62623999
Validation score: 0.662674
Iteration 2, loss = 0.60506743
Validation score: 0.670308
Iteration 3, loss = 0.59702490
Validation score: 0.673720
Iteration 4, loss = 0.59140491
Validation score: 0.676745
Iteration 5, loss = 0.58761974
Validation score: 0.678381
Iteration 6, loss = 0.58442868
Validation score: 0.678944
Iteration 7, loss = 0.58177573
Validation score: 0.677449
Iteration 8, loss = 0.57942868
Validation score: 0.679823
Iteration 9, loss = 0.57764762
Validation score: 0.680949
Iteration 10, loss = 0.57602169
Validation score: 0.680421
Iteration 11, loss = 0.57410248
Validation score: 0.677695
Iteration 12, loss = 0.57274648
Validation score: 0.680967
Iteration 13, loss = 0.57145793
Validation score: 0.679419
Iteration 14, loss = 0.57031113
Validation score: 0.683341
Iteration 15, loss = 0.56916527
Validation score: 0.681705
Iteration 16, loss = 0.56786190
Validation score: 0.679648
Iteration 17, loss = 0.56702685
Validation score: 0.680369
Iterat

In [44]:
print(f'Train Score: {MLP.score(X_train, y_train)}\nVal Score: {MLP.score(X_val, y_val)}')

Train Score: 0.7029212384306083
Val Score: 0.6818145840039259


**The results are fairly similar to, but lower than the self-trained model. The self-trained model outperformed this by 2% in both train and validation.**

Let us take a quick look at the precision and recall scores.

In [26]:
predictions = MLP.predict(X_val)
report = classification_report(y_val, predictions, target_names=['Non-Sarcastic', 'Sarcastic'])

print(report)

               precision    recall  f1-score   support

Non-Sarcastic       0.68      0.70      0.69     94698
    Sarcastic       0.69      0.66      0.68     94811

     accuracy                           0.68    189509
    macro avg       0.68      0.68      0.68    189509
 weighted avg       0.68      0.68      0.68    189509



Rather predictably, the precision and recall scores hover around +-4% from the accuracy score and do not deviate much from one another - neither in the sarcastic or non-sarcastic sets.

## <a href='#top' id='fin'>Conclusion</a>

All in all the `Glove-Twitter` embeddings performed quite similarily to the self-trained embeddings. The difference is slim, but potentially significant, as every decimal point counts to an optimized solution. We could see some differences in the word similarities between these two embeddings, but it is hard to say what that means for their predictive values. In the final notebook we will be using another set of pre-trained embeddings in `Word2Vec Google News`.