## Prerequisites

gensim==3.6.0

In [6]:
from string import punctuation

import numpy as np
import pandas as pd 
from gensim.models import Word2Vec, KeyedVectors
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
stop_words = set(stopwords.words('english'))

In [4]:
df = pd.read_csv("../jigsaw-toxic-comment-classification-challenge/train.csv")

In [7]:
def preprocess_text(tokenizer, lemmatizer, stop_words, punctuation, text): 
    tokens = tokenizer(text.lower())
    lemmas = [lemmatizer.lemmatize(token) for token in tokens]
    return [token for token in lemmas if token not in stop_words and token not in punctuation]

df['cleaned'] = df.comment_text.apply(lambda x: preprocess_text(word_tokenize, lemmatizer, stop_words, punctuation, x))

In [41]:
df_sample = df.sample(100000)

In [42]:
df_sample.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned
134936,d1ab7227677c8368,and revert all my damned edits that I worked h...,0,0,0,0,0,0,"[revert, damned, edits, worked, hard]"
105276,333a6df1eb61a375,"Hello \n\nYou seem busy! LoL! Anyway, new wiki...",0,0,0,0,0,0,"[hello, seem, busy, lol, anyway, new, wikipedi..."
10669,1c34929d44952720,Copyright problem \nThis article has been revi...,0,0,0,0,0,0,"[copyright, problem, article, ha, revised, par..."
43048,72d8591b73fa55a0,I have accepted your apology. Maybe our paths ...,0,0,0,0,0,0,"[accepted, apology, maybe, path, cross, may, f..."
145606,1f70994bd5c269fd,First ever Reading Wiki Meetup\nYou are invite...,0,0,0,0,0,0,"[first, ever, reading, wiki, meetup, invited, ..."


### Train the model from scratch

Train our first model based on the vocabulary from df_sample: 

In [50]:
# With initialization model trained for 5 epochs 

model = Word2Vec(sentences=df_sample.cleaned.tolist(), 
         size=100,      # embedding vector size
         min_count=5,  # consider words that occured at least 5 times
         window=5)

In [51]:
# Continue training the model 

model.train(sentences=df_sample.cleaned.tolist(), 
            total_examples=model.corpus_count,
            epochs=30
           )

(99831505, 118135380)

In [53]:
# model.wv.vocab # to look at vocabulary 

In [54]:
model.most_similar('people')

  """Entry point for launching an IPython kernel.


[('others', 0.6248910427093506),
 ('thing', 0.5807790756225586),
 ('person', 0.5554934740066528),
 ('editor', 0.5365866422653198),
 ('everyone', 0.5280076265335083),
 ('really', 0.5098308324813843),
 ('way', 0.5067179203033447),
 ('admins', 0.5010358095169067),
 ('someone', 0.49282339215278625),
 ('anyone', 0.48838675022125244)]

### The next approach is to try to use the already pretrained model, which can be downloaded from here:

https://github.com/RaRe-Technologies/gensim-data

model:   
GoogleNews-vectors-negative300.bin

In [57]:
model = KeyedVectors.load_word2vec_format("../GoogleNews-vectors-negative300.bin", binary=True)

In [58]:
# You can try to use GloVe model too and experiment with it: 
# import gensim.downloader as api
# model = api.load('glove-wiki-gigaword-100')

## Words distance 

# 1 - Cosine similarity

To measure how similar two words are, we need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows: 

$$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

where $u.v$ is the dot product (or inner product) of two vectors, $||u||_2$ is the norm (or length) of the vector $u$, and $\theta$ is the angle between $u$ and $v$. This similarity depends on the angle between $u$ and $v$. If $u$ and $v$ are very similar, their cosine similarity will be close to 1; if they are dissimilar, the cosine similarity will take a smaller value. 

<img src="cosine_sim.png" style="width:800px;height:250px;">
<caption><center> **Figure 1**: The cosine of the angle between two vectors is a measure of how similar they are</center></caption>

**Exercise**: Implement the function `cosine_similarity()` to evaluate similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [62]:
def cosine(w1, w2):
    """
    Cosine similarity between w1 and w2
    
    Arguments:
        w1 : word vector        
        w2 : word vector 
    Returns:
        cosine_similarity 
    """
    if (not np.any(w1) or not np.any(w2)): # check input is not zero-vector
        return 0
    
    # Dot product between w1 and w2
    dot = None ### YOUR CODE HERE 
    # L2 norm of w1
    norm_u = None ### YOUR CODE HERE 
    # L2 norm of w2 
    norm_v = None ### YOUR CODE HERE 
    # Cosine similarity 
    cosine_similarity = None ### YOUR CODE HERE 
    
    return cosine_similarity

In [63]:
father = model.get_vector("father")
mother = model.get_vector("mother")
ball = model.get_vector("ball")
crocodile = model.get_vector("crocodile")
france = model.get_vector("france")
italy = model.get_vector("italy")
kiev = model.get_vector("kiev")
ukraine = model.get_vector("ukraine")

# print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
# print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
# print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))

**Approximate expected output**:

<table>
    <tr>
        <td>
            **cosine_similarity(father, mother)** =
        </td>
        <td>
         0.79014826
        </td>
    </tr>
        <tr>
        <td>
            **cosine_similarity(ball, crocodile)** =
        </td>
        <td>
         0.10283585
        </td>
    </tr>
        <tr>
        <td>
            **cosine_similarity(france - paris, rome - italy)** =
        </td>
        <td>
         -0.421037
        </td>
    </tr>
</table>

#### The next part of the task is to:  

1. Train your own W2V model using the proposed method above. Use all of the tokens created after your preprocessing pipeline in the previous tasks. (deleting stop_words, punctuation, lowercasing, etc - play as you want).  
2. Use obtained vectors to obtain text vectors using such pipeline: 
  1. For each word in a preprocessed text, get a word vector from the W2V model. 
  2. Add them together to obtain vectors for texts (sum them together, or get mean vector) 
3. Use obtained text vectors as a text representation to perform a text classification task.  
   Proposed - use binary classification (for example: select only 'obscene' text and clean and try to distinguish them one from another)
4. Calculate the metrics - TP, FP, FN, TN, precision, recall, F1 score, F2 score, accurary. 


In [66]:
### Your code here 

#### The second part of the task is: 

1. While performing a step 2 for text vectorization, for each word add its vector with tf-idf weight -> weighted average. 
2. Perform a same text classification task as it was required above. 
3. Calculate the metrics, compare with a vectorization approach without weightning. 

In [67]:
### Your code here 

#### The third part of the task is: 

1. Use a pre-trained W2V model for obtaining a word vectors for each of the tokens in your dataset, create text vectors WITHOUT weightning. 
2. Train text classification model.
3. Calculate the metrics.

In [68]:
### Your code here 

#### The fourth part of the task is: 

1. Use a pre-trained W2V model for obtaining a word vectors for each of the tokens in your dataset, create text vectors WITH tf-idf weightning. 
2. Train a text classification model. 
3. Calculate the metrics. 

In [73]:
### Your code here

### Visualizations part 

Use dimentionality reduction methods such as t-SNE or PCA to make your 300 dim vectors available for 2D plotting. 

Select top (10-20) words for each cathegory BY TF-IDF SCORE, not counts!!! 

Plot on the ONE plot all of this words but colors must be different for top-words for obscene cathegory, clean, toxic, etc... 

See, if words from one cathegory are closer to each other than to others. 
Or you observe ~2 clusters: all of the toxic words, clean words.  
Explain what you see and why. 


In [69]:
### Your code here 

### Additional part: 

1. Find a pre-trained FastText vetors, understand it's difference from W2V vectors. 
2. Vectorize all of your texts using FT model, perform a text classification, calculate the metrics, compare with W2V approach. 

Or/And you can:

1. Train your own FT model and make the same. 
2. Compare it with previous approaches.

In [70]:
### Your code here 

### Conclusions: 

Please, provide a clear table or dataframe with all of the metrics for all of the trained/used models available.   

Compare them to each other.   

Make conclusions which one from your models worked better for this particular task.   
BE CAREFUL: Having a better model performance on this particular task does not matter that this model is better than others in GENERAL. You need to make your own conclusions about this particular model applied to this particular task. Please, think and understand WHY.   
Write your thoughts down below: 



In [72]:
### Your conclusions here.

In [71]:
### Your thoughts about the last question here. 