# How to get started with Word2Vec — and then how to make it work

This notebook is closely adapted from this awesome blog article from Kavita Ganesan: https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3.
As the title says, we'll learn how to use the Gensim implementation of Word2Vec and actually get it to work. [...] Getting it to work and obtaining useable results depends, as Kavita points out, on the well set-up combination of two things: (1) your input data and (2) your parameter settings.

First, we start with our imports and get logging established:

In [1]:
# imports needed and logging
import gzip
import gensim 
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)



Here, we will work on a dataset from Quora, provided in the scope of the "Insincere Questions Classification" on Kaggle. It contains sincere and insincere questions; an insincere question is defined as a question intended to make a statement rather than look for helpful answers. The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect. (from: https://www.kaggle.com/c/quora-insincere-questions-classification/data)

For working on the data, let's assume we have downloaded and unzipped the dataset (from the source given above), which contains one train.csv and one test.csv table with the textual data and labels.

As the dataset is very big, we can create a shortened file with the first few thousand entries at first (we can still use the entire files for the final version of our Training later on). To do this, we need to open the folder containing the train.csv file in a Terminal window (Linux) or a comparable command window that simulates a Linux Terminal in Windows (e.g. the GitLab prompt, or Cygwin). We can then extract the first few thousand lines of train.csv into a shortened file with this Linux command:

head -n NUMBEROFLINES file.csv > mynewfile.csv

Let’s take a closer look at our data below by printing the first line(s).

In [18]:
import csv

def csv_to_list_raw(filename):
    sents = []
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            sents.append(row)
    return sents

train_file = 'train_short.csv'
train_raw = csv_to_list_raw(train_file)
print ("First rows of train_short.csv: ", train_raw[0:10])

First rows of train_short.csv:  [['qid', 'question_text', 'target'], ['00002165364db923c7e6', 'How did Quebec nationalists see their province as a nation in the 1960s?', '0'], ['000032939017120e6e44', 'Do you have an adopted dog, how would you encourage people to adopt and not shop?', '0'], ['0000412ca6e4628ce2cf', 'Why does velocity affect time? Does velocity affect space geometry?', '0'], ['000042bf85aa498cd78e', 'How did Otto von Guericke used the Magdeburg hemispheres?', '0'], ['0000455dfa3e01eae3af', 'Can I convert montra helicon D to a mountain bike by just changing the tyres?', '0'], ['00004f9a462a357c33be', 'Is Gaza slowly becoming Auschwitz, Dachau or Treblinka for Palestinians?', '0'], ['00005059a06ee19e11ad', 'Why does Quora automatically ban conservative opinions when reported, but does not do the same for liberal views?', '0'], ['0000559f875832745e2e', 'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.', '0'], ['00005bd3426b2d0c8305', 'Is there such a t

We'll adapt this function a bit now. Let's also do a mild pre-processing of the text using gensim.utils.simple_preprocess (row[1]). We only use the question text itself, and leave the labels whether a question was "insincere" or not aside (being not the primary scope of this exercise here).
The simple_preprocess function does some basic pre-processing such as tokenization, lowercasing, and so on and returns back a list of tokens (words).

In [3]:
import csv

def csv_to_list(filename):
    sents = []
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            row_text = gensim.utils.simple_preprocess(row[1])
            sents.append(row_text)
    return sents

train_file = 'train_short.csv'
sentences = csv_to_list(train_file)
print ("First rows of train_short.csv: ", sentences[0:10])

First rows of train_short.csv:  [['question_text'], ['how', 'did', 'quebec', 'nationalists', 'see', 'their', 'province', 'as', 'nation', 'in', 'the'], ['do', 'you', 'have', 'an', 'adopted', 'dog', 'how', 'would', 'you', 'encourage', 'people', 'to', 'adopt', 'and', 'not', 'shop'], ['why', 'does', 'velocity', 'affect', 'time', 'does', 'velocity', 'affect', 'space', 'geometry'], ['how', 'did', 'otto', 'von', 'guericke', 'used', 'the', 'magdeburg', 'hemispheres'], ['can', 'convert', 'montra', 'helicon', 'to', 'mountain', 'bike', 'by', 'just', 'changing', 'the', 'tyres'], ['is', 'gaza', 'slowly', 'becoming', 'auschwitz', 'dachau', 'or', 'treblinka', 'for', 'palestinians'], ['why', 'does', 'quora', 'automatically', 'ban', 'conservative', 'opinions', 'when', 'reported', 'but', 'does', 'not', 'do', 'the', 'same', 'for', 'liberal', 'views'], ['is', 'it', 'crazy', 'if', 'wash', 'or', 'wipe', 'my', 'groceries', 'off', 'germs', 'are', 'everywhere'], ['is', 'there', 'such', 'thing', 'as', 'dressing

## Training the Word2Vec model

Here, from Radim Řehůřek, the creator of the gensim package, an explanation of the parameters we're going to set for creating the model (https://radimrehurek.com/gensim/models/word2vec.html):

- sentences (iterable of iterables, optional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
corpus_file (str, optional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized).
- size (int, optional) – Dimensionality of the word vectors.
- window (int, optional) – Maximum distance between the current and predicted word within a sentence.
- min_count (int, optional) – Ignores all words with total frequency lower than this.
- workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).

The input parameters are of the following types:
word (str) - the word we are examining
count (int) - the word’s frequency count in the corpus
min_count (int) - the minimum count threshold.

In [41]:
# build vocabulary and train model
model = gensim.models.Word2Vec(
    sentences,
    size=150,
    window=10,
    min_count=1,
    workers=10)

2019-03-28 00:22:06,111 : INFO : collecting all words and their counts
2019-03-28 00:22:06,114 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-28 00:22:06,252 : INFO : PROGRESS: at sentence #10000, processed 120880 words, keeping 14878 word types
2019-03-28 00:22:06,368 : INFO : PROGRESS: at sentence #20000, processed 242134 words, keeping 21567 word types
2019-03-28 00:22:06,483 : INFO : PROGRESS: at sentence #30000, processed 363093 words, keeping 26675 word types
2019-03-28 00:22:06,593 : INFO : PROGRESS: at sentence #40000, processed 484276 words, keeping 30912 word types
2019-03-28 00:22:06,713 : INFO : PROGRESS: at sentence #50000, processed 605073 words, keeping 34597 word types
2019-03-28 00:22:06,834 : INFO : PROGRESS: at sentence #60000, processed 726181 words, keeping 38005 word types
2019-03-28 00:22:06,947 : INFO : PROGRESS: at sentence #70000, processed 848209 words, keeping 41047 word types
2019-03-28 00:22:07,059 : INFO : PROGRESS: at 

After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Behind the scenes we are actually training a simple neural network with a single hidden layer. But we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

In short, again from Radim ():
With the "train" method, we'll "update the model’s neural weights from a sequence of sentences".
The parameter "epochs" is defined as follows:
- epoch (int) – Number of iterations (epochs) over the corpus.

Training on the Word2Vec OpinRank dataset takes about 10–15 minutes. so please be patient while running your code on this dataset

In [42]:
model.train(sentences, total_examples=len(sentences), epochs=10)

2019-03-28 00:25:42,624 : INFO : training model with 10 workers on 49058 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-03-28 00:25:43,769 : INFO : EPOCH 1 - PROGRESS: at 47.10% examples, 393751 words/s, in_qsize 18, out_qsize 1
2019-03-28 00:25:44,618 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-28 00:25:44,636 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-28 00:25:44,649 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-28 00:25:44,652 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-28 00:25:44,655 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-28 00:25:44,671 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-28 00:25:44,678 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-28 00:25:44,684 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-0

2019-03-28 00:25:58,621 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-28 00:25:58,624 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-28 00:25:58,634 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-28 00:25:58,637 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-28 00:25:58,649 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-28 00:25:58,653 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-28 00:25:58,656 : INFO : EPOCH - 7 : training on 1213051 raw words (907199 effective words) took 2.0s, 451267 effective words/s
2019-03-28 00:25:59,726 : INFO : EPOCH 8 - PROGRESS: at 43.77% examples, 389093 words/s, in_qsize 19, out_qsize 2
2019-03-28 00:26:00,624 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-28 00:26:00,692 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-28 00:26:00,69

(9075187, 12130510)

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word 'young'. All we need to do here is to call the most_similar function and provide the word 'young' as the positive example. This returns the top 10 similar words.

In [43]:
w1 = "young"
model.wv.most_similar(positive=w1)

2019-03-28 00:26:14,048 : INFO : precomputing L2-norms of word weight vectors


[('gay', 0.6983577013015747),
 ('teen', 0.6771925687789917),
 ('female', 0.6601540446281433),
 ('adults', 0.6504086852073669),
 ('sexually', 0.632117748260498),
 ('transgender', 0.6239749193191528),
 ('male', 0.6151970624923706),
 ('children', 0.6108031272888184),
 ('kids', 0.6039509177207947),
 ('teenage', 0.60386061668396)]

In [44]:
w2 = "opinion"
model.wv.most_similar(positive=w2)

[('opinions', 0.7887747287750244),
 ('stance', 0.7273621559143066),
 ('assyrian', 0.7138534784317017),
 ('perspctive', 0.6914175748825073),
 ('proudest', 0.6599768996238708),
 ('favourite', 0.6536139845848083),
 ('midfoot', 0.6442692279815674),
 ('zahra', 0.6429859399795532),
 ('arabidopsis', 0.6400624513626099),
 ('whatdid', 0.6365878582000732)]

In [45]:
w2 = "crime"
model.wv.most_similar(positive=w2)

[('crimes', 0.7659684419631958),
 ('committed', 0.7128411531448364),
 ('accusations', 0.7062324285507202),
 ('murder', 0.6925055384635925),
 ('israel', 0.6859065294265747),
 ('treason', 0.684887707233429),
 ('syria', 0.6803385019302368),
 ('corruption', 0.6714112758636475),
 ('party', 0.6637689471244812),
 ('congress', 0.6575760841369629)]

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the "similarity(...)" function and passing in the relevant words.

In [46]:
# Similarity between two different words
model.wv.similarity(w1="spain", w2="country")

0.5480313677944857

In [47]:
# Similarity between two different words
model.wv.similarity(w1="germany", w2="countries")

0.6540631693632291

In [48]:
# Similarity between two different words
model.wv.similarity(w1="trump", w2="president")

0.7821647513259993

In [63]:
# Similarity between two different words
model.wv.similarity(w1="learning", w2="treason")

-0.1781629686372547

Let's use the word vectors for creating features of our text dataset.
We may draw on the assumption that sentences that convey "insincere" content are composed of words that are, contextually, more similar to terms like "treason", "crime", "sabotage", "terrorist".
For each sentence, we can then calculate the average similarity to these words.
To verify whether this approach provides any informational value, we compare the average similarity of "insincere" sentences with the mentioned words to that of "sincere" sentences.

In [50]:
#import pandas and numpy packages
import pandas as pd
import numpy as np

# Define the words related to "insincere" context that we will use for feature creation
insincere_topics = ["treason", "crime", "sabotage", "terrorist"]

# Create DataFrame from train_raw
df_train_raw = pd.DataFrame(train_raw)
# Add columns with nan values to later store similarity values
# Number of columns will equal the number of words used for determining the similarity to "insincere" content
for i in range (0, len(insincere_topics)):
    df_train_raw['similarity_'+insincere_topics[i]] = np.nan
print (df_train_raw[0:10])

                      0                                                  1  \
0                   qid                                      question_text   
1  00002165364db923c7e6  How did Quebec nationalists see their province...   
2  000032939017120e6e44  Do you have an adopted dog, how would you enco...   
3  0000412ca6e4628ce2cf  Why does velocity affect time? Does velocity a...   
4  000042bf85aa498cd78e  How did Otto von Guericke used the Magdeburg h...   
5  0000455dfa3e01eae3af  Can I convert montra helicon D to a mountain b...   
6  00004f9a462a357c33be  Is Gaza slowly becoming Auschwitz, Dachau or T...   
7  00005059a06ee19e11ad  Why does Quora automatically ban conservative ...   
8  0000559f875832745e2e  Is it crazy if I wash or wipe my groceries off...   
9  00005bd3426b2d0c8305  Is there such a thing as dressing moderately, ...   

        2  similarity_treason  similarity_crime  similarity_sabotage  \
0  target                 NaN               NaN                  NaN 

In [78]:
# Go through the whole list of list "sentences" and provide similarity values for each of the words with the mentioned topics.
# Begin with the first occurence of a "1" in train_raw[i][2], break the loop after
stop_index = 100
### for i in range (50, 56):
for i in range (0, len(train_raw)):
    if (i % 5000 == 0):
        print ("i =", i, "of len(train_raw) =", len(train_raw))
    if (train_raw[i][2] == "1" or "0"):
        ### print (sentences[i])
        ### stop_index = i
        ### print ("stop_index: ", stop_index)
        # Calculate similarity of every word of the sentence to every word of the insincere_topics.
        # Add up all the similarities, and then build the average.
        # Determine the number of words of the sentence for that.
        number_of_words_in_sentence = len(sentences[i])
        if (number_of_words_in_sentence == 0):
            number_of_words_in_sentence = 1
        for k in range (0, len(insincere_topics)):
            tmp_similarity_sum = 0
            w1 = insincere_topics[k]
            for j in range (0, len(sentences[i])):
                w2 = sentences[i][j]
                # Get similarity of word to current insincere_topic
                tmp_similarity = model.wv.similarity(w1, w2)
                ### print ("tmp_similarity = model.wv.similarity(", w1, ",", w2,") = ", tmp_similarity)
                tmp_similarity_sum = tmp_similarity_sum + tmp_similarity
                ### print ("new tmp_similarity_sum = ", tmp_similarity_sum)
            tmp_similarity_average = 0
            tmp_similarity_average = tmp_similarity_sum / number_of_words_in_sentence
            # Determine which column in dataframe records similarities for given word
            tmp_similarity_column_number = df_train_raw.columns.get_loc('similarity_'+insincere_topics[k])
            # Save average similarity value there
            df_train_raw.iloc[i, tmp_similarity_column_number] = tmp_similarity_average
        ### print ("IF i = ", i , " > stop_index + 2 = ", stop_index, "+ 2 --> break")
    ### if (i > stop_index + 2):
        ### break

print (df_train_raw[i-15:i])

i = 0 of len(train_raw) = 100000
i = 5000 of len(train_raw) = 100000
i = 10000 of len(train_raw) = 100000
i = 15000 of len(train_raw) = 100000
i = 20000 of len(train_raw) = 100000
i = 25000 of len(train_raw) = 100000
i = 30000 of len(train_raw) = 100000
i = 35000 of len(train_raw) = 100000
i = 40000 of len(train_raw) = 100000
i = 45000 of len(train_raw) = 100000
i = 50000 of len(train_raw) = 100000
i = 55000 of len(train_raw) = 100000
i = 60000 of len(train_raw) = 100000
i = 65000 of len(train_raw) = 100000
i = 70000 of len(train_raw) = 100000
i = 75000 of len(train_raw) = 100000
i = 80000 of len(train_raw) = 100000
i = 85000 of len(train_raw) = 100000
i = 90000 of len(train_raw) = 100000
i = 95000 of len(train_raw) = 100000
                          0  \
99984  1394ce4e03490702a6b1   
99985  1394d4ce23b18714e726   
99986  1394d75405f6320d07e5   
99987  1394e8fb0e3382faf27e   
99988  1394ef966dca965819b3   
99989  13951b7e70572214e93d   
99990  13951e98d97449e53c48   
99991  139528d12e

Let's create DataFrames that contain
0) only those sentences that were NOT labelled as insincere
1) only those sentences that were labelled as insincere

Then, we calculate the average similarity of those sentences to the "insincere_topics" depending on the abovementioned label.
If the average similarity to the "insincere_topics" is usually considerably higher for those sentences labeled as "insincere", then it might be useful to use this similarity as a feature for our model (the model that shall determine whether a sentence had been insincere or not).

In [85]:
df_train_raw_target_1 = df_train_raw[df_train_raw[2] == '1']
df_train_raw_target_0 = df_train_raw[df_train_raw[2] == '0']

for i in range (0, len(insincere_topics)):
    print ("insincere_topic: ", insincere_topics[i])
    print ("df_train_raw_target_1['similarity_", insincere_topics[i], "] = ", df_train_raw_target_1['similarity_'+insincere_topics[i]].mean())
    print ("df_train_raw_target_0['similarity_", insincere_topics[i], "] = ", df_train_raw_target_0['similarity_'+insincere_topics[i]].mean())
    print ("-------------------------")

insincere_topic:  treason
df_train_raw_target_1['similarity_ treason ] =  0.1965593864255121
df_train_raw_target_0['similarity_ treason ] =  0.10510808685340248
-------------------------
insincere_topic:  crime
df_train_raw_target_1['similarity_ crime ] =  0.1858277078470125
df_train_raw_target_0['similarity_ crime ] =  0.08734700818744957
-------------------------
insincere_topic:  sabotage
df_train_raw_target_1['similarity_ sabotage ] =  0.04299316762722095
df_train_raw_target_0['similarity_ sabotage ] =  0.009735614315394332
-------------------------
insincere_topic:  terrorist
df_train_raw_target_1['similarity_ terrorist ] =  0.1895742749255517
df_train_raw_target_0['similarity_ terrorist ] =  0.07829692343517229
-------------------------


These results - average similarity to the "insincere_topics" is usually considerably higher for those sentences labeled as "insincere" - support the assumption that it might be useful to use this similarity as features for our model.