# How to get started with Word2Vec — and then how to make it work

This notebook is closely adapted from this awesome blog article from Kavita Ganesan: https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3.
As the title says, we'll learn how to use the Gensim implementation of Word2Vec and actually get it to work. [...] Getting it to work and obtaining useable results depends, as Kavita points out, on the well set-up combination of two things: (1) your input data and (2) your parameter settings.

First, we start with our imports and get logging established:

In [4]:
# imports needed and logging
import gzip
import gensim 
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Here, we will work on a dataset from Quora, provided in the scope of the "Insincere Questions Classification" on Kaggle. It contains sincere and insincere questions; an insincere question is defined as a question intended to make a statement rather than look for helpful answers. The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect. (from: https://www.kaggle.com/c/quora-insincere-questions-classification/data)

For working on the data, let's assume we have downloaded and unzipped the dataset (from the source given above), which contains one train.csv and one test.csv table with the textual data and labels.

As the dataset is very big, we can create a shortened file with the first few thousand entries at first (we can still use the entire files for the final version of our Training later on). To do this, we need to open the folder containing the train.csv file in a Terminal window (Linux) or a comparable command window that simulates a Linux Terminal in Windows (e.g. the GitLab prompt, or Cygwin). We can then extract the first few thousand lines of train.csv into a shortened file with this Linux command:

head -n NUMBEROFLINES file.csv > mynewfile.csv

Let’s take a closer look at our data below by printing the first line(s).

In [9]:
import csv

def csv_to_list(filename):
    sents = []
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            sents.append(row)
    return sents

train_file = 'train_short.csv'
train = csv_to_list(train_file)
print ("First rows of train_short.csv: ", train[0:10])

First rows of train_short.csv:  [['qid', 'question_text', 'target'], ['00002165364db923c7e6', 'How did Quebec nationalists see their province as a nation in the 1960s?', '0'], ['000032939017120e6e44', 'Do you have an adopted dog, how would you encourage people to adopt and not shop?', '0'], ['0000412ca6e4628ce2cf', 'Why does velocity affect time? Does velocity affect space geometry?', '0'], ['000042bf85aa498cd78e', 'How did Otto von Guericke used the Magdeburg hemispheres?', '0'], ['0000455dfa3e01eae3af', 'Can I convert montra helicon D to a mountain bike by just changing the tyres?', '0'], ['00004f9a462a357c33be', 'Is Gaza slowly becoming Auschwitz, Dachau or Treblinka for Palestinians?', '0'], ['00005059a06ee19e11ad', 'Why does Quora automatically ban conservative opinions when reported, but does not do the same for liberal views?', '0'], ['0000559f875832745e2e', 'Is it crazy if I wash or wipe my groceries off? Germs are everywhere.', '0'], ['00005bd3426b2d0c8305', 'Is there such a t

We'll adapt this function a bit now. Let's also do a mild pre-processing of the text using gensim.utils.simple_preprocess (row[1]). We only use the question text itself, and leave the labels whether a question was "insincere" or not aside (being not the primary scope of this exercise here).
The simple_preprocess function does some basic pre-processing such as tokenization, lowercasing, and so on and returns back a list of tokens (words).

In [11]:
import csv

def csv_to_list(filename):
    sents = []
    with open(filename, newline='', encoding='utf-8') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        for row in csvreader:
            row_text = gensim.utils.simple_preprocess(row[1])
            sents.append(row_text)
    return sents

train_file = 'train_short.csv'
sentences = csv_to_list(train_file)
print ("First rows of train_short.csv: ", sentences[0:10])

First rows of train_short.csv:  [['question_text'], ['how', 'did', 'quebec', 'nationalists', 'see', 'their', 'province', 'as', 'nation', 'in', 'the'], ['do', 'you', 'have', 'an', 'adopted', 'dog', 'how', 'would', 'you', 'encourage', 'people', 'to', 'adopt', 'and', 'not', 'shop'], ['why', 'does', 'velocity', 'affect', 'time', 'does', 'velocity', 'affect', 'space', 'geometry'], ['how', 'did', 'otto', 'von', 'guericke', 'used', 'the', 'magdeburg', 'hemispheres'], ['can', 'convert', 'montra', 'helicon', 'to', 'mountain', 'bike', 'by', 'just', 'changing', 'the', 'tyres'], ['is', 'gaza', 'slowly', 'becoming', 'auschwitz', 'dachau', 'or', 'treblinka', 'for', 'palestinians'], ['why', 'does', 'quora', 'automatically', 'ban', 'conservative', 'opinions', 'when', 'reported', 'but', 'does', 'not', 'do', 'the', 'same', 'for', 'liberal', 'views'], ['is', 'it', 'crazy', 'if', 'wash', 'or', 'wipe', 'my', 'groceries', 'off', 'germs', 'are', 'everywhere'], ['is', 'there', 'such', 'thing', 'as', 'dressing

## Training the Word2Vec model

In [13]:
# build vocabulary and train model
model = gensim.models.Word2Vec(
    sentences,
    size=150,
    window=10,
    min_count=2,
    workers=10)

2019-03-26 18:36:17,701 : INFO : collecting all words and their counts
2019-03-26 18:36:17,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-26 18:36:17,774 : INFO : PROGRESS: at sentence #10000, processed 120880 words, keeping 14878 word types
2019-03-26 18:36:17,831 : INFO : PROGRESS: at sentence #20000, processed 242134 words, keeping 21567 word types
2019-03-26 18:36:17,890 : INFO : PROGRESS: at sentence #30000, processed 363093 words, keeping 26675 word types
2019-03-26 18:36:17,948 : INFO : PROGRESS: at sentence #40000, processed 484276 words, keeping 30912 word types
2019-03-26 18:36:18,007 : INFO : PROGRESS: at sentence #50000, processed 605073 words, keeping 34597 word types
2019-03-26 18:36:18,064 : INFO : PROGRESS: at sentence #60000, processed 726181 words, keeping 38005 word types
2019-03-26 18:36:18,120 : INFO : PROGRESS: at sentence #70000, processed 848209 words, keeping 41047 word types
2019-03-26 18:36:18,179 : INFO : PROGRESS: at 

2019-03-26 18:36:29,820 : INFO : training on a 6065255 raw words (4402457 effective words) took 10.1s, 435584 effective words/s


After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Behind the scenes we are actually training a simple neural network with a single hidden layer. But we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn.

Training on the Word2Vec OpinRank dataset takes about 10–15 minutes. so please be patient while running your code on this dataset

In [14]:
model.train(sentences, total_examples=len(sentences), epochs=10)

2019-03-26 18:38:56,128 : INFO : training model with 10 workers on 24820 vocabulary and 150 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-03-26 18:38:57,162 : INFO : EPOCH 1 - PROGRESS: at 48.73% examples, 425511 words/s, in_qsize 20, out_qsize 1
2019-03-26 18:38:58,040 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-26 18:38:58,064 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-26 18:38:58,080 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-26 18:38:58,095 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-26 18:38:58,101 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-26 18:38:58,108 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-26 18:38:58,114 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-26 18:38:58,121 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-0

2019-03-26 18:39:10,612 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-26 18:39:10,623 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-26 18:39:10,626 : INFO : EPOCH - 7 : training on 1213051 raw words (880383 effective words) took 2.0s, 434011 effective words/s
2019-03-26 18:39:11,657 : INFO : EPOCH 8 - PROGRESS: at 49.55% examples, 434827 words/s, in_qsize 19, out_qsize 0
2019-03-26 18:39:12,492 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-26 18:39:12,528 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-26 18:39:12,538 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-26 18:39:12,543 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-26 18:39:12,548 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-26 18:39:12,557 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-26 18:39:12,56

(8805853, 12130510)

Let’s get to the fun stuff already! Since we trained on user reviews, it would be nice to see similarity on some adjectives. This first example shows a simple look up of words similar to the word 'young'. All we need to do here is to call the most_similar function and provide the word 'young' as the positive example. This returns the top 10 similar words.

In [15]:
w1 = "young"
model.wv.most_similar(positive=w1)

2019-03-26 18:41:21,153 : INFO : precomputing L2-norms of word weight vectors


[('teen', 0.7018687725067139),
 ('teenage', 0.6970053315162659),
 ('adults', 0.6967407464981079),
 ('gay', 0.6640512943267822),
 ('male', 0.6555132269859314),
 ('sexually', 0.6529505252838135),
 ('headscarf', 0.6497985124588013),
 ('female', 0.6405802965164185),
 ('aged', 0.6368280649185181),
 ('transgender', 0.6365754008293152)]

In [19]:
w2 = "opinion"
model.wv.most_similar(positive=w2)

[('opinions', 0.7783907055854797),
 ('stance', 0.7686251401901245),
 ('views', 0.6847450733184814),
 ('favourite', 0.6429904699325562),
 ('thoughts', 0.622490406036377),
 ('classmates', 0.6156018972396851),
 ('invasions', 0.6017940044403076),
 ('appearance', 0.5995303392410278),
 ('observations', 0.5772193670272827),
 ('enemy', 0.5739162564277649)]

In [25]:
w2 = "spain"
model.wv.most_similar(positive=w2)

[('italy', 0.8383039236068726),
 ('germany', 0.8264582753181458),
 ('france', 0.8190193176269531),
 ('indonesia', 0.8083047866821289),
 ('portugal', 0.7982422113418579),
 ('malaysia', 0.7938610315322876),
 ('ireland', 0.7936679124832153),
 ('europe', 0.7907922267913818),
 ('england', 0.7895593047142029),
 ('japan', 0.7890664935112)]

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the "similarity(...)" function and passing in the relevant words.

In [28]:
# Similarity between two different words
model.wv.similarity(w1="spain", w2="country")

0.5253688863715762

In [30]:
# Similarity between two different words
model.wv.similarity(w1="germany", w2="countries")

0.631107837954425

In [31]:
# Similarity between two different words
model.wv.similarity(w1="trump", w2="president")

0.7719104869073291

In [32]:
# Similarity between two different words
model.wv.similarity(w1="obama", w2="president")

0.776062665288362