# Getting started with Word2Vec in Gensim

(This tutorial is based on [This](http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W84fpi-ZNTY))

The idea behind Word2Vec is pretty simple. We are making an assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec. The performance of word embeddings depends on two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.


To get started, you will need to install Gensim. The first cell below does that; ignore it if you already have gensim installed.



In [1]:
import sys
!{sys.executable} -m pip install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 5.5 MB/s 
Collecting smart-open>=1.8.1
  Downloading smart_open-3.0.0.tar.gz (113 kB)
[K     |████████████████████████████████| 113 kB 41.3 MB/s 
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=107097 sha256=9b66b9b8b93449020b138cc44bfd698e2fad7ca8137f67aa8fdd797cb691e5e2
  Stored in directory: /home/leander/.cache/pip/wheels/83/a6/12/bf3c1a667bde4251be5b7a3368b2d604c9af2105b5c1cb1870
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-3.0.0


### Imports and logging

First, we start with our imports and get logging established:

In [2]:
# imports needed and set up logging
import gzip
import gensim 
import logging
import warnings
warnings.filterwarnings("ignore")
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use IMDB Movie Review Dataset [Download](http://ai.stanford.edu/~amaas/data/sentiment/) (feel free to use other datasets). This dataset has full user reviews of movies (50,000 positive + negative reviews for training). I concatenated all the positive and negative reviews in one text file, which is available for download in Blackboard. 

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [5]:
data_file="imdb.txt.gz"

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b"once again mr .  costner has dragged out a movie for far longer than necessary .  aside from the terrific sea rescue sequences ,  of which there are very few i just did not care about any of the characters .  most of us have ghosts in the closet ,  and costner's character are realized early on ,  and then forgotten until much later ,  by which time i did not care .  the character we should really care about is a very cocky ,  overconfident ashton kutcher .  the problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet .  his only obstacle appears to be winning over costner .  finally when we are well past the half way point of this stinker ,  costner tells us all about kutcher's ghosts .  we are told why kutcher is driven to be the best with no prior inkling or foreshadowing .  no magic here ,  it was all i could do to keep from turning it off an hour in . \n"


### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [6]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2020-11-15 13:51:36,135 : INFO : reading file imdb.txt.gz...this may take a while
2020-11-15 13:51:36,137 : INFO : read 0 reviews
2020-11-15 13:51:38,687 : INFO : read 10000 reviews
2020-11-15 13:51:41,324 : INFO : read 20000 reviews
2020-11-15 13:51:43,864 : INFO : read 30000 reviews
2020-11-15 13:51:46,529 : INFO : read 40000 reviews
2020-11-15 13:51:49,200 : INFO : Done reading data file


## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the IMDB dataset takes about 5 minutes or less.


In [18]:
documents

ole',
  'is',
  'seen',
  'coming',
  'out',
  'from',
  'behind',
  'some',
  'boxes',
  'in',
  'the',
  'closet',
  'she',
  'would',
  'have',
  'been',
  'easily',
  'spotted',
  'if',
  'the',
  'cop',
  'had',
  'spent',
  'all',
  'of',
  'seconds',
  'looking',
  'apparently',
  'too',
  'stupid',
  'to',
  'have',
  'said',
  'or',
  'done',
  'anything',
  'when',
  'the',
  'policeman',
  'was',
  'there',
  'wow',
  'this',
  'movie',
  'is',
  'apparently',
  'the',
  'first',
  'in',
  'new',
  'line',
  'of',
  'quality',
  'direct',
  'to',
  'dvd',
  'movies',
  'marketed',
  'as',
  'being',
  'too',
  'extreme',
  'for',
  'theaters',
  'in',
  'reality',
  'it',
  'just',
  'more',
  'cliché',
  'movie',
  'garbage'],
 ['this',
  'movie',
  'has',
  'got',
  'to',
  'be',
  'the',
  'biggest',
  'disappointment',
  've',
  'ever',
  'experienced',
  'with',
  'film',
  'the',
  'acting',
  'is',
  'horrific',
  'the',
  'suspense',
  'build',
  'up',
  'minimal',
 

In [7]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

d; awaiting finish of 2 more threads
2020-11-15 13:52:34,949 : INFO : worker thread finished; awaiting finish of 1 more threads
2020-11-15 13:52:34,952 : INFO : worker thread finished; awaiting finish of 0 more threads
2020-11-15 13:52:34,952 : INFO : EPOCH - 4 : training on 10974519 raw words (8411376 effective words) took 6.3s, 1335056 effective words/s
2020-11-15 13:52:35,957 : INFO : EPOCH 5 - PROGRESS: at 15.50% examples, 1270102 words/s, in_qsize 19, out_qsize 0
2020-11-15 13:52:36,959 : INFO : EPOCH 5 - PROGRESS: at 31.30% examples, 1296634 words/s, in_qsize 17, out_qsize 2
2020-11-15 13:52:37,964 : INFO : EPOCH 5 - PROGRESS: at 47.53% examples, 1313817 words/s, in_qsize 19, out_qsize 0
2020-11-15 13:52:38,964 : INFO : EPOCH 5 - PROGRESS: at 63.68% examples, 1321650 words/s, in_qsize 17, out_qsize 2
2020-11-15 13:52:39,974 : INFO : EPOCH 5 - PROGRESS: at 79.52% examples, 1324460 words/s, in_qsize 18, out_qsize 1
2020-11-15 13:52:40,985 : INFO : EPOCH 5 - PROGRESS: at 95.14% exam

(84126347, 109745190)

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `boring`. All we need to do here is to call the `most_similar` function and provide the word `boring` as the positive example. This returns the top 10 similar words (nearest neighbors). 

In [8]:
w1 = "boring"
model.wv.most_similar(positive=[w1])


2020-11-15 13:53:44,365 : INFO : precomputing L2-norms of word weight vectors


[('dull', 0.7930447459220886),
 ('tedious', 0.7073862552642822),
 ('pointless', 0.6814718246459961),
 ('uninteresting', 0.6390188336372375),
 ('predictable', 0.6207038760185242),
 ('uneventful', 0.6168208718299866),
 ('confusing', 0.6135261654853821),
 ('meaningless', 0.6114116907119751),
 ('repetitive', 0.6027559638023376),
 ('unoriginal', 0.5979882478713989)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [9]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
df=model.wv.most_similar(positive=w1)
print(df[0:6])

[('stubborn', 0.5525908470153809), ('rude', 0.5403105616569519), ('naive', 0.5330991744995117), ('withdrawn', 0.5108522772789001), ('sensitive', 0.5086380243301392), ('conceited', 0.5052778124809265)]


In [10]:
# look up top 6 words similar to 'france'
w1 = ["france"]
df=model.wv.most_similar(positive=w1)
print(df[0:6])

[('spain', 0.7504943609237671), ('england', 0.7110885381698608), ('italy', 0.6984615921974182), ('germany', 0.6935436129570007), ('russia', 0.6747918725013733), ('europe', 0.6714396476745605)]


In [11]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
df=model.wv.most_similar(positive=w1)
print(df[0:6])


[('amazed', 0.7389718294143677), ('surprised', 0.7292163372039795), ('appalled', 0.6932759284973145), ('stunned', 0.6799626350402832), ('astonished', 0.6732420921325684), ('disgusted', 0.6556116342544556)]


That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [12]:
# get everything related to stuff on the bed
w1 = ["food",'cake','fruit']
w2 = ['spoon']
model.wv.most_similar(positive=w1, negative=w2)


[('meal', 0.5071399211883545),
 ('bath', 0.4899105429649353),
 ('lemonade', 0.48824217915534973),
 ('flowers', 0.46308645606040955),
 ('chocolate', 0.46211183071136475),
 ('joints', 0.455669105052948),
 ('snack', 0.44953450560569763),
 ('acid', 0.44833898544311523),
 ('puppies', 0.4460829198360443),
 ('vegetables', 0.4450179636478424)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [13]:
# similarity between synonyms
w1= "good"
w2= "great"

model.wv.similarity(w1, w2)

0.71336734

In [14]:
# similarity between two opposite words
w1= "great"
w2= "aweful"

model.wv.similarity(w1,w2)

0.078864224

In [15]:
# similarity between two unrelated words
w1= "food"
w2= "car"
model.wv.similarity(w1,w2)

0.21182805

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `boring` is highly similar to `dull` but `great` is dissimilar to `aweful`. Antonyms is a special case, and we managed to model them well since we used a movie reviews dataset, which includes separate positive and negative reviews. Using a general dataset won't necessarily model these antonyms correctly, since generally antonyms have very similar distributions in text. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [16]:
# Which one is the odd one out in this list?
list_of_words = ["cat","dog","france"]

pairs = []
values = []
for i in range(len(list_of_words)):
    for j in range(len(list_of_words)):
        if i<j:
            values.append(model.wv.similarity(list_of_words[i],list_of_words[j]))
            pairs.append([list_of_words[i], list_of_words[j]])

index = values.index(max(values))

for word in list_of_words:
    if word not in pairs[index]:
        print(word)
      

france


In [17]:
# Which one is the odd one out in this list?
list_of_words = ["dog","cat","horse","shower"]

for w1 in list_of_words:
    for w2 in list_of_words:
        if w1 != w2:
            print(model.wv.similarity(w1,w2), w1,w2)

print("shower is the odd one out")

0.6440322 dog cat
0.42674297 dog horse
0.11138967 dog shower
0.6440322 cat dog
0.27482846 cat horse
0.057583783 cat shower
0.42674297 horse dog
0.27482846 horse cat
0.12003072 horse shower
0.11138967 shower dog
0.057583783 shower cat
0.12003072 shower horse
shower is the odd one out


## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?
