# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [2]:
!wget -O reviews_data.txt https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/reviews_data.txt.gz?raw=true

--2020-12-15 14:46:09--  https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/reviews_data.txt.gz?raw=true
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/kavgan/nlp-in-practice/raw/master/word2vec/reviews_data.txt.gz [following]
--2020-12-15 14:46:10--  https://github.com/kavgan/nlp-in-practice/raw/master/word2vec/reviews_data.txt.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kavgan/nlp-in-practice/master/word2vec/reviews_data.txt.gz [following]
--2020-12-15 14:46:10--  https://raw.githubusercontent.com/kavgan/nlp-in-practice/master/word2vec/reviews_data.txt.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.gith

In [3]:
data_file = 'reviews_data.txt'
with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [4]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess(line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2020-12-15 14:49:22,326 : INFO : reading file reviews_data.txt...this may take a while
2020-12-15 14:49:22,329 : INFO : read 0 reviews
2020-12-15 14:49:24,284 : INFO : read 10000 reviews
2020-12-15 14:49:26,224 : INFO : read 20000 reviews
2020-12-15 14:49:28,469 : INFO : read 30000 reviews
2020-12-15 14:49:30,562 : INFO : read 40000 reviews
2020-12-15 14:49:32,915 : INFO : read 50000 reviews
2020-12-15 14:49:35,137 : INFO : read 60000 reviews
2020-12-15 14:49:37,030 : INFO : read 70000 reviews
2020-12-15 14:49:38,764 : INFO : read 80000 reviews
2020-12-15 14:49:40,622 : INFO : read 90000 reviews
2020-12-15 14:49:42,802 : INFO : read 100000 reviews
2020-12-15 14:49:44,594 : INFO : read 110000 reviews
2020-12-15 14:49:46,390 : INFO : read 120000 reviews
2020-12-15 14:49:48,182 : INFO : read 130000 reviews
2020-12-15 14:49:50,102 : INFO : read 140000 reviews
2020-12-15 14:49:51,881 : INFO : read 150000 reviews
2020-12-15 14:49:53,745 : INFO : read 160000 reviews
2020-12-15 14:49:55,556 : 

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [5]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=1)

2020-12-15 14:54:44,334 : INFO : collecting all words and their counts
2020-12-15 14:54:44,335 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-12-15 14:54:44,668 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2020-12-15 14:54:45,026 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2020-12-15 14:54:45,450 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2020-12-15 14:54:45,823 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2020-12-15 14:54:46,235 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2020-12-15 14:54:46,642 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types
2020-12-15 14:54:46,993 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types
2020-12-15 14:54:47,312 : INFO : PROG

(30350047, 41519355)

In [None]:
# store weight vectors of the hidden layers of the neural network 
# weight vectors are also w2v representation of the vocabulary words
wordvectors = model.wv 

In [None]:
# Print the learned vocabulary of tokens (words)
vocabulary = list(wordvectors.vocab)
print(vocabulary) 

In [None]:
# Extract all vocabulary word vectors
X = model[vocabulary]
print(X)

In [None]:
#  Create a 3-dimensional PCA model of the word vectors
from sklearn.decomposition import PCA
pca3 = PCA(n_components=3)
projection_tn03 = pca3.fit_transform(X)

# Plot the resulting projection
# The dots are annotated with the words
import matplotlib.pyplot as plt
# Visualizations will be shown in the notebook
%matplotlib inline 
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure(figsize=(10,8))
ax = Axes3D(fig)

# create a scatter plot of the projection
ax.scatter3D(projection_tn03[:, 0], projection_tn03[:, 1], projection_tn03[:, 2], c = projection_tn03[:, 0], cmap='Greens')
# add title and axis names
plt.title('3D Word2Vect Projection')
ax.set_xlabel('First Projection Axis')
ax.set_ylabel('Second Projection Axis')
ax.set_zlabel('Third Projection Axis')

plt.show()

In [None]:
small_vocab = vocabulary[0:200]
small_X = model[small_vocab]

#  Create a 2-dimensional PCA model of the word vectors
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
projection = pca.fit_transform(small_X)

# Plot the resulting projection
# The dots are annotated with the words
import matplotlib.pyplot as plt
# Visualizations will be shown in the notebook
%matplotlib inline 

plt.figure(figsize=(20,20))

# create a scatter plot of the projection
plt.scatter(projection[:, 0], projection[:, 1], s = 50, c='green')
plt.rc('font', family='Lohit Devanagari')
for i, word in enumerate(small_vocab):
	plt.annotate(word, xy=(projection[i, 0], projection[i, 1]), size=10)
  
  
plt.show()

### HOMEWORK
## Now, let's look at some output 
Those are few example to show simple cases of looking up words similarity using the gensim package.

In [None]:
# look up top 6 words similar to "polite"
w1 = "polite"
.....

# get everything related to stuff on the bed by cheking the top 10 similar words
w2 = ["bed","sheet","pillow"]
.....

# similarity between two different words: "dirty" and "smelly"
w3 = "dirty"
w4 = "smelly"
.....

# Which one is the odd one out in this list? "cat", "dog" and "france"
w5 = "cat"
w6 = "dog"
w7 = "france"
.....