Author: Jose Camacho-Collados, Yipeng Qin

This Jupyter notebook shows how to train and use word embeddings with the [Gensim](https://radimrehurek.com/gensim/) library. Some snippets have been adapted from the [Gensim's Word2Vec tutorial](https://rare-technologies.com/word2vec-tutorial/).

Word embeddings are vector representations of words which are generally **low-dimensional** (often less than 1000 dimensions) vectors that encode the **semantics** of words.


## TRAINING WORD EMBEDDINGS (Word2Vec)

---

As usual, we first import the libraries that we are going to use, including now Gensim.

**Note:** All these libraries need to be downloaded beforehand if not using Google Colab. Check their official websites for details on how to install them.

In [2]:
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
import gensim
import requests
import random
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c21099797\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

To learn word embeddings, in this case Word2Vec (Mikolov et al. 2013 [link text](https://arxiv.org/pdf/1301.3781.pdf)), we first need a sufficiently large text corpus. To this end we are going to use the IMDb review corpus (the same used in Coursework 1), which includes all sentences.






In [3]:
url_train="http://josecamachocollados.com/imdb_train.txt" # Containing all sentences of the imdb training set

#Load training set
response_train = requests.get(url_train)
dataset_file_train = response_train.text.split("\n") # "\n" is a character representing "new line"
random.shuffle(dataset_file_train) # We shuffle all sentences of our corpus

Let's check what the IMDb dataset looks like

In [48]:
for line in dataset_file_train[:5]:
  print (line)

unlike others , i refuse to call this pitiful excuse for a movie a triumph of style over substance ( i do n't want to give style a bad name ) . still , it 's the most apt description that comes to mind . <br /> <br /> a pointless , unpleasant and ultimately meaningless assault on the eyes and ears , " wonderland " leaves one wondering only why the film was made in the first place and who in their right mind gave the greenlight to this dreary and tangled mess . a biography of porn star john holmes ? a study of who the man was , why he went into the business and how it affected him ? great . bound to be compelling , bound to be entertaining . bound to be enlightening and fascinating on about a million levels ( and i have zero interest in porn ) . <br /> <br /> but a confusing , violent , rashomon-style study of a series of murders holmes was connected with after his career ended ? who in hell cares ? what insights do we gain ? this film completely ignores the most interesting aspect of j

Once the dataset is loaded, we are going to tokenize and store our corpus into sentences.

In [5]:
gensim_imdb_corpus=[]
for line in dataset_file_train:
  gensim_imdb_corpus.append(word_tokenize(line))
# This could take a lot of memory. If it's the case you can reduce the number of lines

In case we do not know or we forget what "tokenize" means, we can have a quick check to see what has happened.

In [49]:
print ("Before tokenization:", dataset_file_train[0])
print ("After tokenization:", gensim_imdb_corpus[0])

Before tokenization: unlike others , i refuse to call this pitiful excuse for a movie a triumph of style over substance ( i do n't want to give style a bad name ) . still , it 's the most apt description that comes to mind . <br /> <br /> a pointless , unpleasant and ultimately meaningless assault on the eyes and ears , " wonderland " leaves one wondering only why the film was made in the first place and who in their right mind gave the greenlight to this dreary and tangled mess . a biography of porn star john holmes ? a study of who the man was , why he went into the business and how it affected him ? great . bound to be compelling , bound to be entertaining . bound to be enlightening and fascinating on about a million levels ( and i have zero interest in porn ) . <br /> <br /> but a confusing , violent , rashomon-style study of a series of murders holmes was connected with after his career ended ? who in hell cares ? what insights do we gain ? this film completely ignores the most in

**Note:** The corpus may be further preprocessed if necessary (e.g. lowercased) or further cleaned. In this case the version of the IMDb corpus that we use was already lowercased.

Finally, we can train our Word2Vec word embedding model! For more information on training Word2Vec with gensim, you can check [here](https://radimrehurek.com/gensim/models/word2vec.html).

In [7]:
from gensim.models import Word2Vec

In [10]:
model = Word2Vec(gensim_imdb_corpus, vector_size=100, window=5, min_count=3) 
# Size is the number of dimensions of the embeddings we are going to learn
# Window is the size considered for context of a target word
# Min count is the minimum number of times that a word need to occur to be learnt

**Note:** You can save and load models using `model.save` and `model.load` functions. This enables you to export your models and use them anytime, or to use models training by someone else (i.e. pre-trained models). 

**Exercise (Optional):** Train the same model using [FastText](https://radimrehurek.com/gensim/models/fasttext.html) instead of Word2Vec. FastText is a model similar to Word2Vec but takes also into account character information, which can be useful for noisy text such as the one we find in social media.

In [29]:
# Train the model using FastText algorithm instead of Word2Vec
model.train(gensim_imdb_corpus, total_examples=len(gensim_imdb_corpus), epochs=10)

# Save the model
model.save("gensim_imdb.model")



In [30]:
from gensim.models import FastText

model_fasttext = FastText(gensim_imdb_corpus, vector_size=100, window=5, min_count=3)

## PLAYING WITH WORD2VEC

---

Now that our model has been trained, we can check:

i) what the model has learned in general

In [50]:
#print(model.wv.vocab)

# Store just the words + their trained embeddings.
word_vectors = model.wv

print(model.wv.key_to_index)






In [51]:
print(model_fasttext.wv.key_to_index)



ii) How many words are there in the word embeddings?

In [52]:
#print(len(model.wv.vocab))

print(len(model.wv.key_to_index))

40166


In [32]:
print(len(model_fasttext.wv.key_to_index))

40166


 iii) What is the vector for each word? The size for each embedding vector should have 100 dimensions.

In [18]:
#vector_movie=model['movie']
vector_movie=model.wv['movie']
print ("Number of dimensions: "+str(len(vector_movie)))
print (vector_movie)

Number of dimensions: 100
[ 4.7187195   0.6391044   3.2110643  -1.6583861  -3.008352   -0.10702334
 -3.599195   -0.8350251   0.9289836   2.246731    0.24415356 -0.23169012
  0.54113257 -2.197405   -1.7457457   4.111549    1.7398655  -4.299542
  2.320371   -0.15107937 -0.53977436  3.880567   -0.5771787  -1.777853
 -1.0839218   0.50750846 -0.87219536 -0.24411763  1.2296457   0.870925
  1.3740119  -0.47693112 -1.475267    1.8540196  -1.446466   -1.1510545
 -0.17366807 -0.48329112 -1.8186642  -0.72412914 -3.0490034   0.952467
  2.8459554  -0.48527834 -0.45080146 -4.191804   -0.09094614 -3.2207654
 -1.4678178   1.2428933   0.38197935  0.2661844   1.0341221   0.6437449
  0.10011708 -1.3196064  -3.810715   -1.4390714   2.9951015  -2.0260856
 -2.7983303  -1.0379045   2.3958163   0.21883397 -1.9545356  -0.7884925
 -0.9001525   0.21452707 -0.08642765 -0.6695173   2.3629045   1.0489724
 -0.92969394  0.33157843  0.69601643 -1.0386292   2.3327994  -0.82911557
  0.39917257 -1.5917305   1.2185732   0

In [33]:
vector_movie_fasttext=model_fasttext.wv['movie']
print ("Number of dimensions: "+str(len(vector_movie_fasttext)))
print (vector_movie_fasttext)

Number of dimensions: 100
[-1.8847263   2.9914157   3.6427753   2.95253    -2.9798703   2.17487
  1.7798916   1.444022   -1.0086968   0.13797578 -0.19609445  0.7219611
 -1.6022784   0.66163754  1.2486875   0.50136536 -0.27980998 -1.7431422
  0.14650533 -2.580268   -0.6733206   2.4943695   0.61561674  0.26509908
 -0.8795058   4.7205505   0.94685054 -1.4943553  -1.0044161   2.5462122
 -0.7904467  -0.8588277  -2.3303368   0.51841575 -0.6424538  -4.088907
  1.4003118   0.6907385   0.39785296  1.939632   -0.774195    0.86340433
  0.33619282 -2.3723967   0.96883714  0.43044004  3.6778953  -1.1703097
  0.52817243 -2.0437567   1.4481575  -1.3879666  -2.9637227  -1.4382756
 -4.6750665   4.5933294  -0.17462079  1.0467371  -0.2919     -2.2890854
  1.0100875  -1.9243292   0.7238644   1.0158615  -0.5447782   0.1260048
  0.15355998 -0.36525223  2.3095224   0.6608069  -2.5905771  -0.86844903
  1.0944118   0.980903   -0.5229587   1.5734702  -3.0814142   4.359117
  0.6371427   1.7751839   0.37180665  1

In addition, we can check the vocabulary learned with:

In [53]:
#print(model.wv.vocab.keys())

print(model.wv.key_to_index.keys())



In [54]:
print(model_fasttext.wv.key_to_index.keys())



Or

In [21]:
#for word in list(model.wv.vocab.keys())[:5]:
  #print(word)

for word in list(model.wv.key_to_index.keys())[:5]:
  print(word)

the
.
,
and
a


We can also check the similarity (measured by cosine similarity) between some words. Let's start with finding the most similar words to *film* or *casablanca* in our vector space. We can find the most similar words of any input word by using the `.most_similar` command. 

In [23]:
#model.most_similar('movie')

model.wv.most_similar('movie')

[('film', 0.9432273507118225),
 ('flick', 0.7656893730163574),
 ('show', 0.6861786246299744),
 ('documentary', 0.6572632193565369),
 ('picture', 0.6507958769798279),
 ('it', 0.6313016414642334),
 ('sequel', 0.6287752985954285),
 ('episode', 0.6088770031929016),
 ('movies', 0.595389723777771),
 ('mess', 0.5896077156066895)]

In [35]:
model_fasttext.wv.most_similar('movie')

[('a-movie', 0.9727110266685486),
 ('d-movie', 0.9726241827011108),
 ('c-movie', 0.9724273681640625),
 ('movie.i', 0.9628679156303406),
 ('movie.in', 0.9618176221847534),
 ('movie.it', 0.9592311382293701),
 ('tv-movie', 0.9555339217185974),
 ('film', 0.950377881526947),
 ('b-movie', 0.9382290244102478),
 ('moovie', 0.9381521940231323)]

In [24]:
#model.most_similar('casablanca')

model.wv.most_similar('casablanca')

[('smokey', 0.7568488717079163),
 ('dallas', 0.7456918358802795),
 ('anchorman', 0.7387089133262634),
 ('re-animator', 0.7242797017097473),
 ('rouge', 0.7228572368621826),
 ('magnolias', 0.7136237025260925),
 ('goodfellas', 0.7131175398826599),
 ('bueller', 0.7123828530311584),
 ('strada', 0.712144136428833),
 ('michaels', 0.71092689037323)]

In [36]:
model_fasttext.wv.most_similar('casablanca')

[('casa', 0.889998197555542),
 ('cadillac', 0.8795784711837769),
 ('cairo', 0.8738831281661987),
 ('casanova', 0.8594125509262085),
 ('casinos', 0.8516393899917603),
 ('cassi', 0.8495320677757263),
 ('blanca', 0.8494088649749756),
 ('cazalÃ©', 0.8493580222129822),
 ('caesar', 0.8486674427986145),
 ('cagey', 0.84819495677948)]

We can also check the similarity between two given words.

In [25]:
#print(model.similarity('movie', 'film'))
print(model.wv.similarity('movie', 'film'))
#print(model.similarity('movie', 'popcorn'))
print(model.wv.similarity('movie', 'popcorn'))
#print(model.similarity('movie', 'table'))
print(model.wv.similarity('movie', 'table'))

0.9432273
0.22088306
-0.014616387


In [37]:
print(model.wv.similarity('movie', 'a-movie'))


0.048673112


Here we can see how words like *movie* and *film* are very close (in fact they are synonyms). Then other words like *movie* and *popcorn* are somehow related, while *movie* and *table* do not seem to be similar at all in this corpus.

**Note:** In this notebook we have learned our own word embeddings in IMDb. However, please note that in many cases we are going to directly use an available pre-trained word embedding model. These are generally trained on large corpora and are therefore more complete/accurate. For example, there are pre-trained models for [Word2Vec](https://code.google.com/archive/p/word2vec/), [GloVe](https://nlp.stanford.edu/projects/glove/) or even [FastText trained on Twitter](https://github.com/pedrada88/crossembeddings-twitter).

**Exercise (optional):** Choose a pre-trained model from Word2Vec, GloVe or FastText (there are many available online) and load it using gensim. Check a few similarities and compare it with the word embeddings trained on IMDb.

**Exercise 1:** Train a Word2Vec word embedding model on the IMDb corpus with 75 dimensions and a window size of 8. Then, check the most similar words of *movie* in the vector space and the similarity between *movie* and *table*. Compare the results with the previous trained model.

In [26]:
# Train with 75 dimensions and window f 8
model2 = Word2Vec(gensim_imdb_corpus, vector_size=75, window=8, min_count=3)

In [27]:
# Check the most similar words to movie
model2.wv.most_similar('movie')

[('film', 0.9196907877922058),
 ('flick', 0.7433812022209167),
 ('picture', 0.6529372930526733),
 ('documentary', 0.6407851576805115),
 ('show', 0.6344099044799805),
 ('it', 0.617644727230072),
 ('movies', 0.6010158061981201),
 ('thing', 0.5990554094314575),
 ('mess', 0.5964698791503906),
 ('turkey', 0.5867173075675964)]

In [28]:
# Check similarity between movie and table
print(model2.wv.similarity('movie', 'table'))

-0.054146286


In [38]:
print(model2.wv.similarity('movie', 'a-movie'))


0.100233175


**Exercise (optional):** Take a corpus of your choice (e.g. from one of the NLP projects) and train a word embedding model using gensim. Check a few similarities of words and compare with the models trained on IMDb.

In [39]:
# Download a corpus of movie reviews
nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

# Get all the words in the corpus
all_words = movie_reviews.words()


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\c21099797\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


In [40]:
# Visualize the first 100 words
print(all_words[:100])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]


In [41]:
print(len(all_words))

1583820


In [42]:
# Get the frequency distribution of all words
all_words_freq = nltk.FreqDist(all_words)

In [43]:
print(all_words_freq.most_common(10))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822)]


In [44]:
# Train the model with the corpus of movie reviews
model3 = Word2Vec(movie_reviews.sents(), vector_size=100, window=5, min_count=3)

In [45]:
# Compare the most similar words to movie
model3.wv.most_similar('movie')

[('film', 0.9465670585632324),
 ('picture', 0.8432642221450806),
 ('sequel', 0.7543413639068604),
 ('case', 0.6854908466339111),
 ('ending', 0.6776500344276428),
 ('premise', 0.677554190158844),
 ('story', 0.6584279537200928),
 ('comedy', 0.6540756225585938),
 ('thing', 0.653117299079895),
 ('plot', 0.6506665945053101)]

In [46]:
# Compare the most similar words to film
model3.wv.most_similar('film')

[('movie', 0.9465671181678772),
 ('picture', 0.8319717049598694),
 ('premise', 0.7189927697181702),
 ('sequel', 0.7023196816444397),
 ('case', 0.6926998496055603),
 ('ending', 0.6790832281112671),
 ('story', 0.6710167527198792),
 ('plot', 0.668138861656189),
 ('script', 0.6632994413375854),
 ('installment', 0.6486179828643799)]

In [47]:
# Compare the two models trained with the IMDB corpus and the movie reviews corpus
print(model.wv.similarity('movie', 'film'))
print(model3.wv.similarity('movie', 'film'))

# Compare the two models trained with the IMDB corpus and the movie reviews corpus
print(model.wv.similarity('movie', 'popcorn'))
print(model3.wv.similarity('movie', 'popcorn'))

0.94246566
0.946567
0.13621056
0.27083156
