# word2vec implementation with Python (& Gensim)

Effective word representation is inecitable for natural language processing (NLP). Word2vec is a model provided by Google for the effective word representation, named word embedding. 

Note: This code is written in Python 3.6.1 (+Gensim 2.3.0).

Gensim is a free python library for topic modelling, document indexing and similarity retrieval with large corpora, useful for NLP and information retrieval.

In [None]:
import re
import numpy as np
import nltk

from nltk.corpus import gutenberg
from gensim.models import Word2Vec
from multiprocessing import Pool
from scipy import spatial

### Import training dataset
- Import Shakespeare's Hamlet corpus from nltk library.

In [None]:
nltk.download('gutenberg')
nltk.download('punkt')
sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

In [None]:
print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

### Preprocess data
- Use re module to preprocess data
- Convert all letters into lowercase
- Remove punctuations, numbers, etc.

In [None]:
for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z]+', word)]  

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

### Create and train model
- Create a word2vec model and train it with Hamlet corpus
- Key parameter description (https://radimrehurek.com/gensim/models/word2vec.html)
    - **sentences**: training data (has to be a list with tokenized sentences)
    - **size**: dimension of embedding space
    - **sg**: CBOW if 0, skip-gram if 1
    - **window**: number of words accounted for each context (if the window size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
    - **min_count**: minimum count of words to be included in the vocabulary
    - **iter**: number of training iterations
    - **workers**: number of worker threads to train

In [None]:
model = Word2Vec(sentences = sentences, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = Pool()._processes)

In [None]:
model.init_sims(replace = True)

### Save and load model
- word2vec model can be saved and loaded locally
- Doing so can reduce time to train model again

In [None]:
model.save('word2vec_model')

In [None]:
model = Word2Vec.load('word2vec_model')

### Similarity calculation
- Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity
- For other metrics and comparisons between them, refer to: https://github.com/taki0112/Vector_Similarity

In [None]:
model.most_similar('king')

In [None]:
# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

In [None]:
v1 = model['king']
v2 = model['queen']
cosine_similarity(v1, v2)

In [None]:
a = model['king'] - model['man'] + model['woman'] 
b = model['queen']
print(a)
print(b)
cosine_similarity(a, b)