# word2vec implementation with Python (& Gensim)

Effective word representation is inecitable for natural language processing (NLP). Word2vec is a model provided by Google for the effective word representation, named word embedding. 

Note: Run this code in Python 3.6 or above (+ Gensim 3.8.3).

Gensim is a free python library for topic modelling, document indexing and similarity retrieval with large corpora, useful for NLP and information retrieval.

In [None]:
import re
import nltk

from nltk.corpus import gutenberg
from gensim.models import Word2Vec

### Import training dataset
- Import Shakespeare's Hamlet corpus from nltk library.

In [None]:
nltk.download('gutenberg') # default path in windows C:\users\<user>\AppData\Roaming\nltk_data 
nltk.download('punkt')   # Sentence Tokenizer. This tokenizer divides a text into a list of sentences
sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

In [None]:
print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

### Preprocess data
- Use re module to preprocess data
- Convert all letters into lowercase
- Remove punctuations, numbers, etc.

In [None]:
for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('[a-zA-Z]+', word)]  

In [None]:
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

### Create and train model
- Create a word2vec model and train it with Hamlet corpus
- Key parameter description (https://radimrehurek.com/gensim/models/word2vec.html)
    - **sentences**: training data (has to be a list with tokenized sentences)
    - **size**: dimension of embedding space
    - **window**: number of words accounted for each context (if the window size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
    - **iter**: number of training iterations

In [None]:
print("Training ....")
model = Word2Vec(sentences = sentences, size = 20, window = 3, iter = 100)
print("Training is done!")
print("Number of vocabs: ", len(model.wv.vocab))

### Test

In [None]:
model.wv.most_similar('king')

In [None]:
model.wv.most_similar('queen')

In [None]:
model.wv.__getitem__('king')

### Save and load model
- word2vec model can be saved and loaded locally
- Doing so can reduce time to train model again

In [None]:
model.save('word2vec_model')

In [None]:
model = Word2Vec.load('word2vec_model')