# Word2Vec with gensim


Main source: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py

Reddit QA: https://www.kaggle.com/jiriroz/qa-jokes

You will need to register in kaggle for the data. Either you do that or you write an e-mail to the author of this notebook.

It is recommendet to install cython as well: gensim automatically access cython libraries which greatly increases the computation time.

#### Import modules

In [None]:
import os
import pandas as pd
import nltk
import gensim
import logging
from gensim import corpora, models, similarities

# In case that the nltk tokenizer doesn't work, the problem might lie in missing libraries. In this case use the code below.
# nltk.download()

# For this notebook the library gensim, cython and plotly are used, but they are usually not pre-installed. For Anaconda:
# conda install gensim, cython, plotly

# for buffer information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


#### Include data

In [None]:
df=pd.read_csv('data/question-answer-jokes/jokes.csv') #set path

df

#### Generate your corpus

Remark: the answers are appended at the end!

In [None]:
x = df['Question'].values.tolist()
y = df['Answer'].values.tolist()

corpus = x+y


In [None]:
print(corpus)

#### Preprocessing (if needed)

In [None]:
#tok_corp = [nltk.word_tokenize(sent) for sent in corpus] # tokenize the words in the corpus with nltk. It is up to you to preprocess the corpus as you like
tok_corp = [[w.lower() for w in nltk.word_tokenize(text)] for text in corpus]

# hint: word_tokenize

#### Generate your Word2Vec!

You will not see any math at the surface. All operations are computed in the background. Nevertheless, there are around 20 parameters which can be adjusted according to the task.



CBOW: _predicting the word given its context_

SG: _predicting the context given a word_

In [None]:
model = gensim.models.Word2Vec(tok_corp, min_count=1, size = 50)

# 'gensim' is the library we use
# 'models' is the class for creating different models
# 'Word2Vec()' is the function we need. It takes, among others, the following parameters:

# param: sentence: first attribute is for the corpus
# param: size = 100: size of the vectors
# param: alpha = 0.025: is the initial learning rate (will linearly drop to `min_alpha` as training progresses)
# param: window: maximum distance between the current and predicted word within a sentence
# param: min_count: ignore all words with total frequency lower than this
# param: workers: number of threads to run in parallel
# param: sg: skip-gram for activation sg = 1 (standard is 0)
# param: iter: number of iterations over the corpus (default is 5)

In [None]:
print(model)

In [None]:
#i = 0
#while i < 15:
#    print(tok_corp[i])
#    i += 1

#### Save the model!

It is recommended to persist the model to the hard drive and make it accessible. This very model can be re-used for further training or other modification.

Question: how does the file look like with the save() function? What is the problem here?

In [None]:
# Persist model to disk
model.save('testmodel')

In [None]:
model.wv.save_word2vec_format('testmodel.txt', binary=False)

# hint: function 'wv' for word vector
# hint: function save_word2vec_format takes useful parameters...

#### Load pre-trained models

For the purpose of re-using some existing model, use the load() function.

In [None]:
model = gensim.models.Word2Vec.load('testmodel')

#### In-built function: similarity and vector representation

In [None]:
# The plain output of the most similar words can be printed with the function most_similar('word' or vector). 
# It is a function of the class 'model'
# Try out 'hi','mom,'dad'...
# What kind of similarity measure do you think is used here?

model.most_similar('oops') #hi, mom, dad ...

In [None]:
# The similarity of two words can be calculated with the function similarity().

model.similarity('crime','officer')

In [None]:
# The probability of a sentence

#model.score(["The fox jumped over a lazy dog".split()])
model.score(["What does the fox say".split()])

In [None]:
# No clue how to this one works

model.predict_output_word(['What','does','the','fox'])

#### calculate: (king - man) + woman = ?

In [None]:
# The most_similar() function takes arguments, such as positive=[] and negative=[], to add and subtract vectors.
# What is the output for the iconic query with our Language Model?

model.most_similar(positive=['woman', 'king'], negative=['man'])

##### (DON'T) try out at home: GoogleNews Vectors

... it might crash your memory

Download corpus: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

In [None]:
# For the google data only

from gensim.models import KeyedVectors

# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model_google = KeyedVectors.load_word2vec_format(filename, binary=True)

# calculate: (king - man) + woman = ?
result = model_google.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)


#### Visualization is the key! .....?

We used to visualize our results if possible. Same goes for Word2Vec: the vectors can be plotted and visualized. Anyway, it is not unproblematic. Why?

In [None]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def reduce_dimensions(model, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = []        # positions in vector space
    labels = []         # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model[word])
        labels.append(word)


    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

In [None]:
reduce_dimensions(model)

##### Further reading: "Under the hood" of gensim and Word2Vec:
1. http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
2. http://adventuresinmachinelearning.com/gensim-word2vec-tutorial/
3. http://www.claudiobellei.com/2018/01/07/backprop-word2vec-python/