## Word2vec

In this notebook, we will introduce word 2 vect algorithm

## Readings

Here are the resources I used to build this notebook. I suggest reading these either beforehand or while you're working on this material.

* A really good [conceptual overview](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) of word2vec from Chris McCormick 
* [First word2vec paper](https://arxiv.org/pdf/1301.3781.pdf) from Mikolov et al.
* [NIPS paper](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) with improvements for word2vec also from Mikolov et al.
* An [implementation of word2vec](http://www.thushv.com/natural_language_processing/word2vec-part-1-nlp-with-deep-learning-with-tensorflow-skip-gram/) from Thushan Ganegedara
* TensorFlow [word2vec tutorial](https://www.tensorflow.org/tutorials/word2vec)
* Gensim w2v totorial https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

## Word2Vec

The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as "black", "white", and "red" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.

<img src="assets/word2vec_architectures.png" width="500">


## Training Data

You will have to define the window size of surrounding words that you want to look at. And then, we go through each word in your corpus and create paris of input and output words as training data.

<img src="assets/rolling_window.png" width="700">


## Building the graph

From [Chris McCormick's blog](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), we can see the general structure of our network.
![embedding_network](./assets/skip_gram_net_arch.png)

The input words are passed in as integers. This will go into a hidden layer of linear units, then into a softmax layer. We'll use the softmax layer to make a prediction like normal.

The idea here is to train the hidden layer weight matrix to find efficient representations for our words. We can discard the softmax layer becuase we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in other networks we build from the dataset.

![lookup](assets/lookup_matrix.png)

In [4]:
import gensim
from gensim.models.word2vec import Word2Vec
import pickle
import os

In [None]:
## input will be a list of tokenized sentances
# e.g input = [['good','morning','!'],['how','are','you','?']]
# I am just going to read a preprocessed piclke file with all my training sentances.
# you will have to use your own data sources
total_results = pickle.load(open("sentances.p", "rb"))

In [5]:
### initialize model and build vocabulary 
n_dim = 300
window = 7 
downsampling = 0.001
seed = 1 
num_workers = os.cpu_count()    ## not sure if this is a good idea
min_count = 30 
imf_w2v = Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers,
    size=n_dim,
    min_count=min_count,
    window= window,
    sample=downsampling
)
## build the vocabulary
imf_w2v.build_vocab(total_results)

In [6]:
## train w2v model 
corpus_count = imf_w2v.corpus_count
iteration = 10
if gensim.__version__[0] =='1':
    imf_w2v.train(total_results)
else:
    imf_w2v.train(total_results,total_examples=corpus_count,epochs = iteration)

In [34]:
## save trained word2 to vect model 
if not os.path.exists("trained"):
    os.makedirs("trained")
    imf_w2v.save(os.path.join('trained','imf.w2v'))
else:
    imf_w2v = Word2Vec.load(os.path.join('trained','imf.w2v'))

In [7]:
model = imf_w2v.wv
vocabs = model.vocab.keys()

In [28]:
imf_w2v.most_similar('spillover',topn=20)

[('spillovers', 0.7708446383476257),
 ('linkages', 0.5720889568328857),
 ('spill-over', 0.5676941871643066),
 ('contagion', 0.5501283407211304),
 ('spillbacks', 0.5454457998275757),
 ('outwards', 0.5422704219818115),
 ('marketvolatility', 0.5338668823242188),
 ('report—analytical', 0.5317055583000183),
 ('inwards', 0.5228022336959839),
 ('outward', 0.5054793953895569),
 ('reportthe', 0.5010243654251099),
 ('boomerang', 0.4919307827949524),
 ('outwardspillovers', 0.4914931058883667),
 ('inward', 0.4884394407272339),
 ('effects', 0.4873153269290924),
 ('spilloverswe', 0.48556917905807495),
 ('interconnectedness', 0.47779250144958496),
 ('spillback', 0.4770641624927521),
 ('interconnections', 0.47506511211395264),
 ('surveillance.the', 0.47185012698173523)]

In [33]:
imf_w2v.most_similar('npl',topn=20)

[('npls', 0.8256365656852722),
 ('nonperforming', 0.7252997756004333),
 ('non-performing', 0.7176951766014099),
 ('provisioning', 0.5830786228179932),
 ('write-offs', 0.5457184314727783),
 ('impairments', 0.517227292060852),
 ('loan-loss', 0.5097397565841675),
 ('impaired', 0.5086087584495544),
 ('capitaladequacy', 0.5065985918045044),
 ('loan-to-deposit', 0.497059166431427),
 ('nonperformingloans', 0.4962959885597229),
 ('npes', 0.4939619302749634),
 ('provisions-to-npls', 0.4868103265762329),
 ('corporatedebt', 0.4835938811302185),
 ('innon-performing', 0.48146674036979675),
 ('cost-to-income', 0.47889918088912964),
 ('loan', 0.47880086302757263),
 ('loans', 0.4780082106590271),
 ('cesees', 0.47704681754112244),
 ('capital-to-assets', 0.4730897843837738)]