In [99]:
!conda install seaborn

/bin/sh: 1: conda: not found


# Briefing about Word2Vec:

<img src="http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" alt="drawing" width="550"/>

[[1]](#References:)


## Purpose of the tutorial:
As I said before, this tutorial focuses on the right use of the Word2Vec package from the Gensim libray; therefore, I am not going to explain the concepts and ideas behind Word2Vec here. I am simply going to give a very brief explanation, and provide you with links to good, in depth tutorials.

## Brief explanation:

Word2Vec was introduced in two [papers](#Material-for-more-in-depths-understanding:) between September and October 2013, by a team of researchers at Google. Along with the papers, the researchers published their implementation in C. The Python implementation was done soon after the 1st paper, by [Gensim](https://radimrehurek.com/gensim/index.html). 

The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2Vec they will therefore share a similar vector representation.<br>

From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

# Getting Started

## Setting up the environment:

`python==3.6.3`

Libraries used:
 * `xlrd==1.1.0`: https://pypi.org/project/xlrd/
 * `spaCy==2.0.12`: https://spacy.io/usage/
 * `gensim==3.4.0`: https://radimrehurek.com/gensim/install.html
 * `scikit-learn==0.19.1`: http://scikit-learn.org/stable/install.html
 * `seaborn==0.8`: https://seaborn.pydata.org/installing.html

In [1]:
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing
from spacy.lang.de import German

from nltk import tokenize
import nltk
import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)


In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/lateknight/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
file = open('../data/alpen/alpine_full_corpus_text.txt',encoding="utf-8")
text = file.read()

In [4]:
sentence_data = tokenize.sent_tokenize(text)

In [5]:
df = pd.DataFrame(sentence_data, columns = ['Sentences']) 

In [6]:
df.head()

Unnamed: 0,Sentences
0,Der Deutsche Alpenverein hat es sich zur Aufga...
1,Ohne hier auf die Geschichte seiner Gründung e...
2,"Grundgedanke war, der Deutsche Alpenverein sol..."
3,Für sie alle soll der Deutsche Alpenverein das...
4,Er erhebt keine anderen Ansprüche an seine Mit...


In [7]:

df.shape

(16411, 1)

In [8]:
df.head(20)

Unnamed: 0,Sentences
0,Der Deutsche Alpenverein hat es sich zur Aufga...
1,Ohne hier auf die Geschichte seiner Gründung e...
2,"Grundgedanke war, der Deutsche Alpenverein sol..."
3,Für sie alle soll der Deutsche Alpenverein das...
4,Er erhebt keine anderen Ansprüche an seine Mit...
5,Der Deutsche Alpenverein kennt keine politisch...
6,Ueberall soll die Liebe zu den Alpen geweckt u...
7,"Eine derselben, alljährlich durch Wahl der Gen..."
8,"—\nDies der Grundgedanke, welcher, lange geheg..."
9,"Der Erfolg hat bewiesen, dass die Idee, einen ..."


## Bigrams:
We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences.
https://radimrehurek.com/gensim/models/phrases.html


In [9]:
from gensim.models.phrases import Phrases, Phraser

As `Phrases()` takes a list of list of words as input:

In [10]:
sent = [row.split() for row in df['Sentences']]

Creates the relevant phrases from the list of sentences:

In [11]:
phrases = Phrases(sent, min_count=30, progress_per=10000)

INFO - 18:48:39: collecting all words and their counts
INFO - 18:48:39: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 18:48:39: PROGRESS: at sentence #10000, processed 244723 words and 197832 word types
INFO - 18:48:40: collected 289401 word types from a corpus of 396824 words (unigram + bigrams) and 16411 sentences
INFO - 18:48:40: using 289401 counts as vocab in Phrases<0 vocab, min_count=30, threshold=10.0, max_vocab_size=40000000>


The goal of Phraser() is to cut down memory consumption of Phrases(), by discarding model state not strictly needed for the bigram detection task:

In [12]:
bigram = Phraser(phrases)

INFO - 18:48:42: source_vocab length 289401
INFO - 18:48:44: Phraser built with 108 phrasegrams


Transform the corpus based on the bigrams detected:

In [13]:
sentences = bigram[sent]

## Most Frequent Words:
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [14]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

57469

In [15]:
sorted(word_freq, key=word_freq.get, reverse=True)[:20]

['der',
 'die',
 'und',
 'den',
 'in',
 'von',
 'zu',
 'des',
 'das',
 'auf',
 'sich',
 'dem',
 'wir',
 'mit',
 'eine',
 'über',
 'ein',
 'im',
 'nicht',
 'als']

# Training the model
## Gensim Word2Vec Implementation:
We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [16]:
import multiprocessing

from gensim.models import Word2Vec

## Why I seperate the training of the model in 3 steps:
I prefer to separate the training in 3 distinctive steps for clarity and monitoring.
1. `Word2Vec()`: 
>In this first step, I set up the parameters of the model one-by-one. <br>I do not supply the parameter `sentences`, and therefore leave the model uninitialized, purposefully.
2. `.build_vocab()`: 
>Here it builds the vocabulary from a sequence of sentences and thus initialized the model. <br>With the loggings, I can follow the progress and even more important, the effect of `min_count` and `sample` on the word corpus. I noticed that these two parameters, and in particular `sample`, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.
3. `.train()`:
>Finally, trains the model.<br>
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [17]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer

## The parameters:

* `min_count` <font color='purple'>=</font> <font color='green'>int</font> - Ignores all words with total absolute frequency lower than this - (2, 100)


* `window` <font color='purple'>=</font> <font color='green'>int</font> - The maximum distance between the current and predicted word within a sentence. E.g. `window` words on the left and `window` words on the left of our target - (2, 10)


* `size` <font color='purple'>=</font> <font color='green'>int</font> - Dimensionality of the feature vectors. - (50, 300)


* `sample` <font color='purple'>=</font> <font color='green'>float</font> - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial.  - (0, 1e-5)


* `alpha` <font color='purple'>=</font> <font color='green'>float</font> - The initial learning rate - (0.01, 0.05)


* `min_alpha` <font color='purple'>=</font> <font color='green'>float</font> - Learning rate will linearly drop to `min_alpha` as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00


* `negative` <font color='purple'>=</font> <font color='green'>int</font> - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)


* `workers` <font color='purple'>=</font> <font color='green'>int</font> - Use these many worker threads to train the model (=faster training with multicore machines)

In [18]:
w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

## Building the Vocabulary Table:
Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them):

In [19]:
t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 18:49:02: collecting all words and their counts
INFO - 18:49:02: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 18:49:03: PROGRESS: at sentence #10000, processed 240613 words, keeping 42655 word types
INFO - 18:49:03: collected 57469 word types from a corpus of 389786 raw words and 16411 sentences
INFO - 18:49:03: Loading a fresh vocabulary
INFO - 18:49:03: effective_min_count=20 retains 2135 unique words (3% of original 57469, drops 55334)
INFO - 18:49:03: effective_min_count=20 leaves 267249 word corpus (68% of original 389786, drops 122537)
INFO - 18:49:03: deleting the raw counts dictionary of 57469 items
INFO - 18:49:03: sample=6e-05 downsamples 1061 most-common words
INFO - 18:49:03: downsampling leaves estimated 96006 word corpus (35.9% of prior 267249)
INFO - 18:49:03: estimated required memory for 2135 words and 300 dimensions: 6191500 bytes
INFO - 18:49:03: resetting layer weights


Time to build vocab: 0.02 mins


## Training of the model:
_Parameters of the training:_
* `total_examples` <font color='purple'>=</font> <font color='green'>int</font> - Count of sentences;
* `epochs` <font color='purple'>=</font> <font color='green'>int</font> - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [20]:
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 18:49:06: training model with 3 workers on 2135 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
INFO - 18:49:07: EPOCH 1 - PROGRESS: at 84.94% examples, 80585 words/s, in_qsize 0, out_qsize 0
INFO - 18:49:07: worker thread finished; awaiting finish of 2 more threads
INFO - 18:49:07: worker thread finished; awaiting finish of 1 more threads
INFO - 18:49:07: worker thread finished; awaiting finish of 0 more threads
INFO - 18:49:07: EPOCH - 1 : training on 389786 raw words (96063 effective words) took 1.2s, 80122 effective words/s
INFO - 18:49:08: EPOCH 2 - PROGRESS: at 92.69% examples, 87872 words/s, in_qsize 1, out_qsize 0
INFO - 18:49:08: worker thread finished; awaiting finish of 2 more threads
INFO - 18:49:08: worker thread finished; awaiting finish of 1 more threads
INFO - 18:49:08: worker thread finished; awaiting finish of 0 more threads
INFO - 18:49:08: EPOCH - 2 : training on 389786 raw words (95922 effective words) took 1.1s, 87149 effectiv

INFO - 18:49:28: worker thread finished; awaiting finish of 1 more threads
INFO - 18:49:28: worker thread finished; awaiting finish of 0 more threads
INFO - 18:49:28: EPOCH - 19 : training on 389786 raw words (95906 effective words) took 1.3s, 74800 effective words/s
INFO - 18:49:29: EPOCH 20 - PROGRESS: at 68.81% examples, 64726 words/s, in_qsize 0, out_qsize 0
INFO - 18:49:30: worker thread finished; awaiting finish of 2 more threads
INFO - 18:49:30: worker thread finished; awaiting finish of 1 more threads
INFO - 18:49:30: worker thread finished; awaiting finish of 0 more threads
INFO - 18:49:30: EPOCH - 20 : training on 389786 raw words (95802 effective words) took 1.5s, 62492 effective words/s
INFO - 18:49:31: EPOCH 21 - PROGRESS: at 53.34% examples, 50628 words/s, in_qsize 0, out_qsize 0
INFO - 18:49:31: worker thread finished; awaiting finish of 2 more threads
INFO - 18:49:31: worker thread finished; awaiting finish of 1 more threads
INFO - 18:49:31: worker thread finished; awai

Time to train the model: 0.63 mins


As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient:

In [21]:
w2v_model.init_sims(replace=True)

INFO - 18:49:56: precomputing L2-norms of word weight vectors


# Exploring the model
## Most similar to:



<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Saslong_udu_da_Sacun_ora.jpg/1200px-Saslong_udu_da_Sacun_ora.jpg" alt="drawing" width="130"/>

Let's see what we get for langkofel:

In [22]:
len(w2v_model.wv.vocab)
print(w2v_model.wv.vocab)

{'Der': <gensim.models.keyedvectors.Vocab object at 0x7fe95e1fd908>, 'Deutsche': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4c780>, 'Alpenverein': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4cb70>, 'hat': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4cf98>, 'es': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4c898>, 'sich': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4cc18>, 'zur': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4c198>, 'Aufgabe': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4cb38>, 'gemacht,': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4cb00>, 'von': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4ccc0>, 'den': <gensim.models.keyedvectors.Vocab object at 0x7fe95de4c908>, 'Deutschen_Alpen': <gensim.models.keyedvectors.Vocab object at 0x7fe95de3edd8>, 'zu': <gensim.models.keyedvectors.Vocab object at 0x7fe95de3eba8>, 'und': <gensim.models.keyedvectors.Vocab object at 0x7fe95de3ed30>, 'ihre': <gensi

In [33]:
w2v_model.wv.most_similar(positive=["Alpenverein"])

[('Alpenvereins', 0.9488771557807922),
 ('Vereins', 0.8816549181938171),
 ('Sectionen', 0.8727765083312988),
 ('deutschen', 0.8522363901138306),
 ('Centralausschuss', 0.8506952524185181),
 ('Mitgliedern', 0.8389559984207153),
 ('Centralausschusses', 0.8316552639007568),
 ('Deutschen', 0.8224343061447144),
 ('Oesterreichischen', 0.8204638957977295),
 ('soll', 0.8202454447746277)]

### t-SNE visualizations:
t-SNE is a non-linear dimensionality reduction algorithm that attempts to represent high-dimensional data and the underlying relationships between vectors in a lower-dimensional space.<br>
Here is a good tutorial on it: https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

In [30]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Our goal in this section is to plot our 300 dimensions vectors into 2 dimensional graphs, and see if we can spot interesting patterns.<br>
For that we are going to use t-SNE implementation from scikit-learn.

To make the visualizations more relevant, we will look at the relationships between a query word (in <font color='red'>**red**</font>), its most similar words in the model (in <font color="blue">**blue**</font>), and other words from the vocabulary (in <font color='green'>**green**</font>).

In [31]:
def tsnescatterplot(model, word, list_names):
    """ Plot in seaborn the results from the t-SNE dimensionality reduction algorithm of the vectors of a query word,
    its list of most similar words, and a list of words.
    """
    arrays = np.empty((0, 300), dtype='f')
    word_labels = [word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
    close_words = model.wv.most_similar([word])
    
    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)
    
    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)
        
    # Reduces the dimensionality from 300 to 50 dimensions with PCA
    reduc = PCA(n_components=30).fit_transform(arrays)
    
    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)
    
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    
    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words': word_labels,
                       'color': color_list})
    
    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)
    
    # Basic plot
    p1 = sns.regplot(data=df,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df['color']
                                 }
                    )
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line],
                 df['y'][line],
                 '  ' + df["words"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df['color'][line],
                 weight='normal'
                ).set_size(15)

    
    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization for {}'.format(word.title()))
    

In [34]:
tsnescatterplot(w2v_model, 'Alpenverein', [i[0] for i in w2v_model.wv.most_similar(negative=["Alpenverein"])])

ValueError: n_components=50 must be between 0 and min(n_samples, n_features)=21 with svd_solver='full'

In [35]:
tsnescatterplot(w2v_model, 'Alpenverein', [i[0] for i in w2v_model.wv.most_similar(negative=["Alpenverein"])])

ValueError: n_components=50 must be between 0 and min(n_samples, n_features)=21 with svd_solver='full'