# Word Embedding Advanced Tutorial

## What's in-scope
- improving our model: incorporating phraser
- hyperparameters for training word embeddings
    - the architecture: CBOW vs Skip-gram
    - embedding window
- different algorithms to train word embedding
    - Word2Vec
    - FastText
    - GloVe
    - BERT


## What's not in-scope
- [ELMO](https://allennlp.org/elmo), another variation to train word embeddings
- comparing and [evaluating word embedding models](https://arxiv.org/pdf/1901.09785.pdf) objectively

## Intro
There are a few different ways to train word embeddings. In the previous tutorial, we implemented FastText using gensim, but we didn't go into the detail of how it worked or why it worked well. 

In this tutorial, we'll dig into a few algorithms used in training word embeddings. We'll weight their strengths and weaknesses. 

First, let's incorporate one more step to improve the process: gensim's [phraser](https://radimrehurek.com/gensim/models/phrases.html). 

## Phraser
Some phrases have a meaning that is _not_ a simple composition of the meanings of its individual words. For example, `red head` or `book worm` don't mean exactly what the combination of their parts might imply. 

If we tried to get the meaning of `red head` by combining the vectors for `red` and `head`, would we get something that makes sense? `Red` is probably near other colors, `yellow`, `blue`, `green`. `Head` would be near body parts: `leg`, `arm`, `torso`. We'd want `red head` to be near `blond`, `blonde`, and `brunette`. 

![red-head](figures/red-head-vectors.png)

In order to get this concept in the right space, we need to change our preprocessing step so that `red head` is treated as one unit, one word. That way, the vector for `red head` is trained separately from `red` and `head`. It's getting the context that `red head` falls into, unlike `red apple` or `red firetruck`. 

Since we're using FastText, we're also getting some information from the characters of the words, so the information from each piece of the phrase is still somewhat included. 

In [1]:
# we're going to modify our Sentences object from yesterday to incorporate phraser
# but first, let's get familiar with gensim's phraser
# (all snippets taken from gensim documentation)

from gensim.models.phrases import Phraser, Phrases
import os
from gensim.test.utils import datapath
from gensim.models.word2vec import Text8Corpus

# gensim even gives you some toy data to use
sentences = Text8Corpus(datapath('testcorpus.txt'))

# The training corpus must be a sequence (stream, generator) of sentences,
# with each sentence a list of tokens

# print out a sentence so you know what it looks like 


In [2]:
# Train a toy bigram model.
phrases = Phrases(sentences, min_count=1, threshold=1)
# Apply the trained phrases model to a new, unseen sentence.
phrases[['trees', 'graph', 'minors']]

In [4]:
# The toy model considered "trees graph" a single phrase => joined the two
# tokens into a single token, `trees_graph`.
# Export the trained model = use less RAM, faster processing. Model updates no longer possible.
bigram = Phraser(phrases)
bigram[['trees', 'graph', 'minors']]  # apply the exported model to a sentence

### How did it know which words were phrases and which weren't?
What makes a combination of words a phrase?

A phrase consists of words that "appear frequently together, and infrequently in other contexts" ([Mikolov, et al](https://arxiv.org/pdf/1310.4546.pdf)).

So `New York Times` will be phrased, but not something with very frequently occurring terms like `this is`.

Gensim implements this using this formula:
![phraser score equation](figures/phraser-score-equation.png)

So to get a high score, words need to occur very frequently together as a bigram, and far less frequently with other words.

There's another scoring method that relies on mutual information:
![npmi score](figures/phraser-score-npmi.png)

Simple, right? 

Regardless of how we score the word combination, the score must be greater than the value for `threshold` to be considered a phrase.

### Some other options when running the Phraser

#### Phrase Order
You can also train the phrases model iteratively, passing in phrased data to another phrases model. After one run of phraser, you'll get bigrams. What's the max length of a phrase after using two phrases models?

#### Common Terms
You can also take stopwords into consideration when training a phrases model. 

Some phrases contain stopwords, words that hold very little meaning on their own. Something like `hold the phone` or `cat's got your tongue`. Gensim offers the ability to consider a list of supplied stopwords differently.

In [6]:
# let's take a list of stopwords, taken from NLTK
stopword_file = os.path.join('resources', 'stopwords.txt')

'''
------------------------------------------------
Objective: get a set of stopwords from the file
------------------------------------------------
gensim.Phrases has a `common_terms` param that expects a set of strings

This lets the phraser create longer phrases by handling the stopwords in
this set differently than other vocab items

the file in resources/stopwords.txt has one stopword per line

he
is
me
the
...

Read in this file, and get a set of strings (words) to use in the phraser
'''
# be sure to show a few words in the set to make sure it works as expected

In [7]:
# Train a new toy bigram model with stopwords
phrases = Phrases(sentences, 
                  min_count=1, 
                  threshold=1,
                  common_terms=stopwords_set)
# Apply the trained phrases model to a new, unseen sentence.
phrases[['computer', 'is', 'off']]

In [8]:
# what?? This still doesn't phrase? 

# let's look at sentences...


What's happening here?


In [9]:
'''
-------------------------------------------------------------------------
Objective: modify the Sentences object from yesterday to phrase the data
-------------------------------------------------------------------------

Using your code from yesterday, modify the sentences object so that it yields 
phrased data.

Add a function to your class called `create_phrasers` that trains the gensim
phrases models and uses them to yield phrased data.
''' 
from gensim.models.phrases import Phraser, Phrases
import os
import string

class Sentences:
    def __init__(self, filename, delim, encoding, limit=float('inf'), phrasers=None):
        pass
    
    def create_phrasers(self, phrase_order=2, min_count=5, threshold=5):
        pass
        
    def __iter__(self):
        pass
                
got_dialogue_file = os.path.join('data', 'got_scripts_breakdown.csv')
sents = Sentences(filename=got_dialogue_file, 
                  delim=';', 
                  encoding='utf-8-sig',
                 )
# check that your function for creating phrasers works
sents.create_phrasers(phrase_order=2)

In [10]:
# I only want to show a few sentences, 
# but I want to use the phrasers I've already trained
# initialize a new sentences object with limit=5 and phrasers=sents.phrasers
five_sentences = None
# iterate through it and print the output to see what phrases occur


In [11]:
# How do phrasers change the FastText model?
# retrain a fastText model using your newly phrased sentences object
# to be consistent, use the same hyperparameters as yesterday


In [12]:
# challenge: find phrases using most_similar


In [13]:
# let's save this model for later use
save_path = os.path.join('models', 'phrased_got_ft.model')


## Important hyperparameters in training algorithms

Regardless of the algorithm chosen, there are important hyperparameters to choose from.

The word _architecture_ is used here to mean the internal structure of a neural network. 

Remember that the word embeddings are the weight matrix of the neural model used. 

![CBOW](figures/word2vec-cbow.png)
![weight_matrix](figures/weight-matrix.png)

### CBOW vs Skip-gram
This [article](https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures) has the best distinction I've found.
>These two architectures describe how the neural network "learns" the underlying word representations for each word. Since learning word representations is essentially unsupervised, you need some way to "create" labels to train the model. Skip-gram and CBOW are two ways of creating the "task" for the neural network -- you can think of this as the output layer of the neural network, where we create "labels" for the given input (which depends on the architecture).
>
>CBOW: The input to the model could be $w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}$, the preceding and following words of the current word we are at. The output of the neural network will be $𝑤_i$. Hence you can think of the task as "predicting the word given its context"
Note that the number of words we use depends on your setting for the window size.

![CBOW simple](figures/CBOW-simple.png)

[image source](https://towardsdatascience.com/word-embeddings-intuition-and-some-maths-to-understand-end-to-end-skip-gram-model-cab57760c745)

The sentence encoded in the above image is "A cat ________ a mouse."

>Skip-gram: The input to the model is $w_i$, and the output could be $w_{i-2}, w_{i-1}, w_{i+1}, w_{i+2}$. So the task here is "predicting the context given a word". In addition, more distant words are given less weight by randomly sampling them. 

![Skip-gram detailed](figures/Skip-gram-detailed.png)

To summarize,

CBOW predicts "I went to the store to buy _________"

Skipgram predicts "_______ _______ butter ________ _______ _______"

_How do I know when it's best to use CBOW or Skip-gram?_

And according to [Mikolov](https://en.wikipedia.org/wiki/Tomas_Mikolov),  
>Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
>
>CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

In [15]:
# retrain your FastText model using skipgram or CBOW using the parameter
# sg=1 for true, 0 for CBOW
# keep the same hyperparameters for consistency


In [16]:
# explore the model to see any differences


Did anything change? Speed of training?

In [17]:
# if you prefer this model, save it over the CBOW version
# if not, keep the CBOW one
save_path = os.path.join('models', 'phrased_got_ft.model')



### Embedding Window

Another hyperparameter shared across all algorithms and architectures is the embedding window. 

The embedding window determines how far to the right and left of a word to consider the context. 

Quiz: What is the embedding window for this image?

![CBOW visualization](figures/CBOW-visualized.png)

## Different Algorithms for training word embeddings

There are a little more than a handful of algorithms used to train word embeddings.

For all of these architectures, the two main questions are 
1. How is the data encoded?
2. What is the architecture used in the model?

### Word2Vec
To train word embeddings using Word2Vec, as seen explicitly above in the images for CBOW and Skip-gram, we do the following for a fixed number of epochs.

1. encode each word as a one-hot vector
2. for each instance (or mini-batch) in the data set:
    1. predict the missing words (according to CBOW or Skip Gram)
    2. calculate the loss 
    3. use backpropagation and update the weights accordingly

Here is the architecture of the neural network:
![word2vec](figures/word2vec.png)

Let's train a Word2Vec model using [gensim's documentation](https://radimrehurek.com/gensim/models/word2vec.html). 

In [18]:
# let's use gensim to train a word2vec model
from gensim.models import Word2Vec


# save the model here
model_save_path = os.path.join('models', 'got_w2v.model')


In [19]:
# check it out with a most_similar call


In [20]:
# how about a word out of the model's vocabulary?
# try "bran_will_never_be_my_king"


word2Vec doesn't have a way to handle words that it hasn't seen before in the training data set. It will throw an error when you ask for a word it hasn't seen before. 

### FastText

The reason FastText achieves such incredible performance for word representations and sentence classification is in-part due to it's use of character level information.

From [FastText: Under the Hood](https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3)
>Each word is represented as a bag of character n-grams in addition to the word itself, so for example, for the word `matter`, with $n = 3$, the FastText representations for the character n-grams is `<ma`, `mat`, `att`, `tte`, `ter`, `er>`. `<` and `>` are added as boundary symbols to distinguish the ngram of a word from a word itself. So for example, if the word `mat` is part of the vocabulary, it is represented as `<mat>`. 
>
>This helps preserve the meaning of shorter words that may show up as ngrams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes.
    
The architecture is similar to that of Word2Vec, however some additional math changes are used. See the [the original paper](https://arxiv.org/pdf/1607.04606.pdf) for more detail.

#### Why is FastText good on our data? 

FastText is also useful on data sets with speech misrecognition errors. Speech recognition often gets consonants correct, but misses correct vowel sounds. Think `bowl` and `ball`. We get the `b` and the `l` in both, but miss the middle vowel sound characters. Because FastText accounts for these characters, its easier for aliases to be treated as similar words. 

In [21]:
# we've trained enough FastText models today :)
# let's load in the one from before
save_path = os.path.join('models', 'phrased_got_ft.model')


In [22]:
# FastText is able to generate a vector for words it hasn't seen before
# because of the character embeddings it employs
# retry 'bran_will_never_be_my_king'


### GloVe

From [the GloVe website](https://nlp.stanford.edu/projects/glove/)
>The GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpora, this pass can be computationally expensive, but it is a one-time up-front cost. Subsequent training iterations are much faster because the number of non-zero matrix entries is typically much smaller than the total number of words in the corpus.

Here's a co-occurrence matrix example:
![co-occurrence matrix](figures/co-occurrence-matrix.png)

>GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. For example, consider the co-occurrence probabilities for target words `ice` and `steam` with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:

![probabilities](figures/GloVe.png)

> As one might expect, `ice` co-occurs more frequently with `solid` than it does with `gas`, whereas `steam` co-occurs more frequently with `gas` than it does with `solid`. Both words co-occur with their shared property `water` frequently, and both co-occur with the unrelated word `fashion` infrequently. Only in the ratio of probabilities does noise from non-discriminative words like `water` and `fashion` cancel out, so that large values (much greater than 1) correlate well with properties specific to `ice`, and small values (much less than 1) correlate well with properties specific of `steam`. In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.

>The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well. For this reason, the resulting word vectors perform very well on word analogy tasks, such as those examined in the word2vec package.

In [38]:
# could install another package to train GloVe
# https://stackoverflow.com/questions/48962171/how-to-train-glove-algorithm-on-my-own-corpus

# but let's leave this exercise for later

### BERT

BERT stands for [Bidirectional Encoder Representations from Transformers](https://arxiv.org/pdf/1810.04805.pdf). This paper released by Google AI has become the state-of-the-art for many NLP tasks like question answering. 

From this [blog](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270) (which I'll quote throughout this section):
> BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible.

_Attention? Transformer? What does all this mean?_

Foundationally, BERT relies on another paper by google, [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf), which introduces the transformer architecture. Without getting into the math, the paper talks about attention. This is easy to visualize for machine translation tasks. 

![https://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/](figures/attention-machine-translation.gif)

Here, the thickness of the lines connecting the words shows how much weight the model is putting on each English word when making the translation to the French word. Note that the model considers _all_ English words when making the translation, some with more importance than others. The model _attends_ to certain words more than others at each step.

> As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

_Okay, neat, but weren't we making word embeddings? Not doing machine translation?_

BERT is making use of the the transformer architecture that learns contextual relations between words in a text. Let's get into the architecture.

#### Model architecture
Input: sequence of tokens, embedded into vectors

Output: sequence of vectors (size $H$) corresponding to an input token with the same index

#### Training
In architectures like Word2Vec, we had the option of training using CBOW or Skip-gram. These define the prediction goal. But they rely on a directionality, which limits context learning. BERT overcomes this using two strategies. 

##### Masked LM (MLM)

> Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence. 
![MLM prediction from https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7](figures/BERT-MLM-prediction.png)
>In technical terms, the prediction of the output words requires:
>
> 1. Adding a classification layer on top of the encoder input
>
> 2. Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension
>
> 3. Calculating the probability of each word in the vocabulary with softmax.

![MLM](figures/BERT-MLM.png)

> The BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. As a consequence, the model converges slower than directional models, a characteristic which is offset by its increased context awareness. 

(See the article for more detail, as this is a bit of a simplification)

##### Next Sentence Prediction (NSP)

> In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

![image from https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7](figures/BERT-NSP-prediction.png)

>
> To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:
>
> 1. A [CLS] token is inserted at the beginning of the first sentences and a [SEP] token is inserted at the end of each sentence.
> 
> 2. A sentence embedding indicating Sentence A or Sentence B is added to each token. Sentence embeddings are similar in concept to token embeddings with a vocabulary of 2.
> 
> 3. A positional embedding is added to each token to indicate its position in the sequence. The concept and implementation of positional embedding are presented in the Transformer paper. 

![encoding NSP](figures/BERT-NSP.png)

> To help the model distinguish between the two sentences in training, the input is processed in the following way before entering the model:
> 
> 1. The entire input sequence goes through the Transformer model.
> 
> 2. The output of the [CLS] token is transformed into a $2x1$ shaped vector, using a simple classification layer (learned matrices of weights and biases). 
> 
> 3. Calculating the probability of $IsNextSequence$ with softmax.

Combining these two strategies captures the context both from the sentence and from surrounding sentences. The goal of training is to minimize the combined loss functions of the two strategies. 

You can check out their implementation [here](https://github.com/google-research/bert). 

In [65]:
# could train a BERT model following this tutorial
# https://blog.insightdatascience.com/using-bert-for-state-of-the-art-pre-training-for-natural-language-processing-1d87142c29e7
# but let's leave this exercise for another day...

# Summary
- phraser improves the word embedding model
- hyperparameters for training word embeddings
    - the architecture: CBOW vs Skip-gram
    - embedding window
- different algorithms to train word embedding
    - Word2Vec
    - FastText
    - GloVe
    - BERT


# Survey !!!

Please complete the [course survey](https://forms.office.com/Pages/ResponsePage.aspx?id=gwv7BWBlfUGFbTjusOst_QYpnoW2nrtJmgVZLQ3gu25UMURGMDdaUTA0QUhJQTM3NlMxNE9GVVkyRC4u)