<a href="https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_9_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings

In this lab we will use an embedding to train a simple model over the IMDB dataset, but we will use an embedding instead of a one-hot representation. For this, we will use the fastText embeddings that have been provided.

First, let's start by importing the data and the packages that we will use. Remember to set the runtime environment to GPU!

## Data Import

In [0]:
# General imports
import string
import numpy as np
import pandas as pd
import sklearn.feature_extraction as skprep
from sklearn.metrics import roc_curve, auc
from itertools import compress
import matplotlib.pyplot as plt
import seaborn as sns
import random
random.seed(20190124)
%matplotlib inline

# Keras imports
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Embedding, Reshape, MaxPooling1D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Flatten, Dense, Dropout, Lambda
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import SGD, RMSprop, Adam
from tensorflow.keras.metrics import categorical_crossentropy, categorical_accuracy
from tensorflow.keras.layers import *
from tensorflow.keras.preprocessing import image, sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
# Download a zip file with the data
!wget --no-check-certificate --output-document=IMDB.zip 'https://drive.google.com/uc?export=download&id=1owCcH4eU_XvUzrjnMVec-obX3endPzy_'

# Extract the files.
!unzip IMDB.zip

With this we are ready to start training embeddings!

# fasttext

We will use [fasttext embeddings](https://github.com/facebookresearch/fastText) in this lab. For this, we need to download the fasttext model and apply it to our data, i.e., associate each word with the corresponding embedding vector.

fasttext is a heavy program, so it makes sense to use the C++ library directly. This does complicate our life a bit, but nothing a few lines of code can't solve. First, we will download the library and unzip it as before.



In [0]:
!wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
  
!unzip v0.9.1.zip

Now we need to [compile](https://en.wikipedia.org/wiki/Compiler) the library. Compiling turns the code we just downloaded to something the computer can understand. Configuring this is complicated, so programmers add a [makefile](http://www.cs.colby.edu/maxwell/courses/tutorials/maketutor/) with instructions for the compiler. This means we just need to call the command ```make``` in the base folder with the code. This code does it:

In [0]:
%cd fastText-0.9.1

!make

There are some warnings, but we can ignore them as they are for future versions.

Now, we need to download the embedding vectors. These are **really heavy downloads** of about 8GB.  fasttext is a language-dependent model, so be sure to download the one for your chosen application. The list is [here](https://fasttext.cc/docs/en/english-vectors.html). There are four levels of embeddings:

- Vectors trained over Wikipedia (1 million words, 16 billion tokens).
- Vectors trained over Wikipedia with subword information.
- Vectors trained over a webcrawl of many websites ([Common Crawl](http://commoncrawl.org/)). This one has 2M words and 200B tokens.
- Vectors trained over a webcrawl of many websites with subword information. 

For the lab we will use the first one, but I encourage you to use try other ones! The download will take a little while.

In [0]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
  
!gunzip -v -f cc.en.300.bin.gz

Now we are ready to go!

## Data Preprocessing

With this ready we can now start working on the data, which comes from the [Internet Movie Database](https://www.imdb.com/) (IMDB). The following code reads the data in the format it comes, which is the folder 'IMDB_Lecture_Sample' with two folders: pos and neg with positive and negative reviews. 


In [0]:
# Come back to the original work folder
%cd /content

In [0]:
# Import relevant packages
import os
import codecs
import pandas as pd

# List all files in the "pos" directory. Replace with your own!
dir = 'IMDB_Lecture_Sample/train/pos/'
fileList = os.listdir(dir)

# Create vector with texts
outtexts = []

# Read the files in the directory and append them with the class to the dataset
for eachFile in fileList:
    with codecs.open(dir + eachFile, encoding='utf-8') as _fp:
        fileData = _fp.read()
        outtexts.append(fileData)
    _fp.close()
    
# Create dataframe from outputs
texts = pd.DataFrame({'texts': outtexts, 'class': 1})

# Repeat for negative values
# List all files in the "pos" directory
dir = 'IMDB_Lecture_Sample/train/neg/'
fileList = os.listdir(dir)

# Create vector with texts
outtexts = []

# Read the files in the directory and append them with the class to the dataset
for eachFile in fileList:
    with codecs.open(dir + eachFile, encoding='utf-8') as _fp:
        fileData = _fp.read()
        outtexts.append(fileData)
    _fp.close()
    
# Create dataframe from outputs
texts = pd.concat((texts, pd.DataFrame({'texts': outtexts, 'class': 0})), ignore_index = True)

texts.describe()

In [0]:
texts

Now we'll clean the text, getting rid of special characters and keeping the text in a clear form. 

In [0]:
# Text cleaning
import string

# Collect punctuation signs.
table = str.maketrans(' ', ' ', string.punctuation)

# Remove them from the text
texts.iloc[:,0] = [j.translate(table) for j in texts.iloc[:,0]]
texts.iloc[:,0] = [j.replace('\x96',' ') for j in texts.iloc[:,0]]

# Eliminate double spaces
texts.iloc[:,0] = [" ".join(j.split()) for j in texts.iloc[:,0]]

# Show first 5
texts.head()

## Estimating the embedding

Once we have the embedding model, using it consists of:

1. Calculate the words that appear on the text and save to disk
2. Use the fastText program to obtain the word embeddings.
3. Import the embeddings into a Keras input layer.
4. Train the model!

First, we will start by selecting the individual words. The Keras internal model "[Tokenizer](https://keras.io/preprocessing/text/#tokenizer)" will allow us to quickly do this, with the added benefit of giving us a [dictionary](https://docs.python.org/2/tutorial/datastructures.html#dictionaries) of the words, which will be stored in the "tokenizer" model.

A dictionary is a very powerful object included by Python, which will efficiently index anything by any key. In our case it will index the words and an arbitraty number that will give its position. Read the linked article to know more about dictionaries, but it is important that you understand its usefulness: It allows fast (indexed) access to objects linked by a key (in this case, the words).

In [0]:
tokenizer = Tokenizer() # Creates tokenizer model.
tokenizer.fit_on_texts(texts.iloc[:,0]) # Trains it over the tokens that we have.

# Get words
Vals = list(tokenizer.word_index.keys())

# Write CSV with the output.
file = codecs.open('IMDBWords.csv', "w", "utf-8")

for item in Vals:
    file.write("%s\r\n" % item)
    
file.close()

In [0]:
!ls

In [0]:
!head IMDBWords.csv

We now have a csv file with all the words being used to review the movies in a standard format. Let's get the embeddings!

We need to call the fasttext software from the command line.

In [0]:
!ls

In [0]:
!./fastText-0.9.1/fasttext print-sentence-vectors fastText-0.9.1/cc.en.300.bin < IMDBWords.csv > EmbeddingIMDB.tsv

As always, ignore the warnings. They are basically saying "I need a very large amount of RAM to do what you are asking!".

This process actually takes a relatively long time. Let's take a look at the command part by part:

- ```!./fastText-0.2.0/fasttext``` invokes fasttext. The notation "./" means "execute this program".

- We give it two parameters ```print-sentence-vectors``` which instruct fastText to actually give us the embedding for every word, and ```fastText-0.2.0/cc.en.300.bin``` which is the language model we are using.

- Then comes the processing of the inputs and outputs. The "```< IMDBWords.csv```" is telling Linux "give IMDBWords.csv as an input to what's to the left" and the "```> EmbeddingIMDB.tsv```" is telling Linux "write whatever is outputted from the left into EmbeddingIMDB.tsv".

The output is a space-separated file with the embedding vectors in the same order we gave them to the software.

In [0]:
!head EmbeddingIMDB.tsv

Note that this is only for training, for testing we would:

- If we kept the embedding as is, we simply calculate the new embeddings for the new words and add it to our matrix.

- If we retrained the embeddings, then we would either use the output that we already have if the word was in our original vocabulary, or just leave a vector of zeros for those words if it is not.

fastText outputs space-separated words. We replace them with a comma.

In [0]:
import fileinput

with fileinput.FileInput('EmbeddingIMDB.tsv', inplace=True, backup='.bak') as file:
    for line in file:
        print(line.replace(' ', ','), end='')

We add a first line with the variable names, to be able to import it back.

In [0]:
import numpy as np
import os

# Create the first line
firstLine = ','.join(['D'+str(i) for i in np.arange(1, 301)]) + '\n'

# Open as read only. Read the file
with open('EmbeddingIMDB.tsv', 'r') as original: 
  data = original.read()

# Open to write and write the first line and the rest
with open('EmbeddingsIMDB.csv', 'w') as modified: 
  modified.write(firstLine + data)

In [0]:
!head EmbeddingsIMDB.csv

Just what we wanted! Now we have a matrix with every word in the document with its corresponding Embedding. We can now import this file into Python, and use it to train our model.

## Using the Embedding Layer

The next step is to actually train a neural network with an Embedding Layer. For this, Keras has the aptly named "Embedding" layer, which will take care of our structures. The following code creates a very simple network that does the following:

1. Read the embeddings.
2. Calculate the One-Hot inputs (by using an "index") which will index which words are in which text.
3. Create a layer that associates the indexes with the embeddings.
4. Create the rest of the architecture.
5. Train the model.

In [0]:
# Read word embeddings
Embeddings = pd.read_csv('EmbeddingsIMDB.csv', sep=',', decimal = '.', 
                         low_memory = True, index_col = False)
Embeddings.describe()

We will now create a dictionary for the embeddings. The zip function allows to create the (key, element) structure that we need. Read more about the zip function [here](https://docs.python.org/3.7/library/functions.html#zip). 

In [0]:
# Create embedding dictionary

EmbeddingsDict = dict(zip(Vals, Embeddings.values))

Now, let's study our texts to create the optimal embedding layer. One of the decisions we need to make is what is going to be the maximum size of our documents. Too large, and we will need to add a lot of padding thus will make it inefficient; too small, and we will be losing a lot of information. There is no clear rule here, I usually try to cover 90% of all elements, but you can argue anything that makes sense to you.

In [0]:
import seaborn as sns
import numpy as np
%matplotlib inline

# Count maximum number of words per file.
wordDist = [len(w.split()) for w in texts.iloc[:,0]]
print('Avg. no of words: ' + str(np.round(np.mean(wordDist), 2)))
print('Std. deviation: ' + str(np.round(np.std(wordDist), 2)))
print('Max words: ' + str(np.max(wordDist)))

# Generate the plot
distIMDB = sns.distplot(wordDist)

# I'm saving the image to a PDF, as it makes it easier later to download.
distIMDB.figure.savefig("wordDist.pdf", format = "pdf")

Arbitrarily, we will use 600 words maximum. Try different values!

Now we create the input layer. The first layer will have the index of each word per-text, which then we will use to efficiently associate with the embedding. For this, we use Keras' "pad_sequence". This will either add padding to texts that are smaller than 600, or trim the ones that are longer. 

In [0]:
# Create word index from input
sequences = tokenizer.texts_to_sequences(texts.iloc[:,0]) # Create the sequences.

# Creates the indexes. Word index is a dictionary with words in it.
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Creates the training dataset, adding padding when necessary.
data = pad_sequences(sequences, maxlen=600, 
                     padding = 'post') # add padding at the end. No difference in practice.

# Creates the objective function
labels = texts.iloc[:,1]
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

In [0]:
# Let's save the outputs, so we don't run all of the above 20 times.
# Be efficient! Save always save intermediate outputs

# Create saving directory
!mkdir IMDB_Preprocessed

# Save outputs
np.savetxt("IMDB_Preprocessed/IMDB_Padded.txt", data)
np.savetxt("IMDB_Preprocessed/IMDB_Labels.txt", labels)

In [0]:
data[0]

As we can see above, our data now is a matrix corresponding to where on the embedding matrix is the vector we are looking for. This is an extremely efficient way of storing embeddings, but uses more CPU. That's ok though!

Now we are almost ready! Now we need to construct the Embedding matrix. This matrix will have the weights associated with each index. Keras will automatically construct the correct embedding of length 600 (see below).

In [0]:
# Create first matrix full with 0's
embedding_matrix = np.zeros((len(word_index) + 1, 300))

# Generate embeddings matrix
for word, i in word_index.items():
    embedding_vector = EmbeddingsDict.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# Print what came out
embedding_matrix

In [0]:
# Again, we save the intermediate result. If done right you only need this matrix!
# No need to run everything all over again.
np.savetxt("IMDB_EmbeddingMatrix.txt", embedding_matrix)

In [0]:
# We will also save the word dictionary
# A pickle file is a Python native file
import pickle
f = open("WordDictionary.pkl","wb")
pickle.dump(word_index, f)
f.close()

In [0]:
# Zip all files for download.
!zip -r IMDB_Preprocessed.zip IMDB_Preprocessed 

In [0]:
# Download files
from google.colab import files
files.download("IMDB_Preprocessed.zip")

## Modelling using an embedding layer

Now that we have this ready, we need to create our model and add an [Embedding Layer](https://keras.io/layers/embeddings/). We'll create a very simple model using Convolutional Layers as hidden layers. In the next lecture we'll check in detail what this means.

In [0]:
# Final model.
model = Sequential()
embedding_layer = Embedding(len(word_index) + 1,           # Words in the embedding.
                            300,                           # Embedding dimension
                            weights=[embedding_matrix],    # The weights we just calculated
                            input_length=600,              # The maximum number of words.
                            trainable=False)               # To NOT recalculate weights!

model.add(embedding_layer)

Very important: If you are letting your embedding to adapt to your own model, you need to set "trainable=True", if not, leave to False.

Done! We have a model that uses an embedding layer as input. Let's try it in a (very bad) model.Our network will take the embedding as input and will estimate the probability of being of class 1 or 0 (positive or negative). A potential architecture is as follows:

- A [1D-Convolutional Layer](https://keras.io/layers/convolutional/): See the next lecture for details :). I will add 64 filters and a kernel size of 3, which means "look for 64 different combinations of 3 words that are useful". We use ReLU activation for it.

- A [Flatten](https://keras.io/layers/core/#flatten) layer: The embedding matrix comes as a, well, a matrix, the output of the first layer will be as well. We need to change this to a shallow 1D tensor. The Flatten layer takes matrices (or N-Dimensional tensors) and turns them into 1D tensors.

- A Dense layer with 64 neurons and ReLU activation.

- A [Dropout](https://keras.io/layers/core/#dropout) layer: Big models can have many millions of parameters. These models are prone to be overfitted. [Srivastava et al. (2014)](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf) realized that a simple way to avoid overadjustment was to simply randomly set a large number of parameters to 0. This is called "Dropout". We will randomly set 40% of all weights to 0. This is a tunable parameter, you should experiment with parameters that make sense to you.

- A sigmoid output layer, with 1 neuron. As this is a binary problem, that's the most appropriate one.

In [0]:
# Check for 64 sequences of length 3.
model.add(Conv1D(64, 3, activation = 'relu'))

# Turn output matrices into 1D tensor for shallow network.
model.add(Flatten())

# Add 64 neurons with ReLU activation.
model.add(Dense(64))

# Add dropout.
model.add(Dropout(0.4))
model.add(Activation('relu'))

# Add an output layer with a sigmoid.
model.add(Dense(1))
model.add(Activation('sigmoid'))

As this is a binary classification problem, we need a binary cross-entropy error function. I will use the optimizer [Adam](https://arxiv.org/abs/1412.6980) by Kingman et al. (2014), which works well for this problem and, more importantly, requires little tuning. 

In [0]:
# Use Adam as optimizer, with a binary_crossentropy error.
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['acc'])

Done! Let's see the architecture of our network.

In [0]:
model.summary()

The model has around 2.5 million trainable parameters, the rest come from the embedding, which we left unchanged. Quite an increase compared to our last model!

Now we train. We will use 33% of the data as a test set, and train for 10 epochs.

In [0]:
# Fit the model
history = model.fit(data, labels, validation_split=0.33, epochs=10, batch_size=20)
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

We have clear signs the model overadjusted, but this is to be expected in data so simple. That's it! Now we can garnish the power of embeddings for modelling. We just need to learn to create models that can leverage this power. Remember, fastText is available for several hundreds of languages, so they can be used in many contexts.

## Self-study

Change the embedding layer and train your own embeddings. Do you get an improvement?

Try with different architectures to get a better value in the validation layer. Why do you get such low scores? Next lecture we will study architectures that deal correctly with these many inputs.

Compare the convergence speed of Adam with SGD with a reasonable learning rate (try very tiny values). Which one converges faster?

### Other other embeddings
In the coursework you are asked to use more than one embedding. Go through the tutorials for 

- GloVe: Keras has its own tutorial [here](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html).

- BERT: This one, as is so new, is far more advanced. You need to install a few new packages to make it work. Read [this](https://github.com/hanxiao/bert-as-service) or [this](https://pypi.org/project/keras-bert/).

### Categorical Embeddings

Additionally, embeddings can also be used to efficiently encode categorical variables. [Read this](https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0).