# Language Models
## Homework 2: Embeddings

**Instructor**: Dr. Pavlos Protopapas<br />
**Maximum Score**: 89


## INSTRUCTIONS

- This homework is a notebook. Download and work on it on your local machine or work on it in Colab.

- This homework should be submitted in a team.

- Ensure you and your partners together have submitted the homework only once. Multiple submissions of the same work will be penalised and will cost you 2 points.

- Please restart the kernel and run the entire notebook again before you submit.

- Running cells out of order is a common pitfall in Jupyter Notebooks. To make sure your code works restart the kernel and run the whole notebook again before you submit.

- To submit the homework, either one of you upload the working notebook on edStem and click the submit button on the bottom right corner.

- Submit the homework well before the given deadline. Submissions after the deadline will not be graded.

- We have tried to include all the libraries you may need to do the assignment in the imports statement at the top of this notebook. We strongly suggest that you use those and not others as we may not be familiar with them.

- Comment your code well. This would help the graders in case there is any issue with the notebook while running. It is important to remember that the graders will not troubleshoot your code.

- Please use .head() when viewing data. Do not submit a notebook that is **excessively long**.

- In questions that require code to answer, such as "calculate the $R^2$", do not just output the value from a cell. Write a `print()` function that includes a reference to the calculated value, **not hardcoded**. For example:
```
print(f'The R^2 is {R:.4f}')
```
- Your plots should include clear labels for the $x$ and $y$ axes as well as a descriptive title ("MSE plot" is not a descriptive title; "95 % confidence interval of coefficients of polynomial degree 5" is).

- **Ensure you make appropriate plots for all the questions it is applicable to, regardless of it being explicitly asked for.**

<hr style="height:2pt">

## **Names of the people who worked on this homework**
#### / Wang Hesong/ Chen Taiyi/ Li Yuepeng/ Mao Yuchen/ Yu Lufei/ Zhong Yixiao 

## **Setup Notebook**

**Imports**

In [1]:
import requests
import urllib
import re
import os
import zipfile
import collections
import numpy as np
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
from collections import defaultdict
%matplotlib inline
from IPython.core.display import HTML

import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as K
from tensorflow.keras.models import Model

from sklearn.model_selection import train_test_split

**Verify Setup**

In [2]:
print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

tensorflow version 2.10.0
keras version 2.10.0
Eager Execution Enabled: True
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of replicas: 1
Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
GPU Available:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
All Physical Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## **PART 1 [25 points]: Word2Vec from scratch**

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

Word2Vec architecture allows us to get *contextual* representations of word tokens.     

There are several methods to build a word embedding. We will focus on the SGNS architecture.

![](https://drive.google.com/uc?export=view&id=1eyozbhsrzRaKc86SM7LblgzVMZKAW8Pe)

In this problem, you are asked to build and analyze a Word2Vec architecture trained on Wikipedia articles.

</div>

### **PART 1: Questions**

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

    
**1.1 [5 points] Model Processing**
<br />
<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
**1.1.1** - Get the data    

- Get the data from the `text8.zip` file.
    `text8.zip` is a small, *cleaned* subset of a large corpus of data scraped from Wikipedia pages. More details can be found [here](https://paperswithcode.com/sota/language-modelling-on-text8)
    It is usually used to quickly train, or test language models.

- Split the data by whitespace and print the first 10 words to check if has been correctly loaded.

    **NOTE:** For this part of the homework, all words will be in their lowercase for simplicity of analysis
<br />    

**1.1.2** - Build the dataset

- Write a function that takes the `vocabulary_size` and `corpus` as input, and outputs:
    - Tokenized data
    - count of each token
    - A dictionary that maps words to tokens
    - A dictionary that maps tokens to words
    You can use the same function used in **Lab 3**, or else you can use [`tf.keras.Tokenizer`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) to write a similar function.
- Print the first 10 tokens and reverse them to words to confirm a match to the initial print above.
     
  
Eg. `corpus[:10] = ['this','is,'an','example',...]`

`data[:10] = [44,26,24,16,...]`
    
`reversed_data =['this','is,'an','example',...]`

**NOTE**: Choose a sufficiently large vocabulary size. i.e `vocab_size>= 1000`    
<br />
    
**1.1.3** - Build skipgrams with negative samples
- Use the `tf.keras.preprocessing.sequence.skipgrams` function to build positive and negative samples \
    for word2vec training. Follow the documentation on how to make the pairs
- You are free to choose your own `window_size`, but we recommend a value of 3.
- Print 10 pairs of *center* and *context* words with their associated labels.    
    
Skip-gram Sampling table
A large dataset means a larger vocabulary with a higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as the, is, on) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality.

The `tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token. You can use the `tf.keras.preprocessing.sequence.make_sampling_table` to generate a word-frequency rank-based probabilistic sampling table and pass it to skipgrams function.    
<br />
    
**1.1.4** - What is the difference between using a sampling table and not using a sampling table while building the dataset for skipgrams?
<br /><br />
    
</details>
    
**1.2 [8 points] Building a Word2Vec model**
<br />
<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
Build a word2vec model architecture based on the schematic below.


![](https://drive.google.com/uc?export=view&id=1fBTpoBoG5RZIPTtdogZt37Bw3oUb7tuT)
  
    
- To do so, you will need:
    - `tf.keras.layers.Embedding` layer
    - `tf.keras.layers.Dot()`
    - `tf.keras.Model()` which is the functional API
- You can choose an appropriate embedding dimension
- Compile the model using `binary_crossentropy()` function and an appropriate optimizer.
- Sufficiently train the model.    
- Save model weights using the `model.save_weights()` for analysis of **2.3**. More information on saving your weights [here](https://www.tensorflow.org/tutorials/keras/save_and_load)    
<br />

</details>
    
    
**1.3 [7 points] Post-training analysis**

<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
This segment involves some simple analysis of your trained embeddings.
<br /><br />
    
    
**1.3.1** - Vector Algebra on Embeddings

Assuming you have chosen a sufficiently large `vocab_size`, find the embeddings for:
    
1. King
2. Male
3. Female
4. Queen
    
Find the vector `v = King - Male + Female` and find its `cosine_similarity()` with the embedding for 'Queen'.
You can use the `cosine_similarity()` function defined in the session 3 exercise.

**NOTE**:The `cosine_similarity()` value, must be greater than `0.9`; If it is not, this implies that your word2vec embeddings are not well-trained.

Write a function `most_similar()`, which finds the top-n words most similar to the given word. Use this function to find the words most similar to `king`.
    
**Conceptual Question** Why can't we use `cosine_similarity()` as a `loss_function`?
    
<br />
    
**1.3.2** - Visualizing Embeddings

Find the embeddings for the words:
1. 'January'
2. 'February'
3. 'March'
4. 'April'
    
Find the `cosine_similarity()` of 'january' with each of 'february`, 'march', 'april' (which should be high values).
    
Save your trained weights. Recreate the network you have created above and initialize it with random weights. Compute the `cosine_similarity()` values. The values should be small (because the embeddings are random).
    
Use a demonstrative plot to show the `before & after` of the 4 embeddings. Here are some suggestions:
    1. PCA/TSNE for dimensionality reduction
    2. Radar plot to show all embedding dimensions
    
Bonus points for using creative means to demonstrate how the embeddings change after training.

Here is a [video](https://youtu.be/VDl_iA8m8u0) of a sample demonstration. We used a custom callback to get embeddings during training.  
        

<br />
    
**1.3.3** - Embedding and Context Matrix
    
    
**1.3.3.1** Investigate the relation between the Embedding & Context matrix. Again use the `cosine_similarity()` function to find the average value across all the words in the embedding and context matrix, i.e:
  - For the word 'dog', find the embedding value, and context value. <br>
  - Calculate the `cosine_similarity()` between the two <br>
  - Repeat the same for every word in the vocabulary and calculate the average value of the `cosine_similarity()`
<br />

**1.3.3.2** Answer the following question and explain:

**Question:** The embedding and context matrices should be identical.
<br /><br />
    
 </details>

**1.4 [5 points] Learning phrases**
    
<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>

As per the original paper by [Mikolov et al](https://arxiv.org/abs/1301.3781) many phrases have a meaning that is not a simple composition of the meanings of their individual words.
For eg. `new york` is one entity, however, as per our analysis above, we have two separate entities `new` & `york` which can have different meanings independently.    
To learn vector representation for phrases, we first find words that
appear frequently together, and infrequently in other contexts.
    
As per the analysis in the paper, we can use a formula to rank commonly used word pairs, and take the first 100 commonly occurring pairs.
$$\operatorname{score}\left(w_{i}, w_{j}\right)=\frac{\operatorname{count}\left(w_{i} w_{j}\right)-\delta}{\operatorname{count}\left(w_{i}\right) \times \operatorname{count}\left(w_{j}\right)}$$

**NOTE:** For simplicity of analysis, we take the discounting factor $\delta$ as 0, and take bi-gram combinations. You can experiment with tri-grams for word pairs such as `New_York_Times`.     
<br /><br />

    
**1.4.1** - Find 100 most common bi-grams

From the tokenized data above, find the count for each bigram pair.
    
For each such pair, find the score associated with each token pair using the formula above.
    
 Pick the top 100 pairs based on the score. (Higher the better). To understand the `score()` function we suggest you read the paper mentioned above.
    
Replace the original `text8` file with the pairs as one entity. E.g., if `prime, minister` is a commonly occurring pair, replace `... prime minister ...' in the original corpus with a single entity `prime_minister`. Do this for all 100 pairs.
<br /><br />
    
**1.4.2** - Retrain word2vec
With the new corpus generated as above, build the dataset, use skipgrams and retrain your word2vec with a sufficiently large vocabulary.

Write a function `most_dissimilar()`, similar to the `most_similar()` function, however this finds the top-n words which are **most dissimilar** to the given word.
Use this function defined above to find the entities most dissimilar to `united_kingdom`
    
Compare the above with separate tokens for `united` & `kingdom` and the sum of the vectors (to get this, you may need a sufficiently large vocabulary (>2000)).
<br /> <br />
</div>

### **PART 1: Solutions**

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
#### **1.1 [5 points] Model processing**

**1.1.1** - Get the data  

- Get the data from the `text8.zip` file.
    `text8.zip` is a small, *cleaned* subset of a large corpus of data scraped from Wikipedia pages. More details can be found [here](https://paperswithcode.com/sota/language-modelling-on-text8)
    It is usually used to quickly train, or test language models.
    
- Split the data by whitespace and print the first 10 words to check if has been correctly loaded.

    **NOTE:** For this part of the homework, all words will be in their lowercase for simplicity of analysis   
    </div>

In [None]:
# Helper code to read the data

# Download
urllib.request.urlretrieve("https://github.com/dlops-io/datasets/releases/download/v1.0/text8.zip", "text8.zip")

# Unzip and read data
filename = 'text8.zip'
with zipfile.ZipFile(filename) as f:
    vocabulary = tf.compat.as_str(f.read(f.namelist()[0])).split()

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.1.2** - Build the dataset  

- Write a function that takes the `vocabulary_size` and `corpus` as input, and outputs:
    - Tokenized data
    - count of each token
    - A dictionary that maps words to tokens
    - A dictionary that maps tokens to words
    You can use the same function used in **Lab 3**, or else you can use [`tf.keras.Tokenizer`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) to write a similar function.
- Print the first 10 tokens and reverse them to words to confirm a match to the initial print above.
     
  
Eg. `corpus[:10] = ['this','is,'an','example',...]`

`data[:10] = [44,26,24,16,...]`
    
`reversed_data =['this','is,'an','example',...]`

**NOTE**: Choose a sufficiently large vocabulary size. i.e `vocab_size>= 1000`
    
</div>

In [None]:
def build_dataset(words, n_words):
    """Process raw inputs into a dataset."""

   # Your code here

In [None]:
vocab_size = 1000
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                                vocab_size)
del vocabulary  # Hint to reduce memory.

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.1.3** - Build skipgrams with negative samples
- Use the `tf.keras.preprocessing.sequence.skipgrams` function to build positive and negative samples \
    for word2vec training. Follow the documentation on how to make the pairs
- You are free to choose your own `window_size`, but we recommend a value of 3.
- Print 10 pairs of *center* and *context* words with their associated labels.    
    
Skip-gram Sampling table
A large dataset means a larger vocabulary with a higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as the, is, on) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality.

The `tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token. You can use the `tf.keras.preprocessing.sequence.make_sampling_table` to generate a word-frequency rank-based probabilistic sampling table and pass it to skipgrams function.    

</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.1.4** - What is the difference between using a sampling table and not using a sampling table while building the dataset for skipgrams?
    
</div>

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

#### **1.2 [8 points]** **Building a word2vec model**

Build a word2vec model architecture based on the schematic below.


![](https://drive.google.com/uc?export=view&id=1fBTpoBoG5RZIPTtdogZt37Bw3oUb7tuT)
  
    
- To do so, you will need:
    - `tf.keras.layers.Embedding` layer
    - `tf.keras.layers.Dot()`
    - `tf.keras.Model()` which is the functional API
- You can choose an appropriate embedding dimension
- Compile the model using `binary_crossentropy()` function and an appropriate optimizer.
- Sufficiently train the model.    
- Save model weights using the `model.save_weights()` for analysis of **2.3**. More information on saving your weights [here](https://www.tensorflow.org/tutorials/keras/save_and_load)    
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

#### **1.3 [7 points] Post-training analysis**
    
This segment involves some simple analysis of your trained embeddings.

    
**1.3.1** - Vector Algebra on Embeddings

Assuming you have chosen a sufficiently large `vocab_size`, find the embeddings for:
    
1. King
2. Male
3. Female
4. Queen
    
Find the vector `v = King - Male + Female` and find its `cosine_similarity()` with the embedding for 'Queen'.
You can use the `cosine_similarity()` function defined in the session 3 exercise.

**NOTE**:The `cosine_similarity()` value, must be greater than `0.9`; If it is not, this implies that your word2vec embeddings are not well-trained.

Write a function `most_similar()`, which finds the top-n words most similar to the given word. Use this function to find the words most similar to `king`.
    <br />
    <br />
    
**Conceptual Question** Why can't we use `cosine_similarity()` as a `loss_function`?
    
<br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

    
**1.3.2** - Visualizing Embeddings

Find the embeddings for the words:
1. 'January'
2. 'February'
3. 'March'
4. 'April'
    
Find the `cosine_similarity()` of 'january' with each of 'february`, 'march', 'april' (which should be high values).
    
Save your trained weights. Recreate the network you have created above and initialize it with random weights. Compute the `cosine_similarity()` values. The values should be small (because the embeddings are random).
    
Use a demonstrative plot to show the `before & after` of the 4 embeddings. Here are some suggestions:
    1. PCA/TSNE for dimensionality reduction
    2. Radar plot to show all embedding dimensions
    
Bonus points for using creative means to demonstrate how the embeddings change after training.

Here is a [video](https://youtu.be/VDl_iA8m8u0) of a sample demonstration. We used a custom callback to get embeddings during training.  
            
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

#### **1.4 [5 points] Learning phrases**
    
As per the original paper by [Mikolov et al](https://arxiv.org/abs/1301.3781) many phrases have a meaning that is not a simple composition of the meanings of their individual words.
For eg. `new york` is one entity, however, as per our analysis above, we have two separate entities `new` & `york` which can have different meanings independently.    
To learn vector representation for phrases, we first find words that
appear frequently together, and infrequently in other contexts.
    
As per the analysis in the paper, we can use a formula to rank commonly used word pairs, and take the first 100 commonly occurring pairs.
$$\operatorname{score}\left(w_{i}, w_{j}\right)=\frac{\operatorname{count}\left(w_{i} w_{j}\right)-\delta}{\operatorname{count}\left(w_{i}\right) \times \operatorname{count}\left(w_{j}\right)}$$

**NOTE:** For simplicity of analysis, we take the discounting factor $\delta$ as 0, and take bi-gram combinations. You can experiment with tri-grams for word pairs such as `New_York_Times`.     
    
</div>

**1.4.1** - Find 100 most common bi-grams

From the tokenized data above, find the count for each bigram pair.
    
For each such pair, find the score associated with each token pair using the formula above.
    
 Pick the top 100 pairs based on the score. (Higher the better). To understand the `score()` function we suggest you read the paper mentioned above.
    
Replace the original `text8` file with the pairs as one entity. E.g., if `prime, minister` is a commonly occurring pair, replace `... prime minister ...' in the original corpus with a single entity `prime_minister`. Do this for all 100 pairs.

In [None]:
# Get the training data again
filename = 'text8.zip'
with zipfile.ZipFile(filename) as f:
# Read the data into a list of strings.
    super_text = tf.compat.as_str(f.read(f.namelist()[0]))

In [None]:
# Fill in to complete this function
def build_dataset(words, n_words):
    """Process raw inputs into a dataset."""

    # Your code here

In [None]:
# Make sure to use lower case and split as before
corpus = super_text.lower().split()
data, count, dictionary, reversed_dictionary = build_dataset(corpus,6000)

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.4.2** - Retrain word2vec   
With the new corpus generated as above, build the dataset, use skipgrams and retrain your word2vec with a sufficiently large vocabulary.

Write a function `most_dissimilar()`, similar to the `most_similar()` function, however this finds the top-n words which are **most dissimilar** to the given word.
Use this function defined above to find the entities most dissimilar to `united_kingdom`
    
Compare the above with separate tokens for `united` & `kingdom` and the sum of the vectors (to get this, you may need a sufficiently large vocabulary (>2000).
<br />

</div>

In [None]:
# Your code here

## **PART 2 [64 points]: IMDB Sentiment Analysis using ELMo**

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
   
Sentiment analysis, also known as opinion mining or emotion AI, is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
    
For this part of the homework, we will be using the IMDB dataset, which is publicly available [here](http://ai.stanford.edu/~amaas/data/sentiment/).    

This represents a "many-to-one" problem, with the output classified as a `positive` or `negative` sentiment, depending on the words used in the review.
    
    
In the first part of this section, you are expected to build a language model to train a basic ELMo.
    
Although the original ELMo implementation uses *Character Embeddings*, for the sake of this homework, we will use word embeddings instead.
    
Read more about the ELMo paper [here](https://arxiv.org/pdf/1802.05365.pdf).

In the second part of this subsection, you will use the generated ELMo embeddings in a deep-learning model to perform sentiment analysis using the IMDB dataset.
    
You will compare its performance, with a baseline model without any trained embeddings, and another model which directly uses the `word2vec` embeddings.

<br />
    </div>

### **PART 2: Questions**    

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

 **2.1 [15 points] Preprocess the dataset**
<br />

<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
**2.1.1** - Load the dataset

For simplicity, we will use the training split of the IMDB dataset.
- Limit to the most frequent 5000 words.
- Do not skip any frequently occurring words.
- Limit the largest review to a maximum of 200 words only.
    
    
**NOTE**: You can use the `imdb.get_word_index()` to get the mapping between tokens and words. This will load a dictionary with the mappings, which have to be corrected. A helper code is provided below to fix the dictionary.
    
Read more about `tf.keras.datasets.imdb` [here](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data?version=nightly).    
    
To each review, you must add an end-of-sentence token \<\/s>.

Eg. Review: "\<s\> This movie is so bad, I had to leave early"
    
Modified review: "\<s\> This movie is so bad, I had to leave early \<\/s\>"
<br />

**2.1.2** - Load the `word2vec` embeddings
    
You will use the pre-trained `word2vec` embeddings for this section of the homework. This file can be downloaded from [here](https://github.com/dlops-io/datasets/releases/download/v1.0/GoogleNews-vectors-negative300.bin.gz) and can be accessed like a dictionary.
    
**NOTE**: To access the pre-trained embeddings, [gensim](https://pypi.org/project/gensim/) library can be used.

Check if the embeddings for the start token (\<s\>) and end token (\</s\>) are present in the loaded `word2vec` embeddings.

Add the \<s\> and\</s\> tokens to the `word2vec` embeddings as random vectors if they are not present.

Create an `embedding_matrix` that will consist of the words present in word2vec. It will be a matrix of dimension `num_words X embedding_dim` `(5000 X 300)` including the addition of start and end tokens.
<br /><br />
**2.1.3** Prepare data for model training:

- Not all the words in the reviews are present in the embeddings file. Hence, if it is not present, you must OMIT that word from the sentence.
   
E.g. If `and` token is not present in the embeddings:
   
```
OLD SENTENCE: <s>The movie was good and I really liked it </s>
    
NEW SENTENCE: <s> The movie was good I really liked it </s>
    
```

- Split the data (`x_train` (tokens list), `y_train` (class list)) into 80% training and 20% validation. We will use the `y_train` (which is the sentiment associated with each movie review) only in Part 2.3 for sentiment analysis.

</details>
<br />
    
**2.2 [34 points] Define and train the model**
    
<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
We define an *ELMo-like* language model using bi-directional LSTMs and residual connections  **without** the character CNN to simplify the analysis.
    
We will use the `word2vec` embeddings instead of the character representations of the CNN.
    
For simplicity, we train our *ELMo-like* language model on the IMDB dataset itself. But generally, language models are trained on much larger corpora.  

**2.2.1** Build a `tf.data.Dataset` for training, and another one for validation

- Set an appropriate batch size (32, 64, 128, 256,...) This value will be determined by the GPU you have.

- Set `train_shuffle_buffer_size` and `validation_shuffle_buffer_size` to the length of training data and validation data respectively

- `tf.data.Dataset` is an efficient way to build data pipelines. Instead of preprocessing the entire dataset, we can preprocess a batch. It is faster and consumes fewer resources, which is optimal for training.

- Hint: When creating tf.data.Dataset use `tf.ragged.constant` to convert your ragged tokens to ragged tensors

- `dataset.map` enable us to apply a function to each element of the batch individually. The parameter `num_parallel_calls` allow us to control how many threads we will use to feed the network. It can be set to `num_parallel_calls=AUTOTUNE`. You will use the `transform_pad` function to perform model-specific data processing.

- When building the tf.data pipeline use the following order:
  - shuffle
  - batch
  - map
  - prefetch
<br>
<br>
- After you build your train and validation dataset, use `dataset.take(1)` to view the first row of data from the training dataset. It is important to verify the data input and output dimensions before modelling
<br /><br />
    
**2.2.2** Building the language model
    
*Transform Input within the model:*

*   In forward LSTM, we use the n-th token to predict the (n+1)-th token. Hence we want to discard the last one, the end of sentence token, `</s>`, from all the sentences. Remember all the sentences are padded, so the `</s>` will not be the last element of the sequence.

   * One way of achieving that is, using a Boolean mask with help of [tf.sequence_mask](https://www.tensorflow.org/api_docs/python/tf/sequence_mask) or with the help of [tf.gather_nd](https://www.tensorflow.org/api_docs/python/tf/gather_nd), which can be used to select specific elements from a tensor based on their indices. Remember you can combine multiple boolean masks via multiple logic operations. We also encourage you to come up with your solutions.
   * Note that after using boolean masks your outputs will be flattened out and you have to reshape them back with the appropriate batch size. Remember that you removed the end of sentence token, hence the length of the sequence would be length-1.

    
![](https://drive.google.com/uc?export=view&id=1f5bPplDGlRUdfii5X1bD2kd20SKRyOSo)
  

* For the backward LSTM layers, the next word prediction task is doing the reverse of what it was doing for the forward LSTM layers. We aim to predict the n-th token with the (n+1)-th token. To achieve this we remove the start of the sentence token `<s>` from all the sentences.


The model's inputs should be followed by a `tf.keras.layers.Embedding()` layer. The `Embedding` layer will act as a lookup table to convert token inputs to their corresponding word2vec values. When initialising the `Embedding` layer, set the layer weights with the `embedding_matrix` you had built in the previous question.

Set the trainable to false and mask_zero to true in the `Embedding` layer. The input dimension should be the number of rows in `embedding_matrix` and the output dimension should be your embedding dimension of `300`.
    
Refer to this image from the lecture slides on ELMo.
    
    
![](https://drive.google.com/uc?export=view&id=1fNPnrBR7Wfh_Jci70QaHde1L3nTcs9-D)
    
*go_backwards in backward LSTM layers*

* To predict the words backwards, we will use the go_backwards parameter present in the TensorFlow LSTM layer implementation. Remember to reverse the output of each backward LSTM layer before using it. Refer to the documentation of [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). Remember to invert the sequence dimension and not the embedding dimension.

Remember to use the **same** *softmax* layer on both the forward and the backward LSTMs.
    
This will give you an `output_f` and `output_b` which you will evaluate with your two targets.  

$$L = -\sum y \cdot log(\widehat{y}_{right}) -\sum y \cdot log(\widehat{y}_{left})$$

Use an appropriate loss function, and optimizer and train the network sufficiently.   

Finally, plot the training history of the train and validation loss.
<br />
    
**NOTE**: Use native tensorflow functions like `tf.shape` instead of numpy functions like `np.shape`.
<br />  

**2.2.3** Extracting ELMo embeddings
    
Use the Functional API to build another model called `Toy_ELMo` to obtain the embeddings.

*To concatenate the forward and backward LSTM hidden states we have to align them, which will be achieved by removing the start and end of sentence token embeddings from the forward and the backwards LSTM hidden states respectively. You can use the same logic that you used to process the input. It is necessary to reverse the output of the backward LSTM layers before using it in the Toy ELMo model.*

The obtained embeddings should be:
    
1. The `word2vec` embeddings
        
    This is just the output after the masked layer in the language model defined above.    
    
2. The first bidirectional-LSTM layer embeddings

    This will be the concatenation of the first LSTM layers of the language model (`lstm1 forward + lstm1 backwards`).
    
3. The second bidirectional-LSTM layer embeddings

    This will be the concatenation of the second LSTM layer of the language model (`lstm2 forward + lstm2 backwards`).    
    
- Make a test prediction of your `Toy_ELMo` model  

</details>
    
<br />
    
 **2.3 [15 points] Transfer Learning**
    
<details>
<summary>
<font size="3" color="green">
<b>Click for instructions</b>
</font>
</summary>
    
Once you've sufficiently trained your ELMo embeddings, we can use it for a downstream task such as sentiment analysis.
    
**2.3.1** - Baseline model:
    
For the baseline model, you will use:

- `tf.keras.Layers.Embedding()` layer
-  2 layers of `GRU` with `hidden_size=300`
-  Dense output layer
    
You will build a `tf.data.Dataset` similar to the one created in Section 2.2.1 but instead of having a target as a series of tokens, the target should only be a class (positive or negative sentiment). Unlike in 2.2.1 we only need a single sequence of tokens for this model

Train it for sufficient epochs using an appropriate optimizer and learning rate.

**2.3.2** - Directly using pre-trained `word2vec`:
    
For this section, use the pre-trained `word2vec` embeddings directly into your model.

You can use the same `tf.data.Dataset` from 2.3.1 for this model
    
Train, and compare its performance with the baseline model defined above **using the same architecture** as above.
    <br /><br />
**2.3.3** - You have already done sentiment analysis using `tf.keras.layers.Embedding()`. You will now aim to beat that baseline using your ELMo embeddings.  

Using ELMo embeddings:

You will build another `tf.data.Dataset` similar to the one created in Section 2.2.1 but instead of having a target as a series of tokens, the target should only be a class (positive or negative sentiment). This model also requires two inputs one for forward and one for backward LSTM.   

For this model, you will use:
- `Toy_ELMo` model after the input layer
-  Sauce layer
-  2 layers of `GRU` with `hidden_size=300`
-  Dense output layer
    
**NOTE**: Set `Toy_ELMo.trainable` to `False` to avoid retraining the model.
        
Create the **sauce** layer to combine the three embeddings from your `Toy_ELMo`. You should have **three** trainable parameters in this layer

$$ELMo_{t} = \gamma \sum_{j=0}^{L} s_{j} h_{t}^{j}$$
    
Since we are not using any other embeddings, we will set the value of $\gamma$ to 1. <br>
Train the modified model sufficiently, and compare it to the previously trained models.
</details>
</div>

### **PART 2: Solutions**    

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">



#### **2.1 [15 points] Preprocess the dataset**
</div>


**2.1.1** - Load the dataset

For simplicity, we will use the training split of the IMDB dataset.
- Limit to the most frequent 5000 words.
- Do not skip any frequently occurring words.
- Limit the largest review to a maximum of 200 words only.
    
    
**NOTE**: You can use the `imdb.get_word_index()` to get the mapping between tokens and words. This will load a dictionary with the mappings, which have to be corrected. A helper code is provided below to fix the dictionary.
    
Read more about `tf.keras.datasets.imdb` [here](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data?version=nightly).    
    
To each review, you must add an end-of-sentence token \<\/s>.

Eg. Review: "\<s\> This movie is so bad, I had to leave early"
    
Modified review: "\<s\> This movie is so bad, I had to leave early \<\/s\>"

In [None]:
# Your code here

In [None]:
# Your code here
### Helper code to fix the mapping of the imdb word index

index = tf.keras.datasets.imdb.get_word_index()

# we need to add 3 from the indices because 0 is 'padding', 1 is 'start of sequence' and 2 is 'unknown'

inv_index = {j+3:i for i,j in index.items()}

# Tags for start and end of sentence

inv_index[1] = '<s>'
inv_index[2] = 'UNK'
inv_index[3] = '</s>'

index = {j:i for i,j in inv_index.items()}

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
**2.1.2** - Load the `word2vec` embeddings
    
You will use the pre-trained `word2vec` embeddings for this section of the homework. This file can be downloaded from [here](https://github.com/dlops-io/datasets/releases/download/v1.0/GoogleNews-vectors-negative300.bin.gz) and can be accessed like a dictionary.
    
**NOTE**: To access the pre-trained embeddings, [gensim](https://pypi.org/project/gensim/) library can be used.

Check if the embeddings for the start token (\<s\>) and end token (\</s\>) are present in the loaded `word2vec` embeddings.

Add the \<s\> and\</s\> tokens to the `word2vec` embeddings as random vectors if they are not present.

Create an `embedding_matrix` that will consist of the words present in word2vec. It will be a matrix of dimension `num_words X embedding_dim` `(5000 X 300)` including the addition of start and end tokens.
<br />

</div>

In [None]:
# Obtaining Word2Vec embeddings using gensim library
import gensim.downloader as api

model = api.load("word2vec-google-news-300")

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
        
    
**2.1.3** - Prepare data for model training:

- Not all the words in the reviews are present in the embeddings file. Hence, if it is not present, you must OMIT that word from the sentence.
   
E.g. If `and` token is not present in the embeddings:
   
```
OLD SENTENCE: <s>The movie was good and I really liked it </s>
    
NEW SENTENCE: <s> The movie was good I really liked it </s>
    
```

- Split the data (`x_train` (tokens list), `y_train` (class list)) into 80% training and 20% validation. We will use the `y_train` (which is the sentiment associated with each movie review) only in Part 2.3 for sentiment analysis.
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
#### **2.2 [34 points] Define and train the model**
    
We define an *ELMo-like* language model using bi-directional LSTMs and residual connections  **without** the character CNN to simplify the analysis.
    
We will use the `word2vec` embeddings instead of the character representations of the CNN.
    
For simplicity, we train our *ELMo-like* language model on the IMDB dataset itself. But generally, language models are trained on much larger corpora.  
</div>


**2.2.1** - Build a `tf.data.Dataset` for training, and another one for validation

- Set an appropriate batch size (32, 64, 128, 256,...) This value will be determined by the GPU you have.

- Set `train_shuffle_buffer_size` and `validation_shuffle_buffer_size` to the length of training data and validation data respectively

- `tf.data.Dataset` is an efficient way to build data pipelines. Instead of preprocessing the entire dataset, we can preprocess a batch. It is faster and consumes fewer resources, which is optimal for training.

- Hint: When creating tf.data.Dataset use `tf.ragged.constant` to convert your ragged tokens to ragged tensors

- `dataset.map` enable us to apply a function to each element of the batch individually. The parameter `num_parallel_calls` allow us to control how many threads we will use to feed the network. It can be set to `num_parallel_calls=AUTOTUNE`. You will use the `transform_pad` function to perform model-specific data processing.

- When building the tf.data pipeline use the following order:
  - shuffle
  - batch
  - map
  - prefetch
<br><br>
- After you build your train and validation dataset, use `dataset.take(1)` to view the first row of data from the training dataset. It is important to verify the data input and output dimensions before modelling

In [None]:
# Your code here

In [None]:
# Helper Code
batch_size = 64
train_shuffle_buffer_size = len(tokens_list_train)
validation_shuffle_buffer_size = len(tokens_list_val)

# Fill the required cells to complete the function
def transform_pad(input, output,n):

    # You will  transform the input at the run time
    input = input.to_tensor(default_value=0, shape=[None, None])

    # Transform the output for the f and b LSTM
    output_f = output[:,1:]
    output_b = output[:,:-1]


    # Pad the outputs
    output_f = output_f.to_tensor(default_value=0, shape=[None, None])
    output_b = output_b.to_tensor(default_value=0, shape=[None, None])

    return (input,n),(output_f, output_b)

# Calculate and store length of each sentence
N_train = [len(n) for n in tokens_list_train]
N_val = [len(n) for n in tokens_list_val]

N_train = tf.constant(N_train, tf.int32)
N_val = tf.constant(N_val, tf.int32)

# Use tensorflow ragged constants to get the ragged version of data
train_processed_x = tf.ragged.constant(tokens_list_train)
validate_processed_x = tf.ragged.constant(tokens_list_val)
train_processed_y = tf.ragged.constant(tokens_list_train)
validate_processed_y = tf.ragged.constant(tokens_list_val)

# Create TF Dataset
train_data = tf.data.Dataset.from_tensor_slices((train_processed_x, train_processed_y,N_train))
validation_data = tf.data.Dataset.from_tensor_slices((validate_processed_x, validate_processed_y,N_val))

#############
# Train data
#############
# Apply all data processing logic
train_data = train_data.shuffle(buffer_size=train_shuffle_buffer_size)
train_data = train_data.batch(batch_size)
train_data = train_data.map(transform_pad, num_parallel_calls=AUTOTUNE)
train_data = train_data.prefetch(AUTOTUNE)

##################
# Validation data
##################
# Apply all data processing logic
#validation_data = validation_data.shuffle(buffer_size=validation_shuffle_buffer_size)
validation_data = validation_data.batch(batch_size)
validation_data = validation_data.map(transform_pad, num_parallel_calls=AUTOTUNE)
validation_data = validation_data.prefetch(AUTOTUNE)

print("train_data",train_data)
print("validation_data",validation_data)

In [None]:
# View some data from tf dataset
for (input_f,n) ,(output_f, output_b) in train_data.take(1):
  print(input_f.shape)
  print(input_f[0])
  print(n.shape)
  print(n[0])
  print("************************")
  print(output_f.shape,output_b.shape)
  print(output_f[0])
  print(output_b[0])

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
**2.2.2** - Building the language model
    
*Transform Input within the model:*

*   In forward LSTM, we use the n-th token to predict the (n+1)-th token. Hence we want to discard the last one, the end of sentence token, `</s>`, from all the sentences. Remember all the sentences are padded, so the `</s>` will not be the last element of the sequence.

   * One way of achieving that is, using a Boolean mask with help of [tf.sequence_mask](https://www.tensorflow.org/api_docs/python/tf/sequence_mask) or with the help of [tf.gather_nd](https://www.tensorflow.org/api_docs/python/tf/gather_nd), which can be used to select specific elements from a tensor based on their indices. Remember you can combine multiple boolean masks via multiple logic operations. We also encourage you to come up with your solutions.
   * Note that after using boolean masks your outputs will be flattened out and you have to reshape them back with the appropriate batch size. Remember that you removed the end of sentence token, hence the length of the sequence would be length-1.

    
![](https://drive.google.com/uc?export=view&id=1f5bPplDGlRUdfii5X1bD2kd20SKRyOSo)
  

* For the backward LSTM layers, the next word prediction task is doing the reverse of what it was doing for the forward LSTM layers. We aim to predict the n-th token with the (n+1)-th token. To achieve this we remove the start of the sentence token `<s>` from all the sentences.


The model's inputs should be followed by a `tf.keras.layers.Embedding()` layer. The `Embedding` layer will act as a lookup table to convert token inputs to their corresponding word2vec values. When initialising the `Embedding` layer, set the layer weights with the `embedding_matrix` you had built in the previous question.

Set the trainable to false and mask_zero to true in the `Embedding` layer. The input dimension should be the number of rows in `embedding_matrix` and the output dimension should be your embedding dimension of `300`.
    
Refer to this image from the lecture slides on ELMo.
    
    
![](https://drive.google.com/uc?export=view&id=1fNPnrBR7Wfh_Jci70QaHde1L3nTcs9-D)
    
*go_backwards in backward LSTM layers*

* To predict the words backwards, we will use the go_backwards parameter present in the TensorFlow LSTM layer implementation. Remember to reverse the output of each backward LSTM layer before using it. Refer to the documentation of [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). Remember to invert the sequence dimension and not the embedding dimension.

Remember to use the **same** *softmax* layer on both the forward and the backward LSTMs.
    
This will give you an `output_f` and `output_b` which you will evaluate with your two targets.     

$$L = -\sum y \cdot log(\widehat{y}_{right}) -\sum y \cdot log(\widehat{y}_{left})$$

Use an appropriate loss function, and optimizer and train the network sufficiently.   

Finally, plot the training history of the train and validation loss.
<br />
    
**Note**: Use native tensorflow functions like `tf.shape` instead of numpy functions like `np.shape`.
<br />  

</div>    

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

#### **2.3 [15 points] Transfer Learning**
    
Once you've sufficiently trained your ELMo embeddings, we can use it for a downstream task such as sentiment analysis.

</div>    

**2.3.1** - Baseline model
    
For the baseline model, you will use:
    
- `tf.keras.Layers.Embedding()` layer
-  2 layers of `GRU` with `hidden_size=300`
-  Dense output layer
    
You will build a `tf.data.Dataset` similar to the one created in Section 2.2.1 but instead of having a target as a series of tokens, the target should only be a class (positive or negative sentiment). Unlike in 2.2.1 we only need a single sequence of tokens for this model

Train it for sufficient epochs using an appropriate optimizer and learning rate.

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.3.2** - Directly using pre-trained `word2vec`
    
For this section, use the pre-trained `word2vec` embeddings directly into your model.

You can use the same `tf.data.Dataset` from 2.3.1 for this model
    
Train, and compare its performance with the baseline model defined above **using the same architecture** as above.
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
**2.3.3** You have already done sentiment analysis using `tf.keras.layers.Embedding()`. You will now aim to beat that baseline using your ELMo embeddings.

Using ELMo embeddings:

You will build another `tf.data.Dataset` similar to the one created in Section 2.2.1 but instead of having a target as a series of tokens, the target should only be a class (positive or negative sentiment). This model also requires two inputs one for forward and one for backward LSTM.

For this model, you will use:
- `Toy_ELMo` model after the input layer
-  Sauce layer
-  2 layers of `GRU` with `hidden_size=300`
-  Dense output layer

**NOTE**: Set `Toy_ELMo.trainable` to `False` to avoid retraining the model.
        
Create the **sauce** layer to combine the three embeddings from your `Toy_ELMo`. You should have **three** trainable parameters in this layer

$$ELMo_{t} = \gamma \sum_{j=0}^{L} s_{j} h_{t}^{j}$$

Since we are not using any other embeddings, we will set the value of $\gamma$ to 1. <br>
Train the modified model sufficiently, and compare it to the previously trained models.

</div>  

In [None]:
# Helper code
# Scale layer for sauce
class ScaleLayer(tf.keras.layers.Layer):
    def __init__(self, shape):
        super(ScaleLayer, self).__init__()
        self.supports_masking = True
        self.shape = shape

    def build(self, inputs):
        sauce_initializer = tf.keras.initializers.Constant(value = [0.4, 0.3, 0.3])
        self.scale = self.add_weight(shape = (self.shape,), initializer = sauce_initializer, trainable = True)

    def call(self, inputs):
        scale_norm = tf.nn.softmax(self.scale)

        return tf.tensordot(scale_norm, inputs, axes=1)