<a href="https://colab.research.google.com/github/Manisha2297/DataAndBases/blob/master/Homework1_Scaffold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://github.com/PadmajaVB/UnivAI_AI-3/blob/main/fig/univ.png?raw=1)

# AI-3: Language Models
## Homework 1: Embeddings and Language Models

**AI3 Cohort 1**<br/>
**Univ.AI**<br/>
**Instructor**: Pavlos Protopapas<br />
**Maximum Score**: 100

<hr style="height:2.4pt">

In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML

In [2]:
# Import necessary libraries
import re
import os
import math
import zipfile
from collections import Counter
import numpy as np
import pandas as pd
import urllib.request
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import dot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Reshape
from tensorflow.keras.preprocessing.sequence import skipgrams
%matplotlib inline

### INSTRUCTIONS


- This homework is a jupyter notebook. Download and work on it on your local machine.

- This homework should be submitted in pairs.

- Ensure you and your partner together have submitted the homework only once. Multiple submissions of the same work will be penalised and will cost you 2 points.

- Please restart the kernel and run the entire notebook again before you submit.

- Running cells out of order is a common pitfall in Jupyter Notebooks. To make sure your code works restart the kernel and run the whole notebook again before you submit. 

- To submit the homework, either one of you upload the working notebook on edStem and click the submit button on the bottom right corner.

- Submit the homework well before the given deadline. Submissions after the deadline will not be graded.

- We have tried to include all the libraries you may need to do the assignment in the imports statement at the top of this notebook. We strongly suggest that you use those and not others as we may not be familiar with them.

- Comment your code well. This would help the graders in case there is any issue with the notebook while running. It is important to remember that the graders will not troubleshoot your code. 

- Please use .head() when viewing data. Do not submit a notebook that is **excessively long**. 

- In questions that require code to answer, such as "calculate the $R^2$", do not just output the value from a cell. Write a `print()` function that includes a reference to the calculated value, **not hardcoded**. For example: 
```
print(f'The R^2 is {R:.4f}')
```
- Your plots should include clear labels for the $x$ and $y$ axes as well as a descriptive title ("MSE plot" is not a descriptive title; "95 % confidence interval of coefficients of polynomial degree 5" is).

- **Ensure you make appropraite plots for all the questions it is applicable to, regardless of it being explicitly asked for.**

<hr style="height:2pt">

### Names of the people who worked on this homework together
#### Manisha R and Padmaja V Bhagwat

### **DATASET ACCESS**

**Please note that all the datasets used in this homework are available to you on edStem. You will find it in the resources tab (on the top right) next to your lessons tab. Additionally, some datasets have been provided in a form that will allow you to access it directly on google colab by uncommenting and running some cells.**

### **HOMEWORK QUIZ**

**For each part of the homework, there is an associated quiz on edStem. You are required to attempt that after completing each section of this homework. Please note that the quiz is one attempt only.**


![](https://github.com/PadmajaVB/UnivAI_AI-3/blob/main/fig/one_attempt.png?raw=1)

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
## **PART 1 [35 points]: Language Modelling using ngrams**
<br />    

In the first part of the homework, you are expected to build a language model based on bigrams. You will develop your own sub-word tokenization to analyze dissaster messages from multiple natural disasters dataset. All the sentences are translated into english.


You have been tasked to develop a language model to complete messages that for some reason arrive incomplete to a radio station. Given the delicate situation, you will have to be extra careful. Each word in the sentence convey a lot of information, and improper handling of the data can mean harm to someone. 
    
    
</div>
    

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 1: Questions**
<br />

### **1.1 [5 points] PREPROCESS THE DATASET**
<br />

**1.1.1** - Read in the dataset `disaster_response_messages_training.csv` and select only the column "message".
<br /><br />

**1.1.2** - Define a function `clean_data` that takes the data frame as input, converts the characters to lower case and removes any special characters that you might consider irrelevant,  adds the start token `<s>` and the end token `</s>` to every sentence (row) in the data frame and returns the processed data frame. 
<br /><br />


**1.1.3** - Split the dataset into train and test sets. The proportion should be 0.95 and 0.05, respectively. You will create the language model based on the train set and validate your results on the test set.
<br /><br />    



    
### **1.2 [8 points] TOKENIZE AND COUNT**
<br />
In this section, you will create three different tokenizers that you will build LM based on. The tokenization functions must divide the text into tokens, count their frequency and return a dictionary with a mapping of token to number.
    
**1.2.1** - Create your own tokenization function ('tokenizer_1') based on whitespace. Set the vocabulary size to 1000, including the `<UNK>` token for out of the vocabulary (OOV) words. 
<br /><br />

**1.2.2** - Create a second tokenization function ('tokenizer_2') based on whitespace, but do not limit the vocabulary size.
<br /><br />

**1.2.3** - Create a third tokenization function ('tokenizer_3') based on sub-words. You have to define a set of common sub-words in the English language, for example, the subtokens _ing_ and _n't_.
    
In this example, the sentence "_It is raining outside_" would be tokenized as [_It_, _is_, _rain_, _ing_, _outside_ ].
<br /><br />
    
    
    
### **1.3 [6 points] CONSTRUCTING BIGRAMS**
<br />

**1.3.1** - Using each of the tokenizer functions you created, split each sentence into tokens in their numerical representation. 
<br /><br />

**1.3.2** - Count the bigrams in the dataset for each tokenizer and divide them by the total number of bigrams. This will give you the probability of each bigram.
<br /><br />
    
    
    
### **1.4 [8 points] PREDICTING THE NEXT WORDS**
<br />

**1.4.1** - Simulate the incomplete messages dividing each sentence of the **test** set into two. For this, split each sentence in a 3:1 ratio. The first $75\%$ of a sentence will represent the correct message, and the last $25\%$ will convey the missing information. You will not give this 25% to your model, it is kept hidden. This 25% will only be used to evaluate the predictions of your language model.
    
For example in the sentence: *"I will go out on a vacation, now that my semester ended."*

The first 75% will be *"I will go out on a vacation, now"*

The last 25% will be *"that my semeter ended"*

Your aim is to predict the last part by giving your model the first "part" of the sentence.


Note that in an n-gram language model, only the last $n-1$ words are used to make a prediction. For example, for the above sentence, if you are using bigrams, the input to your model would only be "now" and you are expected to predict "that".
    
<br /><br />    
    
**1.4.2** - Given 5 sentences from the previous question (test set), predict the next word. 
Append this predicted word to the input sequence and predict the next one. Repeat this process until you reach the 10th token or the end of a sentence. Compare your results qualitatively with the original sentences. Do the results make sense wrt the context and semantics?

Repeat this for all the models built using different tokenization techniques.
<br /><br />

**1.4.3** - Repeat the same exercise, for all 3 models, but this time, the next token will be sampled from a distribution given by the bigram frequency. Compare and comment on the results?


*Hint:* In a model of two bigrams with frequencies 0.7 and 0.3, a deterministic prediction will only predict the first bigram. Sampling from a distribution, will enable the model to predict the second bigram with a probability of 0.3. In this way we can still predict infrequent tokens. 
<br /><br />
    

### **1.5 [5 points] EVALUATE THE LANGUAGE MODELS**
<br />

    
**1.5.1** - For each of the models built using different tokenization techniques, compute the average perplexity in the test set (part 1.1.3). Perform smoothing on the bigram models. Based on the perplexity, which model is better?
<br /><br />

**1.5.2** - Given the perplexities, which model do you think is better? Why do you think so? Does this reflect the quality of the prediction as seen in part 1.4? 
 What is the effect of UNK words?

<br /><br />

### **1.6 [3 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer them without attempting this part.

<br /><br />

</div>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
## **PART 1: Solutions**
    
### **1.1 [5 points] PREPROCESS THE DATASET**
<br />

**1.1.1** - Read in the dataset `disaster_response_messages_training.csv` and select only the column "message".
<br /><br />

    
</div>

In [4]:
# Your code here
raw_data = pd.read_csv('/content/drive/MyDrive/UnivAI/Univ AI 3/Homework 1 - Part 1 Dataset.csv')
raw_data = raw_data[['message']]

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
raw_data.head()

Unnamed: 0,message
0,Weather update - a cold front from Cuba that c...
1,Is the Hurricane over or is it not over
2,"says: west side of Haiti, rest of the country ..."
3,Information about the National Palace-
4,Storm at sacred heart of jesus


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
**1.1.2** - Define a function `clean_data` that takes the data frame as input, converts the characters to lower case and removes any special characters that you might consider irrelevant,  adds the start token `<s>` and the end token `</s>` to every sentence (row) in the data frame and returns the processed data frame. 
    
</div>

In [6]:
def preprocess_text(text):
  clean_text = text.lower()
  # remove special characters - basically anything that is not a letter or a space
  clean_text = re.sub(r'[^a-z0-9\s]+','', clean_text)
  clean_text = '<s> '+clean_text+' </s>'
  return clean_text

In [7]:
# Your code here
def clean_data(dataframe):
  dataframe['message'] = dataframe['message'].apply(preprocess_text) 
  return dataframe

In [8]:
# clean the data
df = clean_data(raw_data)

In [9]:
df.head()

Unnamed: 0,message
0,<s> weather update a cold front from cuba tha...
1,<s> is the hurricane over or is it not over </s>
2,<s> says west side of haiti rest of the countr...
3,<s> information about the national palace </s>
4,<s> storm at sacred heart of jesus </s>


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.1.3** - Split the dataset into train and test sets. The proportion should be 0.95 and 0.05, respectively. You will create the language model based on the train set and validate your results on the test set.

</div>

In [10]:
# Your code here
index = np.unique(df.index)
train_index, val_index = train_test_split(index, train_size=0.95, random_state=66)

df_train = df.loc[train_index]
df_val = df.loc[val_index]

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **1.2 [8 points] TOKENIZE AND COUNT**
<br />
In this section, you will create three different tokenizers and build an LM based on each one of them. The tokenization functions must divide the text into tokens, count their frequency and return a dictionary with a mapping of token to number.

    
</div>

In [11]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.2.1** - Create your own tokenization function ('tokenizer_1') based on whitespace. Set the vocabulary size to 1000, including the `<UNK>` token for out of the vocabulary (OOV) words. 
<br /><br />
    
</div> 

# New Section

In [12]:
# Fill in to complete this function 
def tokenizer_1(text_corpus, vocabulary_size):
    """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(Counter(text_corpus.split()).most_common(vocabulary_size-1))
    
    dictionary={}
    # For all words in count, assign a token (you can use a for loop) 
    for i, tup in enumerate(count):
        dictionary[tup[0]] = i
        
    # Make a new list of tokens associated with words    
    data = []
    # Initialize a counter for 'UNK' values 
    unk_count = 0
    
    # For all words in corpus, find the associated token, and append to 
    # the 'data' variable defined above
    for word in text_corpus:
        if word in dictionary:
            token = dictionary[word]
        # If word is not in dictionary, it is 'out of vocabulary'
        # So we need to assign it the zero token and
        # update the count of the 'UNK' token
        else:
            token = 0  
            unk_count += 1
            
        # Append token to data 
        data.append(token)
        
    # We can now set the count of 'UNK' tokens in the corpus
    count[0][1] = unk_count
    
    # A reverse dictionary takes you from tokens to words
    # Eg. if dictionary['Ignacio'] == 44
    # reverse_dictionary[44] == 'Ignacio'
    reversed_dictionary = {v:k for k,v in dictionary.items()}
    
    return data, count, dictionary, reversed_dictionary

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.2.2** - Create a second tokenization function ('tokenizer_2') based on whitespace, but do not limit the vocabulary size.
<br /><br />
    
</div>

In [13]:
# Fill in to complete this function 
def tokenizer_2(text_corpus):
    """Process raw inputs into a dataset."""
    count = []
    count.extend(Counter(text_corpus.split()).most_common())
    
    dictionary={}
    for i, tup in enumerate(count):
        dictionary[tup[0]] = i
        
    # Make a new list of tokens associated with words    
    data = []
    for word in text_corpus:
        if word in dictionary:
            token = dictionary[word]
        else:
            token = 0  
            
        # Append token to data 
        data.append(token)

    dictionary['UNK']=len(text_corpus.split())
    reversed_dictionary = {v:k for k,v in dictionary.items()}
    
    return data, count, dictionary, reversed_dictionary

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.2.3** - Create a third tokenization function ('tokenizer_3') based on sub-words. You have to define a set of common sub-words in the English language, for example, the subtokens _ing_ and _n't_.
    
In this example, the sentence "_It is raining outside_" would be tokenized as [_It_, _is_, _rain_, _ing_, _outside_ ].
<br /><br />
    
</div>

In [14]:
def sub_word_tokenizer(text_corpus):
  common_prefix = [ 'pre', 'un', 'under', 'over', 'post', 'ir', 'in', 'sub']
  common_suffix = ['ing','ly','ed', 'n\'t','er','est', 'es', 'ful']

  word_seq = text_corpus.split()
  results = []
  for word in word_seq:
    filtered_prefix = list(filter(word.startswith, common_prefix))
    filtered_suffix = list(filter(word.endswith, common_suffix))
    if len(filtered_prefix)>0:
      results.extend(re.split(rf"^({filtered_prefix[0]})",word))
    if len(filtered_suffix)>0:
      results.extend(re.split(rf"({filtered_suffix[0]})$",word))
    if len(filtered_prefix)==0 and len(filtered_suffix)==0:
      results.append(word)
  
  subword_tokens = list(filter(None, results))
  return subword_tokens

In [15]:
# Your code here
def tokenizer_3(text_corpus):
  subword_tokens = sub_word_tokenizer(text_corpus)
  count = []
  count.extend(Counter(subword_tokens).most_common())

  dictionary={}
  for i, tup in enumerate(count):
      dictionary[tup[0]] = i
      
  # Make a new list of tokens associated with words    
  data = []
  for word in text_corpus:
      if word in dictionary:
          token = dictionary[word]
      else:
          token = 0  
          
      # Append token to data 
      data.append(token)
      
  dictionary['UNK']=len(text_corpus.split())
  reversed_dictionary = {v:k for k,v in dictionary.items()}

  return data, count, dictionary, reversed_dictionary



<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **1.3 [6 points] CONSTRUCTING BIGRAMS**
<br />

**1.3.1** - Using each of the tokenizer functions you created, split each sentence into tokens in their numerical representation. 
<br /><br />
    
</div>

In [16]:
def get_tok_sequence(text, vocab_dict):
  seq=[]
  for word in text.split():
    if word in vocab_dict.keys():
      seq.append(vocab_dict[word])
    else:
      seq.append(vocab_dict['UNK'])
  return seq

In [17]:
text_corpus = ""
for text in  df['message'].values:
  text_corpus = text_corpus + " "+ text

In [19]:
data, count, dictionary, reversed_dictionary  = tokenizer_1(text_corpus,1000)

In [20]:
df_train_tok1 = pd.DataFrame(df_train['message'].apply(get_tok_sequence, vocab_dict=dictionary), columns=['message'])

In [21]:
df_train_tok1.head()

Unnamed: 0,message
20231,"[2, 0, 0, 686, 0, 346, 164, 988, 24, 0, 0, 308..."
8658,"[2, 65, 120, 71, 6, 77, 9, 0, 0, 0, 0, 28, 11,..."
8449,"[2, 747, 39, 0, 0, 248, 1, 0, 84, 7, 1, 0, 147..."
8551,"[2, 25, 17, 442, 0, 0, 55, 24, 0, 3]"
18220,"[2, 0, 589, 30, 33, 0, 6, 7, 0, 0, 273, 134, 1..."


In [30]:
count

[['UNK', 2190154],
 ('the', 26098),
 ('<s>', 21046),
 ('</s>', 21046),
 ('and', 14737),
 ('to', 14106),
 ('of', 13494),
 ('in', 12879),
 ('a', 7873),
 ('i', 5679),
 ('for', 5289),
 ('is', 4489),
 ('are', 3893),
 ('have', 3859),
 ('we', 3808),
 ('on', 3311),
 ('that', 3265),
 ('with', 2821),
 ('you', 2431),
 ('by', 2398),
 ('people', 2351),
 ('as', 2350),
 ('water', 2266),
 ('from', 2238),
 ('food', 2168),
 ('help', 2083),
 ('at', 2022),
 ('has', 1948),
 ('this', 1930),
 ('it', 1764),
 ('can', 1763),
 ('need', 1690),
 ('will', 1681),
 ('be', 1668),
 ('please', 1573),
 ('me', 1540),
 ('not', 1534),
 ('an', 1439),
 ('my', 1417),
 ('earthquake', 1404),
 ('been', 1375),
 ('were', 1316),
 ('was', 1302),
 ('they', 1280),
 ('us', 1221),
 ('like', 1220),
 ('would', 1214),
 ('do', 1115),
 ('some', 1090),
 ('which', 1087),
 ('what', 1078),
 ('there', 1067),
 ('said', 1059),
 ('all', 1047),
 ('more', 1046),
 ('or', 1021),
 ('their', 1020),
 ('but', 969),
 ('who', 969),
 ('its', 947),
 ('about', 92

In [22]:
data_2, count_2, dictionary_2, reversed_dictionary_2  = tokenizer_2(text_corpus)
df_train_tok2 = pd.DataFrame(df_train['message'].apply(get_tok_sequence, vocab_dict=dictionary_2), columns=['message'])

In [23]:
df_train_tok2.head()

Unnamed: 0,message
20231,"[1, 15252, 5112, 685, 4982, 345, 163, 987, 23,..."
8658,"[1, 64, 119, 70, 5, 76, 8, 2231, 4631, 20250, ..."
8449,"[1, 746, 38, 6019, 7779, 247, 0, 2623, 83, 6, ..."
8551,"[1, 24, 16, 441, 1708, 1140, 54, 23, 1580, 2]"
18220,"[1, 32229, 588, 29, 32, 15022, 5, 6, 10548, 12..."


In [24]:
def get_subword_tok_sequence(text, vocab_dict):
  subword_tokens = sub_word_tokenizer(text)
  seq=[]
  for word in subword_tokens:
    if word in vocab_dict.keys():
      seq.append(vocab_dict[word])
    else:
      seq.append(vocab_dict['UNK'])
  return seq

In [25]:
data_3, count_3, dictionary_3, reversed_dictionary_3 = tokenizer_3(text_corpus)
df_train_tok3 = pd.DataFrame(df_train['message'].apply(get_subword_tok_sequence, vocab_dict=dictionary_3), columns=['message'])

In [26]:
df_train_tok3.head()

Unnamed: 0,message
20231,"[1, 13799, 4757, 8, 525, 5, 3597, 5, 384, 186,..."
8658,"[1, 73, 138, 10, 80, 9, 88, 12, 1748, 14, 837,..."
8449,"[1, 813, 46, 4051, 4, 7145, 278, 0, 368, 54, 9..."
8551,"[1, 28, 21, 218, 5, 43, 1733, 873, 5, 1183, 62..."
18220,"[1, 13586, 4, 643, 37, 32, 7910, 4, 9, 3, 9562..."


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.3.2** - Count the bigrams in the dataset for each tokenizer and divide them by the total number of bigrams. This will give you the probability of each bigram.
<br /><br />
    
</div>

In [59]:
# Your code here
def get_bigrams(df):
  bigrams = []
  for seq in df['message'].values:
    bigrams.extend([(seq[i],seq[i+1]) for i in range(1,len(seq[1:-1]))])
  bigrams_count_dict = Counter(bigrams)
  return bigrams_count_dict


In [60]:
# def get_bigram_prob(bigrams_count_dict):
#   bigram_prob_dict = {}
#   for k, v in bigrams_count_dict.items():
#     total = 0
#     for key, value in bigrams_count_dict.items():
#       if key[0] == k[0]:
#         total+=1
#     bigram_prob_dict[k] = v/total
#   return bigram_prob_dict 


In [61]:
def get_bigram_prob(bigrams_count_dict):
  first_gram_count = {}
  for k, v in bigrams_count_dict.items():
    if k[0] in first_gram_count.keys():
      first_gram_count[k[0]] += v
    else:
      first_gram_count[k[0]] = v

  bigram_prob_dict = {k: v / first_gram_count[k[0]] for k, v in bigrams_count_dict.items()}
  return bigram_prob_dict

In [62]:
# Bigram probability for tokenizer1  
bigrams_count_dict_tok1 = get_bigrams(df_train_tok1)
bigram_prob_dict_tok1 = get_bigram_prob(bigrams_count_dict_tok1)

In [70]:
# Bigram probability for tokenizer2  
bigrams_count_dict_tok2 = get_bigrams(df_train_tok2)
bigram_prob_dict_tok2 = get_bigram_prob(bigrams_count_dict_tok2)

In [71]:
# Bigram probability for tokenizer3  
bigrams_count_dict_tok3 = get_bigrams(df_train_tok3)
bigram_prob_dict_tok3 = get_bigram_prob(bigrams_count_dict_tok3)

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **1.4 [8 points] PREDICTING THE NEXT WORDS**
<br />

**1.4.1** - Simulate the incomplete messages dividing each sentence of the **test** set into two. For this, split each sentence in a 3:1 ratio. The first $75\%$ of a sentence will represent the correct message, and the last $25\%$ will convey the missing information. You will not give this 25% to your model, it is kept hidden. This 25% will only be used to evaluate the predictions of your language model.
    
For example in the sentence: *"I will go out on a vacation, now that my semester ended."*

The first 75% will be *"I will go out on a vacation, now"*

The last 25% will be *"that my semeter ended"*

Your aim is to predict the last part by giving your model the first "part" of the sentence.


Note that in an n-gram language model, only the last $n-1$ words are used to make a prediction. For example, for the above sentence, if you are using bigrams, the input to your model would only be "now" and you are expected to predict "that".
    
<br /><br />   
    
</div>

In [72]:
# Your code here
def split_sent(word_seq):
  msg_len = round(0.75*len(word_seq))
  msg = word_seq[:msg_len]
  missing_info = word_seq[msg_len:]
  return msg, missing_info

In [73]:
# get word sequence based on tokenizer_1 for test set 
df_val_tok1 = pd.DataFrame(df_val['message'].apply(get_tok_sequence, vocab_dict=dictionary), columns=['message'])

# get word sequence based on tokenizer_2 for test set 
df_val_tok2 = pd.DataFrame(df_val['message'].apply(get_tok_sequence, vocab_dict=dictionary_2), columns=['message'])

# get word sequence based on tokenizer_3 for test set 
df_val_tok3 = pd.DataFrame(df_val['message'].apply(get_subword_tok_sequence, vocab_dict=dictionary_3), columns=['message'])

In [74]:
# split the msg in 3:1 ratio
def split_sent_df(df):
  message, missing_info = [],[]
  for text_seq in df['message'].values:
    x, y = split_sent(text_seq)
    message.append(x)
    missing_info.append(y)

  df_split = pd.DataFrame({'message':message, 'missing_info':missing_info})
  return df_split
  

In [75]:
df_val_tok1_split = split_sent_df(df_val_tok1)
df_val_tok2_split = split_sent_df(df_val_tok2)
df_val_tok3_split = split_sent_df(df_val_tok3)

In [76]:
df_val_tok3_split.head()

Unnamed: 0,message,missing_info
0,"[1, 172, 0, 1748, 993, 1679, 29, 23905, 51, 35...","[0, 714, 8, 7, 621, 10, 3997, 63, 1841, 2]"
1,"[1, 69, 749, 35, 22, 17039, 195, 339]","[41, 20, 2]"
2,"[1, 69, 99, 550, 5, 13, 11, 202, 56, 1199]","[37, 12, 155, 2]"
3,"[1, 230, 5, 7, 423, 616, 8, 73, 61, 97, 1145, ...","[6, 29337, 29338, 29339, 6, 29340, 2]"
4,"[1, 283, 508, 740, 33, 796, 4, 12068, 3178, 9,...","[892, 12069, 518, 5166, 20, 3, 500, 23972, 478..."


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">


**1.4.2** - Given 5 sentences from the previous question (test set), predict the next word. 
Append this predicted word to the input sequence and predict the next one. Repeat this process until you reach the 10th token or the end of a sentence. Compare your results qualitatively with the original sentences. Do the results make sense wrt the context and semantics?

Repeat this for all the models built using different tokenization techniques.
<br /><br />
    
</div>

In [77]:
def predict_missing_word_unigram(msg, missing_info, dictionary, reversed_dictionary):
  missing_text_actual = ""
  for seq in missing_info[:min(10, len(missing_info))]:
    missing_text_actual += reversed_dictionary[seq] + " "

  missing_text_predicted = ""
  for j in range(min(10, len(missing_info))):
    next_word_seq = dictionary[reversed_dictionary[0]]
    missing_text_predicted += reversed_dictionary[next_word_seq] + " "

  return missing_text_actual, missing_text_predicted

In [78]:
def print_predicted_seq(df, dictionary, reversed_dictionary):
  for i in range(5):
    actual, pred = predict_missing_word_unigram(df['message'][i], df['missing_info'][i], dictionary, reversed_dictionary)
    print("==== Sentence {} ====".format(i+1))
    print("Actual: ",actual)
    print("Predicted: ",pred)
    print()

In [79]:
print_predicted_seq(df_val_tok1_split, dictionary, reversed_dictionary)

==== Sentence 1 ====
Actual:  the lines to better UNK their UNK </s> 
Predicted:  UNK UNK UNK UNK UNK UNK UNK UNK 

==== Sentence 2 ====
Actual:  that </s> 
Predicted:  UNK UNK 

==== Sentence 3 ====
Actual:  i call </s> 
Predicted:  UNK UNK UNK 

==== Sentence 4 ====
Actual:  UNK UNK UNK and UNK </s> 
Predicted:  UNK UNK UNK UNK UNK UNK 

==== Sentence 5 ====
Actual:  UNK transport UNK that include UNK UNK UNK and UNK 
Predicted:  UNK UNK UNK UNK UNK UNK UNK UNK UNK UNK 



In [80]:
print_predicted_seq(df_val_tok2_split, dictionary_2, reversed_dictionary_2)

==== Sentence 1 ====
Actual:  the lines to better reflect their status </s> 
Predicted:  the the the the the the the the 

==== Sentence 2 ====
Actual:  that </s> 
Predicted:  the the 

==== Sentence 3 ====
Actual:  i call </s> 
Predicted:  the the the 

==== Sentence 4 ====
Actual:  ruhengeri nkamira mkwero and mudende </s> 
Predicted:  the the the the the the 

==== Sentence 5 ====
Actual:  paf transport aircrafts that include y12 c130 casa and mi17 
Predicted:  the the the the the the the the the the 



In [81]:
print_predicted_seq(df_val_tok3_split, dictionary_3, reversed_dictionary_3)

==== Sentence 1 ====
Actual:  the lin es to bett er reflect their status </s> 
Predicted:  the the the the the the the the the the 

==== Sentence 2 ====
Actual:  me that </s> 
Predicted:  the the the 

==== Sentence 3 ====
Actual:  can i call </s> 
Predicted:  the the the the 

==== Sentence 4 ====
Actual:  and ruhengeri nkamira mkwero and mudende </s> 
Predicted:  the the the the the the the 

==== Sentence 5 ====
Actual:  various paf transport aircrafts that in clude y12 c130 casa 
Predicted:  the the the the the the the the the the 



**We can clearly see that the perdicted word doesn't make sense wrt the context and semantics, since its always predicting the most frequent word.**

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.4.3** - Repeat the same exercise, for all 3 models, but this time, the next token will be sampled from a distribution given by the bigram frequency. Compare and comment on the results?


*Hint:* In a model of two bigrams with frequencies 0.7 and 0.3, a deterministic prediction will only predict the first bigram. Sampling from a distribution, will enable the model to predict the second bigram with a probability of 0.3. In this way we can still predict infrequent tokens. 
<br /><br />
    
</div>

In [82]:
# Your code here
def predict_next_word(msg, missing_info, dictionary, reversed_dictionary, probability_dict):
  missing_text_actual = ""
  for seq in missing_info[:min(10, len(missing_info))]:
    missing_text_actual += reversed_dictionary[seq] + " "

  missing_text_predicted=''
  message = msg
  for i in range(min(10, len(missing_info))):
    last_word = message[-1]
    probability_list = [v for k,v in probability_dict if k[0] == last_word]
    next_word_list = [k[1] for k,v in probability_dict if k[0] == last_word]

    next_word = np.random.choice(next_word_list, p=probability_list)
    message.append(next_word)
    missing_text_predicted += reversed_dictionary[next_word] + " "

  return missing_text_actual, missing_text_predicted




In [83]:
def print_predicted_seq(df, dictionary, reversed_dictionary, probability_dict):
  for i in range(5):
    actual, pred = predict_next_word(df['message'][i], df['missing_info'][i], dictionary, reversed_dictionary, probability_dict)
    print("==== Sentence {} ====".format(i+1))
    print("Actual: ",actual)
    print("Predicted: ",pred)
    print()

In [84]:
print_predicted_seq(df_val_tok3_split, dictionary_3, reversed_dictionary_3, bigram_prob_dict_tok3)

TypeError: ignored

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **1.5 [5 points] EVALUATE THE LANGUAGE MODELS**
<br />

    
**1.5.1** - For each of the models built using different tokenization techniques, compute the average perplexity in the test set (part 1.1.3). Perform smoothing on the bigram models. Based on the perplexity, which model is better?
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**1.5.2** - Given the perplexities, which model do you think is better? Why do you think so? Does this reflect the quality of the prediction as seen in part 1.4? 

What is the effect of UNK words?
    
</div>

#### Type your answer here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **1.6 [3 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer it without attempting it.

</div>

#### Answer the questions on EdStem

___
___



<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

<h1>PART 2 [50 pts]:Word2Vec from scratch</h1>
<br /><br />
[Return to contents](#contents)
<br /><br />

<a id="part2intro"></a>
<h2> Problem Statement </h2>
<br /><br />
[Return to contents](#contents)
<br /><br />

Word2Vec architecture allows us to get *contextual* representations of word tokens.     
<br /><br />
There are several methods to build a word embedding. We will focus on the SGNS architecture. 
![](https://i.ibb.co/FW8Sr54/Screen-Shot-2021-04-27-at-3-27-16-PM.png)    
<br /><br />
In this problem, you are asked to build and analyze a Word2Vec architecture trained on wikipedia articles.

</div>

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 2 [50 pts]: Word2Vec from scratch** 
<br />

Word2Vec architecture allows us to get *contextual* representations of word tokens.     
<br /><br />
There are several methods to build a word embedding. We will focus on the SGNS architecture. 
<br/>

![](https://i.ibb.co/FW8Sr54/Screen-Shot-2021-04-27-at-3-27-16-PM.png)    
<br />
In this problem, you are asked to build and analyze a Word2Vec architecture trained on wikipedia articles.

<br />


</div>

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 2 Questions**
<br />
    
### **2.1 [5 points] MODEL PROCESSING**
<br />
    
**2.1.1 Get the data:**    

- Get the data from the `text8.zip` file.
    `text8.zip` is a small, *cleaned* subset of a large corpus of data scraped from wikipedia pages. More details can be found [here](https://paperswithcode.com/sota/language-modelling-on-text8).
    It is usually used to quickly train, or test language models.
    [Read here](http://mattmahoney.net/dc/textdata#:~:text=The%20purpose%20of%20the%20smaller,on%20the%20larger%20data%20set.&text=The%20two%20files%20have%20the,108%20bytes%20of%20fil9.) for more information.
- Split the data by whitespace and print the first 10 words to check if has been correctly loaded.

    **NOTE:** For this part of the homework, all words will be in their lowercase for simplicity of analysis.
<br />    

**2.1.2 Build the dataset**  

- Write a function that takes the `vocabulary_size` and `corpus` as input, and outputs:
    - Tokenized data
    - Count of each token
    - A dictionary that maps words to tokens
    - A dictionary that maps tokens to words.
    You can use the same function used in **Lab 3**, or else you can use [`tf.keras.Tokenizer`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) to write a similar function.
- Print the first 10 tokens and reverse them to words to confirm a match to the initial print above.
     
  
Example:  

 `corpus[:10] = ['this','is,'an','example',...]`

`data[:10] = [44,26,24,16,...]`
    
`reversed_data =['this','is,'an','example',...]`

**NOTE**: Choose a sufficiently large vocabulary size. i.e `vocab_size>= 1000`    
<br />
    
**2.1.3 Build skipgrams with negative samples:**  
- Use the `tf.keras.preprocessing.sequence.skipgrams` function to build positive and negative samples for word2vec training. Follow the documentation on how to make the pairs, or see Lab 3 for an example.
- You are free to choose your own `window_size`, but we recommend a value of 3.
- Print 10 pairs of *center* and *context* words with their associated labels.    
    
### Skip-gram Sampling table
A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as *the*, *is*, *on*) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality.

The `tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token. You can use the `tf.keras.preprocessing.sequence.make_sampling_table` to generate a word-frequency rank based probabilistic sampling table and pass it to skipgrams function.    
<br />
    
**2.1.4 Conceptual question** 
    
What is the difference between using a sampling table and not using a sampling table while building the dataset for skipgrams?
(Answer in less than 200 words)
<br /><br />
    
### **2.2 [8 points] BUILDING A WORD2VEC MODEL** 
<br />
    
Build a word2vec model architecture based on the schematic below.
<center>
    <img src="https://i.ibb.co/1LyGb0J/Screen-Shot-2021-04-23-at-10-50-52-AM.png" alt="centered image" width="200"/>
</center>    
    
- To do so, you will need:
    - `tf.keras.layers.Embedding` layer
    - `tf.keras.layers.Dot()`
    - `tf.keras.Model()` which is the functional API
- You can choose an appropriate embedding dimension
- Compile the model using `binary_crossentropy()` function and an appropriate optimizer. 
- Sufficiently train the model.    
It is generally a good practise to save the model weights. To save the model weights using the `model.save_weights()` for analysis of **2.3**. (More information on how to save your weights can be found [here](https://www.tensorflow.org/tutorials/keras/save_and_load))
<br />


### **2.3 [7 points] POST-TRAINING ANALYSIS**

    
This segment involves some simple analysis of your trained embeddings.
<br /><br />
    
**2.3.1 Vector Algebra on Embeddings**

- Assuming you have chosen a sufficiently large `vocab_size`, find the embeddings for:
    
1. King
2. Male
3. Female
4. Queen
    
- Find the vector `v = King - Male + Female` and find it's `cosine_similarity()` with the embedding for 'Queen'.
You can use the `cosine_similarity()` function defind in the exercise from lecture.

**NOTE**: The `cosine_similarity()` value, must be greater than `0.9`; If it is not, this implies that your word2vec embeddings are not well-trained.

- Write a function `most_similar()`, which finds the top-n words most similar to the given word.
- Use this function to find the words most similar to `king`.
    
- **Conceptual Question** Why can't we use `cosine_similarity()` as a `loss_function`?
(Answer in less than 200 words) 
    
<br />
    
**2.3.2 Visualizing Embeddings**

- Find the embeddings for the words:
1. 'Six'
2. 'Seven'
3. 'Eight'
4. 'Nine'
    
- Find the `cosine_similarity()` of 'six' with each of 'seven`,'eight','nine'.
    
- Reset your network (make sure your trained weights are saved), and again compute the `cosine_similarity()` values. The values should be small (because the embeddings are random).
    
- Use a demonstrative plot to show the `before & after training` the 4 embeddings. Here are some suggestions: 
    1. PCA/TSNE for dimensionality reduction
    2. Radar plot to show all embedding dimensions
    
Bonus points for using creative means to demonstrate how the embeddings change after training.

Here is a [video](https://youtu.be/VDl_iA8m8u0) of a sample demonstration. We used a custom callback to get embeddings during training.  
        

<br />
    
**2.3.3 Embedding and Context Matrix**
    
<br />
    
Investigate the relation between the Embedding & Context matrix. Again use the `cosine_similarity()` function to find the average value across all the words in the embedding and context matrix, i.e:

- For a word 'dog', find the embedding value, and context value.
- Calculate the `cosine_similarity()` between the two
- Repeat the same for every word in the vocabulary and calculate the average value of the `cosine_similarity()
 
<br /><br />
    

### **2.4 [5 points] LEARNING PHRASES**
    
As per the original paper by [Mikolov et al]() many phrases have a meaning that is not a simple composition of the meanings of its individual words. 
For eg. `new york` is one entity, however, as per our analysis above, we have two separate entities `new` & `york` which can have different meanings independently.    
To learn vector representation for phrases, we first find words that
appear frequently together, and infrequently in other contexts.
    
As per the analysis in the paper, we can use a formula to rank commonly used word pairs, and take the first 100 commonly occuring pairs.
$$\operatorname{score}\left(w_{i}, w_{j}\right)=\frac{\operatorname{count}\left(w_{i} w_{j}\right)-\delta}{\operatorname{count}\left(w_{i}\right) \times \operatorname{count}\left(w_{j}\right)}$$

**NOTE:** For simplicity of analysis, we take the discounting factor $\delta$ as 0, and take bi-gram combinations. You can experiment with tri-grams for word pairs such as `New_York_Times`.     
<br /><br />

    
**2.4.1 Find 100 most common bi-grams**

- From the tokenized data above, find the count for each bigram pair.
    
- For each such pair, find the score associated with each token pair using the formula above.
    
- Pick the top 100 pairs based on the score (higher is better). To understand the `score()` function we suggest you read the paper mentioned above.
    
- Replace the original `text8` file with the pairs as one entity. For e.g., if `prime,minister` is a commonly occuring pair, replace `... prime minister ...' in the original corpus to a single entity `prime_minister`. Do this for all 100 pairs.
<br /><br />
    
**2.4.2 Retrain word2vec**    
- With the new corpus generated as above, build the dataset, use skipgrams and retrain your word2vec with a sufficiently large vocabulary.
    
- Use the `most_similar()` function defiend above to find the entities most similar to `united_kingdom`.
    
- Compare the above with separate tokens for `united` & `kingdom` and the sum of the vectors (to get this, you may need a sufficiently large vocabulary (>2000).
<br /> <br />

    
### **2.5 [5 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer them without attempting this part.
    
</div>

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 2: Solutions**
<br />
    
### **2.1 [5 points] MODEL PROCESSING**
<br />
    
**2.1.1 Get the data:**    

- Get the data from the `text8.zip` file.
    `text8.zip` is a small, *cleaned* subset of a large corpus of data scraped from wikipedia pages.
    It is usually used to quickly train, or test language models.
    [Read here](http://mattmahoney.net/dc/textdata#:~:text=The%20purpose%20of%20the%20smaller,on%20the%20larger%20data%20set.&text=The%20two%20files%20have%20the,108%20bytes%20of%20fil9.) for more information
- Split the data by whitespace print the first 10 words to check if has been correctly loaded.
    
**NOTE** : For this part of the homework, all words will be in their lowercase for simplicity of analysis
<br />    <br />    

    
</div>

#### Helper function to read data

In [None]:
# Helper code to read the data

filename = 'text8.zip'
with zipfile.ZipFile(filename) as f:
# Read the data into a list of strings.
    vocabulary = tf.compat.as_str(f.read(f.namelist()[0])).split()

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.1.2 Build the dataset**  

- Write a function that takes the `vocabulary_size` and `corpus` as input, and outputs:
    - Tokenized data
    - count of each token
    - A dictionary that maps words to tokens
    - A dictionary that maps tokens to words
    You can use the same function used in **Lab 3**, or else you can use [`tf.keras.Tokenizer`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) to write a similar function.
- Print the first 10 tokens and reverse them to words to confirm a match to the initial print above.
     
  
Eg. `corpus[:10] = ['this','is,'an','example',...]`

`data[:10] = [44,26,24,16,...]`
    
`reversed_data =['this','is,'an','example',...]`

**NOTE**: Choose a sufficiently large vocabulary size. i.e `vocab_size>= 1000`    
<br />
    
</div>

In [None]:
def build_dataset(words, n_words):
    """Process raw inputs into a dataset.""" 
    # Fill in to complete this function 
    count = [['UNK', -1]]
    count.extend(Counter(text_corpus.split()).most_common(vocabulary_size-1))
    
    dictionary={}
    for i, tup in enumerate(count):
        dictionary[tup[0]] = i
           
    data = []
    unk_count = 0
    
    for word in text_corpus:
        if word in dictionary:
            token = dictionary[word]
        else:
            token = 0  
            unk_count += 1
            
        data.append(token)
        
    count[0][1] = unk_count
    
    reversed_dictionary = {v:k for k,v in dictionary.items()}
    
    return data, count, dictionary, reversed_dictionary

In [None]:
def get_data_sequence(text, vocab_dict):
  seq=[]
  for word in text.split():
    if word in vocab_dict.keys():
      seq.append(vocab_dict[word])
    else:
      seq.append(vocab_dict['UNK'])
  return seq

In [None]:
vocab_size = 1000
data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                                vocab_size)

data_sequence = get_data_sequence(vocabulary[0], dictionary)
print(f'Original sentence is: "{" ".join(vocabulary[0])}"')
print(f'Tokenized sentence is: "{data_sequence}"')
print(f'Tokenized sentence reversed is: {" ".join([reverse_dictionary[i] for i in data_sequence])}')

In [None]:
del vocabulary  # Hint to reduce memory.

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.1.3 Build skipgrams with negative samples:**  
- Use the `tf.keras.preprocessing.sequence.skipgrams` function to build positive and negative samples \
    for word2vec training. Follow the documentation on how to make the pairs, or see Lab 3 for an example.
- You are free to choose your own `window_size`, but we recommend a value of 3.
- Print 10 pairs of *center* and *context* words with their associated labels.    
    
#### Skip-gram Sampling table
A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as the, is, on) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality.

The `tf.keras.preprocessing.sequence.skipgrams` function accepts a sampling table argument to encode probabilities of sampling any token. You can use the `tf.keras.preprocessing.sequence.make_sampling_table` to generate a word-frequency rank based probabilistic sampling table and pass it to skipgrams function.    

</div>

In [None]:
# Your code here
window_size = 3
couples, labels = skipgrams(data,vocab_size, window_size=window_size)

# Separate the target,context pairs as word_target, word_context 
word_center, word_context = zip(*couples)
print(couples[:10], labels[:10])

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.1.4 Conceptual question** 
    
What is the difference between using a sampling table and not using a sampling table while building the dataset for skipgrams?
(Answer in less than 200 words)
    
</div>

#### Type your answer here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.2 [15 points]** **Building a word2vec model:** 

Build a word2vec model architecture based on the schematic below.
<center>
    <img src="https://i.ibb.co/1LyGb0J/Screen-Shot-2021-04-23-at-10-50-52-AM.png" alt="centered image" width="200"/>
</center>    
    
- To do so, you will need:
    - `tf.keras.layers.Embedding` layer
    - `tf.keras.layers.Dot()`
    - `tf.keras.Model()` which is the functional API
- You can choose an appropriate embedding dimension
- Compile the model using `binary_crossentropy()` function and an appropriate optimizer. 
- Sufficiently train the model.
    - Your model will be sufficiently trained when? (Check with Ignacio)    
    
It is generally a good practise to save the model weights. To save the model weights using the `model.save_weights()` for analysis of **2.3**. (More information on how to save your weights can be found [here](https://www.tensorflow.org/tutorials/keras/save_and_load))


    
</div>

In [None]:
# Your code here
embedding_dim = 300

word_model = Sequential()
word_model.add(Input(shape=(1,)))
word_model.add(Embedding(vocab_size, embedding_dim, input_length=1, name="Embedding"))

context_model = Sequential()
context_model.add(Input(shape=(1,)))
context_model.add(Embedding(vocab_size, embedding_dim, input_length=1, name="Embedding"))

dot_product = dot([word_model.output, context_model.output],axes=1,
                  normalize=False,name='dotproduct')

sigmoid_dot_product = Dense(1, activation="signmoid", name="Dense Layer")(dot_product)

model = Model(inputs=[word_model.input, context_model.input], outputs=sigmoid_dot_product, name="Model")

model.summary()

In [None]:
model.compile(loss='bce',optimizer='adam', metrics=['accuracy'])

In [None]:
model.fit(couples, labels, epochs=5)
model.save_weights('/content/drive/MyDrive/UnivAI/Univ AI 3/my_checkpoint')

In [None]:
# Create a new model instance
model = create_model()

# Restore the weights
model.load_weights('/content/drive/MyDrive/UnivAI/Univ AI 3/my_checkpoint')

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **2.3 [7 points] POST-TRAINING ANALYSIS**
<br />
    
This segment involves some simple analysis of your trained embeddings.
<br />
    
**2.3.1Vector Algebra on Embeddings**

Assuming you have chosen a sufficiently large `vocab_size`, find the embeddings for:
    
1. King
2. Male
3. Female
4. Queen
    
Find the vector `v = King - Male + Female` and find it's `cosine_similarity()` with the embedding for 'Queen'.
You can use the `cosine_similarity()` function defind in session 3 exercise.

**NOTE**:The `cosine_similarity()` value, must be greater than `0.9`; If it is not, this implies that your word2vec embeddings are not well-trained.

Write a function `most_similar()`, which finds the top-n words most similar to the given word.
    - Use this function to find the words most similar to `king`.
    
**Conceptual Question** Why can't we use `cosine_similarity()` as a `loss_function`?
(Answer in less than 200 words) 
    
<br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">


**2.3.2 Visualizing Embeddings**

Find the embeddings for the words:
1. 'Six'
2. 'Seven'
3. 'Eight'
4. 'Nine'
    
Find the `cosine_similarity()` of 'six' with each of 'seven`,'eight','nine' (which should be high values).
    
Reset your network (make sure your trained weights are saved), and again compute the `cosine_similarity()` values. The values should be small (because the embeddings are random).
    
Use a demonstrative plot to show the `before & after training` of the 4 embeddings. Here are some suggestions: 
    1. PCA/TSNE for dimensionality reduction
    2. Radar plot to show all embedding dimensions
    
Bonus points for using creative means to demonstrate how the embeddings change after training.

        

<br />
    
</div>

#### Here is a [video](https://youtu.be/VDl_iA8m8u0) of a sample demonstration. We used a custom callback to get embeddings during training.  

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.3.3 Embedding and Context Matrix**
    
<br />
    
Investigate the relation between the Embedding & Context matrix. Again use the `cosine_similarity()` function to find the average value across all the words in the embedding and context matrix, i.e:
    - For a word 'dog', find the embedding value, and context value.
    - Calculate the `cosine_similarity()` between the two
    - Repeat the same for every word in the vocabulary and calculate the average value of the `cosine_similarity()`

<br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **2.4 [5 points] LEARNING PHRASES**
    
As per the original paper by [Mikolov et al]() many phrases have a meaning that is not a simple composition of the meanings of its individual words. 
For eg. `new york` is one entity, however, as per our analysis above, we have two separate entities `new` & `york` which can have different meanings independently.    
To learn vector representation for phrases, we first find words that
appear frequently together, and infrequently in other contexts.
    
As per the analysis in the paper, we can use a formula to rank commonly used word pairs, and take the first 100 commonly occuring pairs.
$$\operatorname{score}\left(w_{i}, w_{j}\right)=\frac{\operatorname{count}\left(w_{i} w_{j}\right)-\delta}{\operatorname{count}\left(w_{i}\right) \times \operatorname{count}\left(w_{j}\right)}$$

**NOTE:** For simplicity of analysis, we take the discounting factor $\delta$ as 0, and take bi-gram combinations. You can experiment with tri-grams for word pairs such as `New_York_Times`.     
<br /><br />

    
**2.4.1 Find 100 most common bi-grams**

From the tokenized data above, find the count for each bigram pair.
    
For each such pair, find the score associated with each token pair using the formula above.
    
 Pick the top 100 pairs based on the score. (Higher the better). To understand the `score()` function we suggest you read the paper mentioned above.
    
Replace the original `text8` file with the pairs as one entity. For e.g., if `prime,minister` is a commonly occuring pair, replace `... prime minister ...' in the original corpus to a single entity `prime_minister`. Do this for all 100 pairs.
<br /><br />
    
</div>

In [None]:
# Uncomment this cell to download the dataset directly onto colab
# !gdown https://drive.google.com/uc?id=1yW-rQc8It9Ro0DAu4jp3XhVH3JQOdNKk

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**2.4.2 Retrain word2vec**    
With the new corpus generated as above, build the dataset, use skipgrams and retrain your word2vec with a sufficiently large vocabulary.
    
Use the `most_similar()` function defiend above to find the entities most similar to `united_kingdom`
    
Compare the above with separate tokens for `united` & `kingdom` and the sum of the vectors (to get this, you may need a sufficiently large vocabulary (>2000).
<br /> <br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **2.5 [5 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer them without attempting this part.

</div>

#### Answer the questions on edStem

___
___


<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
## **PART 3 [35 points] : Language Modelling using RNNs**
<br />    

In the last part of the homework, you are expected to build and train a language model. For this, we will be using Pavlos famous texts which end with `...` for prediction. Here, you will preprocess a data corpus and train your simple RNN network with it. With this network you will try to predict what he meant when he typed `...`. This can be to some extent a form of transfer learning.
<br /><br />
    
</div>

# New Section

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 3: Questions**
<br />

### **3.1 [2 points] PREPROCESS THE DATASET**
<br />

**3.1.1** - Read in the dataset `imdb.csv`. Create a new dataframe by splitting each review into individual sentences. The sentences can be delimited by different characters such as period and question mark (eroteme). Call this column as `text` in the new dataframe.
<br /><br />

**3.1.2** - Define a function `clean_data` that takes the new dataframe as input and removes all html tags and non-alphabetic characters from the dataframe. Additionally, convert all characters to lower case. Remove all the sentences where the number of words is less than 10 and higher than 30. Finally, add the start token `<s>` and the end token `</s>` to every sentence (row) in the dataframe. Return the processed the dataframe. 
<br /><br />

### **3.2 [2 points] TOKENIZE THE DATASET**
<br />

**3.2.1** - Tokenize the dataset using `tensorflow.keras.preprocessing.text` with a vocabulary size of 5000. Do **not** add an additional token for unknown words (out of vocabulary words) `<UNK>`.
<br /><br />

**3.2.2** - Fit the tokenizer on the dataset and get the sequence representation of each sentence.
<br /><br />

### **3.3 [10 points] MODELLING THE DATA**
<br />

**3.3.1** - The first step is to define the input and output of the model. The input to the model is all the words of the sentences except the last one. 
The output of the model is all words of the sentences except the first one. Using `tf.keras.preprocessing.sequence.pad_sequences` Post-pad all the sentences of the input and output to a length of 30.
<br /><br />

**3.3.2** - Define a simple RNN model that has an embedding layer with an embedding dimension of 300. You can define any number of RNN layers. The output of the RNN model will be a dense layer with size of the vocabulary and softmax activation. This model is that one you will be using to train the network. Use functional API for this to reuse it later on.
<br /><br />

**3.3.3** - Train the model with the input and output data formed above and a validation split of 0.2. The number epochs and batch size is left as a choice you have to make.
<br /><br />

**3.3.4** - Plot the train and validation loss. 
<br /><br />

### **3.4 [9 points] PREDICTING THE NEXT WORD**
<br />

**3.4.1** - Read the dataset `pp_text.csv`. Add the start and end tokens to each line and tokenize it. Convert each sentence to a sequence vector and post-pad to a length of 30. This will be the input for the prediction phase.
<br /><br />

**3.4.2** - For predicting the next word, use the trained RNN model from above. 

NOTE - Based on your implementation, the output of the RNN model might have to be different from that of your trained network. You can make use of Keras function API for this.
<br /><br />

**3.4.3** - Choose any sentence from the list of Pavlos' texts to predict the next word. Input this to the RNN model built for prediction and print the predicted word. Try this out with multiple sentences.
<br /><br />

**3.4.4** - Do you notice any pattern in the predicted words? Do they seem approriate to the context of the texts as you understand it? What do you attribute this discrepency to? How can you resolve it?

Answer in less than 150 words.
<br /><br />

### **3.5 [6 points] TRAINING AND PREDICTING WITH A DIFFERENT DATASET**
<br />

**3.5.1** - Read the dataset `cleaned_sarcasm.csv`. This dataset has been preprocessed for you, all you need to do is tokenize, convert to sequence and pad it, similar to 3.2.1, 3.2.2 and 3.3.1.
<br /><br />

**3.5.2** - Train your RNN model with this data and plot the train and validation trace plot. This part is similar to 3.3.2, 3.3.3 and 3.3.4.
<br /><br />

**3.5.3** - Repeat 3.4.1, 3.4.2 and 3.4.3 with the RNN model trained using the new dataset.
<br /><br />

**3.5.4** - How do the results with the new dataset compare to the previous ones? Why do you think so? 

Answer in less than 100 words.
<br /><br />
    
### **3.6 [3 points] COMPLETING THE SENTENCE**
<br />

**3.6.1** Until now we have predicted a single word for a given sentence. However, what if he meant more than one word when he typed in `...`

We will now predict multiple words for each input sentence. To do this we will first predict one word, append this word to the input text and then predict one more with the updated input. Continue doing this for 5 words or until the end token `</s>` (whichever comes first). 
<br /><br />

### **3.7 [3 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer them without attempting this part.

<br />

</div>

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

## **PART 3: Solutions**

<br />

### **3.1 [2 points] PREPROCESS THE DATASET**
<br />    

**3.1.1** - Read in the dataset `imdb.csv`. Create a new dataframe by splitting each review into individual sentences. The sentences can be delimited by different characters such as period and question mark (eroteme). Call this column as `text` in the new dataframe.
    
</div>


*If you are using colab, the code in the next cell will help download the dataset directly onto your workspace.*

In [None]:
# Uncomment if you are using colab
# !gdown --id "1YmxaDY-VhGItJ5uRZKtHoEqpqAR1L2IY"


In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.1.2** - Define a function `clean_data` that takes the new dataframe as input and removes all html tags and non-alphabetic characters from the dataframe. Additionally, convert all characters to lower case. Remove all the sentences where the number of words is less than 10 and higher than 30. Finally, add the start token `<s>` and the end token `</s>` to every sentence (row) in the dataframe. Return the processed the dataframe. 
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">
    
### **3.2 [2 points] TOKENIZE THE DATASET**
    
<br />

**3.2.1** - Tokenize the dataset using `tensorflow.keras.preprocessing.text` with a vocabulary size of 5000. Do **not** add an additional token for unknown words (out of vocabulary words) `<UNK>`.
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.2.2** - Fit the tokenizer on the dataset and get the sequence representation of each sentence.
<br /><br />

    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **3.3 [10 points] MODELLING THE DATA**

**3.3.1** - The first step is to define the input and output of the model. The input to the model is all the words of the sentences except the last one. 
The output of the model is all words of the sentences except the first one. Using `tf.keras.preprocessing.sequence.pad_sequences` Post-pad all the sentences of the input and output to a length of 30.
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

    
**3.3.2** - Define a simple RNN model that has an embedding layer with an embedding dimension of 300. You can define any number of RNN layers. The output of the RNN model will be a dense layer with size of the vocabulary and softmax activation. This model is that one you will be using to train the network. Use functional API for this to reuse it later on.
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.3.3** - Train the model with the input and output data formed above and a validation split of 0.2. The number epochs and batch size is left as a choice you have to make.
<br /><br />

</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.3.4** - Plot the train and validation loss. 

<br /><br />

</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **3.4 [9 points] PREDICTING THE NEXT WORD**
    
<br />
    
**3.4.1** - Read the dataset `pp_text.csv`. Add the start and end tokens to each line and tokenize it. Convert each sentence to a sequence vector and post-pad to a length of 30. This will be the input for the prediction phase.
<br /><br />
    
</div>

*If you are using colab, the code in the next cell will help download the dataset directly onto your workspace.*

In [None]:
# Uncomment if you are using colab
# !gdown --id "1xeQ4w0iYJimzth0e3t3dQJbeuCsFlxVa"

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">


**3.4.2** - For predicting the next word, use the trained RNN model from above. 

NOTE - Based on your implementation, the output of the RNN model might have to be different from that of your trained network. You can make use of Keras function API for this.
<br /><br />


    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.4.3** - Choose any sentence from the list of Pavlos' texts to predict the next word. Input this to the RNN model built for prediction and print the predicted word. Try this out with multiple sentences.
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.4.4** - Do you notice any pattern in the predicted words? Do they seem approriate to the context of the texts as you understand it? What do you attribute this discrepency to? How can you resolve it?

Answer in less than 150 words.
<br /><br />
    
</div>

#### Type your answer here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">


### **3.5 [6 points] TRAINING AND PREDICTING WITH A DIFFERENT DATASET**
<br />
    
**3.5.1** - Read the dataset `cleaned_sarcasm.csv`. This dataset has been preprocessed for you, all you need to do is tokenize, convert to sequence and pad it, similar to 3.2.1, 3.2.2 and 3.3.1.
<br /><br />
    
</div>

*If you are using colab, the code in the next cell will help download the dataset directly onto your workspace.*

In [None]:
# Uncomment if you are using colab
# !gdown --id "1pMUuhoKsZVnktosQRuqAutfd3J4sFUsa"


In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.5.2** - Train your RNN model with this data and plot the train and validation trace plot. This part is similar to 3.3.2, 3.3.3 and 3.3.4.
<br /><br />
    
</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.5.3** - Repeat 3.4.1, 3.4.2 and 3.4.3 with the RNN model trained using the new dataset.
<br /><br />

</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

**3.5.4** - How do the results with the new dataset compare to the previous ones? Why do you think so? 

Answer in less than 100 words.
<br /><br />
    
</div>

#### Type your answer here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">


### **3.6 [3 points] COMPLETING THE SENTENCE**
<br />

**3.6.1** Until now we have predicted a single word for a given sentence. However, what if he meant more than one word when he typed in `...`

We will now predict multiple words for each input sentence. To do this we will first predict one word, append this word to the input text and then predict one more with the updated input. Continue doing this for 5 words or until the end token `</s>` (whichever comes first). 
<br /><br />

</div>

In [None]:
# Your code here

<div class="alert alert-block alert-danger" style="color:black;background-color:#E7F4FA">

### **3.7 [3 points] HOMEWORK QUIZ**
<br />
After attempting this part of the homework, answer the questions on edStem. All the questions depend on this part of the homework and you will not be able to answer them without attempting this part.
<br /><br />

</div>

#### Answer the questions on edStem