# Summer 2023 Applied NLP Homework 3

## Instructors: Dr. Mahdi Roozbahani and Wafa Louhichi

## Deadline: Jul 7th, 11:59PM AoE

## Honor Code and Assignment Deadline
<!-- No changes needed on the below section -->
* No unapproved extension of the deadline is allowed. Late submission will lead to 0 credit. 

* Discussion is encouraged on Ed as part of the Q/A. However, all assignments should be done individually.
<font color='darkred'>
* Plagiarism is a **serious offense**. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.</font>
<font color='darkred'>
* All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures. If we observe any (even small) similarities/plagiarisms detected by Gradescope or our TAs, **WE WILL DIRECTLY REPORT ALL CASES TO OSI**, which may, unfortunately, lead to a very harsh outcome. **Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.**
</font>


## Instructions for the assignment 

<!-- No changes needed on the below section -->
- This entire assignment will be autograded through Gradescope.

- We provided you different .py files and we added libraries in those files please DO NOT remove those lines and add your code after those lines. Note that these are the only allowed libraries that you can use for the homework.

- You will submit your implemented .py files to the corresponding homework section on Gradescope. 

- You are allowed to make as many submissions until the deadline as you like. Additionally, note that the autograder tests each function separately, therefore it can serve as a useful tool to help you debug your code if you are not sure of what part of your implementation might have an issue.


# Google Colab Setup (Optional for running on Colab)
You may need to right click on the Applied NLP folder and `Add shortcut to Drive`

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive/')

## Change path to directory of where notebook is located
%cd '/content/drive/MyDrive/Applied_NLP/HW3/hw3_code/'

## If no GPU selected it will ask for GPU to be selected
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)


## This wraps output text according to the window size
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Assignment Overview

In this homework we will explore non-linear text classification algorithms : Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). We will also look into another embedding techique : Word2Vec.

We will reuse the datasets from HW2 for this exploration:
* The first dataset is a subset of a [Clickbait Dataset](https://github.com/bhargaviparanjape/clickbait/tree/master/dataset) that has article headlines and a binary label on whether the headline is considered clickbait. 
* The second dataset is a subset of [Web of Science Dataset](https://data.mendeley.com/datasets/9rw3vkcfy4/6) that has articles and a corresponding label on the domain of the articles. 

We will first explore Continuous Bag-of-Words (CBOW) and Skip-gram based Word2Vec models using a very small dataset. We will then use pre-trained Word2Vec embeddings and feed them to the classification algorithms.

**You will be using pytorch for coding the models in this homework.** [Tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) and [Building model](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html) are good references for those who are new to pytorch.

## Deliverables and Points Distribution

### Q1: Word2Vec [47pts]
- **1.1 Implementing CBOW from Scratch** [27pts] Deliverables: <font color = 'green'>word2vec.py</font>

    - [2pts] tokenize (Word2Vec class)

    - [5pts] create_vocabulary (Word2Vec class)

    - [10pts] cbow_embeddings (Word2Vec class)

    - [5pts] \__init__ (CBOW_Model class)
    
    - [5pts] forward (CBOW_Model class)

- **1.2 Implementing Skip-Gram from Scratch** [20pts] Deliverables: <font color = 'green'>word2vec.py</font>

    - [10pts] skipgram_embeddings (Word2Vec class)

    - [5pts] \__init__ (SkipGram_Model class)

    - [5pts] forward (SkipGram_Model class)

### Q2: Classification with CNN [15pts]
- **2.1 Classification with CNN** [15pts] Deliverables: <font color = 'green'>cnn.py</font>

    - [5pts] \__init__

    - [10pts] forward

### Q3: Classification with RNN [15pts]
- **3.1 Classification with RNN** [15pts] Deliverables: <font color = 'green'>rnn.py</font>

    - [5pts] \__init__ 

    - [10pts] forward 



# Setup
This notebook is tested under [python 3. * . *](https://www.python.org/downloads/release/python-368/), and the corresponding packages can be downloaded from [miniconda](https://docs.conda.io/en/latest/miniconda.html). You may also want to get yourself familiar with several packages:

- [jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/)
- [numpy](https://docs.scipy.org/doc/numpy-1.15.1/user/quickstart.html)
- [sklearn](https://matplotlib.org/users/pyplot_tutorial.html)
- [pytorch](https://pytorch.org/)

In the .py files please implement the functions that have `raise NotImplementedError`, and after you finish the coding, please delete or comment out `raise NotImplementedError`.

## Library imports

In [1]:
#Import the necessary libraries
import pandas as pd
import numpy as np
import scipy as sp
import sys
import re
from copy import deepcopy
import pickle
import random
import seaborn as sns
import gensim
import torchtext
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
from torch import optim
torch.manual_seed(10)
from torch.autograd import Variable
import torch.nn.functional as F
from torch.utils.data import DataLoader
import warnings
warnings.filterwarnings("ignore")

# import gzip

%load_ext autoreload
%autoreload 2
%reload_ext autoreload

print('Version information')

print('python: {}'.format(sys.version))
print('numpy: {}'.format(np.__version__))

Version information
python: 3.11.3 (main, Apr 19 2023, 18:51:09) [Clang 14.0.6 ]
numpy: 1.25.0


# Load Dataset


We start by loading both data sets already split into an 80/20 train and test set.

In [2]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')

# Separate dataframes into train and test lists
x_train, y_train = list(df_train['headline']), list(df_train['label'])
x_test, y_test = list(df_test['headline']), list(df_test['label'])

Below is the number of headlines in the train and test set as well as a sample of the article headlines and its binary label, where 0 is considered not clickbait and 1 is clickbait.

In [3]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

print(f'Number of Train Headlines: {len(x_train)}')
print(f'Number of Test Headlines: {len(x_test)}')

print('\n\nSample Label and Headlines:')
x = 105
for label, line in zip(y_train[x:x+5], x_train[x:x+5]):
    print(f'{label}: {line}')
    
print('\nOutput of Sample Headlines without Print Statement:')
x_train[x:x+5]

Number of Train Headlines: 19200
Number of Test Headlines: 4800


Sample Label and Headlines:
1: 27 Breathtaking Alternatives To A Traditional Wedding Bouquet <br>

1: 22 Pictures People Who Aren't Grad Students Will <strong>Never</strong> Understand

0: PepsiCo Profit Falls 43 Percent

0: Website of Bill O'Reilly, FOX News commentator, hacked in retribution

1: The Green Toy Soldiers From Your Childhood Now Come In Baller Yoga Poses A


Output of Sample Headlines without Print Statement:


['27 Breathtaking Alternatives To A Traditional Wedding Bouquet <br>\n',
 "22 Pictures People Who Aren't Grad Students Will <strong>Never</strong> Understand\n",
 'PepsiCo Profit Falls 43 Percent\n',
 "Website of Bill O'Reilly, FOX News commentator, hacked in retribution\n",
 'The Green Toy Soldiers From Your Childhood Now Come In Baller Yoga Poses A\n']

In [4]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

df_train_wos = pd.read_csv('./data/train_wos.csv')
df_test_wos = pd.read_csv('./data/test_wos.csv')

# Separate dataframes into train and test lists
x_train_wos, y_train_wos = list(df_train_wos['article']), list(df_train_wos['label'])
x_test_wos, y_test_wos = list(df_test_wos['article']), list(df_test_wos['label'])

# Numerical label to domain mapping
wos_label = {0:'CS', 1:'ECE', 2:'Civil', 3:'Medical'}
# Numerical label to Numerical mapping
label_mapping = {0:0, 1:1, 4:2, 5:3}

for i, label in enumerate(y_train_wos):
    y_train_wos[i] = label_mapping[label]
for i, label in enumerate(y_test_wos):
    y_test_wos[i] = label_mapping[label]

In [5]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

print(f'Number of Train Articles: {len(x_train_wos)}')
print(f'Number of Test Articles: {len(x_test_wos)}')

print('\nLabel Key:', wos_label)

print('\nSample Label and Articles:\n')
x = 107
for label, line in zip(y_train_wos[x:x+3], x_train_wos[x:x+3]):
    print(f'{label} - {wos_label[label]}: {line}')

Number of Train Articles: 1600
Number of Test Articles: 400

Label Key: {0: 'CS', 1: 'ECE', 2: 'Civil', 3: 'Medical'}

Sample Label and Articles:

0 - CS: An efficient procedure for calculating the electromagnetic fields in multilayered cylindrical structures is reported in this paper. Using symbolic computation, spectral Green's functions, suitable for numerical implementations are determined in compact and closed forms. Applications are presented for structures with two dielectric layers.

1 - ECE: A multifunctional platform based on the microhotplate was developed for applications including a Pirani vacuum gauge, temperature, and gas sensor. It consisted of a tungsten microhotplate and an on-chip operational amplifier. The platform was fabricated in a standard complementary metal oxide semiconductor (CMOS) process. A tungsten plug in standard CMOS process was specially designed as the serpentine resistor for the microhotplate, acting as both heater and thermister. With the sacrifici

## Q1: Word2Vec [47pts]

Word2vec is a method to efficiently create word embeddings. More details on word2vec and the intuition behind it can be found here :  
* [The Illustrated Word2vec by Jay Alammar](https://jalammar.github.io/illustrated-word2vec/)

Word2vec is based on the idea that a word’s meaning is defined by its context. Context is represented as surrounding words. For the word2vec model, context is represented as N words before and N words after the current word. N is a hyperparameter. With larger N we can create better embeddings, but at the same time, such a model requires more computational resources. 

There are two word2vec architectures proposed in the paper:

* CBOW (Continuous Bag-of-Words) — a model that predicts a current word based on its context words.
* Skip-Gram — a model that predicts context words based on the current word.

We will be running our Word2Vec models on a very small dataset as described below.

In [6]:
corpus = [
    'he is a king',
    'she is a queen',
    'he is a man',
    'she is a woman',
    'warsaw is poland capital',
    'berlin is germany capital',
    'paris is france capital',   
]

## 1.1: Implementing Continuous Bag-of-words From Scratch [27pts]
In the **word2vec.py** file complete the following functions:
  * <strong>tokenize</strong> (Word2Vec class)
  * <strong>create_vocabulary</strong> (Word2Vec class)
  * <strong>cbow_embeddings</strong> (Word2Vec class)
  * <strong>\__init__</strong> (CBOW_Model class)
  * <strong>forward</strong> (CBOW_Model class)

A high level overview of the CBOW model can be described as :    
<p align="center"><img src="https://miro.medium.com/max/1400/1*ETcgajy5s0KNIfMgE5xOqg.png" width="75%" align="center"></p>

CBOW model takes several words, each goes through the same Embedding layer, and then word embedding vectors are averaged before going into the Linear layer.

We will be implementing this model using the architecture described below :    

<p align="center"><img src="https://miro.medium.com/max/1400/1*mLDM3PH12CjhaFoUm5QTow.png" width="75%" align="center"></p>

Here are the steps that needs to be followed for implementing CBOW model :    
* Step-1: Create vocabulary
  * Split each words into tokens.
  * Assign a unique ID to each unique token.

* Step-2: Create CBOW Embeddings
  * Create CBOW embeddings by taking context as N past words and N future words.

* Step-3: Implement CBOW Model
  * Implement CBOW model as described in the architecture above.
  
<b>Hint:</b> Since we are using the cross entropy loss, there is no need to apply the softmax after the linear layer.


## 1.1.1: Fetching CBOW embeddings [No Points]
Run the below cell to fetch CBOW embeddings using functions that you have already implemented in **1.1**.

In [71]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from word2vec import Word2Vec

w2v = Word2Vec()
tokens = w2v.tokenize(corpus)
w2v.create_vocabulary(tokens)
source, target = w2v.cbow_embeddings(tokens)

In [72]:
print(tokens)
print('-------------------------------')
print(source)
print('-------------------------------')
print(target)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'man'], ['she', 'is', 'a', 'woman'], ['warsaw', 'is', 'poland', 'capital'], ['berlin', 'is', 'germany', 'capital'], ['paris', 'is', 'france', 'capital']]
-------------------------------
[[6, 0], [5, 0, 7], [5, 6, 7], [6, 0], [6, 0], [12, 0, 11], [12, 6, 11], [6, 0], [6, 0], [5, 0, 8], [5, 6, 8], [6, 0], [6, 0], [12, 0, 14], [12, 6, 14], [6, 0], [6, 10], [13, 10, 2], [13, 6, 2], [6, 10], [6, 4], [1, 4, 2], [1, 6, 2], [6, 4], [6, 3], [9, 3, 2], [9, 6, 2], [6, 3]]
-------------------------------
[[5], [6], [0], [7], [12], [6], [0], [11], [5], [6], [0], [8], [12], [6], [0], [14], [13], [6], [10], [2], [1], [6], [4], [2], [9], [6], [3], [2]]


## 1.1.2: Training CBOW model [No Points]
Run the below cell to train CBOW model using functions that you have already implemented in **1.1**.

In [152]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from word2vec import Word2Vec

w2v = Word2Vec()
tokens = w2v.tokenize(corpus)
w2v.create_vocabulary(tokens)
source, target = w2v.cbow_embeddings(tokens)

from word2vec import CBOW_Model

N_EPOCHS = 300
model = CBOW_Model(w2v.vocabulary_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()

for epoch in range(N_EPOCHS):
    total_loss = 0.0
    shuffled_i = list(range(0,len(target)))
    random.shuffle(shuffled_i)
    for i in shuffled_i:
        x = torch.from_numpy(np.asarray(source[i])).long().to(device)
        y = torch.from_numpy(np.asarray(target[i])).float().to(device)

        optimizer.zero_grad()
        outputs = model(x)
        #print(outputs, y)
        loss = criterion(outputs, y)
        total_loss += loss
        loss.backward()
        optimizer.step()
        
    if epoch % 20 == 0:    
      print("loss on epoch %i: %f" % (epoch, total_loss))

torch.Size([2, 300])
torch.Size([1, 15])


RuntimeError: expected scalar type Long but found Float

## 1.1.3: Visualizing CBOW embeddings [No Points]
Run the below cells to visualize CBOW embeddings.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# embedding from first model layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms

Here, we use truncated SVD to project the learned word2vec embedding to 2D space for visualization. Feel free to tune the learning rate in the training part for different results.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

import seaborn as sns
from sklearn import decomposition

w2v.word2idx[''] = 0
svd = decomposition.TruncatedSVD(n_components=2)
W2_dec = svd.fit_transform(embeddings)

x = W2_dec[:,0]
y = W2_dec[:,1]
plot = sns.scatterplot(x=x, y=y)

for i in range(0,W2_dec.shape[0]):
     plot.text(x[i], y[i]+2e-2, list(w2v.word2idx)[i], horizontalalignment='center', size='small', color='black', weight='semibold');

Here, we will look into the learned property of the word2vec embedding. Warsaw is the capital of Poland and we now calculate the difference between embeddings of "warsaw" and "poland". After that, we add the difference to the embedding of "paris". We then rank the dot product of the computed embedding vs all the embeddings. Notice that the larger this value, the more similar two embeddings are. Feel free to play with different word pairs.

In [None]:
emb1 = embeddings[w2v.word2idx["poland"]]
emb2 = embeddings[w2v.word2idx["warsaw"]]
emb3 = embeddings[w2v.word2idx["paris"]]

emb4 = emb1 - emb2 + emb3
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_norm, emb4).flatten()

top5 = np.argsort(-dists)[:5]

for word_id in top5:
    print("{}: {:.3f}".format(w2v.idx2word[word_id], dists[word_id]))

## 1.2: Implementing Skip-Gram From Scratch [20pts]
In the **word2vec.py** file complete the following functions:
  * <strong>skipgram_embeddings</strong> (Word2Vec class)
  * <strong>\__init__</strong> (SkipGram_Model class)
  * <strong>forward</strong> (SkipGram_Model class)

A high level overview of the SkipGram model can be described as :    
<p align="center"><img src="https://miro.medium.com/max/720/1*SVs6xTpD7AYviP24UTOYUA.png" width="75%" align="center"></p>

The Skip-Gram model takes a single word as compared to CBOW model.

We will be implementing this model using the architecture described below :    

<p align="center"><img src="https://miro.medium.com/max/720/1*eHh1_t8Wms_hqDNBLuAnFg.png" width="75%" align="center"></p>

Here are the steps that needs to be followed for implementing SkipGram model :    
* Step-1: Create vocabulary
  * Split each words into tokens.
  * Assign a unique ID to each unique token.

* Step-2: Create SkipGram Embeddings
  * Create SkipGram embeddings by taking context as middle word.

* Step-3: Implement SkipGram Model
  * Implement SkipGram model as described in the architecture above. Output SkipGram embeddings for N past words and N future words.

**Hint:** Since we are using the cross entropy loss, there is no need to apply the softmax after the linear layer.

## 1.2.1: Fetching SkipGram embeddings [No Points]
Run the below cell to fetch **SkipGram** embeddings using functions that you have already implemented in 1.2.

In [145]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from word2vec import Word2Vec

w2v = Word2Vec()
tokens = w2v.tokenize(corpus)
w2v.create_vocabulary(tokens)
source, target = w2v.skipgram_embeddings(tokens)

In [146]:
print(tokens)
print(source)
print(target)

[['he', 'is', 'a', 'king'], ['she', 'is', 'a', 'queen'], ['he', 'is', 'a', 'man'], ['she', 'is', 'a', 'woman'], ['warsaw', 'is', 'poland', 'capital'], ['berlin', 'is', 'germany', 'capital'], ['paris', 'is', 'france', 'capital']]
[[5], [5], [6], [6], [6], [0], [0], [0], [7], [7], [12], [12], [6], [6], [6], [0], [0], [0], [11], [11], [5], [5], [6], [6], [6], [0], [0], [0], [8], [8], [12], [12], [6], [6], [6], [0], [0], [0], [14], [14], [13], [13], [6], [6], [6], [10], [10], [10], [2], [2], [1], [1], [6], [6], [6], [4], [4], [4], [2], [2], [9], [9], [6], [6], [6], [3], [3], [3], [2], [2]]
[[6], [0], [5], [0], [7], [5], [6], [7], [6], [0], [6], [0], [12], [0], [11], [12], [6], [11], [6], [0], [6], [0], [5], [0], [8], [5], [6], [8], [6], [0], [6], [0], [12], [0], [14], [12], [6], [14], [6], [0], [6], [10], [13], [10], [2], [13], [6], [2], [6], [10], [6], [4], [1], [4], [2], [1], [6], [2], [6], [4], [6], [3], [9], [3], [2], [9], [6], [2], [6], [3]]


## 1.2.2: Training SkipGram model [No Points]
Run the below cell to train SkipGram model using functions that you have already implemented in **1.2**.

In [149]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from word2vec import SkipGram_Model

N_EPOCHS = 300
model = SkipGram_Model(w2v.vocabulary_size)
optimizer = optim.Adam(model.parameters(), lr=0.001)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = nn.CrossEntropyLoss()

for epoch in range(N_EPOCHS):
    total_loss = 0.0
    shuffled_i = list(range(0,len(target)))
    random.shuffle(shuffled_i)
    for i in shuffled_i:
        x = torch.from_numpy(np.asarray(source[i])).long().to(device)
        y = torch.from_numpy(np.asarray(target[i])).float().to(device)
        #x = torch.from_numpy(np.asarray(source[i])).to(device)
        #y = torch.from_numpy(np.asarray(target[i])).to(device)

        optimizer.zero_grad()
        outputs = model(x)
        #print(outputs.shape, y.shape, outputs.dtype, y.dtype)
        #print(outputs)
        #print(y)
        loss = criterion(outputs, y)
        #print(loss)
        total_loss += loss
        loss.backward()
        optimizer.step()
        
    if epoch % 20 == 0:    
      print("loss on epoch %i: %f" % (epoch, total_loss))

torch.Size([1, 300])
torch.Size([15])


RuntimeError: size mismatch (got input: [15], target: [1])

In [None]:
test1 = torch.tensor([[ 0.0240],
        [ 0.0240],
        [-0.0501],
        [-0.0501],
        [-0.0501],
        [-0.0310],
        [-0.0310],
        [-0.0310],
        [ 0.0065],
        [ 0.0065]])

test2 = torch.tensor([[6.],
        [4.],
        [1.],
        [4.],
        [2.],
        [1.],
        [6.],
        [2.],
        [6.],
        [4.]])
criterion = nn.CrossEntropyLoss()
print(test1.shape, test2.shape)
loss = criterion(test1, test2)
print(loss)

## 1.2.3: Visualizing SkipGram embeddings [No Points]
Run the below cells to visualize SkipGram embeddings.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# embedding from first model layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms

Here, we use truncated SVD to project the learned word2vec embedding to 2D space for visualization. Feel free to tune the learning rate in the training part for different results.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn import decomposition

w2v.word2idx[''] = 0
svd = decomposition.TruncatedSVD(n_components=2)
W2_dec = svd.fit_transform(embeddings)

x = W2_dec[:,0]
y = W2_dec[:,1]
plot = sns.scatterplot(x=x, y=y)

for i in range(0,W2_dec.shape[0]):
     plot.text(x[i], y[i]+2e-2, list(w2v.word2idx)[i], horizontalalignment='center', size='small', color='black', weight='semibold');

Here, we will look into the learned property of the word2vec embedding. Warsaw is the capital of Poland and we now calculate the difference between embeddings of "warsaw" and "poland". After that, we add the difference to the embedding of "paris". We then rank the dot product of the computed embedding vs all the embeddings. Notice that the larger this value, the more similar two embeddings are. Feel free to play with different word pairs.

In [None]:
emb1 = embeddings[w2v.word2idx["poland"]]
emb2 = embeddings[w2v.word2idx["warsaw"]]
emb3 = embeddings[w2v.word2idx["paris"]]

emb4 = emb1 - emb2 + emb3
emb4_norm = (emb4 ** 2).sum() ** (1 / 2)
emb4 = emb4 / emb4_norm

emb4 = np.reshape(emb4, (len(emb4), 1))
dists = np.matmul(embeddings_norm, emb4).flatten()

top5 = np.argsort(-dists)[:5]

for word_id in top5:
    print("{}: {:.3f}".format(w2v.idx2word[word_id], dists[word_id]))

Ideally with large amount of data, we should have got france as the most nearest word. But depending on the implementation and amount of data, it can vary.

## Q2: Classification with CNN [15pts]

Convolutional layers are used to find patterns by sliding small kernel window over input. Instead of multiplying the filters on the small regions of the images, it slides through embedding vectors of few words as mentioned by window size. For looking at sequences of word embeddings, the window has to look at multiple word embeddings in a sequence. They will be rectangular with size window_size * embedding_size. For example, if window size is 3 then kernel will be 3*500. This essentially represents n-grams in the model. The kernel weights (filter) are multiplied to word embeddings in pairs and summed up to get output values. As the network is being learned, these kernel weights are also being learned.

We will be using convolutional network with pre-trained word2vec models for classification. We implement a convolutional neural network for text classification similar to the CNN-rand baseline described by [Kim (2014)](https://aclanthology.org/D14-1181.pdf). We use pre-trained word2vec models for feasibility of finding appropriate embeddings. The architecture of our model looks like :

<p align="center"><img src="https://cezannec.github.io/assets/cnn_text/complete_text_classification_CNN.png" width="75%" align="center"></p>

We will be using an Embedding layer loaded with a word2vec model, followed by a convolution layer, and a linear layer.

We will would be using the Clickbait and Web of science dataset for this task.

## 2.1: Implementing CNN classifier
In the **cnn.py** file complete the following functions:
  * <strong>\__init__</strong>
  * <strong>forward</strong>

### 2.1.1 : Pre-Processing Data [No Points]

Run the below cell to preprocess the data.

In [15]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from gensim.utils import simple_preprocess

def preprocess(data):
  preprocessed_data = []
  for text in data:
    tokens = simple_preprocess(text, deacc=True)
    preprocessed_data.append(tokens)
  return preprocessed_data

preprocessed_x_train = preprocess(x_train)
preprocessed_x_train_wos = preprocess(x_train_wos)

preprocessed_x_test = preprocess(x_test)
preprocessed_x_test_wos = preprocess(x_test_wos)

In [16]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Use cuda if present
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cpu


### 2.1.2 : Utility functions for training Word2Vec Model [No Points]

Run the below cells for making word2vec model, vectors and target.

In [17]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from gensim.parsing.porter import PorterStemmer
from gensim.models import Word2Vec

porter_stemmer = PorterStemmer()

size = 500
window = 3
min_count = 1
workers = 3
sg = 1

# Function to train word2vec model
def make_word2vec_model(data, padding=True, sg=1, min_count=1, vector_size=500, workers=3, window=3):
    data.append(['pad'])
    w2v_model = Word2Vec(data, min_count = min_count, vector_size = size, workers = workers, window = window, sg = sg)
    return w2v_model

In [32]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

def make_word2vec_vector(sentence):
    padded_X = [padding_idx for i in range(max_sen_len)]
    i = 0
    for word in sentence:
        #if word not in w2vmodel.wv.vocab:
        if word not in w2vmodel.wv.index_to_key:
            padded_X[i] = 0
        else:
            padded_X[i] = w2vmodel.wv.key_to_index[word]
        i += 1
    return torch.tensor(padded_X, dtype=torch.long, device=device).view(1, -1)

In [19]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

def make_target(label):
  return torch.tensor([label], dtype=torch.long, device=device)

## 2.2 : Classifying Clickbait Dataset using CNN [No Points]

Run the below cell to classify the Clickbait train and test dataset using the CNN functions that you have already implemented in 2.

An accuracy of more than 80% is acceptable.


In [20]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Train Word2vec model
w2vmodel = make_word2vec_model(preprocessed_x_train, padding=True, sg=sg, min_count=min_count, vector_size=size, workers=workers, window=window)

Because CNN requires the input data to be of the same length. We use the embedding of the "pad" word as the padding vector. Notice that this choice is just a convention and other tokens could also work for this purpose. In more complex language model, there will be a dedicated '\<pad\>' token for padding.

In [25]:
import sys
print(sys.version)

3.11.3 (main, Apr 19 2023, 18:51:09) [Clang 14.0.6 ]


In [26]:
#%%capture
#!pip install python==3.7.0
!pip install gensim==3.8.0



In [21]:
!pip show gensim

Name: gensim
Version: 3.8.0
Summary: Python framework for fast Vector Space Modelling
Home-page: http://radimrehurek.com/gensim
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPLv2.1
Location: /Users/nickdinapoli/anaconda3/lib/python3.10/site-packages
Requires: numpy, scipy, six, smart-open
Required-by: 


In [33]:
#w2vmodel.wv.index_to_key

In [22]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_sen_len = max(map(len, preprocessed_x_train))
padding_idx = w2vmodel.wv.key_to_index['pad']

In [52]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from cnn import CNN

NUM_CLASSES = 2

model = CNN(w2vmodel, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
N_EPOCHS = 5

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    shuffled_i = list(range(0,len(y_train)))
    random.shuffle(shuffled_i)

    for index in range(len(shuffled_i)):
        model.zero_grad()
        bow_vec = make_word2vec_vector(preprocessed_x_train[index]).float()
        outputs = model(bow_vec)
        y = make_target(y_train[index])

        loss = criterion(outputs, y)
        total_loss += loss.item()

        loss.backward()
        optimizer.step()


    print("loss on epoch %i: %f" % (epoch, total_loss))

loss on epoch 0: 7284.202690
loss on epoch 1: 7150.303315
loss on epoch 2: 7104.886764
loss on epoch 3: 7075.922083
loss on epoch 4: 7064.057964


In [53]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score
cnn_predictions = []
original_lables_cnn = []
model.eval()

with torch.no_grad():
    for index in range(len(y_test)):
        bow_vec = make_word2vec_vector(preprocessed_x_test[index])
        probs = model(bow_vec)
        _, predicted = torch.max(probs.data, 1)
        cnn_predictions.append(predicted.cpu().numpy()[0])
        t = make_target(y_test[index]).cpu().numpy()[0]
        original_lables_cnn.append(make_target(y_test[index]).cpu().numpy()[0])

print("Test Accuracy on Clickbait Dataset using CNN : {:.3f}".format(accuracy_score(original_lables_cnn, cnn_predictions)))

Test Accuracy on Clickbait Dataset using CNN : 0.939


Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.

In [54]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = np.asarray(cnn_predictions)

with open('cnn_clickbait.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

## 2.3 : Classifying Web of Science Dataset using CNN [No Points]

Run the below cell to classify the WoS train and test dataset using the CNN functions that you have already implemented in 2.

An accuracy of more than 55% is acceptable.

In [56]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Train Word2vec model
w2vmodel = make_word2vec_model(preprocessed_x_train_wos, padding=True, sg=sg, min_count=min_count, vector_size=size, workers=workers, window=window)

In [57]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_sen_len = max(map(len, preprocessed_x_train_wos))
padding_idx = w2vmodel.wv.key_to_index['pad']

In [159]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from cnn import CNN

NUM_CLASSES = 4

model = CNN(w2vmodel, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
N_EPOCHS = 20

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    shuffled_i = list(range(0,len(y_train_wos)))
    random.shuffle(shuffled_i)

    for index in range(len(shuffled_i)):
        model.zero_grad()
        bow_vec = make_word2vec_vector(preprocessed_x_train_wos[index]).float()
        outputs = model(bow_vec)
        y = make_target(y_train_wos[index])

        loss = criterion(outputs, y)
        total_loss += loss.item()

        loss.backward()
        optimizer.step()


    if epoch % 5 == 0:    
      print("loss on epoch %i: %f" % (epoch, total_loss))

loss on epoch 0: 1969.677522
loss on epoch 5: 1548.603587
loss on epoch 10: 1461.306056
loss on epoch 15: 1408.127949


In [160]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score
cnn_predictions = []
original_lables_cnn = []
model.eval()

with torch.no_grad():
    for index in range(len(y_test_wos)):
        bow_vec = make_word2vec_vector(preprocessed_x_test_wos[index])
        probs = model(bow_vec)
        _, predicted = torch.max(probs.data, 1)
        cnn_predictions.append(predicted.cpu().numpy()[0])
        t = make_target(y_test_wos[index]).cpu().numpy()[0]
        original_lables_cnn.append(make_target(y_test_wos[index]).cpu().numpy()[0])

print("Test Accuracy on WoS Dataset using CNN : {:.3f}".format(accuracy_score(original_lables_cnn, cnn_predictions)))

Test Accuracy on WoS Dataset using CNN : 0.640


Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.

In [161]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = np.asarray(cnn_predictions)

with open('cnn_wos.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

## Q3: Classification with RNN [15pts]


We will be using recurrent neural networks for classification. The architecture of our model looks like :

<p align="center"><img src="https://www.tensorflow.org/static/text/tutorials/images/bidirectional.png" width="75%" align="center"></p>

We will be using an Embedding layer loaded, followed by a RNN layer, and a linear layer.

We will would be using the Clickbait and Web of science dataset for this task.

## 3.1: Implementing RNN classifier
In the **rnn.py** file complete the following functions:
  * <strong>\__init__</strong>
  * <strong>forward</strong>

### 3.1.1 : Pre-Processing Data [No Points]

Run the below cells to load functions for building vocabulary and tokenizing the sentences.

In [162]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")

def build_vocabulary(datasets):
  for dataset in datasets:
    for text in dataset:
      yield tokenizer(text)

vocab = build_vocab_from_iterator(build_vocabulary([x_train]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

vocab_wos = build_vocab_from_iterator(build_vocabulary([x_train_wos]), min_freq=1, specials=["<UNK>"])
vocab_wos.set_default_index(vocab["<UNK>"])

## 3.2 : Classifying Clickbait Dataset using RNN [No Points]

Run the below cell to classify the Clickbait train and test dataset using the RNN functions that you have already implemented in 3.

An accuracy of more than 85% is acceptable.


In [163]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from torch.utils.data import DataLoader

max_words = 0
for t in x_train:
    max_words = max(max_words, len(vocab_wos(tokenizer(t))))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)

In [164]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train, x_train))
test_dataset = list(map(lambda y, x: (y, x), y_test, x_test))

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

In [166]:
print(len(vocab))

18559


In [167]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from rnn import RNN
from tqdm import tqdm

NUM_CLASSES = 2

model = RNN(vocab, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
N_EPOCHS = 5

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

100%|███████████████████████████████████████████| 19/19 [00:06<00:00,  3.06it/s]


loss on epoch 0: 7.789461


100%|███████████████████████████████████████████| 19/19 [00:06<00:00,  3.10it/s]


loss on epoch 1: 2.756296


100%|███████████████████████████████████████████| 19/19 [00:06<00:00,  3.12it/s]


loss on epoch 2: 1.257725


100%|███████████████████████████████████████████| 19/19 [00:06<00:00,  2.86it/s]


loss on epoch 3: 0.584716


100%|███████████████████████████████████████████| 19/19 [00:06<00:00,  2.96it/s]

loss on epoch 4: 0.307369





In [168]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on Clickbait Dataset using RNN  : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

torch.Size([1024, 21, 150])
torch.Size([1024, 22, 150])
torch.Size([1024, 20, 150])
torch.Size([1024, 22, 150])
torch.Size([704, 20, 150])
Test Accuracy on Clickbait Dataset using RNN  : 0.956


Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.

In [169]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('rnn_clickbait.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

## 3.3 : Classifying Web of Science Dataset using RNN [No Points]

Run the below cell to classify the WoS train and test dataset using the rnn functions that you have already implemented in 3.

An accuracy of more than 35% is acceptable.

In [170]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_words = 0
for t in x_train_wos:
    max_words = max(max_words, len(vocab_wos(tokenizer(t))))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab_wos(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] 
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)

In [171]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train_wos, x_train_wos))
test_dataset = list(map(lambda y, x: (y, x), y_test_wos, x_test_wos))

train_loader = DataLoader(train_dataset, batch_size=128, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=128, collate_fn=vectorize_batch)

In [173]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from rnn import RNN
from tqdm import tqdm

NUM_CLASSES = 4

model = RNN(vocab_wos, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
N_EPOCHS = 20

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

100%|███████████████████████████████████████████| 13/13 [00:50<00:00,  3.87s/it]


loss on epoch 0: 17.797240


100%|███████████████████████████████████████████| 13/13 [00:59<00:00,  4.55s/it]


loss on epoch 1: 16.947120


100%|███████████████████████████████████████████| 13/13 [01:02<00:00,  4.79s/it]


loss on epoch 2: 16.422907


100%|███████████████████████████████████████████| 13/13 [01:06<00:00,  5.10s/it]


loss on epoch 3: 15.834189


100%|███████████████████████████████████████████| 13/13 [00:56<00:00,  4.37s/it]


loss on epoch 4: 15.368607


100%|███████████████████████████████████████████| 13/13 [00:49<00:00,  3.78s/it]


loss on epoch 5: 14.741068


100%|███████████████████████████████████████████| 13/13 [00:48<00:00,  3.75s/it]


loss on epoch 6: 14.151179


100%|███████████████████████████████████████████| 13/13 [00:51<00:00,  3.99s/it]


loss on epoch 7: 13.562327


100%|███████████████████████████████████████████| 13/13 [00:51<00:00,  3.93s/it]


loss on epoch 8: 13.051048


100%|███████████████████████████████████████████| 13/13 [00:45<00:00,  3.51s/it]


loss on epoch 9: 12.492842


100%|███████████████████████████████████████████| 13/13 [00:48<00:00,  3.69s/it]


loss on epoch 10: 11.870239


100%|███████████████████████████████████████████| 13/13 [00:46<00:00,  3.61s/it]


loss on epoch 11: 11.135075


100%|███████████████████████████████████████████| 13/13 [00:48<00:00,  3.73s/it]


loss on epoch 12: 10.194510


100%|███████████████████████████████████████████| 13/13 [00:44<00:00,  3.43s/it]


loss on epoch 13: 9.688906


100%|███████████████████████████████████████████| 13/13 [00:45<00:00,  3.51s/it]


loss on epoch 14: 9.079505


100%|███████████████████████████████████████████| 13/13 [00:45<00:00,  3.53s/it]


loss on epoch 15: 8.269086


100%|███████████████████████████████████████████| 13/13 [00:45<00:00,  3.53s/it]


loss on epoch 16: 7.496048


100%|███████████████████████████████████████████| 13/13 [00:42<00:00,  3.29s/it]


loss on epoch 17: 7.187210


100%|███████████████████████████████████████████| 13/13 [00:45<00:00,  3.50s/it]


loss on epoch 18: 6.576881


100%|███████████████████████████████████████████| 13/13 [00:47<00:00,  3.64s/it]

loss on epoch 19: 5.942134





In [174]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on WoS Dataset using RNN  : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Test Accuracy on WoS Dataset using RNN  : 0.405


Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.


In [175]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('rnn_wos.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

**NOTE** : RNN alone is not able to perform good on the WoS dataset and that can be attributed to the very limited data with large vocabulary and lack of embedding structure.