# Summer 2023 Applied NLP Homework 4

## Instructors: Dr. Mahdi Roozbahani and Wafa Louhichi

## Deadline: July 25th, 11:59PM AoE

## Honor Code and Assignment Deadline
<!-- No changes needed on the below section -->
* No unapproved extension of the deadline is allowed. Late submission will lead to 0 credit. 

* Discussion is encouraged on Ed as part of the Q/A. However, all assignments should be done individually.
<font color='darkred'>
* Plagiarism is a **serious offense**. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.</font>
<font color='darkred'>
* All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures. If we observe any (even small) similarities/plagiarisms detected by Gradescope or our TAs, **WE WILL DIRECTLY REPORT ALL CASES TO OSI**, which may, unfortunately, lead to a very harsh outcome. **Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.**
</font>


## Instructions for the assignment 

<!-- No changes needed on the below section -->
- This entire assignment will be autograded through Gradescope.

- We provided you different .py files and we added libraries in those files please DO NOT remove those lines and add your code after those lines. Note that these are the only allowed libraries that you can use for the homework.

- You will submit your implemented .py files to the corresponding homework section on Gradescope. 

- You are allowed to make as many submissions until the deadline as you like. Additionally, note that the autograder tests each function separately, therefore it can serve as a useful tool to help you debug your code if you are not sure of what part of your implementation might have an issue.


## Using the local tests <a id='using_local_tests'></a>
- For some of the programming questions we have included a local test using a small toy dataset to aid in debugging. The local test sample data and outputs are stored in .py files in the **local_tests** folder
- There are no points associated with passing or failing the local tests, you must still pass the autograder to get points. 
- **It is possible to fail the local test and pass the autograder** since the autograder has a certain allowed error tolerance while the local test allowed error may be smaller. Likewise, passing the local tests does not guarantee passing the autograder. 
- **You do not need to pass both local and autograder tests to get points, passing the Gradescope autograder is sufficient for credit.**
- It might be helpful to comment out the tests for functions that have not been completed yet. 
- It is recommended to test the functions as it gets completed instead of completing the whole class and then testing. This may help in isolating errors. Do not solely rely on the local tests, continue to test on the autograder regularly as well. 

# Google Colab Setup (Optional for running on Colab)
You may need to right click on the Applied NLP folder and `Add shortcut to Drive`

In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive/')

## Change path to directory of where notebook is located
%cd '/content/drive/MyDrive/Applied_NLP/HW4/hw4_code/'

## If no GPU selected it will ask for GPU to be selected
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)


## This wraps output text according to the window size
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Assignment Overview

In this homework we will explore non-linear text classification algorithms using deep neural networks.

We will reuse the datasets from HW3 for this exploration:
* The first dataset is a subset of a [Clickbait Dataset](https://github.com/bhargaviparanjape/clickbait/tree/master/dataset) that has article headlines and a binary label on whether the headline is considered clickbait. 
* The second dataset is a subset of [Web of Science Dataset](https://data.mendeley.com/datasets/9rw3vkcfy4/6) that has articles and a corresponding label on the domain of the articles. 

We will first explore LSTM and GRU. Then we will be looking into attention based architectures : Transformers and we will apply it to sequence labelling task : Part of Speech Tagging (POS) and Named Entity Recognition (NER). We will also look into Topic Modeling.

## Deliverables and Points Distribution

### Q1: LSTM [15pts]
- **Classification with LSTM** [15pts] Deliverables: <font color = 'green'>lstm.py, lstm_clickbait.pkl, lstm_wos.pkl</font>

    - [3pts] \__init__

    - [4pts] forward

    - [4pts] lstm_clickbait.pkl

    - [4pts] lstm_wos.pkl

### Q2: LSTM with Attention [15pts]
- **Classification with LSTM + Attention** [15pts] Deliverables: <font color = 'green'>attention.py, attention_clickbait.pkl, attention_wos.pkl</font>

    - [3pts] \__init__

    - [4pts] forward

    - [4pts] attention_clickbait.pkl

    - [4pts] attention_wos.pkl

### Q3: Transformers [15pts]
- **Classification with BERT** [15pts] Deliverables: <font color = 'green'>bert.py, bert_clickbait.pkl, bert_wos.pkl</font>

    - [3pts] \__init__ 

    - [6pts] bert_clickbait.pkl

    - [6pts] bert_wos.pkl 


### Q4: Sequence Labeling [15pts]
- **Parts of Speech (POS) Tagging and Named Entity Recognition (NER)** [15pts] Deliverables: <font color = 'green'>sequenceLabeling.py, bert_pos.pkl, bert_ner.pkl</font>

    - [3pts] \__init__

    - [6pts] bert_pos.pkl

    - [6pts] bert_ner.pkl

### Q5: Topic Modeling [20pts]
- **Topic modeling with Latent Dirichlet Allocation (LDA)** [20pts] Deliverables: <font color = 'green'>lda.py</font>

    - [5pts] tokenize_words

    - [5pts] remove_stopwords

    - [5pts] create_dictionary
    
    - [5pts] build_LDAModel



# Setup
This notebook is tested under [python 3. * . *](https://www.python.org/downloads/release/python-368/), and the corresponding packages can be downloaded from [miniconda](https://docs.conda.io/en/latest/miniconda.html). You may also want to get yourself familiar with several packages:

- [jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/)
- [numpy](https://docs.scipy.org/doc/numpy-1.15.1/user/quickstart.html)
- [sklearn](https://matplotlib.org/users/pyplot_tutorial.html)
- [pytorch](https://pytorch.org/)

In the .py files please implement the functions that have `raise NotImplementedError`, and after you finish the coding, please delete or comment out `raise NotImplementedError`.

## Library imports

In [1]:
!pip install transformers
!pip install datasets
#!pip install gensim==3.8.3 
!pip install pyLDAvis





In [2]:
#Import the necessary libraries
import pandas as pd
import pickle
import numpy as np
import scipy as sp
import sys
import re
from copy import deepcopy
import random
from sklearn.metrics import accuracy_score
import torch
import torch.nn as nn
from torch import optim
torch.manual_seed(10)
from torch.autograd import Variable
import torch.nn.functional as F
from torch.utils.data import DataLoader
import transformers
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#import pyLDAvis
#import pyLDAvis.gensim_models
import pickle 

import warnings
warnings.filterwarnings("ignore")

%load_ext autoreload
%autoreload 2
%reload_ext autoreload

print('Version information')

print('python: {}'.format(sys.version))
print('numpy: {}'.format(np.__version__))

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nickdinapoli/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Version information
python: 3.9.17 (main, Jul  5 2023, 16:17:03) 
[Clang 14.0.6 ]
numpy: 1.25.0


In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

# Load Dataset


We start by loading both data sets already split into an 80/20 train and test set.

In [3]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

df_train = pd.read_csv('./data/train.csv')
df_test = pd.read_csv('./data/test.csv')

# Separate dataframes into train and test lists
x_train, y_train = list(df_train['headline']), list(df_train['label'])
x_test, y_test = list(df_test['headline']), list(df_test['label'])

Below is the number of headlines in the train and test set as well as a sample of the article headlines and its binary label, where 0 is considered not clickbait and 1 is clickbait.

In [4]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

print(f'Number of Train Headlines: {len(x_train)}')
print(f'Number of Test Headlines: {len(x_test)}')

print('\n\nSample Label and Headlines:')
x = 105
for label, line in zip(y_train[x:x+5], x_train[x:x+5]):
    print(f'{label}: {line}')
    
print('\nOutput of Sample Headlines without Print Statement:')
x_train[x:x+5]

Number of Train Headlines: 19200
Number of Test Headlines: 4800


Sample Label and Headlines:
1: 27 Breathtaking Alternatives To A Traditional Wedding Bouquet <br>

1: 22 Pictures People Who Aren't Grad Students Will <strong>Never</strong> Understand

0: PepsiCo Profit Falls 43 Percent

0: Website of Bill O'Reilly, FOX News commentator, hacked in retribution

1: The Green Toy Soldiers From Your Childhood Now Come In Baller Yoga Poses A


Output of Sample Headlines without Print Statement:


['27 Breathtaking Alternatives To A Traditional Wedding Bouquet <br>\n',
 "22 Pictures People Who Aren't Grad Students Will <strong>Never</strong> Understand\n",
 'PepsiCo Profit Falls 43 Percent\n',
 "Website of Bill O'Reilly, FOX News commentator, hacked in retribution\n",
 'The Green Toy Soldiers From Your Childhood Now Come In Baller Yoga Poses A\n']

In [5]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Save test and train as csv
df_train_wos = pd.read_csv('./data/train_wos.csv')
df_test_wos = pd.read_csv('./data/test_wos.csv')

# Separate dataframes into train and test lists
x_train_wos, y_train_wos = list(df_train_wos['article']), list(df_train_wos['label'])
x_test_wos, y_test_wos = list(df_test_wos['article']), list(df_test_wos['label'])

# Numerical label to domain mapping
wos_label = {0:'CS', 1:'ECE', 2:'Civil', 3:'Medical'}
# Numerical label to Numerical mapping
label_mapping = {0:0, 1:1, 4:2, 5:3}

for i, label in enumerate(y_train_wos):
    y_train_wos[i] = label_mapping[label]
for i, label in enumerate(y_test_wos):
    y_test_wos[i] = label_mapping[label]

In [6]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

print(f'Number of Train Articles: {len(x_train_wos)}')
print(f'Number of Test Articles: {len(x_test_wos)}')

print('\nLabel Key:', wos_label)

print('\nSample Label and Articles:\n')
x = 107
for label, line in zip(y_train_wos[x:x+3], x_train_wos[x:x+3]):
    print(f'{label} - {wos_label[label]}: {line}')

Number of Train Articles: 1600
Number of Test Articles: 400

Label Key: {0: 'CS', 1: 'ECE', 2: 'Civil', 3: 'Medical'}

Sample Label and Articles:

0 - CS: An efficient procedure for calculating the electromagnetic fields in multilayered cylindrical structures is reported in this paper. Using symbolic computation, spectral Green's functions, suitable for numerical implementations are determined in compact and closed forms. Applications are presented for structures with two dielectric layers.

1 - ECE: A multifunctional platform based on the microhotplate was developed for applications including a Pirani vacuum gauge, temperature, and gas sensor. It consisted of a tungsten microhotplate and an on-chip operational amplifier. The platform was fabricated in a standard complementary metal oxide semiconductor (CMOS) process. A tungsten plug in standard CMOS process was specially designed as the serpentine resistor for the microhotplate, acting as both heater and thermister. With the sacrifici

## Q1: Classification with LSTM [15pts]

In the **lstm.py** file complete the following functions:

* **\__init__**
* **forward**

We will be using LSTM (Long-Short Term Memory) model for classification. The architecture of our model looks like :

<p align="center"><img src="https://www.tensorflow.org/static/text/tutorials/images/bidirectional.png" width="75%" align="center"></p>

For more details on LSTM, please refer to class lectures.

Use an Embedding layer, followed by a LSTM layer, and a linear layer.

We will then classify the Clickbait and Web of science dataset for this task.

### 1.1 : Pre-Processing Data [No Points]

Run the below cells to load functions for building vocabulary and tokenizing the sentences.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torchtext
print(torchtext.__version__)

tokenizer = get_tokenizer("basic_english")

def build_vocabulary(datasets):
  for dataset in datasets:
    for text in dataset:
      yield tokenizer(text)

vocab = build_vocab_from_iterator(build_vocabulary([x_train]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

vocab_wos = build_vocab_from_iterator(build_vocabulary([x_train_wos]), min_freq=1, specials=["<UNK>"])
vocab_wos.set_default_index(vocab["<UNK>"])

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from torch.utils.data import DataLoader

max_words = max(map(len, x_train))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Use cuda if present
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

### 1.2 : Classifying Clickbait Dataset using LSTM [No Points]

Run the below cell to classify the Clickbait train and test dataset using the lstm functions that you have already implemented in 1.

An accuracy of more than 85% is acceptable.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train, x_train))
test_dataset = list(map(lambda y, x: (y, x), y_test, x_test))

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

In [None]:
vocab.__len__()

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from lstm import LSTM
from tqdm import tqdm

NUM_CLASSES = 2

model = LSTM(vocab, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
N_EPOCHS = 5

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on Clickbait Dataset using LSTM  : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('lstm_clickbait.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

### 1.3 : Classifying Web of Science Dataset using LSTM [No Points]

Run the below cell to classify the Web of Science train and test dataset using the lstm functions that you have already implemented in 1.

An accuracy of more than 45% is acceptable.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_words = max(map(len, x_train_wos))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab_wos(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] 
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train_wos, x_train_wos))
test_dataset = list(map(lambda y, x: (y, x), y_test_wos, x_test_wos))

train_loader = DataLoader(train_dataset, batch_size=128, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=128, collate_fn=vectorize_batch)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from lstm import LSTM
from tqdm import tqdm

NUM_CLASSES = 4

model = LSTM(vocab_wos, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
N_EPOCHS = 20

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on WoS Dataset using LSTM  : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.



In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('lstm_wos.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

**NOTE** : LSTM alone is not able to perform good on the WoS dataset and that can be attributed to the very limited data with large vocabulary and lack of embedding structure.

## Q2: Classification with LSTM + Attention [15pts]

In the **attention.py** file complete the following functions:

* **\__init__**
* **forward**

A potential issue with vanilla LSTM approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Attention mechanism helps to look at all hidden states from sequence for making predictions unlike vanilla approach.

In this task, we will be implementing LSTM with Attention.

<p align="center"><img src="https://miro.medium.com/max/1400/1*YM4T-QSJIIPQUlMOO_gnzw.png" width="75%" align="center"></p>

Please refer to lecture for more details.

You will be extending the LSTM model and incorporate attention on top of it.

We will then classify the Clickbait and Web of science dataset for this task.






### 2.1 : Classifying Clickbait Dataset using LSTM with Attention [No Points]

Run the below cell to classify the Clickbait train and test dataset using the attention functions that you have already implemented in 2.

An accuracy of more than 88% is acceptable.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_words = max(map(len, x_train))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train, x_train))
test_dataset = list(map(lambda y, x: (y, x), y_test, x_test))

train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1024, collate_fn=vectorize_batch)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from attention import Attention
from tqdm import tqdm

NUM_CLASSES = 2

model = Attention(vocab, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
N_EPOCHS = 5

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on Clickbait Dataset using LSTM with Attention  : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('attention_clickbait.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

### 2.2 : Classifying Web of Science Dataset using LSTM with Attention [No Points]

Run the below cell to classify the Web of Science train and test dataset using the attention functions that you have already implemented in 2.

An accuracy of more than 50% is acceptable.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

max_words = max(map(len, x_train_wos))

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] ## Tokenize and map tokens to indexes
    X_len = [len(text) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.
    return torch.tensor(X, dtype=torch.int32), torch.tensor(X_len), torch.tensor(Y)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train_wos, x_train_wos))
test_dataset = list(map(lambda y, x: (y, x), y_test_wos, x_test_wos))

train_loader = DataLoader(train_dataset, batch_size=128, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=128, collate_fn=vectorize_batch)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from attention import Attention
from tqdm import tqdm

NUM_CLASSES = 4

model = Attention(vocab, num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
N_EPOCHS = 25

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_len, Y in tqdm(train_loader):
      X = X.to(device)
      Y = Y.to(device)
      outputs = model(X, X_len)
      loss = criterion(outputs, Y)

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

      total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_len, Y in test_loader:
    X = X.to(device)
    outputs = model(X, X_len)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on WoS Dataset using LSTM with Attention : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the model. You will be required to upload the saved model on gradescope for evaluation.



In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('attention_wos.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

## Q3: Classification with BERT [15pts]

In the **bert.py** file complete the following functions:

* **\__init__**
* **forward**

The transformer neural network is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. In a transformer, we can pass all the words of a sentence and determine the word embedding simultaneously.

<p align="center"><img src="https://d2l.ai/_images/bert-one-seq.svg" width="75%" align="center"></p>

We will be using BERT (Bidirectional Encoder Representations from Transformers) pre-trained models for embeddings. BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer. 

The details on BERT can be referred from the paper : [BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805.pdf).

We will be using the BERT embeddings and a fully connected linear layer to perform classification.

We will then classify the Clickbait and Web of science dataset for this task.


### 3.1 : Initialize Tokenizer [No Points]

Run the below cells to initalizer tokenizer for the pretrained BERT model.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

### 3.2 : Classifying Clickbait Dataset using BERT [No Points]

Run the below cell to classify the Clickbait train and test dataset using the bert functions that you have already implemented in 3.

An accuracy of more than 90% is acceptable.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = tokenizer(X, padding='max_length', max_length = 128, truncation=True, return_tensors="pt")
    input_ids, attention_mask = X['input_ids'], X['attention_mask']
    return input_ids, attention_mask, torch.tensor(Y)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train, x_train))
test_dataset = list(map(lambda y, x: (y, x), y_test, x_test))

train_loader = DataLoader(train_dataset, batch_size=32, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=32, collate_fn=vectorize_batch)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from bert import BERTClassifier
from tqdm import tqdm

NUM_CLASSES = 2

device = "cpu"
model = BERTClassifier(num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)
N_EPOCHS = 3

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_mask, Y in tqdm(train_loader):
        X = X.to(device)
        X_mask = X_mask.to(device)
        Y = Y.to(device)
        outputs = model(X, X_mask)
        print(Y.shape, outputs.shape)
        loss = criterion(outputs, Y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_mask, Y in test_loader:
    X = X.to(device)
    X_mask = X_mask.to(device)
    outputs = model(X, X_mask)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on Clickbait Dataset using BERT : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the model. You will be required to upload the saved model on gradescope for evaluation.

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('bert_clickbait.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

### 3.3 : Classifying Web of Science Dataset using BERT [No Points]

Run the below cell to classify the Web of Science train and test dataset using the bert functions that you have already implemented in 3.

An accuracy of more than 70% is acceptable.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_dataset = list(map(lambda y, x: (y, x), y_train_wos, x_train_wos))
test_dataset = list(map(lambda y, x: (y, x), y_test_wos, x_test_wos))

train_loader = DataLoader(train_dataset, batch_size=32, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=32, collate_fn=vectorize_batch)

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from bert import BERTClassifier
from tqdm import tqdm

NUM_CLASSES = 4

model = BERTClassifier(num_classes=NUM_CLASSES)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)
N_EPOCHS = 3

model.train()
for epoch in range(N_EPOCHS):
    total_loss = 0.0
    for X, X_mask, Y in tqdm(train_loader):
        X = X.to(device)
        X_mask = X_mask.to(device)
        Y = Y.to(device)
        outputs = model(X, X_mask)
      
        loss = criterion(outputs, Y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print("loss on epoch %i: %f" % (epoch, total_loss))

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sklearn.metrics import accuracy_score

with torch.no_grad():
  Y_truth, Y_preds = [],[]
  for X, X_mask, Y in test_loader:
    X = X.to(device)
    X_mask = X_mask.to(device)
    outputs = model(X, X_mask)

    Y_truth.append(Y)
    Y_preds.append(outputs)

  Y_truth = torch.cat(Y_truth)
  Y_preds = torch.cat(Y_preds)

print("Test Accuracy on Web of Science Dataset using BERT : {:.3f}".format(accuracy_score(Y_truth.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy())))

Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.



In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

preds = F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

with open('bert_wos.pkl', 'wb') as fp:
    pickle.dump(preds, fp)

## Q4: Sequence Labeling [15pts]

In the **sequenceLabeling.py** file complete the following functions:

* **\__init__**
* **forward**

Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

Named entity recognition (NER) seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

We will be using BERT (Bidirectional Encoder Representations from Transformers) for sequence labeling. The architecture of the model is shown above in the diagram.

We will be using the BERT embeddings and a fully connected linear layer to perform classification.

We will then classify using the conll2003 dataset for this task.


### 4.1 : Loading Dataset [No Points]

Run the below cell to download the conll2003 dataset. Each word in the dataset has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. 

For more details on the structure and the supported labels, please refer : https://huggingface.co/datasets/conll2003.

In [8]:
!pip install datasets



In [8]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [15]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from datasets import load_dataset

dataset = load_dataset("conll2003")

train = dataset['train']
val = dataset['validation']
test = dataset['test']

Found cached dataset conll2003 (/Users/nickdinapoli/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/9a4d16a94f8674ba3466315300359b0acd891b68b6c8743ddf60b9c702adce98)
100%|█████████████████████████████████████████████| 3/3 [00:00<00:00, 35.38it/s]


### 4.2 Pre-process Dataset [No Points]

Run the below cells to vectorize the dataset for tokenizers and to intialize the dataloaders.

In [16]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

### The code is used to vectorize the data for each batch as defined in the data loader.
### For data in the batch, the input tokens are encoded and 
### then special tokens [CLS] (101) is added at the beginning and [SEP] (102) is added at the end.
### These tokens are added because the BERT model was pretrained with these tokens. So to get the same results for inference we need to add them.
### The input is padded with 0 if it is lesser than Max length.

MAX_LEN = 128

def vectorize_batch(batch):
    batch_input_ids = []
    batch_mask = []
    batch_token_type_ids = []
    batch_pos = []
    batch_ner = []

    for data in batch:
        target_pos = []
        target_ner = []
        inputs = []
        tokens = data['tokens']
        pos_tags = data['pos_tags']
        ner_tags = data['ner_tags']
        for i in range(len(tokens)):
            input = tokenizer.encode(tokens[i], add_special_tokens=False)
            input_len = len(input)
            target_pos.extend([pos_tags[i]] * input_len)
            target_ner.extend([ner_tags[i]] * input_len)
            inputs.extend(input)
            
        inputs = inputs[:MAX_LEN - 2]
        target_pos = target_pos[:MAX_LEN - 2]
        target_ner = target_ner[:MAX_LEN - 2]

        inputs = [101] + inputs + [102]
        target_pos = [0] + target_pos + [0]
        target_ner = [0] + target_ner + [0]

        mask = [1] * len(inputs)
        token_type_ids = [0] * len(inputs)

        padding_len = MAX_LEN - len(inputs)
        inputs = inputs + ([0] * padding_len)
        mask = mask + ([0] * padding_len)

        token_type_ids = token_type_ids + ([0] * padding_len)
        target_pos = target_pos + ([0] * padding_len)
        target_ner = target_ner + ([0] * padding_len)

        batch_input_ids.append(inputs)
        batch_mask.append(mask)
        batch_token_type_ids.append(token_type_ids)
        batch_pos.append(target_pos)
        batch_ner.append(target_ner)

    return torch.tensor(batch_input_ids, dtype=torch.long), torch.tensor(batch_mask, dtype=torch.long), torch.tensor(batch_token_type_ids, dtype=torch.long), torch.tensor(batch_pos, dtype=torch.long), torch.tensor(batch_ner, dtype=torch.long)

In [17]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

train_loader = DataLoader(train, batch_size=8, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test, batch_size=1, collate_fn=vectorize_batch)

### 4.3 POS Tagging and NER using conll2003 dataset [No Points]

Run the below cell to classify the conll2003 dataset for POS Tagging and NER using the sequenceLabeling functions that you have already implemented in 4.

An F1-score of more than 90% in both the tasks is acceptable.

In [19]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from sequenceLabeling import SequenceLabeling
from tqdm import tqdm

NUM_CLASSES_POS = 47
NUM_CLASSES_NER = 9

device = 'cpu'
model_pos = SequenceLabeling(NUM_CLASSES_POS)
model_pos.to(device)

model_ner = SequenceLabeling(NUM_CLASSES_NER)
model_ner.to(device)

criterion = nn.CrossEntropyLoss()
optimizer_pos = optim.Adam(model_pos.parameters(), lr=0.0001)
optimizer_ner = optim.Adam(model_ner.parameters(), lr=0.0001)
N_EPOCHS = 3

model_pos.train()
model_ner.train()

for epoch in range(N_EPOCHS):
    total_loss_pos = 0.0
    total_loss_ner = 0.0
    for input_ids, mask, token_type_ids, target_pos, target_ner in tqdm(train_loader):
        input_ids = input_ids.to(device)
        mask = mask.to(device)
        token_type_ids = token_type_ids.to(device)
        target_pos = target_pos.to(device)
        target_ner = target_ner.to(device)

        outputs_pos = model_pos(input_ids, mask, token_type_ids)
        outputs_ner = model_ner(input_ids, mask, token_type_ids)

        active_loss_pos = mask.view(-1) == 1
        active_logits_pos = outputs_pos.view(-1, NUM_CLASSES_POS)

        active_labels_pos = torch.where(
        active_loss_pos,
        target_pos.view(-1),
        torch.tensor(criterion.ignore_index).type_as(target_pos)
        )

        loss_pos = criterion(active_logits_pos, active_labels_pos)      

        optimizer_pos.zero_grad()
        loss_pos.backward()
        optimizer_pos.step()

        active_loss_ner = mask.view(-1) == 1
        active_logits_ner = outputs_ner.view(-1, NUM_CLASSES_NER)

        active_labels_ner = torch.where(
        active_loss_ner,
        target_ner.view(-1),
        torch.tensor(criterion.ignore_index).type_as(target_ner)
        )

        loss_ner = criterion(active_logits_ner, active_labels_ner)      

        optimizer_ner.zero_grad()
        loss_ner.backward()
        optimizer_ner.step()

        total_loss_pos += loss_pos.item()
        total_loss_ner += loss_ner.item()

    print("POS Tagging loss on epoch %i: %f" % (epoch, total_loss_pos))
    print("NER loss on epoch %i: %f" % (epoch, total_loss_pos))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder

KeyboardInterrupt: 

In [None]:
with torch.no_grad():
    y_true_pos = []
    y_pred_pos = []
    y_true_ner = []
    y_pred_ner = []
    for input_ids, mask, token_type_ids, target_pos, target_ner in tqdm(test_loader):
      input_ids = input_ids.to(device)
      mask = mask.to(device)
      token_type_ids = token_type_ids.to(device)
      target_pos = target_pos.to(device).view(-1)
      target_ner = target_ner.to(device).view(-1)

      outputs_pos = model_pos(input_ids, mask, token_type_ids)
      predicted_pos = torch.argmax(outputs_pos, dim=-1)

      outputs_ner = model_ner(input_ids, mask, token_type_ids)
      predicted_ner = torch.argmax(outputs_ner, dim=-1)

      active_loss = mask == 1
      active_loss = active_loss.view(-1)
      predicted_pos = predicted_pos.view(-1)
      predicted_ner = predicted_ner.view(-1)

      for i in range(len(active_loss)):
        if not active_loss[i]:
          break
        y_true_pos.append(target_pos[i].cpu().detach().numpy())
        y_pred_pos.append(predicted_pos[i].cpu().detach().numpy())
        y_true_ner.append(target_ner[i].cpu().detach().numpy())
        y_pred_ner.append(predicted_ner[i].cpu().detach().numpy())

from sklearn.metrics import f1_score

print("Test F1-score on Conll2003 Dataset for POS Tagging : {:.3f}".format(f1_score(y_true_pos, y_pred_pos, average='micro')))
print("Test F1-score on Conll2003 Dataset for NER : {:.3f}".format(f1_score(y_true_ner, y_pred_ner, average='micro')))

Run the below cell to save the predictions. You will be required to upload the predictions on gradescope for evaluation.


In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

with open('bert_pos.pkl', 'wb') as fp:
    pickle.dump(np.array(y_pred_pos), fp)

with open('bert_ner.pkl', 'wb') as fp:
    pickle.dump(np.array(y_pred_ner), fp)

## Q5: Topic Modeling [20pts]

Topic Models are a type of statistical language models used for uncovering hidden structure in a collection of texts. There are several existing algorithms you can use to perform the topic modeling. In this HW, we will be exploring LDA (Latent Dirichlet Allocation) method. For more details please refer to class lectures and slides.

### 5.1: Latent Dirichlet Allocation

In the **lda.py** file complete the following functions:

* **tokenize_words**
* **remove_stopwords**
* **create_dictionary**
* **build_LDAModel**

Latent Dirichlet Allocation (LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an example of a topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

### 5.2: Topic Modeling on Clickbait dataset using LDA [No Points]

Run the below cell to build LDA Model and visualize topics using gensim library.

In [1]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

from lda import LDA

stop_words = stopwords.words('english')
lda = LDA()

data = list(lda.tokenize_words(x_train))
data = lda.remove_stopwords(data, stop_words)
id2word, corpus = lda.create_dictionary(data)
lda_model = lda.build_LDAModel(id2word, corpus)

ModuleNotFoundError: No module named 'gensim'

In [None]:
###############################
### DO NOT CHANGE THIS CELL ###
###############################

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
LDAvis_prepared