### Part 0. Google Colab Setup

Hopefully you're looking at this notebook in Colab! 
1. First, make a copy of this notebook to your local drive, so you can edit it. 
2. Go ahead and upload the OnionOrNot.csv file from the [assignment zip](https://www.cc.gatech.edu/classes/AY2022/cs4650_fall/programming/h2_torch.zip) in the files panel on the left.
3. Right click in the files panel, and select 'Create New Folder' - call this folder src
4. Upload all the files in the src/ folder from the [assignment zip](https://www.cc.gatech.edu/classes/AY2022/cs4650_fall/programming/h2_torch.zip) to the src/ folder on colab.

***NOTE: REMEMBER TO REGULARLY REDOWNLOAD ALL THE FILES IN SRC FROM COLAB.*** 

***IF YOU EDIT THE FILES IN COLAB, AND YOU DO NOT REDOWNLOAD THEM, YOU WILL LOSE YOUR WORK!***

If you want GPU's, you can always change your instance type to GPU directly in Colab.

### Part 1. Loading and Preprocessing Data [10 points]
The following cell loads the OnionOrNot dataset, and tokenizes each data item

In [1]:
# DO NOT MODIFY #
import torch
import random
import numpy as np
RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
import sklearn
# this is how we select a GPU if it's avalible on your computer.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
import pandas as pd
from src.preprocess import clean_text 
import nltk
from tqdm import tqdm

nltk.download('punkt')
df = pd.read_csv('train.csv', quotechar='"')
df["tokenized"] = df["text"].apply(lambda x: nltk.word_tokenize(clean_text(x.lower())))

# to convert authors into numbers
author_to_number = {
    'EAP': 0,
    'HPL': 1,
    'MWS': 2
    
}

# lowercase, removing punctuation and tookenize sentences. Converting labels to int
for i in range(len(df)):
    df['author'][i] = author_to_number[df['author'][i]]

[nltk_data] Downloading package punkt to /home/andre/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Here's what the dataset looks like. You can index into specific rows with pandas, and try to guess some of these yourself :)

In [3]:
df.head()

Unnamed: 0,id,text,author,tokenized
0,id26305,"This process, however, afforded me no means of...",0,"[this, process, ,, however, ,, afforded, me, n..."
1,id17569,It never once occurred to me that the fumbling...,1,"[it, never, once, occurred, to, me, that, the,..."
2,id11008,"In his left hand was a gold snuff box, from wh...",0,"[in, his, left, hand, was, a, gold, snuff, box..."
3,id27763,How lovely is spring As we looked from Windsor...,2,"[how, lovely, is, spring, as, we, looked, from..."
4,id12958,"Finding nothing else, not even gold, the Super...",1,"[finding, nothing, else, ,, not, even, gold, ,..."


In [4]:
df.iloc[42]

id                                                     id27080
text         It was all mud an' water, an' the sky was dark...
author                                                       1
tokenized    [it, was, all, mud, an, ', water, ,, an, ', th...
Name: 42, dtype: object

Now that we've loaded this dataset, we need to split the data into train, validation, and test sets. We also need to create a vocab map for words in our Onion dataset, which will map tokens to numbers. This will be useful later, since torch models can only use tensors of sequences of numbers as inputs. **Go to src/dataset.py, and fill out split_train_val_test, generate_vocab_map**

In [5]:
## TODO: complete these methods in src/dataset.py
from src.dataset import split_train_val_test, generate_vocab_map
df = df.sample(frac=1)

train_df, val_df, test_df = split_train_val_test(df, props=[.8, .1, .1])
train_vocab, reverse_vocab = generate_vocab_map(train_df)

In [6]:
# this line of code will help test your implementation
(len(train_df) / len(df)), (len(val_df) / len(df)), (len(test_df) / len(df))

(0.7999897849736963, 0.09995403238163338, 0.09995403238163338)

PyTorch has custom Datset Classes that have very useful extentions. **Go to src/dataset.py, and fill out the HeadlineDataset class.** Refer to PyTorch documentation on Dataset Classes for help.

In [7]:
from src.dataset import HeadlineDataset
from torch.utils.data import RandomSampler
#print(train_df)

train_dataset = HeadlineDataset(train_vocab, train_df)
val_dataset = HeadlineDataset(train_vocab, val_df)
test_dataset = HeadlineDataset(train_vocab, test_df)

# Now that we're wrapping our dataframes in PyTorch datsets, we can make use of PyTorch Random Samplers.
train_sampler = RandomSampler(train_dataset)
val_sampler = RandomSampler(val_dataset)
test_sampler = RandomSampler(test_dataset)

We can now use PyTorch DataLoaders to batch our data for us. **Go to src/dataset.py, and fill out collate_fn.** Refer to PyTorch documentation on Dataloaders for help.

In [8]:
from torch.utils.data import DataLoader
from src.dataset import collate_fn
BATCH_SIZE = 16
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, collate_fn=collate_fn)
val_iterator = DataLoader(val_dataset, batch_size=BATCH_SIZE, sampler=val_sampler, collate_fn=collate_fn)
test_iterator = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)


In [9]:
# # Use this to test your collate_fn implementation.

# # You can look at the shapes of x and y or put print 
# # statements in collate_fn while running this snippet

for x, y in test_iterator:
    print(x,y)
    break
test_iterator = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)


tensor([[  286,    20,    15,    71,  5519,  1764,     8,   463,     2,  6330,
          6331,    30,  1280,    26,     2,   375,    26,    22,     1,     5,
          3021,  2354,    18,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0],
        [    1,    57,   187,    10,   275,  2115,     2,   490,  4531,   451,
           431,    10,     5,     2,  2692,   196,  5806,  2116,   119,   864,
             2,   103,   986,   121,  1350,     1,   109,    47,   150,  2224,
           100,  3445,     5,   304,     1,     1,    18,     0,     0,     0,
             0,     0,     0,     0],
        [ 1820,    20,    30,   250,  1502,    80,     2,    27,    37,     1,
          6376,    10,   228,   995,   182,    26,     1,     5,  2066,   119,
            30,    29,   121,  2698,   112,  1525,    28,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     

### Part 2: Modeling [10 pts]
Let's move to modeling, now that we have dataset iterators that batch our data for us. **Go to src/model.py, and follow the instructions in the file to create a basic neural network. Then, create your model using the class, and define hyperparameters.** 

In [10]:
from src.models import ClassificationModel
model = None
### YOUR CODE GOES HERE (1 line of code) ###
model = ClassificationModel(len(train_vocab),embedding_dim=128,hidden_dim = 128,num_layers = 2,bidirectional = True)

# model.to(device)
# # 
### YOUR CODE ENDS HERE ###

In the following cell, **instantiate the model with some hyperparameters, and select an appropriate loss function and optimizer.** 

Hint: we already use sigmoid in our model. What loss functions are availible for binary classification? Feel free to look at PyTorch docs for help!

In [11]:
from torch.optim import AdamW

criterion, optimizer = None, None
### YOUR CODE GOES HERE ###
criterion, optimizer = torch.nn.CrossEntropyLoss(), torch.optim.Adam(model.parameters(), lr=0.01)# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

### YOUR CODE ENDS HERE ###

### Part 3: Training and Evaluation [10 Points]
The final part of this HW involves training the model, and evaluating it at each epoch. **Fill out the train and test loops below.**

In [12]:
# returns the total loss calculated from criterion
def train_loop(model, criterion, iterator):
    model.train()
    total_loss = 0
    
    for x, y in tqdm(iterator):
        optimizer.zero_grad()
        # x = x.to(device)
        # y = y.to(device)
        y = y.long()
        ### YOUR CODE STARTS HERE (~6 lines of code) ###
        prediction = model(x)
        prediction = torch.squeeze(prediction)

 
        loss = criterion(prediction,y)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    # scheduler.step()
        ### YOUR CODE ENDS HERE ###
    return total_loss

# returns:
# - true: a Python boolean array of all the ground truth values 
#         taken from the dataset iterator
# - pred: a Python boolean array of all model predictions. 
def val_loop(model, criterion, iterator):
    true, pred = [], []
    ### YOUR CODE STARTS HERE (~8 lines of code) ###
    for x, y in tqdm(iterator):
        # x = x.to(device)
        # y = y.to(device)
        # print("x",x)
        # print("y",y)  
    
        preds = model(x)
        preds = torch.squeeze(preds)
        for i_batch in range(len(y)):
            true.append(y[i_batch])
            pred.append(torch.argmax(preds[i_batch]))
            
            


    ### YOUR CODE ENDS HERE ###
    return true, pred


We also need evaluation metrics that tell us how well our model is doing on the validation set at each epoch. **Complete the functions in src/eval.py.**

In [13]:
from sklearn.metrics import f1_score, accuracy_score
# To test your eval implementation, let's see how well the untrained model does on our dev dataset.
# It should do pretty poorly.
from src.eval_utils import binary_macro_f1, accuracy
true, pred = val_loop(model, criterion, val_iterator)
true = [x.item() for x in true]
pred = [x.item() for x in pred]

print(f1_score(true, pred, average='weighted'))
print(accuracy_score(true, pred))


100%|██████████| 123/123 [00:06<00:00, 18.54it/s]

0.30270534873618415
0.32038834951456313





### Part 4: Actually training the model [1 point]
Watch your model train :D You should be able to achieve a validation F-1 score of at least .8 if everything went correctly. **Feel free to adjust the number of epochs to prevent overfitting or underfitting.**

In [14]:
TOTAL_EPOCHS = 7
for epoch in range(TOTAL_EPOCHS):
    train_loss = train_loop(model, criterion, train_iterator)
    true, pred = val_loop(model, criterion, val_iterator)
    print(f"EPOCH: {epoch}")
    print(f"TRAIN LOSS: {train_loss}")
    print(f"VAL F-1: {f1_score(true, pred, average='weighted')}")
    print(f"VAL ACC: {accuracy_score(true, pred)}")


100%|██████████| 979/979 [04:22<00:00,  3.73it/s]
100%|██████████| 123/123 [00:08<00:00, 13.83it/s]


EPOCH: 0
TRAIN LOSS: 796.6538567692041
VAL F-1: 0.7393125080222425
VAL ACC: 0.7404190086867655


100%|██████████| 979/979 [04:25<00:00,  3.68it/s]
100%|██████████| 123/123 [00:08<00:00, 15.01it/s]


EPOCH: 1
TRAIN LOSS: 483.7765866070986
VAL F-1: 0.7574967100677433
VAL ACC: 0.7577925396014308


100%|██████████| 979/979 [04:08<00:00,  3.94it/s]
100%|██████████| 123/123 [00:06<00:00, 17.80it/s]


EPOCH: 2
TRAIN LOSS: 390.7471934258938
VAL F-1: 0.7612522167700645
VAL ACC: 0.7634133878385284


100%|██████████| 979/979 [04:14<00:00,  3.85it/s]
100%|██████████| 123/123 [00:07<00:00, 15.86it/s]


EPOCH: 3
TRAIN LOSS: 327.9839417822659
VAL F-1: 0.7667573455520934
VAL ACC: 0.7664793050587634


100%|██████████| 979/979 [04:49<00:00,  3.38it/s]
100%|██████████| 123/123 [00:08<00:00, 14.44it/s]


EPOCH: 4
TRAIN LOSS: 328.2313201073557
VAL F-1: 0.7706913714309876
VAL ACC: 0.7705671946857435


100%|██████████| 979/979 [04:39<00:00,  3.50it/s]
100%|██████████| 123/123 [00:10<00:00, 11.30it/s]


EPOCH: 5
TRAIN LOSS: 319.33898543333635
VAL F-1: 0.7576656332052576
VAL ACC: 0.7577925396014308


100%|██████████| 979/979 [06:02<00:00,  2.70it/s]
100%|██████████| 123/123 [00:08<00:00, 15.01it/s]


EPOCH: 6
TRAIN LOSS: 292.9661737512797
VAL F-1: 0.7584246277150211
VAL ACC: 0.7588145120081757


We can also look at the models performance on the held-out test set, using the same val_loop we wrote earlier.

In [15]:
true, pred = val_loop(model, criterion, test_iterator)
print(f"TEST F-1: {f1_score(true, pred, average='weighted')}")
print(f"TEST ACC: {accuracy_score(true, pred)}")

100%|██████████| 123/123 [00:09<00:00, 13.08it/s]


TEST F-1: 0.7600362476236181
TEST ACC: 0.7603474706182933
