### Part 0. Google Colab Setup

Hopefully you're looking at this notebook in Colab! 
1. First, make a copy of this notebook to your local drive, so you can edit it. 
2. Go ahead and upload the OnionOrNot.csv file from the [assignment zip](https://www.cc.gatech.edu/classes/AY2022/cs4650_fall/programming/h2_torch.zip) in the files panel on the left.
3. Right click in the files panel, and select 'Create New Folder' - call this folder src
4. Upload all the files in the src/ folder from the [assignment zip](https://www.cc.gatech.edu/classes/AY2022/cs4650_fall/programming/h2_torch.zip) to the src/ folder on colab.

***NOTE: REMEMBER TO REGULARLY REDOWNLOAD ALL THE FILES IN SRC FROM COLAB.*** 

***IF YOU EDIT THE FILES IN COLAB, AND YOU DO NOT REDOWNLOAD THEM, YOU WILL LOSE YOUR WORK!***

If you want GPU's, you can always change your instance type to GPU directly in Colab.

### Part 1. Loading and Preprocessing Data [10 points]
The following cell loads the OnionOrNot dataset, and tokenizes each data item

In [1]:
# DO NOT MODIFY #
import torch
import random
import numpy as np
RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
# this is how we select a GPU if it's avalible on your computer.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
import pandas as pd
from src.preprocess import clean_text 
import nltk
from tqdm import tqdm

nltk.download('punkt')
df = pd.read_csv("OnionOrNot.csv")
df["tokenized"] = df["text"].apply(lambda x: nltk.word_tokenize(clean_text(x.lower())))

[nltk_data] Downloading package punkt to /home/andre/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Here's what the dataset looks like. You can index into specific rows with pandas, and try to guess some of these yourself :)

In [3]:
df.head()

Unnamed: 0,text,label,tokenized
0,Entire Facebook Staff Laughs As Man Tightens P...,1,"[entire, facebook, staff, laughs, as, man, tig..."
1,Muslim Woman Denied Soda Can for Fear She Coul...,0,"[muslim, woman, denied, soda, can, for, fear, ..."
2,Bold Move: Hulu Has Announced That They’re Gon...,1,"[bold, move, :, hulu, has, announced, that, th..."
3,Despondent Jeff Bezos Realizes He’ll Have To W...,1,"[despondent, jeff, bezos, realizes, he, ’, ll,..."
4,"For men looking for great single women, online...",1,"[for, men, looking, for, great, single, women,..."


In [4]:
df.iloc[42]

text         Customers continued to wait at drive-thru even...
label                                                        0
tokenized    [customers, continued, to, wait, at, drive-thr...
Name: 42, dtype: object

Now that we've loaded this dataset, we need to split the data into train, validation, and test sets. We also need to create a vocab map for words in our Onion dataset, which will map tokens to numbers. This will be useful later, since torch models can only use tensors of sequences of numbers as inputs. **Go to src/dataset.py, and fill out split_train_val_test, generate_vocab_map**

In [5]:
## TODO: complete these methods in src/dataset.py
from src.dataset import split_train_val_test, generate_vocab_map
df = df.sample(frac=1)

train_df, val_df, test_df = split_train_val_test(df, props=[.8, .1, .1])
print(train_df)
train_vocab, reverse_vocab = generate_vocab_map(train_df)

                                                    text  label  \
3111   Scientists Say Smelling Farts Might Prevent Ca...      0   
18679  Cops Charge Man With “Destruction of Police Pr...      0   
17472  Gandalf would kick Dumbledore’s ass in a fight...      0   
21451  Woman Found She Was Farting Through Her Vagina...      0   
20800  Japanese 'naked restaurant' to ban overweight ...      0   
...                                                  ...    ...   
18074  John Hickenlooper Announces Support For Nuking...      1   
13304  Missing sisters survive 2 weeks in woods on Gi...      0   
17888  Blog: Please Do Not Let Funyuns Become The Off...      1   
1806    Exploding rhubarb chutney wrecks retirement flat      0   
3871   Man sues Florida hospital after his leg found ...      0   

                                               tokenized  
3111   [scientists, say, smelling, farts, might, prev...  
18679  [cops, charge, man, with, “, destruction, of, ...  
17472  [gandalf, w

In [6]:
# this line of code will help test your implementation
(len(train_df) / len(df)), (len(val_df) / len(df)), (len(test_df) / len(df))

(0.8, 0.1, 0.1)

PyTorch has custom Datset Classes that have very useful extentions. **Go to src/dataset.py, and fill out the HeadlineDataset class.** Refer to PyTorch documentation on Dataset Classes for help.

In [7]:
from src.dataset import HeadlineDataset
from torch.utils.data import RandomSampler
#print(train_df)

train_dataset = HeadlineDataset(train_vocab, train_df)
val_dataset = HeadlineDataset(train_vocab, val_df)
test_dataset = HeadlineDataset(train_vocab, test_df)

# Now that we're wrapping our dataframes in PyTorch datsets, we can make use of PyTorch Random Samplers.
train_sampler = RandomSampler(train_dataset)
val_sampler = RandomSampler(val_dataset)
test_sampler = RandomSampler(test_dataset)

We can now use PyTorch DataLoaders to batch our data for us. **Go to src/dataset.py, and fill out collate_fn.** Refer to PyTorch documentation on Dataloaders for help.

In [8]:
from torch.utils.data import DataLoader
from src.dataset import collate_fn
BATCH_SIZE = 16
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, collate_fn=collate_fn)
val_iterator = DataLoader(val_dataset, batch_size=BATCH_SIZE, sampler=val_sampler, collate_fn=collate_fn)
test_iterator = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)


In [9]:
# # Use this to test your collate_fn implementation.

# # You can look at the shapes of x and y or put print 
# # statements in collate_fn while running this snippet

for x, y in test_iterator:
    print(x,y)
    break
test_iterator = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)


tensor([[  191,  3553, 21379,     1,   345,   632,    38,  2620,     1,  2157,
          3486,    12,   241,     1,    35, 11588,     0,     0,     0,     0,
             0],
        [   11,  1260,   949,  9382,   451,     1,    38,    39,    54,   174,
          4099,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0],
        [    1,  3279, 14542,    35,   345,    35,   296,    55,     1,    43,
           696,  6668,    12,  1176,     0,     0,     0,     0,     0,     0,
             0],
        [   25,  1022,   311,   621,   323,    65,   198,   667,   311,  3375,
          3376,    38,  6091,  7461,    38,   125,     1,     1,  1272,    55,
             3],
        [ 4338,  5717,  1548,     1,  2931,    55,    61, 16290,     1,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0],
        [ 5478,   132,    15,   471,     1,   235,  7554,    24,     1,   685,
            55,     1,  1070,     0,     0,   

### Part 2: Modeling [10 pts]
Let's move to modeling, now that we have dataset iterators that batch our data for us. **Go to src/model.py, and follow the instructions in the file to create a basic neural network. Then, create your model using the class, and define hyperparameters.** 

In [10]:
from src.models import ClassificationModel
model = None
### YOUR CODE GOES HERE (1 line of code) ###
model = ClassificationModel(len(train_vocab),int(len(train_vocab)/4),128)

# model.to(device)
# # 
### YOUR CODE ENDS HERE ###

In the following cell, **instantiate the model with some hyperparameters, and select an appropriate loss function and optimizer.** 

Hint: we already use sigmoid in our model. What loss functions are availible for binary classification? Feel free to look at PyTorch docs for help!

In [11]:
from torch.optim import AdamW

criterion, optimizer = None, None
### YOUR CODE GOES HERE ###
criterion, optimizer = torch.nn.CrossEntropyLoss(), torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

### YOUR CODE ENDS HERE ###

### Part 3: Training and Evaluation [10 Points]
The final part of this HW involves training the model, and evaluating it at each epoch. **Fill out the train and test loops below.**

In [12]:
# returns the total loss calculated from criterion
def train_loop(model, criterion, iterator):
    model.train()
    total_loss = 0
    
    for x, y in tqdm(iterator):
        optimizer.zero_grad()
        # x.to(device)
        # y.to(device)
        ### YOUR CODE STARTS HERE (~6 lines of code) ###
        prediction = model(x)
        y = y.unsqueeze(1)
        y = torch.round(y)
        y = y.long()
        
        
        loss = criterion(prediction,y)
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    scheduler.step()
        ### YOUR CODE ENDS HERE ###
    return total_loss

# returns:
# - true: a Python boolean array of all the ground truth values 
#         taken from the dataset iterator
# - pred: a Python boolean array of all model predictions. 
def val_loop(model, criterion, iterator):
    true, pred = [], []
    ### YOUR CODE STARTS HERE (~8 lines of code) ###
    for x, y in tqdm(iterator):
        # x.to(device)
        # y.to(device)
        # print("x",x)
        # print("y",y)  
    
        preds = model(x)
        
        for i_batch in range(len(y)):
            true.append(y[i_batch])
            pred.append(torch.round(preds[i_batch][0]))
            
            


    ### YOUR CODE ENDS HERE ###
    return true, pred


We also need evaluation metrics that tell us how well our model is doing on the validation set at each epoch. **Complete the functions in src/eval.py.**

In [14]:
# To test your eval implementation, let's see how well the untrained model does on our dev dataset.
# It should do pretty poorly.
from src.eval_utils import binary_macro_f1, accuracy
true, pred = val_loop(model, criterion, val_iterator)
print(binary_macro_f1(true, pred))
print(accuracy(true, pred))

  0%|          | 0/150 [00:00<?, ?it/s]


IndexError: index out of range in self

### Part 4: Actually training the model [1 point]
Watch your model train :D You should be able to achieve a validation F-1 score of at least .8 if everything went correctly. **Feel free to adjust the number of epochs to prevent overfitting or underfitting.**

In [15]:
TOTAL_EPOCHS = 3
for epoch in range(TOTAL_EPOCHS):
    train_loss = train_loop(model, criterion, train_iterator)
    true, pred = val_loop(model, criterion, val_iterator)
    print(f"EPOCH: {epoch}")
    print(f"TRAIN LOSS: {train_loss}")
    print(f"VAL F-1: {binary_macro_f1(true, pred)}")
    print(f"VAL ACC: {accuracy(true, pred)}")

  0%|          | 0/1200 [00:00<?, ?it/s]


IndexError: index out of range in self

We can also look at the models performance on the held-out test set, using the same val_loop we wrote earlier.

In [None]:
true, pred = val_loop(model, criterion, test_iterator)
print(f"TEST F-1: {binary_macro_f1(true, pred)}")
print(f"TEST ACC: {accuracy(true, pred)}")

### Part 5: Analysis [5 points]
Answer the following questions:
#### 1. What happens to the vocab size as you change the cutoff in the cell below? Can you explain this in the context of [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law)?

In [None]:
tmp_vocab, _ = generate_vocab_map(train_df, cutoff = 1)
len(tmp_vocab)

#### 2. Can you describe what cases the model is getting wrong in the witheld test-set? 

To do this, you'll need to create a new val_train_loop (``val_train_loop_incorrect``) so it returns incorrect sequences **and** you'll need to decode these sequences back into words. 
Thankfully, you've already created a map that can convert encoded sequences back to regular English: you will find the ``reverse_vocab`` variable useful.

```
# i.e. using a reversed map of {"hi": 2, "hello": 3, "UNK": 0}
# we can turn [1, 2, 0] into this => ["hi", "hello", "UNK"]
```

In [None]:
# Implement this however you like! It should look very similar to val_loop.
# Pass the test_iterator through this function to look at errors in the test set.
def val_train_loop_incorrect(model, iterator):
    return