# Homework 1 Helper

Hi everyone! Diving head-first into pytorch is challenging, and there are a lot of different parts at play. Hopefully this notebook can help you a bit with the major challenges.

### Data

The first major hurdle is getting your data processed and ready to consume by the model. Your task is multi-label classification: each example can have 0, 1, or more correct labels, and your (text, labels) pairs have to reflect that.

This class should take you through the general structure of a `Dataset` object. I've marked a whole bunch of `TODO`s in the comments, as well as some comments as a refresher.

**Be sure to understand what the code you're writing is doing  what it's for!** This is absolutely critical. This  structure is nearly the same for almost every neural network you'll write in pytorch (including Homework 2-4), with some variations depending on the task/dataset and author of the code. The earlier you understand it and the more practice you get, the better!

In [1]:
from typing import List

import torch
from torch.utils.data import Dataset, DataLoader

import numpy as np
import pandas as pd

from tqdm import tqdm

In [12]:
class MovieDataset(Dataset):
    def __init__(
        self,
        path: str,  # Path to the training data csv
        vocab_size: int = 1_000,  # How many tokens to include in the vocabulary. Feel free to adjust this!
    ):
        # Read the csv using pandas read_csv function.
        self.data = pd.read_csv(path, index_col=0)

        # the given column names are long, so I often rename them for simplicity.
        self.data.columns = ["text", "labels"]

        # There's a problem with this data...some of the rows have a label called 'none', and others
        # are just empty. These are both referring to the same condition, so lets replace the 'none'
        # labels with empty strings to make it easier. Otherwise, you might predict both 'none' and
        # another label, which doesn't make any sense.
        self.data["labels"] = self.data["labels"].str.replace("none", "")

        # For one-hot encoding, we need a list of all unique labels in the dataset and a map between
        # labels and a unique ID. 
        
        # TODO create self.labels: a list of every possible label in the dataset
        # e.g., ['movie.starring.actor', 'movie.gross_revenue', ...]
        # ======================================================================
        self.labels = []
        self.n_labels = len(self.labels)
        
        # TODO create self.label2id: a dictionary which maps the labels from self.labels (above) to a unique integer
        # ======================================================================
        self.label2id = None
        
        # Similarly, we often need to make a token vocabulary for encoding the input text. Note that
        # this isn't necessary for ALL representations, but for some, you will find it useful. 
        # However, we are only creating the vocabulary from the training data. What happens if 
        # the test data has a token we haven't seen before? 
        # To combat this, we default to a particular token, usually something like <unk> ('unknown').
        
        # In the future, you will see datasets with hundreds of thousands of unique tokens. Normally,
        # we only take the N most common tokens and replace everything else with <unk>. 
        # Otherwise, our models would be huge! For this dataset, it's not a problem, but you should know
        # how to do it. 
        
        # TODO create self.vocab: a dictionary which contains the `vocab_size` most common tokens in the text.
        # Here's a hint - check out the `Counter` class from python's `collections` library.
        # ======================================================================
        self.vocab = {}
        
        # also, don't forget to include <unk> (unknown)
        # TODO assign <unk> a unique ID. 
        # ======================================================================
        self.vocab['<unk>'] = None
        
        self.vocab_size = vocab_size + 1 # plus 1 because <unk>
    
    def one_hot_encode_labels(self, labels: List[str]):
        # For multi-label classification, we're going to one-hot encode our labels.
        # This means that instead of having out data be pairs like:
        #   {'input': ..., 'output': 2}
        # We instead might have multiple correct classes, so we do something like
        #   {'input': ..., 'output': [0, 0, 1, 0, ...]}
        # where the output is a list with one element per possible label. Then, a 1 in position N means
        # the label N is a correct answer.
        
        # We need to create such a list from the input to this function, `labels`, which is a list
        # of labels that appear in a particular example. It might be, for example, 
        #   ['movie.starring.actor', 'movie.release_date']
        # Good thing we have self.label2id! That should help us figure out which 
        # index corresponds to which label, so we can write our own function. 
        # Although...this is a very common thing to do in NLP,
        # I'm sure it's available in a library somewhere (hint: sklearn). 
        
        # TODO create encoded: a vector (np.array) which is a one-hot encoded 
        # representation of the input,`labels`. 
        # ======================================================================

        encoded = labels  # do something to the labels!
        return encoded
    
    def tokenize(self, text: str):
        # Luckily, this dataset is already tokenized; that is, each token is separated by a single
        # spce. Normally, text has punctuation, hyphenated words, paragraph breaks, etc.. which
        # makes tokenization a more complicated problem. For now, just .split() is good enough.
        return text.split()
    
    def encode_tokens(self, tokens: List[str]):
        # Think about how you want to encode your tokens. One-hot encoding? Something else 
        # you've learned in 220 or 243?
        # Whatever you decide, it's convenient if you are able to feed the output of this
        # function directly into your model, like this:
        #   >>> model(encode_tokens(['this', 'is', 'a', 'sentence']))
        
        # Note: that's only a suggestion, you don't have to feed this directly into the model.
        # Feel free to set up your data/model pipeline as you see fit. 
        
        # TODO create encoded: an encoded representation of `tokens`.
        # ======================================================================

        encoded = tokens  # do something to the tokens! 
        return encoded

    def __len__(self):
        # PyTorch expects every Dataset class to implement the __len__ function. 
        # Most of the time, it's very simple like this.
        return len(self.data)

    def __getitem__(self, n: int):
        # TODO get the nth item of your data, and process it so it's ready to used in your model
        # and for training.
        # ======================================================================
        
        # Make sure the output of this function is either an np.array, a
        # torch.Tensor, or a tuple of several of these. That way, pytorch
        # can combine them into batches properly using its default collate_fn in the DataLoader.
        # If you're using nn.Embedding, you will have to deal with padding,
        # but I'll leave that for you :)

        input_to_model = None
        labels = None
        return self.encode_tokens(input_to_model), self.one_hot_encode_labels(labels)



Now you can test out your code to make sure it's outputting what you expect.

In [6]:
# Instantiate the data; you'll need to upload this file, or download the notebook and run it locally if you want this to work. 
dataset = MovieDataset('./data/hw1_train.csv')

# A small batch size of 2 makes it easier to debug for printing. 
data_loader = DataLoader(dataset, batch_size=2)

In [11]:
# Zipping the dataloader with range(N) lets us only print the first N batches
for _, batch in zip(range(5), data_loader):
    # Do something here; maybe print the batch to see if it looks right to you?
    print(batch[0].shape, batch[1].shape)

torch.Size([2, 64]) torch.Size([2, 20])
torch.Size([2, 64]) torch.Size([2, 20])
torch.Size([2, 64]) torch.Size([2, 20])
torch.Size([2, 64]) torch.Size([2, 20])
torch.Size([2, 64]) torch.Size([2, 20])


### Training

You can check out the old Colab notebooks on the canvas which have some training loops you may find useful. Dive into Deep Learning should have some also

I won't go too in-depth here, but remember, you are doing multi-label classification, which means you can't use regular cross-entropy loss.

Instead, you'll need *binary* cross entropy. In pytorch, you'll find `BCELossWithLogits`. You can use it similarly, but make sure you pay attention to the inputs to the function in the documentation. 

In [13]:
n_epochs = 5
learning_rate = 1e-3
model = None # YourModelClass(...)
optimizer = None # torch.optim.SGD(model.parameters(), lr=learning_rate)

In [None]:
for epoch in n_epochs:
    pbar = tqdm(data_loader) # tqdm is a progress bar
    pbar.set_description(f'epoch: {epoch}')
    for batch in pbar:  
        # Three main things to do here:
        # 1. run your model on the input
        # 2. calculate loss of your output vs expected output
        # 3. run backpropagation 
        # (optional) 4. Do some logging, e.g., print out loss, average loss over the epoch, etc. 
        # You could also calculate f1 on your training data, just for comparison.
        pass
        
        

### Evaluation

Your evaluation loop should look similar to your training loop, as you still have to loop through the items and apply your model to the input. The only difference is instead of calculating loss using the logits (the output of the model), you'll be converting the model output into your predictions. 

Remember, with regular multi-*class* classification, you do an `argmax` to find the index with the highest probability. 

However, with multi-*label* classification, this doesn't work - `argmax` only returns a single value, but there might be multiple (or none). 

Think about how you can decide which values in the model output correspond to a correct label and to an incorrect label. Here's a hint: first, use `torch.sigmoid` to normalize the model outputs to `[0, 1]`.

#### F1 score
Once you have your predictions, you have to calculate f1 score. You can do this manually...although this is a common thing to do in NLP, I'm sure there's a library that can do it for you (hint: sklearn, probably a million others). 

### Conclusion

I hope this helps!