# Colab NLP2 Tutorial

This notebook contains a short tutorial that shows basic Google Colab usage,
 and is a Pytorch NLP refresher.

 You will train a Sentiment Analysis model on the Amazon Fine Food Review dataset, and evaluate your models in Tensorboard.

**Step 1**

In order to load code and data to your notebook, you need get access to your Google Drive. Mount your drive, and change the directory to your drive folder with the colab tutorial files.

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

# You can change your current path with notebook Magic commands,
# and execute shell commands by adding a exclamation mark to the start.
# For example:

# %cd "/content/drive/My Drive/NLP2/colab tutorial"
# !ls

# Here you should see the training files and tokenizers.py

**Step 2**

In this tutorial we use a modified version of the Amazon Fine Food Review (AFFR) dataset. Download the dataset from Canvas, put it on your Google Drive and load it here.

The dataset has two columns:


*   a `Review` column, that contains the review of a product
*   a `Score` column, which is `0` for a bad review, `1` for a neutral review, and `2` for a positive review.


In [0]:
import pandas as pd

train_path = 'affr_train.csv'
valid_path = 'affr_valid.csv'

# Load train data
data_train = pd.read_csv(train_path)
reviews_train = data_train.Review.tolist()
scores_train = data_train.Score.tolist()

# Load valid data
data_valid = pd.read_csv(valid_path)
reviews_valid = data_valid.Review.tolist()
scores_valid = data_valid.Score.tolist()

# Visualize the first few datapoints
data_train.head()

**Step 3**

On Canvas you will also find ```tokenizers.py```, which contains a simple tokenizer. We train this tokenizer on the reviews in the train set, and limit the maximum vocabulary size to 10.000 to keep the model reasonably fast.

In [0]:
from tokenizers import WordTokenizer

# Train your tokenizer.
tokenizer = WordTokenizer(reviews_train, max_vocab_size=10000)

# Check if everything works.
for sentence in reviews_train[:5]:
    tokenized = tokenizer.encode(sentence, add_special_tokens=False)
    sentence_decoded = tokenizer.decode(tokenized)
    print('original:', sentence)
    print('tokenized:', tokenized)
    print('decoded:', sentence_decoded)
    print()

**Step 4a**

Pytorch has built-in classes that handle all data loading functionality. We will write a ```torch.utils.data.Dataset```, that handles preprocessing and tokenization, and pass it to a ```torch.utils.data.DataLoader``` which handles shuffling, padding and batching.

To write a `Dataset` class you need two things:


*   A ```__len__``` method, that returns the lengths of your Dataset.
*   A ```__getitem__``` method, that returns an instance from your dataset at the given index.

In [0]:
from torch.utils.data import Dataset

class AFFRDataset(Dataset):
    def __init__(self, sentences, labels, tokenizer):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        """Returns the number of items in the dataset"""
        return len(self.sentences)

    def __getitem__(self, idx):
        """
        Returns the datapoint at index i as a tuple (sentence, label),
        where the sentence is tokenized.
        """
        encoded = self.tokenizer.encode(
            self.sentences[idx], add_special_tokens=False)
        return encoded, self.labels[idx]

# Now, we can load our train and validation data as Pytorch Dataset.
train_data = AFFRDataset(reviews_train, scores_train, tokenizer)
valid_data = AFFRDataset(reviews_valid, scores_valid, tokenizer)

num_labels = len(set(train_data.labels))
print('num train/valid: {}/{}'.format(len(train_data), len(valid_data)))
print('num labels:', num_labels)

**Step 4b**

Next we need a DataLoader that takes care of batching and shuffling the data, this functionality is already present in the base DataLoader class. However, because sentences are not all the same length, we need to write a custom `collate_fn` that handles padding.

In [0]:
from torch.utils.data import DataLoader

def padded_collate(batch):
    """Pad sentences, return sentences and labels as LongTensors."""
    sentences, labels = zip(*batch)
    lengths = [len(s) for s in sentences]
    max_length = max(lengths)
    # Pad each sentence with zeros to max_length
    padded = [s + [0] * (max_length - len(s)) for s in sentences]
    return torch.LongTensor(padded), torch.LongTensor(labels)

train_loader = DataLoader(train_data, batch_size=64, shuffle=True, collate_fn=padded_collate)
valid_loader = DataLoader(valid_data, batch_size=256, shuffle=False, collate_fn=padded_collate)

**Step 5**

I've given you a very simple Bag-of-words classifier to see if everything works.
To train your model at a decent speed you want to use Colabs GPU capabilities. You can switch to a GPU session in `Edit -> Notebook settings -> Hardware Accelerator`

In [0]:
import torch
from torch import nn
import torch.nn.functional as F
from tqdm.notebook import tqdm

class BOWClassifier(nn.Module):
    """
    A basic Bag-of-words Classifier
    """
    def __init__(self, vocab_size, output_size):
        super().__init__()
        self.emb = nn.EmbeddingBag(vocab_size, 64)
        self.fc_out = nn.Linear(64, output_size)
        
    def forward(self, x, y):
        h = self.emb(x)
        out = self.fc_out(h)
        loss = F.cross_entropy(out, y)
        return loss, out


def accuracy(logits, y):
    """Calculate accuracy of a batch."""
    pred = torch.softmax(logits, -1).argmax(-1)
    return (pred == y).float().mean()


def train_epoch(model, optimizer, data_loader, device):
    """Train model for one epoch"""
    # tqdm notebook turns any iterable into a progress bar. 
    for bx, by in tqdm(data_loader, leave=False):
        model.train()
        optimizer.zero_grad()
        loss, out = model(bx.to(device), by.to(device))
        loss.backward()
        optimizer.step()


def validate(model, data_loader, device):
    """Validate model"""
    model.eval()
    total_accuracy = 0
    total_loss = 0
    num_valid = 0
    for bx, by in data_loader:
        # Don't calculate gradients during validation.
        with torch.no_grad():
            loss, out = model(bx.to(device), by.to(device))
            b_accuracy = accuracy(out.detach(), by.to(device))
            # Save accuracy and loss as a sum instead of batch mean, 
            # so we can take the mean of the total validation set later.
            total_accuracy += b_accuracy * out.size(0)
            total_loss += loss.detach().item() * out.size(0)
            num_valid += out.size(0)
    valid_loss = total_loss / num_valid
    valid_acc = total_accuracy / num_valid
    return valid_loss, valid_acc

In [0]:
# Train your model for a few epochs to see if everything works.
print("Using CUDA:", torch.cuda.is_available())
num_epochs = 5
device = 'cuda'
model = BOWClassifier(tokenizer.vocab_size, 3).to(device)
opt = torch.optim.Adam(model.parameters())

for epoch in range(num_epochs):
    train_epoch(model, opt, train_loader, device)
    valid_loss, valid_acc = validate(model, valid_loader, device)
    print("epoch {} loss {:.3f} accuracy {:.3f}".format(epoch + 1, valid_loss, valid_acc))

**Step 6a**

Since a few versions Pytorch has basic Tensorboard operations built in. We will use this to visualise the training progress. First we need to install and load the TensorBoard notebook extension.

*Note*: Tensorboard might already be installed on your Colab instance.

In [0]:
!pip install tensorboard
%load_ext tensorboard

**Step 6b**
Modify the training code so it saves all desired values a Tensorboard SummaryWriter.
For some inspiration, see the pytorch tensorboard documentation: https://pytorch.org/docs/stable/tensorboard.html .

You want to at least report the training loss, validation loss, train accuracy, and validation accuracy.

In [0]:
def train_epoch(*args):
    """Train model for one epoch and save to Tensorboard"""
    pass

def validate(*args):
    """Validate model and save to Tensorboard"""
    pass

**Step 6c**

To view the training progress on Tensorboard, simply use the tensorboard Magic command. To both show the Tensorboard app and train at the same time, first launch tensorboard here, and then start training.

*Note:* Running Tensorboard from a notebook might be impractical in some cases, you can also run Tensorboard locally on your machine if you sync the Tensorboard logs through Google Drive.

In [0]:
%tensorboard --logdir ./logs

**Step 7**

On Colab, you cannot use GPU instances infinitely. There is a maximum runtime of 12 hours, after which you will need to connect to a different VM. Most of the projects in this course will not need longer than 12 hours of training time, but it is always smart to save your model between epochs in case anything goes wrong during training.

In addition to saving the model weights, saving the optimizer state and other statistics (such as number of training steps) is advised. [This Blogpost](https://medium.com/udacity-pytorch-challengers/saving-loading-your-model-in-pytorch-741b80daf3c) gives a good overview of all the options. The two functions below give a good starting point for simple models.

Modify your training script in the previous cells to save the best model, and load the best model here to validate.

In [0]:
def save_model(path, model, optimizer, step):
    """
    Save model, optimizer and number of training steps to path.
    """
    checkpoint = {'state_dict': model.state_dict(),
                  'optimizer': optimizer.state_dict(),
                  'step': step}
    torch.save(checkpoint, path)

def load_model(path, model, optimizer, device):
    """
    Load a model and optimizer state to device from path.
    """
    checkpoint = torch.load(path, map_location=device)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    return model, optimizer, checkpoint['step']

**Step 8** (Optional)

Of course, the performance of a bag-of-words classifier is far from state of the art. As a final step design your own classifier and evaluate it on tensorboard. 

Get creative! Pytorch has implementations for RNNs, CNNs, and even Transformer architectures. A good classifier should be able to achieve at least 85% validation accuracy.