# PyTorch Lightning ⚡️
In the last notebook, we learned how to finetune BERT for sequnece classification. In this module you will be introduced to PyTorch Lightning. PyTorch Lightning helps in faster development of Machine Learning models by looking after a lot of boilerplate code (moving the model to different devices, running on multiple devices, logging etc). In this module, we will simplify our previously written notebooks. 

__What you will learn:__

In this notebook, you will learn how to design custom neural networks. You will also learn how to use PyTorch Lightning that will help you in rapid protyping of models with minimum code. By the end of this notebook, you should have the skills to quickly develop a custom model on top of BERT for sequence classification (or a related) task.

Topics covered:
- PyTorch Lightning
- Defining your own neural network with BERT


# Defining Your Own Model
In the previous notebook, we used the default `Linear` layer that is part of the `BertForSequenceClassification` class for training our sentence classifier. 

As a Machine Learning practitioner, you often come across situations where you would want to use a custom neural network for your downstream task. In such cases, you would use BERT for input text representation and other neural networks on top of BERT.

In this module, we will not use `BertForSequenceClassification`, rather define our own custom neural network. More specifically, we will add a Multi Layer Perceptron (MLP) on top of BERT model. 

But, first, let's go ahead and set up our environment.

In [None]:
!pip install --quiet transformers
!pip install tensorboard==1.15.0
!pip install --quiet pytorch-lightning

#### BERT + MLP
- We first define a custom neural network called `BertForSentClassification` by inheriting the `nn.Module` 
- The input text is first run through BERT model. We want the vector representation corresponding to the `[CLS]` token, which is at index 1 as shown in this [link](https://github.com/huggingface/transformers/blob/c239dcda83c65dd5b1453174c4609c5f6ce1698d/src/transformers/models/bert/modeling_bert.py#L1369). More curious learners may also go through `BertModel` documentation [here](https://huggingface.co/transformers/model_doc/bert.html#bertmodel).
- We then apply dropout on the vectors 
- Finally, we run the vectors through a multi-layer perceptron layer

It is not important to follow the exact neural network design as I have defined here, suffice it is know how to use BERT in combination with other neural networks.

In [1]:
import torch.nn as nn
from transformers import AutoModel

class BertSentClassification(nn.Module):
  def __init__(self, hidden_sz=150, output_sz=4, dropout_prob=0.2):
    super().__init__(self, BertForSentClassification)
    
    self.bert_model = AutoModel.from_pretrained(
          pretrained_model_name_or_path="bert-base-uncased"
    )
    self.dropout = nn.Dropout(dropout_prob)
    # 768 is the dimension of the [CLS] vector obtained from BERT 
    self.mlp = nn.Sequential(
        nn.Linear(768, hidden_sz),
        nn.ReLU(),
        nn.Linear(hidden_sz, output_sz)
    )

  def forward(self, input_ids, attention_mask, labels=None, token_type_ids=None):
    outputs = self.bert_model(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids
    )

    # BERT vectors corresponding to the [CLS] token
    pooled_output = outputs[1]
    pooled_output = self.dropout(pooled_output)
    logits = self.mlp(pooled_output)

    return logits

# Introducing PyTorch Lightning
As we discussed in the last notebook, model development using traditional PyTorch has many issues:
- one has to explicitely transfer models and corresponding inputs to  appropriate (and same) devices (CPU/GPU/TPU). 
- we will have to specify the right train/eval mode so that operations like `BatchNorm` and `Dropout` work in the right setting.
- we will have to repeat boilerplate code such as `backward()`, `optimizer.step()` etc every single time a new model is developed.

PyTorch Lightning abstracts away a lot of these details, so that you as an ML practitioner can focus on model development rather than on engineering related bug fixes. It also adds some very interesting features that helps in faster model development.

To make best use of PyTorch Lightning, we must implement a number of methods, most importanly:
- `configure_optimizers()`
- `training_step()`
- `validation_step()`
- `train_dataloader()`
- `validation_dataloader()`

The purpose and action each of these methods will be explained in the subsequent cells. 

__Note:__ Since PyTorch Lightning requires the implementation of several methods, the effect of each class method cannot be shown individually. Please note that the code cells from step 0 to step 3 are non-executable markdowns. __Step 4: Putting it all together__ has the executable cells.

## Step 0: Setup PyTorch Lightning 

In [4]:
# you might need the below
# !pip install --quiet pytorch-lightning
# #or
# conda install -c conda-forge pytorch-lightning 

# pip install tensorboard==1.14.0
# or 
# !conda install tensorboard==1.14.0

In [None]:
import pytorch_lightning as pl

## Step 1: Inherit `pl.LightningModule`

To use PyTorch Lightning, a custom neural network needs to inherit `pl.LightningModule` class. In our previous implementation of the neural network, inherit from `pl.LightningModule` instead of `nn.Module`. The rest of the code is exactly the same.

```python
class BertSentClassification(pl.LightningModule):
  def __init__(self, hidden_sz=150, output_sz=4, dropout_prob=0.2):
    super().__init__()
    
    self.bert_model = AutoModel.from_pretrained(
          pretrained_model_name_or_path="bert-base-uncased"
    )
    self.dropout = nn.Dropout(dropout_prob)
    self.mlp = nn.Sequential(
        nn.Linear(768, hidden_sz),
        nn.ReLU(),
        nn.Linear(hidden_sz, output_sz)
    )

    def forward(self, input_ids, attention_mask, labels=None, token_type_ids=None):
      outputs = self.bert_model(
          input_ids,
          attention_mask=attention_mask,
          token_type_ids=token_type_ids
      )

      pooled_output = outputs[1]
      pooled_output = self.dropout(pooled_output)
      logits = self.mlp(pooled_output)

      return logits
```

## Step 2: Define optimizer

PyTorch Lightning separates different logically-separate blocks like optimizers, dataloader, training step through reserved function names such as `configure_optimizers()`, `training_step()` etc. 

We will use `configure_optimizers()` to define our optimizer:

```python
class BertSentClassification(pl.LightningModule):

  def __init__(self, hidden_sz=150, output_sz=4, dropout_prob=0.2):
    ...

  def configure_optimizers(self):
    optimizer = torch.optim.SGD(self.parameters(), lr=1e-2)

    return optimizer
```
`configure_optimizers` is a reserved method name. PyTorch lightning will look for an optimizer in this method. We pass `self.parameters()` as an argument to `SGD`. 

## Step 3: Training and validation loop

### Training loop
The implementation block corresponding to training should go under a method called `training_step(self, batch, batch_idx)`. As you might have guessed, the logical block represents training per batch. Tracking of loss per epoch is automatically looked after by PyTorch Lightning.

Some changes from our previous implementation that you can notice right away:
- There's no `epochs loop` as we saw in the previous version of this code. This loop is looked after by `max_epochs` parameter of PyTorch Lightning Trainer object as you will soon see.
- the `forward()` computation of the module is automatically done when `self()` call is made (this is a native PyTorch feature, and not a special feature of Lightning).
- we also have gotten rid of `input_ids.to(device)` code, which quite frankly, was too much to ask from an ML practitioner.
- The logging of loss is taken care by the `log()` method. Lightning also control on whether to log per step, epoch etc. as you will see in our final implementation.


```python
    ...

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch

        logits = self(
            input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )

        torch.nn.utils.clip_grad_norm_(self.parameters(), 1.0)
        loss = F.cross_entropy(logits, labels)
        self.log('train_loss', loss)

        return loss
```

### Validation loop
All validation loop related computations should go to `validation_step(self, batch, batch_idx)` method. Similar to the training loop, this block corresponds operations per batch, and the aggregation of the results is looked after by PyTorch Lightning. 

One of my favourite features of PyTorch Lightning is the ease with which we can calculate evaluation metrics. Notice in `__init__()` we define `accuracy` and `F1` metric using PyTorch Lightning's `metrics` API. Now, in the validation step, the calculation of the metric is as simple as calling the `metric()` API with logits and true labels. This is much simpler compared to our earlier implementation of copying the logits to CPU, converting them to numpy arrays, performing argmax, and finally calculating evaluation metrics.

```python
    def __init__(self, dataset, hidden_sz=200, output_sz=4, dropout_prob=0.5):
        ...

        # define metrics
        self.valid_acc = pl.metrics.Accuracy()
        self.valid_f1 = pl.metrics.Fbeta(num_classes=output_sz, beta=1)
        
    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch

        logits = self(
            input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )
        loss = F.cross_entropy(logits, labels)
        self.log('validation_accuracy', self.valid_acc(logits, labels))
        self.log("validation_f1", self.valid_f1(logits, labels))
        
        return loss
```

### Dataloaders
One of the other features of PyTorch Lightning is that it encapsulates the dataloader implementation in the model class so that you don't have go looking for the data sources. This is done by using the `train_dataloader` and `val_dataloader` methods:

```python
  def _init_(..., dataset, ...):
    self.dataset = dataset
  
  ...

  def train_dataloader(self):
    train_sampler = RandomSampler(self.dataset["train"])
    
    return DataLoader(
        dataset=self.dataset["train"],
        sampler=train_sampler,
        batch_size=16
    )


  def val_dataloader(self):
    val_sampler = SequentialSampler(self.dataset["val"])
    
    return DataLoader(
        dataset=self.dataset["val"],
        sampler=val_sampler,
        batch_size=16
    )
```

# Step 4: Putting it all together

## Load dataset

In [None]:
# from google.colab import files
# climate_change_dataset = files.upload()
#use this when in colab (which is more ideal due to having a GPU option)

Saving twitter_sentiment_data.csv to twitter_sentiment_data.csv


This preprocessing of data is the same implemetation as we saw in the previous notebooks:

In [3]:
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AutoModelForSequenceClassification, AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
MAX_LEN = 64
label2id = {id:id+1 for id in range(-1, 3, 1)}
id2label = {v:k for k, v in label2id.items()}

def convert_examples_to_features(tweets, labels):
  input_ids = [
      bert_tokenizer.encode(tweet, add_special_tokens=True) for tweet in tweets
  ]

  input_ids = pad_sequences(
      input_ids,
      maxlen=MAX_LEN,
      dtype="long", 
      value=bert_tokenizer.pad_token_id,
      padding="post",
      truncating="post"
  )

  input_ids = torch.tensor(input_ids)
  attention_masks = torch.tensor([[int(tok > 0) for tok in tweet] for tweet in input_ids])
  labels = torch.tensor([label2id[label] for label in labels])

  return TensorDataset(input_ids, attention_masks, labels)

Split the dataset into train and validation data:

In [4]:
from sklearn.model_selection import train_test_split

df = pd.read_csv("twitter_sentiment_data.csv")
dataset = convert_examples_to_features(df.message, list(df.sentiment))
train_data, val_data, train_labels, val_labels = train_test_split(
    dataset,
    list(df.sentiment), 
    random_state=1234,
    test_size=0.2
)

## Complete PyTorch Lightning Model

The model implementiton after combining all the individual logical parts together will look as follows:

In [8]:
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import f1_score
import numpy as np
from torch.utils.data import DataLoader, TensorDataset, RandomSampler, SequentialSampler


class BertSentClassification(pl.LightningModule):
    def __init__(self, dataset, hidden_sz=200, output_sz=4, dropout_prob=0.2):
        super().__init__()
        self.dataset = dataset

        # Load pre-trained model
        self.bert_model = AutoModel.from_pretrained(
              pretrained_model_name_or_path="bert-base-uncased"
        )
        # Add dropout layer
        self.dropout = nn.Dropout(dropout_prob)
        # Add MLP layer
        self.mlp = nn.Sequential(
            nn.Linear(768, hidden_sz), #768 is the size of BERT output
            nn.ReLU(),
            nn.Linear(hidden_sz, output_sz),
            nn.Softmax()
        )

        # define metrics
        self.valid_acc = pl.metrics.Accuracy()
        self.valid_f1 = pl.metrics.FBeta(
            num_classes=output_sz,
            beta=1,
            average="macro"
          )

        
    def forward(self, input_ids, attention_mask, labels=None, token_type_ids=None):
        outputs = self.bert_model(
          input_ids,
          attention_mask=attention_mask,
          token_type_ids=token_type_ids
        )

        # BERT vectors corresponding to the [CLS] token
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        logits = self.mlp(pooled_output)

        return logits

    def configure_optimizers(self):
        # Define the optimizer here
        optimizer = torch.optim.SGD(self.parameters(), lr=5e-3)
        return optimizer

    def training_step(self, batch, batch_idx):
        # training_step will hold processing corresponding each traning step
        # the epoch loop and batch training loop are abstracted away by
        # PyTorch Lightning
        input_ids, attention_mask, labels = batch

        logits = self(
            input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )

        torch.nn.utils.clip_grad_norm_(self.parameters(), 1.0)
        loss = F.cross_entropy(logits, labels)
        self.log(
            "train_loss",
            loss,
            on_epoch=True,
            on_step=True,
            prog_bar=True,
            logger=True
        )

        return loss

    def validation_step(self, batch, batch_idx):
       # implementation corresponding to processing of validation data 
        input_ids, attention_mask, labels = batch

        logits = self(
            input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )
        self.log(
            "validation_accuracy",
            self.valid_acc(logits, labels),
            on_epoch=True,
            prog_bar=True,
            logger=True
        )
        self.log(
            "validation_f1", 
            self.valid_f1(logits, labels),
            on_epoch=True,
            prog_bar=True,
            logger=True
        )
        
    def train_dataloader(self):
        # dataloader corresponding to training data
        train_sampler = RandomSampler(self.dataset["train"])

        return DataLoader(
            dataset=self.dataset["train"],
            sampler=train_sampler,
            batch_size=64
        )
        


    def val_dataloader(self):
        # dataloader corresponding to validation data
        val_sampler = SequentialSampler(self.dataset["val"])

        return DataLoader(
            dataset=self.dataset["val"],
            sampler=val_sampler,
            batch_size=64
        )

## Running PyTorch Lightning Model
```python
model = BertSentClassification(dataset=dataset)
#Run for 5 epochs
trainer = pl.Trainer(max_epochs=5)
trainer.fit(model)
```
Quite often you might run through an entire train dataset only to realize that your neural network would throw a nasty error on the validation set. What's PyTorch Lightning solution for this? Introducing `fast_dev_run`! 🏎

With `fast_dev_run=True`, the model goes through one train batch and one validation batch, which makes bug catching easier!

In [6]:
dataset = {"train": train_data, "val": val_data}

In [9]:
model = BertSentClassification(dataset=dataset)
# run in fast_dev mode---go through one train batch and validation batch
trainer = pl.Trainer(fast_dev_run=True)
trainer.fit(model)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Running in fast_dev_run mode: will run a full train, val and test loop using 1 b

Epoch 0:   0%|          | 0/2 [00:00<?, ?it/s] 

  input = module(input)


Epoch 0:  50%|█████     | 1/2 [00:26<00:26, 26.04s/it, loss=1.39, v_num=]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/1 [00:00<?, ?it/s][A
Epoch 0: 100%|██████████| 2/2 [00:34<00:00, 17.30s/it, loss=1.39, v_num=, train_loss_step=1.390, train_loss_epoch=1.390, validation_accuracy=0.219, validation_f1=0.169]
Epoch 0: 100%|██████████| 2/2 [00:34<00:00, 17.31s/it, loss=1.39, v_num=, train_loss_step=1.390, train_loss_epoch=1.390, validation_accuracy=0.219, validation_f1=0.169]


Everything looks good so far with `fast_dev` mode? Alright, let's do full training for 7 epochs!  PyTorch Lightning uses `max_epochs` parameter to specify the epoch number, and abstracts away the epoch loop. Also, specify the `gpus` parameter so that the model can run on GPUs. 

It might take a while to finish training the model. May I suggest a break and some hot beverage! ☕️ 

In [None]:
model = BertSentClassification(dataset=dataset)
#uncomment the below if you are running on a GPU (which is ideal, a CPU could take forever)
trainer = pl.Trainer(max_epochs=7)#, gpus="0") # use GPU at index 0
trainer.fit(model)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name       | Type       | Params
------------------------------------------

Epoch 0:   3%|▎         | 22/688 [09:45<4:55:21, 26.61s/it, loss=1.37, v_num=0, validation_accuracy=0.117, validation_f1=0.0528, train_loss_step=1.360]

In [None]:
!ls lightning_logs/version_5/checkpoints/epoch=6.ckpt

The best model automatically gets saved based on best validation loss. You can find the best model here by checking `lightning_logs` directory. Model with the best checkpoint and corresponding hyperparameters (based on the performance on the validation set) gets saved in this directory.



## Load the best model and perform inference



To load the model from checkpoint, use the `load_from_checkpoint()` method.

In [None]:
#update the following path to reflect your best model
model = BertSentClassification.load_from_checkpoint(
    "lightning_logs/version_5/checkpoints/epoch=6.ckpt", 
    dataset=dataset
)
#Set model in eval mode
model.eval()

Pass an input text, and convert the text into features. Recollect that our labels are from `{-1, 0, 1, 2}` where `-1` indicates a negative outlook towards climate change and `2` represents a positive outlook. 

In [None]:
input_text = "global warming is a hoax hahahaha"
labels = [-1]
# Convert examples to features
test_dataset = convert_examples_to_features([input_text], labels=[-1])


Obtain input_ids and attention mask and pass it to the model for prediction. We are looking at only one sentence here, unsqueeze the tensor to add an extra axis to make the input compatible with the expected format.

In [None]:
input_ids, attention_mask, _ = next(iter(test_dataset))
#add a new axis for both attention mask and inpu_ids
input_ids = input_ids.unsqueeze(0)
attention_mask = attention_mask.unsqueeze(0)

Make a prediction

In [None]:
prediction = model(input_ids, attention_mask)

Find the class with maximum probability:

In [None]:
prediction = torch.argmax(prediction).item()
id2label[prediction]

# Homework time
Our code at the moment uses hard codes hyperparameters such as batch_size, learning rate in the implementation. It is not a good practice to hard code hyper-parameters this way. PyTorch Lightning provides a better way to handle hyperparameters. Using [this](https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html#lightningmodule-hyperparameters) documentation, can you rewrite the model implementation? 