<a href="https://colab.research.google.com/github/SamuelMiller413/Finetuning-the-T5-Transformer/blob/main/TA_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Technical Assignment

Attached below is an assignment where you will have to fine-tune T5, a large, Seq2Seq language model originally developed by Google. Figuring out how to fine-tune a model is as simple as following a tutorial on Medium so I thought it would be more interesting to see *what* you want to put through the model rather than *how*. 

Find a good dataset that would be appropriate for a text-to-text task (i.e., one where you input text, and the model outputs more text based off the input. A good example of this might be machine translation). Once the model fine-tunes and evaluates, take a look at the evaluation data (should be stored in a .csv in the 'outputs' folder the ipynb generates). What does the model do well? What errors does it make, and are they systematic? What do you think you could do to improve the fine-tuning? I am not expecting perfect results, but feel free to play around with the training hyperparameters (e.g., number of epochs, learning rate, etc.) – just keep in mind that you should keep the batch size pretty small since Colab has some significant performance constraints. And remember to keep the tab open while it is fine-tuning, otherwise it will stop! 

Some notes:
- This code will automatically do a train-test split (80/20, I think).
- The results from the test set will be stored in the 'outputs' folder. You should perform your qualitative evaluation on this file.

If you have any questions, or if anything is unclear, please do not hesitate to reach out.

Best,
Raz

## Initial Imports

In [None]:
# Dependencies
!pip install sentencepiece
!pip install transformers
!pip install rich[jupyter]
!pip3 install torch torchvision
!pip install datasets

# required for 'samsum' dataset
!pip install py7zr

# Pytorch Optimizers
!pip install torch_optimizer

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 32.8 MB/s eta 0:00:01[K     |▌                               | 20 kB 39.6 MB/s eta 0:00:01[K     |▉                               | 30 kB 44.2 MB/s eta 0:00:01[K     |█                               | 40 kB 50.5 MB/s eta 0:00:01[K     |█▍                              | 51 kB 24.4 MB/s eta 0:00:01[K     |█▋                              | 61 kB 27.7 MB/s eta 0:00:01[K     |██                              | 71 kB 28.3 MB/s eta 0:00:01[K     |██▏                             | 81 kB 29.7 MB/s eta 0:00:01[K     |██▍                             | 92 kB 32.3 MB/s eta 0:00:01[K     |██▊                             | 102 kB 34.5 MB/s eta 0:00:01[K     |███                             | 112 kB 34.5 MB/s eta 0:00:01[K     |███▎                            | 122 kB 34.5 MB/s eta 0:00:01[K     |██

In [None]:
# Importing libraries
import os
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import torch_optimizer as optim
import pandas as pd
import json

# Data
from datasets import load_dataset

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# logging
from rich.table import Column, Table
from rich import box
from rich.console import Console

# mount drive
from google.colab import drive
drive.mount('/content/drive')


# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'


# # Define GPU device based on availability
# device = 0 if torch.cuda.is_available() else -1
# print(device)

Mounted at /content/drive


## Data
*Narration:*

I chose to fine tune the T5 for the purpose of text summarization. The dataset is from HuggingFace: https://huggingface.co/datasets/samsum. 

From researching this dataset and its creators' methodology, it appears to have a nice corpus of accessible, everyday speech. Since this model is for use in user-facing APIs, I went with a dataset with easy-to-use language. 

Here I'll: 

*   load the train/val data, test data
*   construct a I/O dataframe on each with the dialogue/summary features to fit the script.
*   view/explore the data

In [None]:
# load train/val data
dataset = load_dataset("samsum")
dataset = dataset['train']

# load test data
dataset_test = load_dataset("samsum")
dataset_test = dataset_test["test"]

Downloading builder script:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/770 [00:00<?, ?B/s]

Downloading and preparing dataset samsum/samsum (download: 2.81 MiB, generated: 10.04 MiB, post-processed: Unknown size, total: 12.85 MiB) to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e...


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Reusing dataset samsum (/root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# INSERT DATAFRAME HERE ############################################################
# should be named 'df' and have columns ['input', 'output'] ########################

# construct data frame with 
df = pd.DataFrame({'input': dataset['dialogue'], 'output': dataset['summary']})
df_test = pd.DataFrame({'input': dataset_test['dialogue'], 'output': dataset_test['summary']})

####################################################################################

In [None]:
df.columns

Index(['input', 'output'], dtype='object')

In [None]:
df.head()

Unnamed: 0,input,output
0,Amanda: I baked cookies. Do you want some?\r\...,Amanda baked cookies and will bring Jerry some...
1,Olivia: Who are you voting for in this electio...,Olivia and Olivier are voting for liberals in ...
2,"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I wa...",Kim may try the pomodoro technique recommended...
3,"Edward: Rachel, I think I'm in ove with Bella....",Edward thinks he is in love with Bella. Rachel...
4,Sam: hey overheard rick say something\r\nSam:...,"Sam is confused, because he overheard Rick com..."


## Get logging up and running

In [None]:
# define a rich console logger
console=Console(record=True)

def display_df(df):
  """display dataframe in ASCII format"""

  console=Console()
  table = Table(Column("source_text", justify="center" ), Column("target_text", justify="center"), title="Sample Data",pad_edge=False, box=box.ASCII)

  for i, row in enumerate(df.values.tolist()):
    table.add_row(row[0], row[1])

  console.print(table)

training_logger = Table(Column("Epoch", justify="center" ), 
                        Column("Steps", justify="center"),
                        Column("Loss", justify="center"), 
                        title="Training Status",pad_edge=False, box=box.ASCII)


## Define data loader and training functions

In [None]:
class DSClass(Dataset):
  """
  Creating a custom dataset for reading the dataset and 
  loading it into the dataloader to pass it to the neural network for finetuning the model

  """

  def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text):
    self.tokenizer = tokenizer
    self.data = dataframe
    self.source_len = source_len
    self.summ_len = target_len
    self.target_text = self.data[target_text]
    self.source_text = self.data[source_text]

  def __len__(self):
    return len(self.target_text)

  def __getitem__(self, index):
    source_text = str(self.source_text[index])
    target_text = str(self.target_text[index])

    #cleaning data so as to ensure data is in string type
    source_text = ' '.join(source_text.split())
    target_text = ' '.join(target_text.split())

    source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
    target = self.tokenizer.batch_encode_plus([target_text], max_length= self.summ_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')

    source_ids = source['input_ids'].squeeze()
    source_mask = source['attention_mask'].squeeze()
    target_ids = target['input_ids'].squeeze()
    target_mask = target['attention_mask'].squeeze()

    return {
        'source_ids': source_ids.to(dtype=torch.long), 
        'source_mask': source_mask.to(dtype=torch.long), 
        'target_ids': target_ids.to(dtype=torch.long),
        'target_ids_y': target_ids.to(dtype=torch.long)
    }

In [None]:
def train(epoch, tokenizer, model, device, loader, optimizer):

  """
  Function to be called for training with the parameters passed from main function

  """

  model.train()
  for _,data in enumerate(loader, 0):
    # get input and output data into tip top shape
    y = data['target_ids'].to(device, dtype = torch.long)
    y_ids = y[:, :-1].contiguous()
    lm_labels = y[:, 1:].clone().detach()
    lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
    ids = data['source_ids'].to(device, dtype = torch.long)
    mask = data['source_mask'].to(device, dtype = torch.long)

    # generate outputs
    outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels)
    loss = outputs[0]

    # log every 10th step
    if _%10==0:
      training_logger.add_row(str(epoch), str(_), str(loss))
      console.print(training_logger)

    # clear the optimizer gradients
    optimizer.zero_grad()

    # calculate loss
    loss.backward()

    # optimize based off of loss 
    optimizer.step()

In [None]:
def validate(epoch, tokenizer, model, device, loader):

  """
  Function to evaluate model for predictions

  """

  # throw model in eval mode
  model.eval()

  # predict!
  predictions = []
  actuals = []
  with torch.no_grad():
      for _, data in enumerate(loader, 0):
          y = data['target_ids'].to(device, dtype = torch.long)
          ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
          target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
          if _%10==0:
              console.print(f'Completed {_}')
          
          
          predictions.extend(preds)
          actuals.extend(target)
  return predictions, actuals

In [None]:
def T5Trainer(dataframe, source_text, target_text, model_params, output_dir="./outputs/" ):
  
  """
  T5 trainer

  """

  # Set random seeds and deterministic pytorch for reproducibility
  torch.manual_seed(model_params["SEED"]) # pytorch random seed
  np.random.seed(model_params["SEED"]) # numpy random seed
  torch.backends.cudnn.deterministic = True

  # logging
  console.log(f"""[Model]: Loading {model_params["MODEL"]}...\n""")

  # tokenzier for encoding the text
  tokenizer = T5Tokenizer.from_pretrained(model_params["MODEL"])

  # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
  # Further this model is sent to device (GPU/TPU) for using the hardware.
  model = T5ForConditionalGeneration.from_pretrained(model_params["MODEL"])
  model = model.to(device)
  
  # logging
  console.log(f"[Data]: Reading data...\n")

  # Importing the raw dataset
  dataframe = dataframe[[source_text,target_text]]
  display_df(dataframe.head(2))

  
  # Creation of Dataset and Dataloader
  # Defining the train size. So 80% of the data will be used for training and the rest for validation. 
  train_size = 0.8
  train_dataset=dataframe.sample(frac=train_size,random_state = model_params["SEED"])
  val_dataset=dataframe.drop(train_dataset.index).reset_index(drop=True)
  train_dataset = train_dataset.reset_index(drop=True)

  console.print(f"FULL Dataset: {dataframe.shape}")
  console.print(f"TRAIN Dataset: {train_dataset.shape}")
  console.print(f"TEST Dataset: {val_dataset.shape}\n")


  # Creating the Training and Validation dataset for further creation of Dataloader
  training_set = DSClass(train_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)
  val_set = DSClass(val_dataset, tokenizer, model_params["MAX_SOURCE_TEXT_LENGTH"], model_params["MAX_TARGET_TEXT_LENGTH"], source_text, target_text)


  # Defining the parameters for creation of dataloaders
  train_params = {
      'batch_size': model_params["TRAIN_BATCH_SIZE"],
      'shuffle': True,
      'num_workers': 0
      }


  val_params = {
      'batch_size': model_params["VALID_BATCH_SIZE"],
      'shuffle': False,
      'num_workers': 0
      }


  # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
  training_loader = DataLoader(training_set, **train_params)
  val_loader = DataLoader(val_set, **val_params)

# >>>>>>>>>>>>>>>>>>>>>>

  # Defining the optimizer that will be used to tune the weights of the network in the training session. 
  optimizer = optim.Adafactor(params = model.parameters(), lr=model_params["LEARNING_RATE"])

    # Here I changed the optimizer from Adam to Adafactor.
# >>>>>>>>>>>>>>>>>>>>>>


  # Training loop
  console.log(f'[Initiating Fine Tuning]...\n')

  for epoch in range(model_params["TRAIN_EPOCHS"]):
      train(epoch, tokenizer, model, device, training_loader, optimizer)
      
  console.log(f"[Saving Model]...\n")
  #Saving the model after training
  path = os.path.join(output_dir, "model_files")
  model.save_pretrained(path)
  tokenizer.save_pretrained(path)


  # evaluating test dataset
  console.log(f"[Initiating Validation]...\n")
  for epoch in range(model_params["VAL_EPOCHS"]):
    predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
    final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
    final_df.to_csv(os.path.join(output_dir,'predictions.csv'))
  
  console.save_text(os.path.join(output_dir,'logs.txt'))
  
  console.log(f"[Validation Completed.]\n")
  console.print(f"""[Model] Model saved @ {os.path.join(output_dir, "model_files")}\n""")
  console.print(f"""[Validation] Generation on Validation data saved @ {os.path.join(output_dir,'predictions.csv')}\n""")
  console.print(f"""[Logs] Logs saved @ {os.path.join(output_dir,'logs.txt')}\n""")

  return model, tokenizer, predictions, actuals

## Train Model!
*Narration:*


With Colab Pro processing available, I increased the batch size on both the training and validation sets. 

Plan:


*   Run with Adafactor
*   Sample Size 5000 to begin with
*   Batch Size 32

Starting with Adafactor per request :), and a sample size which should run quickly and give a nice benchmarch for tuning. 


> I've experimented with various hyperparameters, and have implemented this model:

*   Run with Adafactor
*   Entire Train Dataset as Source
*   Batch Size 32
*   Max Target Text 100






In [None]:
model_params={
    "MODEL":"t5-small",            # model_type
    "TRAIN_BATCH_SIZE":32,         # training batch size
    "VALID_BATCH_SIZE":32,         # validation batch size
    "TRAIN_EPOCHS":2,              # number of training epochs
    "VAL_EPOCHS":1,                # number of validation epochs
    "LEARNING_RATE":1e-4,          # learning rate
    "MAX_SOURCE_TEXT_LENGTH":512,  # max length of source text
    "MAX_TARGET_TEXT_LENGTH":100,   # max length of target text
    "SEED": 42                     # set seed for reproducibility 

}

In [None]:

model, tokenizer, predictions, actuals = T5Trainer(dataframe=df, source_text="input", target_text="output", model_params=model_params, output_dir="outputs")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [None]:
# running_loss += loss.item() * inputs.size(0)
# running_corrects += torch.sum(preds == labels.data)
# # Add these lines to obtain f1_score  
# from sklearn.metrics import f1_score   
# f1_score = f1_score(labels.data, preds)
# #or: f1_score = f1_score(labels.cpu().data, preds.cpu())

In [None]:
# if you want to save the model to the Hugging Face Hub and you have an account, you can uncomment the following code.
# if you don't have a Hugging Face account, you should probably make one, because there are some sick models on it.
# their website is huggingface.co

from huggingface_hub import notebook_login
notebook_login() # it's going to ask you for a personal access key, which you can find in your account settings.

In [None]:
model.push_to_hub('sum_it')
tokenizer.push_to_hub('sum_it')

## Evaluation
If you are doing a binary classification task ('yes' or 'no), you could evaluate the model using precision, recall, or F1 score. If it's not a classification task, you could use accuracy. If it's a generative task, maybe a BLEU or ROUGE score. Looking forward to what you come up with!

*   
*   

*Narration*:

For evaluating the text summarizer, I decided to go with BLEU and ROUGE scores. For ROUGE, I chose to start ROUGE-L ('rougeLsum'), it seeming most appropriate as a summarization metric.

In [None]:
# evaluation here!
results = pd.read_csv('/content/outputs/predictions.csv') # hopefully this should work...


In [None]:
!pip install rouge
!pip install bleu
!pip install torchmetrics

from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
from rouge import Rouge
from torchmetrics.text.rouge import ROUGEScore
from pprint import pprint

In [None]:
# pick rouge
rouge = ROUGEScore()
rouge_params={
    "use_stemmer":"true",            
    "tokenizer": tokenizer,
    "accumulate": 'avg',
    "rouge_keys":'rougeLsum'
}   
# apply rouge
smooth = SmoothingFunction()
rouge_scores = rouge(predictions, actuals)
bleu_score = corpus_bleu(actuals, predictions, smoothing_function=smooth.method1)

In [None]:
# scores dataframe
columns = ['Score']
rouge_df = pd.DataFrame.from_dict(rouge_scores, orient='index', columns=columns)
rouge_df.index.name = 'Metric'

# ROUGE Scores
rouge_1_score = rouge_df[:3]
rouge_2_score = rouge_df[3:6]
rouge_L_score = rouge_df[6:9]
rouge_Lsum_score = rouge_df[9:]

# BLEU Score
smooth = SmoothingFunction()

In [None]:
print("--")
print(f"BLEU : {bleu_score}")
print("--")
print(rouge_1_score)
print("--")
print(rouge_2_score)
print("--")
print(rouge_L_score)
print("--")
print(rouge_Lsum_score)
print("--")

In [None]:
results.sample(5,random_state=42)

## My Thoughts

After fine tuning the model and scoring it, I'm overall pleased with my process (if not the results). This model has been a blast to explore. 

In terms of the metrics, it could have done much better. I'll still be tweaking some stuff but some things I've noticed initially:

*   There are grammar issues. Code will be needed to smooth this out. 
*   It shows a propensity for generating very small summaries where it could often benefit with a longer output. This will be my first area for improvement as I go forward with this model. 
*   The generated text could work well to extract keywords, as it already produces shorter, keyword-like summaries. Currently they're absurd and funny. Functional may be better. 

*Going Forward:*

I built an app in HuggingFace in 'Spaces' that implements the model in all its oddness: 
sum_it: https://huggingface.co/spaces/SamuelMiller/sum_it



## References
This notebook is adapted from the following tutorial:
https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/