<a href="https://colab.research.google.com/github/Crystal-Reshea/FinBert-Albert-nlp/blob/main/Why_Fine_Tune_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning T5 model for Summarization using Hugging Face's Transformers

By now we have a decent understanding of Hugging Face's Transformers library. Previously we looked at the different tasks provided in transformers and how we utilize pipelines to easily use them on our own data. While Hugging Face provides several pre-trained models, business needs are often unique and may require some additional training or fine-tuning. 

Fine-tuning is when you take a pre-trained model and further train the model. Today we will fine-tune the T5 model for a summarization task using the CNN/Daily_News Dataset. 

Before we get into the how this is done let's cover some background information.

# T5 Model

The T5 model is a Text-To-Text Transfer Tranformer developed by Google. The model was first pre-trained on unlabeled data with a self-supervised task. Afterwards the model is able to be fine-tuned for other tasks. 

T5 has a Text-To-Text Framework, meaning that the input and ouput are always text. Below is a visual example of how the text-to-text framework. 

![text-to-text framework](https://1.bp.blogspot.com/-o4oiOExxq1s/Xk26XPC3haI/AAAAAAAAFU8/NBlvOWB84L0PTYy9TzZBaLf6fwPGJTR0QCLcBGAsYHQ/s640/image3.gif)

Input text is fed into the model and is trained for a specific text output. This makes it possible to use the same model, hyperparameters, and loss function for different tasks. 

# Why We Fine-Tune

Previously we saw how we could use pre-trained models to do specific tasks. Now we'll go one step further and fine-tune the model. When using transformer pipelines for summarization, the pipeline defaults to a BART model. The other options are variations of the t5 model. While pipelines are helpful and easy to use your business needs may require more flexibility than what is currently available. With the help of Hugging Face Libraries we will walk through the process of fine-tuning a T5 model.

# How to Fine-Tune

## Installs

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
!pip install torch

In [None]:
!pip install accelerate

## Imports

In [20]:
# Hugging Face Imports
from accelerate import Accelerator
from transformers import AdamW
import transformers
from datasets import load_dataset
from transformers import T5ForConditionalGeneration, T5TokenizerFast

In [21]:
import numpy as np
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
import os
import torch.nn.functional as F
import textwrap

## Load Dataset

Hugging face offers over 900 datasets within the datasets library. Here, we will use the [CNN/Daily Mail dataset](https://huggingface.co/datasets/cnn_dailymail). 

In [23]:
dataset = load_dataset("ccdv/cnn_dailymail", '3.0.0')

Downloading:   0%|          | 0.00/9.27k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 to /root/.cache/huggingface/datasets/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f...


  0%|          | 0/5 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

  0%|          | 0/5 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [24]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [25]:
train_dataset = dataset['train']
val_dataset = dataset['validation']

## Load Model and Tokenizer

In [None]:
tokenizer = T5TokenizerFast.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

## Create Dataset Class

In [None]:
class CNN_Dataset(torch.utils.data.Dataset):
  def __init__(self,dataset, tokenizer, article_len, summary_len, article_text, summary_text):
    self.data = dataset
    self.tokenizer = tokenizer
    self.article_len = article_len
    self.summary_len = summary_len
    self.summary_text = self.data[summary_text]
    self.article_text = self.data[article_text]

  def __len__(self):
    return self.data.shape[0]

  def __getitem__(self, index):
    article_text = self.article_text[index]
    summary_text = self.summary_text[index]

    source = self.tokenizer.batch_encode_plus([article_text], max_length = self.article_len, pad_to_max_length = True, truncation = True, padding = "max_length", return_tensors="pt")
    target = self.tokenizer.batch_encode_plus([summary_text], max_length = self.summary_len, pad_to_max_length = True, truncation = True, padding = "max_length", return_tensors="pt")

    source_ids = source["input_ids"].squeeze()
    target_ids = target["input_ids"].squeeze()
    src_mask = source["attention_mask"].squeeze()
    target_mask = target["attention_mask"].squeeze()

    return {"source_ids": source_ids,
            "source_mask": src_mask, 
            "target_ids": target_ids, 
            "target_mask": target_mask}

In [None]:
train_dataset=CNN_Dataset(train_dataset, tokenizer, 600, 128, 'article', 'highlights')

## Train Model

In [None]:
# set up accelerator
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
accelerator = Accelerator()
device = accelerator.device
model.to(device)

# activate training mode of model
model.train()

# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=5e-5)
train_loader = DataLoader(train_dataset_pt, shuffle=True)

model, optim, data_loader = accelerator.prepare(model,optim, train_loader)

loop = tqdm(train_loader, leave=True)
for epoch in range(1):
  # put model in train mode
  model.train()
  for i,data in enumerate(loop, 0):
    # summary input ids
    summ = data['target_ids'].to(device, dtype = torch.long)
    summ_ids = summ[:, :-1].contiguous()

    # labels
    lm_labels = summ[:, 1:].clone().detach()
    lm_labels[summ[:, 1:] == tokenizer.pad_token_id] = -100
   
    # input ids
    ids = data['source_ids'].to(device, dtype = torch.long)

    # attention mask
    mask = data['source_mask'].to(device, dtype = torch.long)
    
    # train model on batch and return outputs
    outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=summ_ids, labels=lm_labels)
    loss = outputs[0]

    # zero the parameter gradients
    optim.zero_grad()
    # forward + backward + optimize
    # loss.backward()
    accelerator.backward(loss)
    optim.step()

    # print relevant info to progress bar
    loop.set_description(f'Epoch {epoch}')
    loop.set_postfix(loss=loss.item())

### Save Model

In [None]:
model_path = '/content/drive/MyDrive/NLP_POC/t5_model_2'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

## Validate Model

In [None]:
def validate(epoch, tokenizer, model, device, loader):
  model.eval()
  predictions = []
  actuals = []

  train_loader = DataLoader(validate_dataset_pt, shuffle=True)

  with torch.no_grad():
      for i, data in enumerate(train_loader, 0):
          target_ids = data['target_ids'].to(device, dtype = torch.long)
          source_ids = data['source_ids'].to(device, dtype = torch.long)
          mask = data['source_mask'].to(device, dtype = torch.long)

          generated_ids = model.generate(
              input_ids = source_ids,
              attention_mask = mask, 
              max_length=150, 
              num_beams=2,
              repetition_penalty=2.5, 
              length_penalty=1.0, 
              early_stopping=True
              )
          
          prediction = [tokenizer.decode(id, skip_special_tokens=True, clean_up_tokenization_spaces=True) for id in generated_ids]
          target = [tokenizer.decode(token, skip_special_tokens=True, clean_up_tokenization_spaces=True) for token in target_ids]
          

          predictions.extend(prediction)
          actuals.extend(target)
  return predictions, actuals

### Uploading Fine-Tuned Model

In [None]:
model_path = '/content/drive/MyDrive/NLP_POC/t5_model_2'
tokenizer = T5TokenizerFast.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

## Test Model on Form 10-K 

In [None]:
def summarize(input_text):
  wrapper = textwrap.TextWrapper(width=75)
  with torch.no_grad():
      tokenized_text = tokenizer(input_text, truncation=True, padding=True, return_tensors='pt')

      source_ids = tokenized_text['input_ids'].to(device, dtype = torch.long)
      source_mask = tokenized_text['attention_mask'].to(device, dtype = torch.long)

      generated_ids = model.generate(
          input_ids = source_ids,
          attention_mask = source_mask, 
          max_length=200,
          num_beams=8,
          length_penalty=1, 
          early_stopping=True,
          no_repeat_ngram_size=2)
      
      pred = tokenizer.decode(generated_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
  return "Output:\n" + wrapper.fill(text=pred)

## Strategy Risk Summary

In [None]:
strategy_risk = """
We may face challenges in executing our omnichannel strategy and expanding our operations to ecommerce.
During fiscal 2020, we began executing on elements of our new growth strategy, which we comprehensively communicated during our October 2020
Investor Day. Our ability to implement our strategic direction is based on a number of key assumptions regarding the future economic environment and our
ability to meet certain ambitions, goals and targets, among other things. If any of these assumptions (including but not limited to our ability to meet certain
ambitions, goals and targets) prove inaccurate in whole or in part, our ability to achieve some or all of the expected benefits of this strategy could be
limited, including our ability to meet our stated financial objectives and retain key employees. Factors beyond our control, including but not limited to
market and economic conditions, execution risk related to the implementation of our strategy and other challenges and risk factors discussed in this annual
report, could limit our ability to achieve some or all of the expected benefits of this strategy. If we are unable to implement this strategy successfully in
whole or in part or should the components of the strategy that are implemented fail to produce the expected benefits, our business, results of operations,
financial condition and financial performance may be materially and adversely affected.
Additionally, an important part of our strategy involves providing customers with a seamless omnichannel shopping experience. Customer expectations
about the methods by which they purchase and receive products or services are evolving, including as a result of the COVID-19 pandemic, and they are
increasingly using technology to compare and purchase products. Once products are purchased, customers are seeking alternate options for delivery of
those products. The coordinated operation of our network of physical stores and online platforms is fundamental to the success of our omnichannel strategy,
and our ability to compete and meet customer expectations may suffer if we are unable to provide relevant customer-facing technology and omnichannel
experiences. Consequently, our business, results of operations, financial condition and financial performance could be materially adversely affected. For
more information on our strategy, see "Item 1 - Business - Strategy."
Successful execution of our omnichannel strategy is dependent, in part, on our ability to establish and profitably maintain the appropriate mix of
digital and physical presence in the markets we serve.
Successful execution of our omnichannel strategy depends, in part, on our ability to develop our digital capabilities in conjunction with optimizing our
physical store operations and market coverage, while maintaining profitability. Our ability to develop these capabilities will depend on a number of factors,
including our assessment and implementation of emerging technologies and our ability to manage the rapid increase in online orders as a result of the
COVID-19 pandemic as well as our ability to drive store traffic upon the expected return of in-person shopping. Our ability to optimize our store operations
and market coverage requires active management of our real estate portfolio in a manner that permits store sizes, layouts, locations and offerings to evolve
over time, which to the extent it involves the relocation of existing stores or the opening of additional stores will depend on a number of factors, including
our identification and availability of suitable locations; our success in negotiating leases on acceptable terms; and our timely development of new stores,
including the availability of construction materials and labor and the absence of significant construction and other delays based on weather or other events.
These factors could potentially increase the cost of doing business and the risk that our business practices could result in liabilities that may adversely affect
our performance, despite the exercise of reasonable care.
There are risks associated with our store network optimization strategies, pursuant to which we plan to close approximately 200 mostly BBB
stores by the end of fiscal 2021.
"""

In [None]:
print(summarize(strategy_risk))

Output:
During fiscal 2020, we began executing on elements of our new growth
strategy. If any of these assumptions prove inaccurate in whole or in part,
our ability to achieve some or all of the expected benefits of this
strategy could be limited, he says. Factors beyond our control, including
but not limited to market and economic conditions, execution risk related
to our strategy may be materially and adversely affected.


## Reputational Risk Summary

In [None]:
reputational_risk = """
Our reputation is based, in part, on perceptions of subjective qualities, so incidents involving us, our products or the retail industry in general that erode
customer trust or confidence could adversely affect our reputation and our business. As we increase the number of items available to be shipped directly
from a vendor to a customer for home delivery or in-home assembly, any deficiencies in the performance of these third party merchandise vendors and
service providers could also have a material adverse effect on our reputation, despite our monitoring controls and procedures. In addition, challenges to our
compliance with a variety of social, product, labor and environmental standards could also jeopardize our reputation and lead to adverse publicity,
especially in social media. The use of social media by us and consumers has also increased the risk that our reputation could be negatively impacted. The
availability of information and opinion on social media is immediate, as is its impact. The opportunity for dissemination of information, including
inaccurate and inflammatory information and opinion, is virtually limitless. Information about or affecting us is easily accessible and rapidly disseminated.
Damage to our brand and reputation could potentially impact our operating and financial results, diminish customer trust and generate negative sentiment,
as well as require additional resources to rebuild our reputation.
"""

In [None]:
print(summarize(reputational_risk))

Output:
Deficiency in performance of third party merchandise vendors and service
providers could adversely affect reputation. The use of social media by us
and consumers has also increased the risk that our reputation could be
negatively impacted, he says.


## Pandemic Impact Summary

In [None]:
pandemic_impact= """"
In March 2020, the World Health Organization declared the COVID-19 outbreak a global pandemic. The pandemic has materially disrupted our operations
to date. In compliance with relevant government directives, we temporarily closed all of our retail banner stores across the U.S. and Canada as of March 23,
2020, except for most stand-alone BABY and Harmon stores, which were categorized as essential given the nature of their products. In May 2020, we
announced a phased approach to reopen our stores in compliance with relevant government directives, and as of the end of July 2020, nearly all of our
stores reopened. We cannot predict, however, whether reopened stores will remain open, particularly as the regions in which we operate are experiencing a
resurgence of reported new cases of COVID-19 and hospitalizations. In response to the health risks caused by the COVID-19 pandemic, we expanded our
recently rolled out BOPIS, contactless Curbside Pickup and Same Day Delivery services to cover the vast majority of our stores.
In conjunction with the temporary store closures, we implemented additional cost reductions, including a furlough of the majority of store associates and a
portion of corporate associates. We provided impacted store associates with applicable pay and benefits through April 3, 2020, and impacted corporate
associates with pay and benefits through April 18, 2020. In addition, we had continued to pay 100% of the cost of healthcare premiums for all associates
who participated in our health plan. Nearly all of the associates who were subject to furlough returned to work as of the third quarter of fiscal 2020. We also
implemented a temporary reduction in salaries of our executive team by 30% through May 16, 2020, and a temporary reduction in the quarterly cash
compensation of the independent members of the Board of Directors by 30% for the first quarter of fiscal 2020. We also modified our fiscal 2020 capital
investment plan, focusing on our core business and key projects that support our digital and omni fulfillment capabilities, including the introduction of
BOPIS and contactless Curbside Pickup services, omni inventory management, and digital marketing and personalization.
We have and will continue to seek opportunities to mitigate the impact of the COVID-19 pandemic, including, among others, renegotiating payment terms
for goods, services and rent, managing inventory levels, and reducing discretionary spending such as business travel and advertising and expense associated
with the maintenance of stores that were temporarily closed. The COVID-19 pandemic materially adversely impacted our results of operations and cash
flows in fiscal 2020, and could continue to materially impact results of operations and cash flows as well as our financial condition. Given the uncertainties
regarding the spread of the virus, the timing of the economic recovery and the resurgence of the virus, the related financial impact cannot be reasonably
predicted or estimated at this time.
"""

In [None]:
print(summarize(pandemic_impact))

Output:
in March 2020, the World Health Organization declared the COVID-19 outbreak
a global pandemic. We temporarily closed all of our retail banner stores
across the U.S. and Canada as of March 23, 2020.


## Resources: 
* [Pytorch: Training a Classifier](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#training-on-gpu)
* [Pytorch Dataloader and DataSet Documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
* [Summarization Tutorial](https://shivanandroy.com/fine-tune-t5-transformer-with-pytorch/#dataset-class)
* [Summarization Tutorial](https://towardsdatascience.com/fine-tuning-a-t5-transformer-for-any-summarization-task-82334c64c81)
* https://medium.com/askdata/train-t5-for-text-summarization-a1926f52d281
*[Hugging Face CNN Daily News Dataset](https://huggingface.co/datasets/cnn_dailymail)
*[Hugging Face Accelerate Documentation](https://huggingface.co/docs/accelerate/)
* [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)