<a href="https://www.kaggle.com/code/aisuko/summarization-nlp?scriptVersionId=164648169" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Summarization creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be:

**Extractive:** extract the most relevant information from a document

**Abstractive:** generative new text that captures the mose relevant information

We are going to fine-tune the pretrained label with `Translation` on the California state bill subset of the BillSum dataset for abstrative summarization.

In [1]:
%%capture
!pip install transformers==4.35.2
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install rouge-score==0.1.2

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tune-models"
os.environ["WANDB_NOTES"] = "Fine tune model distilbert base uncased"
os.environ["WANDB_NAME"] = "ft-t5-with-dill-sum"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Load BillSum dataset

Start by loading the smaller California state bill subset of the BillSum dataset.

In [3]:
from datasets import load_dataset

billsum=load_dataset("billsum", split="ca_test")
print(billsum)

Downloading readme:   0%|          | 0.00/6.87k [00:00<?, ?B/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 1237
})


In [4]:
billsum=billsum.train_test_split(test_size=0.2)
print(billsum)

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 989
    })
    test: Dataset({
        features: ['text', 'summary', 'title'],
        num_rows: 248
    })
})


In [5]:
billsum["train"][0]

{'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nThe Legislature finds and declares the following:\n(a) Every day 22 veterans take their own lives.\n(b) Thirty percent of veterans have considered suicide.\n(b)\nThe number of veterans who take their own lives is likely much higher as certificates of death do not require veteran status to be listed and may be under reporting the number of suicides.\nSEC. 2.\nSection 102875 of the Health and Safety Code is amended to read:\n102875.\nThe certificate of death shall be divided into two sections.\n(a) The first section shall contain those items necessary to establish the fact of the death, including all of the following and those other items as the State Registrar may designate:\n(1) (A) Personal data concerning decedent including full name, sex, color or race, marital status, name of spouse, date of birth and age at death, birthplace, usual residence,\nand\noccupation and industry or\nbusiness.\nbusiness,

There are two fileds that you will want to use:

* `text` the text of the bill which will be the input to the model
* `summary` a condensed version of `text` which will be the model target

# Preprocess

The next step is to load a T5 tokenizer to process `text` and `summary`:

In [6]:
from transformers import AutoTokenizer

model_name="t5-small"
tokenizer=AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

The preprocessing function we want to create needs to:

* Prefix the input with a prompt so T5 knows this is a summarization task. Some model capable of multiple NLP tasks require prompting for specific tasks.

* Use the keyword `text_target` argument when tokenizing labels.

* Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [7]:
prefix="summarize:"

def preprocess_function(examples):
    inputs=[prefix+doc for doc in examples["text"]]
    model_inputs=tokenizer(inputs, max_length=1024, truncation=True)
    
    labels=tokenizer(text_target=examples["summary"], max_length=128, truncation=True)
    
    model_inputs["labels"]=labels["input_ids"]
    return model_inputs

In [8]:
tokenized_billsum=billsum.map(preprocess_function, batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [9]:
from transformers import DataCollatorForSeq2Seq

data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

# Evaluate

Including a metric during training is often helpful for evaluating our model's performance. Here we are going to load the ROUGE metric.


## ROUGE

Recall Oriented Understudy for Gisting Evaluation, is a set of metrics and software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference of a set of references(human_produced) summary or translation.

In [10]:
import evaluate

rouge=evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to `compute` to calcualte the ROUGE metric

In [11]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels=eval_pred
    decoded_preds=tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels=np.where(labels!=-100, labels, tokenizer.pad_token_id)
    decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    result=rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    
    prediction_lens=[np.count_nonzero(pred!=tokenizer.pad_token_id) for pred in predictions]
    
    result["gen_len"]=np.mean(prediction_lens)
    
    return {k: round(v,4) for k,v in result.items()}

# Training

In [12]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model=AutoModelForSeq2SeqLM.from_pretrained(model_name)
print(model)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [13]:
training_args=Seq2SeqTrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16, # decrease from 16 to 8 to adapte low memory GPU
    per_device_eval_batch_size=16,
    gradient_checkpointing=True,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=True,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
    push_to_hub=False,
)


trainer=Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.3 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.1
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240228_064045-3f9wiueb[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mft-t5-with-dill-sum[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/causal_language_trainer/Fine-tune-models/runs/3f9wiueb[0m
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pa

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,3.152024,0.1442,0.0487,0.1194,0.1197,19.0
2,No log,2.820798,0.1365,0.0446,0.1129,0.113,19.0
3,No log,2.696676,0.1402,0.0485,0.1172,0.1172,19.0
4,No log,2.64685,0.1433,0.0529,0.1191,0.1191,19.0
5,No log,2.632482,0.1443,0.0546,0.1198,0.1198,19.0




TrainOutput(global_step=155, training_loss=3.1760413385206654, metrics={'train_runtime': 587.547, 'train_samples_per_second': 8.416, 'train_steps_per_second': 0.264, 'total_flos': 1338530416558080.0, 'train_loss': 3.1760413385206654, 'epoch': 5.0})

# Evaluate

In [14]:
import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 13.91


In [15]:
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(os.getenv("WANDB_NAME"))

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/aisuko/ft-t5-with-dill-sum/tree/main/'

# Inference

In [16]:
from transformers import pipeline

text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

summarizer=pipeline("summarization", model=os.getenv("WANDB_NAME"))
summarizer(text)

Your max_length is set to 200, but your input_length is only 103. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': "the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs . it's the most aggressive action on tackling the climate crisis in American history . no one making under $400,000 per year will pay a penny more in taxes."}]

## Using PyTorch

In [17]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained(os.getenv("WANDB_NAME"))
inputs=tokenizer(text, return_tensors="pt").input_ids

In [18]:
from transformers import AutoModelForSeq2SeqLM

model=AutoModelForSeq2SeqLM.from_pretrained(os.getenv("WANDB_NAME"))
outputs=model.generate(inputs, max_new_tokens=100, do_sample=False)

In [19]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

"the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in American history. it'll ask the ultra-wealthy and corporations to pay their fair share."