<a href="https://colab.research.google.com/github/Sankalpa1321/summarization_model/blob/main/Summarization_Model_using_flan_t5_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install required libraries

In [None]:
pip install transformers datasets torch sentencepiece evaluate PyPDF2 rouge_score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=265b30ccc888a9faa9882b767613059e7bb1dc44bf4d69cca32c83f1fd9783ca
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c682

Import Tools

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments, DataCollatorForSeq2Seq
import torch
import pandas as pd
from evaluate import load



Load Dataset and split if needed

In [None]:
from datasets import load_dataset
dataset = load_dataset("sankalpathapachhetri/Abstractive_Summary")

split_dataset = dataset["train"].train_test_split(test_size=0.1)
train_dataset = split_dataset['train']
test_dataset = split_dataset['test']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/306 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/101k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/161 [00:00<?, ? examples/s]

Tokenization

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
def tokenize_function(batch):
    inputs = ["summarize: " + doc for doc in batch["text"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)

    labels = tokenizer(batch["summary"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
def ensure_string_types(example):
    # Ensure 'text' is a string, handle None by converting to an empty string
    example["text"] = str(example["text"]) if example["text"] is not None else ""
    # Ensure 'summary' is a string, handle None by converting to an empty string
    example["summary"] = str(example["summary"]) if example["summary"] is not None else ""
    return example

# Apply the string conversion to both datasets
train_dataset_cleaned = train_dataset.map(ensure_string_types)
test_dataset_cleaned = test_dataset.map(ensure_string_types)

# Now, apply the tokenization function to the cleaned datasets
tokenized_train = train_dataset_cleaned.map(tokenize_function, batched=True)
tokenized_eval = test_dataset_cleaned.map(tokenize_function, batched=True)

Map:   0%|          | 0/144 [00:00<?, ? examples/s]

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

Map:   0%|          | 0/144 [00:00<?, ? examples/s]

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

Fine Tune Model

In [None]:
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(device)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

cpu


In [None]:
training_args = TrainingArguments(
    output_dir="./flan-t5_pdf_summary",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=10,
    save_steps=50,
    save_total_limit=2,
    learning_rate=5e-5,
)

In [None]:
import numpy as np
rouge = load("rouge")

def compute_metrics(eval_pred):
    predictions = eval_pred.predictions
    labels = eval_pred.label_ids

    # If predictions is a tuple (e.g., (logits_tensor,)), extract the tensor
    if isinstance(predictions, tuple):
        predictions = predictions[0] # Assuming logits are the first element

    # Now, predictions should be a tensor (or numpy array) of logits
    # We need to take the argmax to get the token IDs.
    predictions = predictions.argmax(axis=-1)

    # Replace -100 in labels with pad_token_id for correct decoding
    # -100 is used by Hugging Face for loss masking.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    # Decode predictions and labels if they are token IDs
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # ROUGE expects newline-separated sentences
    decoded_preds = ["\n".join(pred.strip().split()) for pred in decoded_preds]
    decoded_labels = ["\n".join(label.strip().split()) for label in decoded_labels]

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return {key: value * 100 for key, value in result.items()}

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mthapasankalpa29[0m ([33mthapasankalpa29-pashchimanchal-campus[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin




Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
10,2.6647,2.402951,48.235418,20.217091,43.815182,48.348235
20,2.7395,2.338883,48.43646,20.471471,44.25773,48.572184
30,2.8028,2.246322,48.076107,19.703329,43.566973,48.143263
40,2.7584,2.159834,48.587096,20.530465,44.124057,48.824432
50,2.4124,2.076291,49.876923,22.643271,45.471266,50.080045
60,2.41,2.000057,49.399884,22.943324,45.691908,49.613517
70,2.5606,1.947912,49.107484,22.553496,45.731873,49.175207
80,2.2417,1.896558,49.146187,22.708533,46.024466,49.268597
90,2.2204,1.852749,50.85773,23.368853,47.363432,50.890992
100,2.2453,1.809542,51.162341,23.312416,47.737687,51.303629




TrainOutput(global_step=216, training_loss=2.2033735513687134, metrics={'train_runtime': 2649.7431, 'train_samples_per_second': 0.163, 'train_steps_per_second': 0.082, 'total_flos': 98409795913728.0, 'train_loss': 2.2033735513687134, 'epoch': 3.0})

In [None]:
trainer.evaluate(eval_dataset=tokenized_eval)



{'eval_loss': 1.717995285987854,
 'eval_rouge1': 51.446033107552495,
 'eval_rouge2': 25.20089610555357,
 'eval_rougeL': 48.547703903435114,
 'eval_rougeLsum': 51.56698962634748,
 'eval_runtime': 13.6373,
 'eval_samples_per_second': 1.247,
 'eval_steps_per_second': 0.66,
 'epoch': 3.0}

Save and Load Model


In [None]:
model.save_pretrained("./new_summarizer_model")
tokenizer.save_pretrained("./new_summarizer_model")

('./new_summarizer_model/tokenizer_config.json',
 './new_summarizer_model/special_tokens_map.json',
 './new_summarizer_model/spiece.model',
 './new_summarizer_model/added_tokens.json')

In [None]:
model = T5ForConditionalGeneration.from_pretrained("./new_summarizer_model")
tokenizer = T5Tokenizer.from_pretrained("./new_summarizer_model")

Summarization System

In [None]:
def summarize_text(text):
  input_text = "summarize: " + text
  inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
  summary_ids = model.generate(inputs["input_ids"], max_length=128, min_length=40, num_beams=4, early_stopping=True)
  summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
  return summary

In [None]:
sample_text = """
Lucas goes to school every day of the week. He has many subjects to go to each school day: English, art, science, mathematics, gym, and history. His mother packs a big backpack full of books and lunch for Lucas.

His first class is English, and he likes that teacher very much. His English teacher says that he is a good pupil, which Lucas knows means that she thinks he is a good student.

His next class is art. He draws on paper with crayons and pencils and sometimes uses a ruler. Lucas likes art. It is his favorite class.

His third class is science. This class is very hard for Lucas to figure out, but he gets to work with his classmates a lot, which he likes to do. His friend, Kyle, works with Lucas in science class, and they have fun.

Then Lucas gets his break for lunch. He sits with Kyle while he eats. The principal, or the headmaster as some call him, likes to walk around and talk to students during lunch to check that they are all behaving.

The next class is mathematics, which most of the students just call math. Kyle has trouble getting a good grade in mathematics, but the teacher is very nice and helpful.

His fourth class is gym. It is just exercising.

History is his last class of the day. Lucas has a hard time staying awake. Many lessons are boring, and he is very tired after doing gym.
"""

In [None]:
summary = summarize_text(sample_text)
print(summary)

Lucas goes to school every day of the week. His first class is English. His second class is science. His friend Kyle works with Lucas in science. His fourth class is gym. His last class is history.


Download model to own machine