# Fine-Tuning a T5-Small Model for Text Summarization on Scientific Papers Dataset

## Introduction

In this notebook, we focus on fine-tuning a **T5-Small** model for text summarization using the **scientific_papers** dataset. The training process includes data preprocessing, tokenization, model training, and performance evaluation over **two epochs** to assess how well the model generates concise summaries for scientific articles.

The **scientific_papers** dataset consists of scholarly articles spanning multiple disciplines, making it an ideal resource for exploring **natural language processing (NLP)** tasks such as text summarization. By leveraging the **Hugging Face Transformers** library, we aim to optimize the model's ability to generate high-quality abstracts from research papers.

For the full tutorial, check out the original Medium article:
👉 [Read the full article on Medium](https://medium.com/@abdullahk.sulaiman/can-i-creat-my-own-text-summarization-654252f0b138)

## Step 1: Install Required Libraries

Run the following commands to install the necessary dependencies:

In [22]:
!pip install wandb
!pip install evaluate
!pip install rouge_score
!pip install huggingface_hub



## Step 2: Import Required Modules

In [23]:
from transformers import (
    AutoModelForSeq2SeqLM, 
    Seq2SeqTrainingArguments, 
    Seq2SeqTrainer,
    AutoTokenizer,
    DataCollatorForSeq2Seq
)


from datasets import load_dataset, Dataset
from huggingface_hub import login
import pandas as pd
import numpy as np
import evaluate
import shutil
import torch
import wandb

print("Done!")

Done!


## Step 3: Log in to Weights & Biases and Hugging Face

To track experiments with Weights & Biases and push our trained model to Hugging Face, we log in using API keys:

In [24]:
# Login To www.wandb.com
wandb.login(key = 'YOUR_WANDB_API_KEY')

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [25]:
# Login to www.huggingface.co
login("YOUR_HUGGING_FACE_API_KEY")

#### How to Get API Keys

`Weights & Biases`: Go to User Settings → API keys → Generate a new key.

`Hugging Face`: Go to Settings → Access Tokens → Create a new write token.

## Step 4: Set Training Parameters

Set up the parameters once so if you need to change it for specific need, you will change it just form here:

In [26]:
MODEL = 't5-small'
BATCH_SIZE = 16
NUM_PROCS = 5
EPOCHS = 2
OUT_DIR = 'results_t5small'
MAX_LENGTH = 512 

## Step 5: Download The Dataset

In [27]:
# Download the Dataset from the Hub
dataset = load_dataset("scientific_papers", "arxiv")

### Inspecting the Dataset

In [28]:
print(f"Dataset type: {type(dataset)}")
print(f"Dataset length: {len(dataset)}")
print(f"Dataset keys: {dataset.keys()}")

Dataset type: <class 'datasets.dataset_dict.DatasetDict'>
Dataset length: 3
Dataset keys: dict_keys(['train', 'validation', 'test'])


**Based on the structure of the dataset, we can now split it into three parts:**

In [29]:
train_dataset = dataset['train']
eval_dataset = dataset['validation']
test_dataset = dataset['test']

## Step 6: Prepare the Dataset

Let’s take a closer look at the training section to understand its **type**, **length**, and **structure**.

In [30]:
print(f"Train section type: {type(train_dataset)}")
print(f"Train section length: {len(train_dataset)}")

Train section type: <class 'datasets.arrow_dataset.Dataset'>
Train section length: 203037


**Inspecting the Dictionary Keys**

In [31]:
print(train_dataset[0].keys())  

dict_keys(['article', 'abstract', 'section_names'])


In [32]:
print(train_dataset['article'][0]) # Print an example from train Section

additive models @xcite provide an important family of models for semiparametric regression or classification . some reasons for the success of additive models are their increased flexibility when compared to linear or generalized linear models and their increased interpretability when compared to fully nonparametric models .
it is well - known that good estimators in additive models are in general less prone to the curse of high dimensionality than good estimators in fully nonparametric models .
many examples of such estimators belong to the large class of regularized kernel based methods over a reproducing kernel hilbert space @xmath0 , see e.g. @xcite . in the last years
many interesting results on learning rates of regularized kernel based models for additive models have been published when the focus is on sparsity and when the classical least squares loss function is used , see e.g. @xcite , @xcite , @xcite , @xcite , @xcite , @xcite and the references therein . of course , the lea

In [33]:
print(train_dataset['abstract'][0]) # Print an example from abstract Section

 additive models play an important role in semiparametric statistics . 
 this paper gives learning rates for regularized kernel based methods for additive models . 
 these learning rates compare favourably in particular in high dimensions to recent results on optimal learning rates for purely nonparametric regularized kernel based quantile regression using the gaussian radial basis function kernel , provided the assumption of an additive model is valid . 
 additionally , a concrete example is presented to show that a gaussian function depending only on one variable lies in a reproducing kernel hilbert space generated by an additive gaussian kernel , but does not belong to the reproducing kernel hilbert space generated by the multivariate gaussian kernel of the same variance .    * 
 key words and phrases . * additive model , kernel , quantile regression , semiparametric , rate of convergence , support vector machine . 


In [34]:
print(train_dataset['section_names'][0]) # Print an example from the section_names section.

introduction
main results on learning rates
comparison of learning rates


## Step 7: Tokenize the Dataset

We will use a **tokenizer** to transform our text-based dataset (containing articles and abstracts) into numerical format. The tokenizer we’ll use comes from the Hugging Face `transformers` library. It’s specifically designed to work with pre-trained models like BERT, GPT-2, T5 and others.

In [36]:
tokenizer = AutoTokenizer.from_pretrained(MODEL) 

In [37]:
# Tokenizasyon ve preprocess
def preprocess_function(examples):
    inputs = ["summarize: " + text for text in examples['article']]
    
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_LENGTH,
        truncation=True,
        padding=True
    )

    labels = tokenizer(
                text_target = examples['abstract'],
                max_length=128,
                truncation=True,
                padding=True
            )

    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs 
    
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True) 

Map:   0%|          | 0/203037 [00:00<?, ? examples/s]

Map:   0%|          | 0/6436 [00:00<?, ? examples/s]

**Checking the Tokenized Dataset**

In [38]:
print(f"Valedation Dataset Dictionary Keys --> {tokenized_eval_dataset[0].keys()}")
print(f"Datset length --> {len(tokenized_eval_dataset)}\n")

print(f"Training Dataset Dictionary Keys --> {tokenized_train_dataset[0].keys()}")
print(f"Datset length --> {len(tokenized_train_dataset)}")

Valedation Dataset Dictionary Keys --> dict_keys(['article', 'abstract', 'section_names', 'input_ids', 'attention_mask', 'labels'])
Datset length --> 6436

Training Dataset Dictionary Keys --> dict_keys(['article', 'abstract', 'section_names', 'input_ids', 'attention_mask', 'labels'])
Datset length --> 203037


## Step 8: Initialize Model and Set Device

In [39]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=MODEL) 

In [40]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [41]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

60,506,624 total parameters.
60,506,624 training parameters.


## Step 9: Setup the Compute Metrices

In [42]:
rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [43]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

## Step 10: Set Up Training Arguments

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir=OUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_steps=500, 
    weight_decay=0.01, 
    logging_dir=OUT_DIR,
    logging_steps=10,  
    eval_strategy='steps',
    eval_steps=2500,
    save_strategy='epoch',
    save_steps=2500,
    save_total_limit=3,
    learning_rate=5e-4,    
    dataloader_num_workers=4,
    report_to='wandb', 
    predict_with_generate=True, 
    push_to_hub=True,
)

## Step 11: Start the Training

In [44]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator, 
    compute_metrics=compute_metrics
) 
 
trainer.train()


  trainer = Seq2SeqTrainer(


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
2500,2.6116,2.578147,0.1704,0.0553,0.1343,0.1342,20.0
5000,2.5436,2.478952,0.1707,0.0569,0.1348,0.1347,20.0
7500,2.5037,2.431616,0.178,0.0602,0.1399,0.1398,20.0
10000,2.4498,2.403573,0.1803,0.0616,0.1415,0.1415,20.0
12500,2.4604,2.388103,0.1784,0.0606,0.1404,0.1403,20.0




TrainOutput(global_step=12690, training_loss=2.57481667553937, metrics={'train_runtime': 12192.0534, 'train_samples_per_second': 33.306, 'train_steps_per_second': 1.041, 'total_flos': 5.495878669094093e+16, 'train_loss': 2.57481667553937, 'epoch': 2.0})

In [47]:
print("Done!")

Done!


### Pushing the Model to the Hub

In [46]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/AbdullahKnn/results_t5small/commit/cb1e682666d4ac3d46102ce74c16b12106e5ee7c', commit_message='End of training', commit_description='', oid='cb1e682666d4ac3d46102ce74c16b12106e5ee7c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/AbdullahKnn/results_t5small', endpoint='https://huggingface.co', repo_type='model', repo_id='AbdullahKnn/results_t5small'), pr_revision=None, pr_num=None)