# This notebook demonstrates how to preprocess a dataset, tokenize the text, and train a summarization model using the Hugging Face Transformers library.


In [4]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import Dataset





## Load Dataset

We will load the dataset from a Parquet file and display the first few rows.


In [None]:
file_path = "hf://datasets/LeaBresson/pubmed-summarization-sample2/data/train-00000-of-00001-05afa88fda9ec5a7.parquet"
df = pd.read_parquet(file_path)


In [None]:
# Display the first few rows of the DataFrame
print(df.head())


In [None]:
#saving to csv and loading it into data frame 
df.to_csv('new.csv', index=False)
df = pd.read_csv('new.csv')
df

## Data Preprocessing

Before tokenizing, we will handle any missing data and save the dataset to a CSV file if necessary.


In [None]:
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

In [None]:
# Data Preprocessing
df.dropna(inplace=True)

## Tokenization

Initialize the tokenizer and define a function to tokenize the inputs and outputs.


In [None]:
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")


In [None]:
# Tokenization function
def tokenize_function(examples):
    inputs = ["summarize: " + doc for doc in examples['article']]
    model_inputs = tokenizer(inputs, truncation=True, padding='max_length', max_length=512)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['abstract'], truncation=True, padding='max_length', max_length=128)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## Convert and Tokenize Dataset

Convert the DataFrame to a Dataset object and apply the tokenization function.


In [None]:
# Convert DataFrame to Dataset and tokenize
dataset = Dataset.from_pandas(df)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

## Initialize Model and Training Arguments

Load the model and set up training arguments.


In [None]:
# Split the dataset into train and validation sets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split['train'].select(range(1000))  # Use only the first 1000 examples for training
eval_dataset = train_test_split['test'].select(range(200))     # Use only the first 200 examples for evaluation

In [None]:
# Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    predict_with_generate=True,
    evaluation_strategy="epoch",
    logging_dir='./logs',
    logging_steps=100,
    save_steps=500,
    num_train_epochs=3,
    overwrite_output_dir=True,
    save_total_limit=3,
    fp16=True,  
)

## Trainer

Set up the trainer with the model, tokenizer, and training arguments.


In [None]:
# Initialize the model
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

Model Trained and Stored 

In [None]:
# Start training
trainer.train()
# Evaluate the model
results = trainer.evaluate()
print(results)
# Save the trained model
trainer.save_model('./fine-tuned-model')

Testing the Model By giving multiple Inputs and Checking for output

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the fine-tuned model and tokenizer
model_dir = './fine-tuned-model'
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

# Function to summarize text
def summarize(text):
    # Prepare the input
    input_text = "summarize: " + text
    inputs = tokenizer.encode(input_text, return_tensors="pt", truncation=True, padding='longest', max_length=1024)
    
    # Generate summary
    summary_ids = model.generate(
        inputs, 
        max_length=150,  # Adjusted max_length
        min_length=50,   # Adjusted min_length to ensure better summarization
        length_penalty=2.0, 
        num_beams=10,    # Increased num_beams to improve quality
        no_repeat_ngram_size=3,  # Ensure no repeated n-grams
        early_stopping=True
    )
    
    # Decode and return the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example passage
example_passage = """
Cancer is a complex disease involving multiple genetic and environmental factors. The most common types include breast, lung, prostate, and colorectal cancers. Recent advancements in genomics have led to a better understanding of the molecular mechanisms underlying cancer development and progression. Targeted therapies, which are drugs designed to specifically attack cancer cells without harming normal cells, have shown promising results in clinical trials. Immunotherapy, which harnesses the body's immune system to fight cancer, has also emerged as a revolutionary treatment option. However, challenges such as drug resistance and tumor heterogeneity remain significant hurdles. Ongoing research aims to address these issues by developing more effective and personalized treatment strategies. The integration of artificial intelligence and machine learning in cancer research is expected to further accelerate the discovery of novel therapeutic targets and improve patient outcomes.
"""

summary = summarize(example_passage)
print("Original Passage:")
print(example_passage)
print("\nSummarized Passage:")
print(summary)


Original Passage:

Cancer is a complex disease involving multiple genetic and environmental factors. The most common types include breast, lung, prostate, and colorectal cancers. Recent advancements in genomics have led to a better understanding of the molecular mechanisms underlying cancer development and progression. Targeted therapies, which are drugs designed to specifically attack cancer cells without harming normal cells, have shown promising results in clinical trials. Immunotherapy, which harnesses the body's immune system to fight cancer, has also emerged as a revolutionary treatment option. However, challenges such as drug resistance and tumor heterogeneity remain significant hurdles. Ongoing research aims to address these issues by developing more effective and personalized treatment strategies. The integration of artificial intelligence and machine learning in cancer research is expected to further accelerate the discovery of novel therapeutic targets and improve patient ou

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the fine-tuned model and tokenizer
model_dir = './fine-tuned-model'
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForSeq2SeqLM.from_pretrained(model_dir)

# Function to summarize text
def summarize(text):
    # Tokenize input text
    inputs = tokenizer(
        "summarize: " + text,
        truncation=True,
        padding='longest',
        max_length=512,
        return_tensors="pt"
    )
    
    # Generate summary
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=150,
        min_length=40,
        length_penalty=1.0,
        num_beams=4,
        early_stopping=True
    )
    
    # Decode and return the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example passage
example_passage = """
In the heart of the bustling city, where skyscrapers kissed the clouds and neon lights painted the night sky, there existed a small, unassuming bookstore. Its weathered sign, barely hanging on, proclaimed "Whispering Pages" in faded gold lettering. Inside, the air was thick with the scent of aging paper and ink, a comforting fragrance for the soul-weary. Tall bookshelves lined every wall, bending under the weight of their literary treasures. Dust motes danced lazily in the shafts of sunlight that streamed through the small, round windows.The proprietor, Mr. Everly, a man of gentle demeanor with wisps of gray hair framing his kind face, greeted each visitor with a warm smile. His passion for books was infectious, evident in the way he spoke about each volume as if it held a secret waiting to be discovered. Regular patrons, a diverse crowd ranging from young students seeking knowledge to elderly bibliophiles seeking solace, found solace in the quiet corners of the store. Some lost themselves in ancient tomes, while others found new adventures in freshly printed novels.Outside, the world hurried by, oblivious to the sanctuary of stories nestled within Whispering Pages. For within those hallowed walls, time slowed and the imagination soared, transcending the boundaries of reality. Each page turned whispered tales of love and loss, of heroes and villains, binding together the souls of those who dared to tread its worn wooden floors.
"""

# Generate and print the summary
summary = summarize(example_passage)
print("Original Passage:")
print(example_passage)
print("\nSummarized Passage:")
print(summary)


Original Passage:

In the heart of the bustling city, where skyscrapers kissed the clouds and neon lights painted the night sky, there existed a small, unassuming bookstore. Its weathered sign, barely hanging on, proclaimed "Whispering Pages" in faded gold lettering. Inside, the air was thick with the scent of aging paper and ink, a comforting fragrance for the soul-weary. Tall bookshelves lined every wall, bending under the weight of their literary treasures. Dust motes danced lazily in the shafts of sunlight that streamed through the small, round windows.The proprietor, Mr. Everly, a man of gentle demeanor with wisps of gray hair framing his kind face, greeted each visitor with a warm smile. His passion for books was infectious, evident in the way he spoke about each volume as if it held a secret waiting to be discovered. Regular patrons, a diverse crowd ranging from young students seeking knowledge to elderly bibliophiles seeking solace, found solace in the quiet corners of the stor