<a href="https://colab.research.google.com/github/AnamAtr/Text-summarization-project-/blob/main/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## **Installation & Environment Setup**

In [2]:
!nvidia-smi

Fri May 23 16:18:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       2MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
pip install tranformers[sentencepiece] datasets sacrebleu rouge_score py7zr

In [None]:
!pip install --upgrade accelerate
!pip install tranformers accelerate

In [None]:
# Install the evaluate library
!pip install evaluate

# Remove the incorrect import of load_metric from datasets
# from datasets import load_dataset, load_metric # <-- remove this part

# Import load_metric from the correct library
from datasets import load_dataset
import pandas as pd

# Import the evaluate library
import evaluate

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

# Now you can load a specific metric like ROUGE using evaluate.load()
# rouge = evaluate.load("rouge") # Example of how to load a metric

## **Import Require Libraries**

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_ckpt = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
device

In [None]:
model_ckpt="google/pegasus-cnn_dailymail"
tokenizer=AutoTokenizer.from_pretrained(model_ckpt)
model_pegasus=AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

**Load Pre-trained Summarization model**

In [None]:
# Install and upgrade necessary libraries
!pip install --upgrade datasets fsspec huggingface_hub

# After upgrading, try loading the dataset again
from datasets import load_dataset

try:
    ds = load_dataset("knkarthick/samsum")
    print("Dataset loaded successfully!")
except ValueError as e:
    print(f"Failed to load dataset after upgrade: {e}")
    print("The issue might be with the dataset configuration itself or a persistent compatibility problem.")

In [None]:
!wget https://github.com/entbappy/Branching-tutorial/blob/master/summarizer-data.zip
!unzip summarizer-data.zip

In [None]:
!unzip summarizer-data.zip

In [14]:
from google.colab import files
uploaded = files.upload()


Saving summarizer-data (1).zip to summarizer-data (1).zip


# **Input Text For Summarization**

In [15]:
import zipfile
import os

zip_filename = list(uploaded.keys())[0]  # Automatically get the uploaded filename

with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall("unzipped_files")  # Unzips into 'unzipped_files' folder

print("✅ File unzipped successfully!")


✅ File unzipped successfully!


In [16]:
os.listdir("unzipped_files")


['samsum-test.csv',
 'samsum-train.csv',
 'samsum_dataset',
 'samsum-validation.csv']

In [None]:
from datasets import load_from_disk
import os

# Assuming the dataset is located inside the 'unzipped_files' directory
# Construct the absolute path to the dataset directory
# Explicitly add the 'file://' protocol prefix for robustness
dataset_path = "file://" + os.path.abspath('./unzipped_files/samsum_dataset')


# Load the dataset using the absolute path
try:
    dataset_samsum = load_from_disk(dataset_path)
    print("Dataset loaded successfully!")
except ValueError as e:
    print(f"Failed to load dataset: {e}")
    print("Please ensure the path is correct and the directory contains a valid dataset.")
    # Removed the line 'samsum_dataset' as it was a typo and caused a NameError
    # If the dataset loading fails, dataset_samsum is not defined.

In [18]:
# Calculate the length of each split and store it in a list
split_lengths=[len(dataset_samsum[split]) for split in dataset_samsum]
print(f"split lengths:{split_lengths}")
print(f"features:{dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][1]["dialogue"])
print("\nSummary:")
print(dataset_samsum["test"][1]["summary"])

split lengths:[14732, 819, 818]
features:['id', 'dialogue', 'summary']

Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
Eric and Rob are going to watch a stand-up on youtube.


In [19]:
def convert_examples_to_features(example_batch):
  input_encodings=tokenizer(example_batch['dialogue'],max_length=1024,truncation=True)
  with tokenizer.as_target_tokenizer():
    target_encodings=tokenizer(example_batch['summary'],max_length=128,truncation=True)
  return{
      'input_ids':input_encodings['input_ids'],
      'attention_mask':input_encodings['attention_mask'],
      'labels':target_encodings['input_ids']
  }

In [None]:
# This cell initializes the tokenizer and should be run before the mapping cell
from transformers import AutoTokenizer

# Ensure model_ckpt is defined or re-define it
model_ckpt = "google/pegasus-cnn_dailymail"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Ensure convert_examples_to_features is defined or re-define it
def convert_examples_to_features(example_batch, tokenizer):
  # The tokenizer is now passed as an argument
  input_encodings=tokenizer(example_batch['dialogue'],max_length=1024,truncation=True)
  with tokenizer.as_target_tokenizer():
    target_encodings=tokenizer(example_batch['summary'],max_length=128,truncation=True)
  return{
      'input_ids':input_encodings['input_ids'],
      'attention_mask':input_encodings['attention_mask'],
      'labels':target_encodings['input_ids']
  }

# Pass the tokenizer to the map function using fn_kwargs
# This assumes dataset_samsum is already loaded and available
dataset_samsum_pt = dataset_samsum.map(
    convert_examples_to_features,
    batched=True,
    fn_kwargs={"tokenizer": tokenizer} # Pass the tokenizer here
)

In [None]:
dataset_samsum_pt["train"]

# **Summary Analysis (Word Count & Reliability)**

In [22]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator=DataCollatorForSeq2Seq(tokenizer,model=model_pegasus)

In [23]:
from transformers import TrainingArguments,Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_strategy='steps', # Changed from evaluation_strategy to eval_strategy
    eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

In [None]:
import wandb
from transformers import TrainingArguments,Trainer

# Initialize Weights & Biases
wandb.init(project="pegasus-samsum-training") # You can change the project name

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_strategy='steps', # Changed from evaluation_strategy to eval_strategy
    eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator, # Corrected variable name
                  train_dataset=dataset_samsum_pt["test"],
                  eval_dataset=dataset_samsum_pt["validation"])

trainer.train()

# Finish the Weights & Biases run
wandb.finish()

In [None]:
import wandb
from transformers import TrainingArguments,Trainer, AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# Initialize Weights & Biases
wandb.init(project="pegasus-samsum-training") # You can change the project name

# Ensure device and model_ckpt are defined (assuming they are defined in previous cells)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ckpt = "google/pegasus-cnn_dailymail"

# Load the model and tokenizer (re-added these lines to ensure they are defined)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_strategy='steps', # Changed from evaluation_strategy to eval_strategy
    eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

# Ensure seq2seq_data_collator and dataset_samsum_pt are defined in previous cells
# These variables are likely defined in cells 'ipython-input-0-7eaa0621ad29' and
# the cell loading/processing the dataset (e.g., 'ipython-input-9-7eaa0621ad29').
# Make sure those cells are run before this one.
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["test"], # Using the test set for training as per the original code
                  eval_dataset=dataset_samsum_pt["validation"])

trainer.train()

# Finish the Weights & Biasess run
wandb.finish()

In [26]:
def grenerate_batch_sized_chunks(list_of_elements,batch_size):
  for i in range(0,len(list_of_elements),batch_size):
    yield list_of_elements[i:i+batch_size]

def Calculates_metric_on_test_ds(dataset,metric,model,tokenizer,batch_size=16,device=device,
                                 column_text="article",
                                 column_summary="highlights"):
  article_batches=list(grenerate_batch_sized_chunks(dataset[column_text],batch_size))
  target_batches=list(grenerate_batch_sized_chunks(dataset[column_summary],batch_size))
  # Corrected indentation for the for loop and its content
  for article_batch,target_batch in tqdm(
      zip(article_batches,target_batches),total=len(article_batches)):
      inputs=tokenizer(article_batch,max_length=1024,truncation=True,
                        padding="max_length",return_tensors="pt")
      summaries=model.generate(input_ids=inputs["input_ids"].to(device),
                               attention_mask=inputs["attention_mask"].to(device),
                               length_penalty=0.8,num_beams=8,max_length=128)
      # Corrected indentation for the comment
      # " parameter for length penalty ensures that the model does not generate summaries that are too long."
      decoded_summaries=[tokenizer.decode(s,skip_special_tokens=True,
                                          clean_up_tokenization_spaces=True)
                         for s in summaries]
      decoded_summaries=[d.replace(""," ")for d in decoded_summaries]
      metric.add_batch(predictions=decoded_summaries,references=target_batch)
  # Corrected indentation for the return statement
  score=metric.compute()
  return score

#  Final Output & Insight


In [None]:
!pip install rouge_score


In [None]:
# Import the evaluate library (already imported earlier in the notebook, but good to be explicit here)
import evaluate

rouge_name=["rouge1","rouge2","rougeL","rougeLsum"]
# Correct the function name from load_matric to evaluate.load
rouge_metric = evaluate.load('rouge')

In [None]:
# Correct the function name from Calculates_metric_on_test_ds to Calculates_metric_on_test_ds
score = Calculates_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

# Correct the function name in the second call as well
# Removed the redundant second call to Calculates_metric_on_test_ds as it recalculates the score
# score = Calculates_metric_on_test_ds(
#     dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
# )

# Modify the dictionary comprehension to directly access the float score
rouge_dict = {}
for rn in rouge_name:
    # Directly access the float value for each ROUGE metric
    # Check if the key exists to be safe, although based on the traceback it should
    if rn in score:
        rouge_dict[rn] = score[rn]

pd.DataFrame(rouge_dict,index = [f'pegasus'])

# Removed the redundant second creation of the DataFrame
# rouge_dict=dict((rn,score[rn]['mid'].fmeasure)for rn in rouge_name)
# pd.DataFrame(rouge_dict,index = [f'pegasus'])

In [31]:
model_pegasus.save_pretrained("pegasus-samsum-model")

In [None]:
tokenizer.save_pretrained("tokenizer")

In [33]:
tokenizer=AutoTokenizer.from_pretrained("/content/tokenizer")

In [41]:
from transformers import pipeline, AutoTokenizer

gen_kwargs={"length_penalty":0.8,"num_beams":8,"max_length":128}


sample_text=dataset_samsum["test"][0]["dialogue"]
refrence=dataset_samsum["test"][0]["summary"]

# Import the pipeline function
pipe=pipeline("summarization",model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)
print("\nReference Summary:")
print(refrence)
print("\nModel Summary:")
print(pipe(sample_text,**gen_kwargs)[0]["summary_text"])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .


**Example**

In [None]:
!pip install textstat

from transformers import pipeline
from textstat import flesch_reading_ease

# Load summarization model (choose 'facebook/bart-large-cnn' or another)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# User input section
print(" Paste the text you want to summarize below:")
input_text = input("Paste Text Here:\n")



In [42]:
# User input section
print(" Paste the text you want to summarize below:")
input_text = input("Paste Text Here:\n")


if input_text.strip():

    summary = summarizer(input_text, max_length=130, min_length=30, do_sample=False)[0]['summary_text']
    readability = flesch_reading_ease(summary)
    word_count = len(summary.split())

    print("\n✅ Summary:\n", summary)
    print("\n🔹 Summary Word Count:", word_count)
    print("🔹 Readability Score:", readability)
else:
    print("⚠️ No input detected. Please paste a valid text.")

 Paste the text you want to summarize below:
Paste Text Here:
Outside the funeral home, I heard a boy say that she had fallen off the back of her boyfriend’s motorcycle. Broken her neck. She never knew what hit her, he said. I was 13. The dead girl had been a junior in high school.  The line to see her snaked around the building. Boys with long hair, wearing ties they’d borrowed from their fathers, and girls with thick blue eyeshadow smoked cigarettes in the parking lot. Someone passed a bottle of Jack. There were no adults there, just very old kids.  She almost looked like she was sleeping, except that she was too still. There was a puffiness to her face that didn’t seem quite right. They had dressed her for the prom; the crinoline sleeves of her gown like poofs of pink cotton candy. Some kids prayed, but I couldn’t. I just stared at the roses in her corsage.

✅ Summary:
 The dead girl had been a junior in high school. Outside the funeral home, I heard a boy say that she had fallen of