<a href="https://colab.research.google.com/github/MALIK-ZAKRIA-MEHMOOD/Text_Summarization_using_Hugging_Face/blob/main/Text_Summarization_using_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!nvidia-smi # tell me about the GPU or Environment I am Using

Mon Oct 21 19:22:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
# installing packages with transformers
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [4]:
# to install the latest version of the transformers
# accelerate is for the google colab environment to access the GPU
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate # uninstall older version of transformers
!pip install transformers accelerate

Found existing installation: transformers 4.45.2
Uninstalling transformers-4.45.2:
  Successfully uninstalled transformers-4.45.2
Found existing installation: accelerate 1.0.1
Uninstalling accelerate-1.0.1:
  Successfully uninstalled accelerate-1.0.1
Collecting transformers
  Using cached transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
Collecting accelerate
  Using cached accelerate-1.0.1-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.45.2-py3-none-any.whl (9.9 MB)
Using cached accelerate-1.0.1-py3-none-any.whl (330 kB)
Installing collected packages: accelerate, transformers
Successfully installed accelerate-1.0.1 transformers-4.45.2


In [5]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import torch
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
# Now check if the device is cude, if not run it on CPU
# As it is connected with T4 GPU, So this will use Cuda
device = 'cuda' if torch.cuda.is_available() else 'cpu' # using Torch Libraries
device


'cuda'

In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # import Tokenizer that will help converting our text.

In [8]:
model_ckpt = 'google/pegasus-cnn_dailymail' # model from the Hugging face
tokenizer = AutoTokenizer.from_pretrained(model_ckpt) # this will download the tokenizer from the hugging face itself

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
dataset_samsum  = load_dataset('samsum')

In [11]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [12]:
dataset_samsum['train']['dialogue'][1]

'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great'

In [13]:
dataset_samsum['train']['dialogue'][130]

"Mandy: Did you know that Amy smuggled cocaine in Latin America?\nSarah: OMG!! 🙀\nSvetlana: She's crazy. \nSvetlana: Why would she do that?\nMandy: She told me on Friday \nMandy: She said she didn't know.\nMandy: A guy she was with put it in her luggage \nSarah: What a bastard!!!\nSarah: I hope she didn't get in trouble.\nMandy: Luckily nobody realised. \nSarah: I would kill the guy \nSvetlana: That's horrible\nSvetlana: How can you do it to anyone? "

In [14]:
dataset_samsum['train'][1]['summary']

'Olivia and Olivier are voting for liberals in this election. '

In [15]:
dataset_samsum['train'][120]['summary']

"At a party, he made a hole in Luke's wall and vomited inside. Someone cooked Luke's expensive sea fish. "

In [16]:
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
print(f"Split Length : {split_lengths}")
print(f"Features : {dataset_samsum['train'].column_names}")
print("\nDialogue")
print(dataset_samsum['test'][1]['dialogue'])
print("\nSummary")
print(dataset_samsum['test'][1]['summary'])

Split Length : [14732, 819, 818]
Features : ['id', 'dialogue', 'summary']

Dialogue
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary
Eric and Rob are going to watch a stand-up on youtube.


In [17]:
# pre-processing of the data and convert it into vector representation
def convert_examples_to_features(example_batch):
  input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)
  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)
  return {
      'input_ids' : input_encodings['input_ids'],
      'attention_mask' : input_encodings['attention_mask'],
      'labels' : target_encodings['input_ids'],
  }

In [18]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True) # map function on entire dataset

Map:   0%|          | 0/819 [00:00<?, ? examples/s]



In [19]:
dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [20]:
dataset_samsum_pt["train"]["input_ids"][1] # This is the vector representation of the first dialogue

[18038,
 151,
 2632,
 127,
 119,
 6228,
 118,
 115,
 136,
 2974,
 152,
 10463,
 151,
 35884,
 130,
 329,
 107,
 18038,
 151,
 2587,
 314,
 1242,
 10463,
 151,
 1509,
 1]

In [21]:
# now to print the attention mask we use
dataset_samsum_pt["train"]["attention_mask"][1]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [22]:
# to print the summary of the labels we use
dataset_samsum_pt["train"]["labels"][1]

[18038, 111, 34296, 127, 6228, 118, 33195, 115, 136, 2974, 107, 1]

In [23]:
# training of the data
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [24]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir = 'pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)



In [27]:
trainer =  Trainer(model=model_pegasus, args=training_args,
                   tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                   train_dataset = dataset_samsum_pt["test"],
                   eval_dataset= dataset_samsum_pt["validation"]
                   )

In [28]:
trainer.train() # api key is 2e8693a462893a5a9dd5db33737fc8e1dbe4e837

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss




TrainOutput(global_step=51, training_loss=3.0044142264945832, metrics={'train_runtime': 865.1998, 'train_samples_per_second': 0.947, 'train_steps_per_second': 0.059, 'total_flos': 313450454089728.0, 'train_loss': 3.0044142264945832, 'epoch': 0.9963369963369964})

In [29]:
# Evaluation
def generate_batch_sized_chunks(list_of_elements, batch_size):
  """split the dataset into smaller batches that we can process simultaneously
  Yield successive batch-sized chunks from list_of_elements."""
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
  article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
  target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

  for article_batch, target_batch in tqdm(
      zip(article_batches, target_batches), total=len(article_batches)):
      inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")
      summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                              attention_mask=inputs["attention_mask"].to(device),
                              length_penalty=0.8, num_beams=8, max_length=128)
      '''parameter for length penalty ensures that the model does not generate sequences that are too long.'''

      # Finally we decode the generated texts,
      # replace the tokens and add the decoded text with the reference to the metric
      decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]
      decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

      metric.add_batch(predictions=decoded_summaries, references=target_batch)

  #  Finally compute and return the ROUGE scores.
  score = metric.compute()
  return score

In [35]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [37]:
from evaluate import load
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [46]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size=2,
    column_text='dialogue', column_summary='summary'
)

# List of ROUGE names you want to extract
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

# Create a dictionary from the score
rouge_dict = {rn: score[rn] for rn in rouge_names}

# Create a DataFrame to display the results
rouge_df = pd.DataFrame(rouge_dict, index=[f'pegasus'])
rouge_df

100%|██████████| 5/5 [00:18<00:00,  3.69s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.022906,0.0,0.023074,0.022913


In [47]:
# Save Model
model_pegasus.save_pretrained("pegasus-samsum-model")

In [48]:
# Save Tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [49]:
# Load
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [54]:
sample_text = dataset_samsum["test"][0]["dialogue"]
sample_text

"Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye"

In [55]:
reference = dataset_samsum["test"][0]["summary"]
reference

"Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry."

In [52]:
# prediction
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}
# if length penalty closes to zero, it will generate short output
# if length penalty closes to one, it will generate long output
sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .
