In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!nvidia-smi

Thu Dec 19 08:12:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [None]:
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

Found existing installation: transformers 4.47.1
Uninstalling transformers-4.47.1:
  Successfully uninstalled transformers-4.47.1
Found existing installation: accelerate 1.2.1
Uninstalling accelerate-1.2.1:
  Successfully uninstalled accelerate-1.2.1
Collecting transformers
  Using cached transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Collecting accelerate
  Using cached accelerate-1.2.1-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.47.1-py3-none-any.whl (10.1 MB)
Using cached accelerate-1.2.1-py3-none-any.whl (336 kB)
Installing collected packages: accelerate, transformers
Successfully installed accelerate-1.2.1 transformers-4.47.1


## Purpose of accelerate:
- Simplified Multi-Device Training: Facilitates seamless scaling of models across multiple GPUs or TPUs with minimal code modifications.
- Mixed Precision Training: Enables faster training and reduced memory usage by utilizing mixed precision techniques.
- Zero Redundancy Optimizer (ZeRO): Efficiently distributes large models across multiple devices, optimizing memory usage.
- Offloading to CPU/SSD: Provides support for handling large models by offloading parts of the model or optimizer to the CPU or SSD when GPU memory is insufficient.

In [None]:
!pip install evaluate



In [None]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
from datasets import load_dataset
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, AutoTokenizer, TrainingArguments, Trainer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch
import evaluate
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#HuggingFace Model example:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-cnn_dailymail")
model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-cnn_dailymail")

ARTICLE = """
Born in 1987, Kilian has been training for Everest his whole life. And that really does mean his whole life, as he grew up 2,000 metres above sea level in the Pyrenees in the ski resort of Lles de Cerdanya in Catalonia, north-eastern Spain. While other children his age were learning to walk, Kilian was on skis. At one and a half years old he did a five-hour hike with his mother, entirely under his own steam. He left his peers even further behind when he climbed his first mountain and competed in his first cross-country ski race at age three. By age seven, he had scaled a 4,000er and, at ten, he did a 42-day crossing of the Pyrenees.
He was 13 when he says he started to take it 'seriously' and trained with the Ski Mountaineering Technical Centre (CTEMC) in Catalonia, entering competitions and working with a coach. At 18, he took over his own ski-mountaineering and trail-running training, with a schedule that only allows a couple of weeks of rest a year. He does as many as 1,140 hours of endurance training a year, plus strength training and technical workouts as well as specific training in the week before a race. For his record-breaking ascent and descent of the Matterhorn, he prepared by climbing the mountain ten times until he knew every detail of it, even including where the sun would be shining at every part of the day.
Sleeping only seven hours a night, Kilian Jornet seems almost superhuman. His resting heartbeat is extremely low at 33 beats per minute, compared with the average man's 60 per minute or an athlete's 40 per minute. He breathes more efficiently than average people too, taking in more oxygen per breath, and he has a much faster recovery time after exercise as his body quickly breaks down lactic acid - the acid in muscles that causes pain after exercise.
All this is thanks to his childhood in the mountains and to genetics, but it is his mental strength that sets him apart. He often sets himself challenges to see how long he can endure difficult conditions in order to truly understand what his body and mind can cope with. For example, he almost gave himself kidney failure after only drinking 3.5 litres of water on a 100km run in temperatures of around 40°C.
It would take a book to list all the races and awards he's won and the mountains he's climbed. And even here, Kilian's achievements exceed the average person as, somehow, he finds time to record his career on his blog and has written three books, Run or Die, The Invisible Border and Summits of My Life.
"""

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
inputs = tokenizer(ARTICLE, max_length = 2048, truncation=True, return_tensors = "pt")

In [None]:
inputs

{'input_ids': tensor([[10319,   115, 28707, 14895,  3262,   148,   174,   569,   118, 25715,
           169,   664,   271,   107,   325,   120,   288,   358,  1021,   169,
           664,   271,   108,   130,   178,  3139,   164,  9717,  7641,   607,
          1917,   476,   115,   109, 65783,   115,   109,  5288,  3831,   113,
          1054,  6366,   718, 25756,   252, 28717,   115, 38159,   108,  2523,
           121, 41600,  4027,   107,  1041,   176,   404,   169,   779,   195,
           761,   112,  1102,   108, 14895,  3262,   140,   124, 24760,   107,
           654,   156,   111,   114,   751,   231,   459,   178,   368,   114,
           668,   121,  4196,  6280,   122,   169,  1499,   108,  3143,   365,
           169,   282,  5147,   107,   285,   518,   169,  6949,   254,   701,
           893,   173,   178, 14026,   169,   211,  2924,   111, 15463,   115,
           169,   211,  1891,   121, 13340,  5288,  1580,   134,   779,   339,
           107,  1060,   779,  1761,  

In [None]:
summary_ids = model.generate(inputs["input_ids"])

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

#### Fine Tuning

In [None]:
model = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model)  #load a tokenizer

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
dataset_samsum = load_from_disk('/content/drive/MyDrive/samsum-dataset')
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [None]:
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")

print(dataset_samsum["test"][7]["dialogue"])

print("\nSummary:")

print(dataset_samsum["test"][21]["summary"])

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Rita: I'm so bloody tired. Falling asleep at work. :-(
Tina: I know what you mean.
Tina: I keep on nodding off at my keyboard hoping that the boss doesn't notice..
Rita: The time just keeps on dragging on and on and on.... 
Rita: I keep on looking at the clock and there's still 4 hours of this drudgery to go.
Tina: Times like these I really hate my work.
Rita: I'm really not cut out for this level of boredom.
Tina: Neither am I.

Summary:
Gloria has an exam soon. It lasts 4 hours. Emma sent her a link to a website with some texts from previous years so that she can prepare for the exam better.


### Preparing Data For Training For Sequence To Sequence Model

{
    'dialogue': "Hi! How are you?",
    'summary': "The speaker is asking how the other person is."
}


{
    'input_ids': [123, 456, 789, ...],  # Token IDs for the dialogue
    'attention_mask': [1, 1, 1, ...],  # Attention mask for the input
    'labels': [321, 654, 987, ...]  # Token IDs for the summary (target)
}

In [None]:
def features_maker(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [None]:
dataset_samsum_pt = dataset_samsum.map(features_maker, batched = True)

Map:   0%|          | 0/819 [00:00<?, ? examples/s]



In [None]:
dataset_samsum_pt['test']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 819
})

In [None]:
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [None]:
trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)



In [None]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  processing_class=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["test"],
                  eval_dataset=dataset_samsum_pt["validation"])

In [None]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss




TrainOutput(global_step=51, training_loss=48.07101163677141, metrics={'train_runtime': 368.0288, 'train_samples_per_second': 2.225, 'train_steps_per_second': 0.139, 'total_flos': 313450454089728.0, 'train_loss': 48.07101163677141, 'epoch': 0.9963369963369964})

## ROUGE Score

The **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) score is a set of metrics used to evaluate the quality of summaries and translations generated by natural language processing (NLP) models. It measures the overlap between the words, phrases, and sequences in the generated text (candidate) and those in the human-written reference text.

### Formula

The ROUGE score can be calculated using different methods, such as ROUGE-N, ROUGE-L, and ROUGE-S. Here are the formulas for some of the common variants:

- **ROUGE-N**: Measures the overlap of n-grams between the candidate and reference texts.
  $$\text{ROUGE-N} = \frac{\sum_{s \in \text{Reference}} \sum_{gram_n \in s} Count_{match}(gram_n)}{\sum_{s \in \text{Reference}} \sum_{gram_n \in s} Count(gram_n)}$$

- **ROUGE-L**: Measures the longest common subsequence (LCS) between the candidate and reference texts.
  $$\text{ROUGE-L} = \frac{\text{LCS}(candidate, reference)}{\text{Length of reference}}$$

### Cited Paper
Lin, C. Y. (2004, July). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81). [ROUGE](https://aclanthology.org/W04-1013.pdf)



In [None]:
def batch_chunking(list_of_elements, batch_size):
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]



def evaluate_test(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(batch_chunking(dataset[column_text], batch_size))
    target_batches = list(batch_chunking(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,clean_up_tokenization_spaces=True) for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]


        metric.add_batch(predictions=decoded_summaries, references=target_batch)
    score = metric.compute()
    return score


In [None]:
rouge_metric = evaluate.load('rouge')
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
rouge_metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

In [None]:
score = evaluate_test(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = {rn: score[rn] for rn in rouge_names}
import pandas as pd
pd.DataFrame(rouge_dict, index=[f'pegasus'])

100%|██████████| 5/5 [00:27<00:00,  5.47s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.018794,0.0,0.01859,0.018481


### Interpreting Good vs. Bad ROUGE Scores:
1. Scores close to 1: This indicates a strong overlap between the generated summary and the reference summary, which is desirable in summarization tasks. For example, an F1-score of 0.7 or higher across metrics is generally considered good.
2. Scores between 0.5 and 0.7: Indicates moderate overlap. The summary might be capturing key points but is likely missing some structure or important information.
3. Scores below 0.5: Suggest a poor match between the generated and reference summaries. The model might be generating irrelevant or incomplete summaries that don’t capture the key ideas well.

In [None]:
## Save model
model_pegasus.save_pretrained("pegasus-samsum-model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [None]:
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [None]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .


## Trial 2 testing:

The ROUGE score didn't bode well. So I will change the datasets and models if possible:

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", "3.0.0")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
train_data = dataset["train"]
test_data = dataset["test"]
print("Training example:", len(train_data))
print("Testing example:", len(test_data))

print("Example article:", train_data[0]["article"])
print("Example summary:", train_data[0]["highlights"])

Training example: 287113
Testing example: 11490
Example article: LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places belo

In [None]:
!zip -r trial1_folders.zip /content/wandb /content/tokenizer /content/pegasus-samsum-model

  adding: content/wandb/ (stored 0%)
  adding: content/wandb/latest-run/ (stored 0%)
  adding: content/wandb/latest-run/tmp/ (stored 0%)
  adding: content/wandb/latest-run/tmp/code/ (stored 0%)
  adding: content/wandb/latest-run/run-pu6mpzee.wandb (deflated 72%)
  adding: content/wandb/latest-run/files/ (stored 0%)
  adding: content/wandb/latest-run/files/requirements.txt (deflated 55%)
  adding: content/wandb/latest-run/files/wandb-metadata.json (deflated 45%)
  adding: content/wandb/latest-run/files/output.log (deflated 51%)
  adding: content/wandb/latest-run/logs/ (stored 0%)
  adding: content/wandb/latest-run/logs/debug-core.log (deflated 58%)
  adding: content/wandb/latest-run/logs/debug.log (deflated 74%)
  adding: content/wandb/latest-run/logs/debug-internal.log (deflated 81%)
  adding: content/wandb/run-20241219_081829-pu6mpzee/ (stored 0%)
  adding: content/wandb/run-20241219_081829-pu6mpzee/tmp/ (stored 0%)
  adding: content/wandb/run-20241219_081829-pu6mpzee/tmp/code/ (store

In [None]:
from google.colab import files
files.download("trial1_folders.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>