In [1]:
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate


Found existing installation: transformers 4.49.0
Uninstalling transformers-4.49.0:
  Successfully uninstalled transformers-4.49.0
Found existing installation: accelerate 1.4.0
Uninstalling accelerate-1.4.0:
  Successfully uninstalled accelerate-1.4.0
Collecting transformers
  Using cached transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
Collecting accelerate
  Using cached accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.49.0-py3-none-any.whl (10.0 MB)
Using cached accelerate-1.4.0-py3-none-any.whl (342 kB)
Installing collected packages: transformers, accelerate
Successfully installed accelerate-1.4.0 transformers-4.49.0


In [2]:
! pip install datasets
! pip install nltk
! pip install rouge_score



In [3]:
!pip install accelerate



In [4]:
!pip install transformers[torch]
!pip install accelerate>=0.26.0 --upgrade



In [39]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [5]:
from transformers import pipeline,set_seed
from datasets import load_dataset,load_from_disk
import matplotlib.pyplot as plt
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download('punkt')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



This code checks if a GPU is available for use with PyTorch. If a GPU is available, it sets the device to `"cuda"`, otherwise, it uses the CPU (`"cpu"`). It then prints whether CUDA (GPU support) is available as a Boolean value.

### Code:

```python
# Check if CUDA (GPU support) is available and select the appropriate device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Print if CUDA is available (True or False)
print(torch.cuda.is_available())


In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(torch.cuda.is_available())

True



This code demonstrates how to load a pre-trained sequence-to-sequence model and tokenizer from the Hugging Face `transformers` library. It first checks if a GPU (CUDA) is available and selects the device accordingly. Then, it loads the tokenizer and model (`google/pegasus-cnn_dailymail`) onto the selected device.



In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

model = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model)

model_pegas = AutoModelForSeq2SeqLM.from_pretrained(model).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



This code downloads a zip file (`summarizer-data.zip`) from a GitHub repository using `wget` and then extracts the contents of the zip file using the `unzip` command.



In [8]:
!wget https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
!unzip summarizer-data.zip

--2025-03-02 17:00:10--  https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip [following]
--2025-03-02 17:00:11--  https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7903594 (7.5M) [application/zip]
Saving to: ‘summarizer-data.zip.1’


2025-03-02 17:00:11 (73.3 MB/s) - ‘summarizer-data.zip.1’ saved [7903594/7903594]

Archive:  summarizer-data.zip
replace samsum-test.csv? [y]es, [n]o, [

### This code unzips the file and extract the train,test and validation data.

* load the data from the disk (dictionary)

In [18]:
dataset_samsum = load_from_disk('samsum_dataset')
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})


This code calculates and prints the lengths of different splits in a dataset, followed by displaying the feature names in the training set. It also prints a sample dialogue from the test set and the corresponding summary.

In [19]:
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum]
print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][1]["dialogue"])


print("\nSummary:")

print(dataset_samsum["test"][1]["summary"])


Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
Eric and Rob are going to watch a stand-up on youtube.



This function, `convert_examples_to_features`, takes a batch of examples and converts the dialogue and summary into tokenized input and target sequences. It uses the Hugging Face tokenizer to encode the input (dialogue) and target (summary), truncating them to a maximum length. The function then returns a dictionary containing the tokenized input IDs, attention mask, and target labels.





In [20]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

### We apply our convert_example_to_feature to our dataset.

In [21]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

In [22]:
dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})


This code snippet imports `DataCollatorForSeq2Seq` from the Hugging Face `transformers` library and creates a `seq2seq_data_collator` object. This data collator is used to dynamically pad the input sequences to the correct length during training, ensuring that the input and target sequences are aligned properly for the model. It uses the previously defined tokenizer and model.


In [24]:
# Training
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model = model_pegas)


This code sets up the training parameters for the model using `TrainingArguments`. Below is an explanation of each parameter:

### Parameters:

- **`output_dir`**:  
  Directory where the model checkpoints, logs, and evaluation results will be saved. In this case, it's set to `'pegas-samsum'`.

- **`num_train_epochs`**:  
  Specifies the number of times the entire dataset will be passed through the model during training. Here it is set to 1, meaning the model will train for 1 epoch.

- **`warmup_steps`**:  
  The number of steps to perform learning rate warmup. During this phase, the learning rate will gradually increase. It is set to 500 steps in this case.

- **`per_device_train_batch_size`**:  
  Batch size for training on each device (GPU or CPU). In this case, it is set to 1, which means the model will train on one sample at a time per device.

- **`per_device_eval_batch_size`**:  
  Batch size for evaluation during validation. Here, it's set to 1, meaning evaluation will happen with one sample per device.

- **`weight_decay`**:  
  The strength of weight decay regularization used to prevent overfitting by penalizing large weights. It is set to 0.01, meaning the model will have a small penalty on large weights.

- **`logging_steps`**:  
  Defines how often (in steps) the logs are recorded. In this case, it is set to 10, meaning logging will happen every 10 training steps.

- **`evaluation_strategy`**:  
  Specifies when to run evaluation during training. In this case, it is set to `'steps'`, meaning evaluation will happen after a specified number of steps (`eval_steps`).

- **`eval_steps`**:  
  Defines the number of steps between evaluations during training. Here it is set to 500, so evaluation will occur every 500 steps.

- **`save_steps`**:  
  Specifies how often the model is saved during training. A value of `1e6` is used here, which essentially means the model will only be saved after a very large number of steps (effectively disabling frequent saves).

- **`gradient_accumulation_steps`**:  
  The number of steps over which gradients are accumulated before the model weights are updated. It is set to 16, meaning gradients are accumulated over 16 steps before performing a weight update. This can help with training efficiency on smaller batch sizes


In [26]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir = 'pegas-samsum',
    num_train_epochs = 1,
    warmup_steps = 500,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    weight_decay = 0.01,
    logging_steps = 10,
    evaluation_strategy = 'steps',
    eval_steps = 500,
    save_steps = 1e6,
    gradient_accumulation_steps = 16
)


This code initializes a `Trainer` object from the Hugging Face `transformers` library. The `Trainer` handles the training loop and evaluation process for the model. Below is an explanation of each parameter used in this setup.

### Parameters:

- **`model`**:  
  The model to be trained. In this case, it is the `model_pegas`, which is the pre-trained `pegasus` model loaded earlier.

- **`args`**:  
  The training arguments defined using the `TrainingArguments` class. These arguments include settings like batch size, number of epochs, evaluation strategy, and more. Here, `trainer_args` contains all these settings.

- **`tokenizer`**:  
  The tokenizer used to process input and target sequences during training. It is set to the `tokenizer` object, which is responsible for converting text to token IDs and vice versa.

- **`data_collator`**:  
  A data collator that handles dynamic padding and batching during training. It is set to `seq2seq_data_collator`, which ensures that the input and output sequences are padded correctly for the sequence-to-sequence task.

- **`train_dataset`**:  
  The dataset used for training the model. In this case, it's set to the `"test"` split of `dataset_samsum_pt`. This might be a typo, as typically the `"train"` split would be used here instead of `"test"`. Make sure to use the correct split for training.

- **`eval_dataset`**:  
  The dataset used for evaluating the model during training. It is set to the `"validation"` split of `dataset_samsum_pt`, which will be used to monitor the model's performance after each evaluation.


In [27]:
trainer = Trainer(
    model = model_pegas,
    args = trainer_args,
    tokenizer = tokenizer,
    data_collator = seq2seq_data_collator,
    train_dataset = dataset_samsum_pt["test"],
    eval_dataset = dataset_samsum_pt["validation"]
)

#### Train our data(it takes little time)

In [28]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33maakuskar-980[0m ([33maakuskar-980-ai-researcher[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss


TrainOutput(global_step=51, training_loss=3.0044142264945832, metrics={'train_runtime': 483.8722, 'train_samples_per_second': 1.693, 'train_steps_per_second': 0.105, 'total_flos': 313450454089728.0, 'train_loss': 3.0044142264945832, 'epoch': 0.9963369963369964})


The function `generate_batch_sized_chunks` is used to split a large list into smaller, more manageable batches. This is particularly useful for processing large datasets in smaller chunks, such as when performing batch processing in machine learning tasks
- **`list_of_elements`**:  
  This is the list (or iterable) that you want to split into batches. It could represent any collection of data such as a list of input samples or data points.

- **`batch_size`**:  
  This is the size of each batch. The function will divide the `list_of_elements` into smaller groups, each containing up to `batch_size` elements.


In [29]:
# Evaluation
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]


The function `calculate_metric_on_test_ds` evaluates a model's performance on a test dataset by generating summaries for each article in the dataset, comparing them with the actual summaries, and then calculating a specific metric (such as ROUGE scores). It splits the dataset into smaller batches, processes each batch, generates summaries, and compares them to the reference summaries.

- **Batching**: The dataset is split into batches to process the articles and summaries in smaller chunks. This helps with memory efficiency, especially when dealing with large datasets.
- **Summary Generation**: The model generates summaries for the articles using beam search and length penalties to ensure the summaries are of appropriate length and quality.
- **Metric Calculation**: The function adds the generated summaries and reference summaries to the metric, which computes the desired evaluation score (like ROUGE).


In [30]:
def calculate_metric_on_test_ds(dataset,
                                metric,
                                model,
                                tokenizer,
                                batch_size=16,
                                device=device,
                                column_text="article",
                                column_summary="highlights"):

    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]


        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score


In [40]:
import evaluate

rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [42]:

score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn]) for rn in rouge_names)
pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|██████████| 5/5 [00:25<00:00,  5.08s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.010608,0.0,0.011029,0.010608


### Save our pretrained model and tokenizer

In [44]:
model_pegas.save_pretrained("pegasus-samsum-model")


In [45]:
tokenizer.save_pretrained("tokenizer")


('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [46]:
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")


#### Predictions using our saved model

In [47]:
#Prediction

gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .
