## Check the nvidia

In [3]:
!nvidia-smi

Thu Apr 18 19:38:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### In the context of the Hugging Face Transformers library:

1. **sacrebleu**: SacreBLEU is a popular metric used for evaluating the quality of machine-translated text. It is an implementation of the BLEU (Bilingual Evaluation Understudy) metric, which measures the similarity between the machine-translated text and a set of reference translations. SacreBLEU provides a standardized and easy-to-use interface for computing BLEU scores, including support for tokenization, multiple reference translations, and various other options.

2. **rouge_score**: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics commonly used for evaluating the quality of text summarization and machine translation outputs. It measures the overlap between the generated summary and reference summaries in terms of n-gram overlap, longest common subsequences, and skip-bigram overlap. The `rouge_score` module in Transformers provides functionality for computing ROUGE scores, including support for different variants of the ROUGE metric (e.g., ROUGE-N, ROUGE-L, ROUGE-W), multiple reference summaries, and customizable options for preprocessing.

3. **py7zr**: Py7zr is a Python library for working with 7z archives, which are compressed archives similar to ZIP files but using the 7z format. It provides functionality for creating, extracting, and manipulating 7z archives, as well as features such as encryption and multi-threading support. While not directly related to natural language processing or Transformers, it could potentially be used for tasks involving compressed data, such as downloading and unpacking large pre-trained model files.

In [4]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.6/67.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.2/411.2 kB[0m [31m1

In [5]:
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

Collecting accelerate
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [6]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
from datasets import load_dataset
import pandas as pd
from datasets import load_dataset, load_metric

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's go through each of the imported modules and functions:

1. **`pipeline`**: This is a function from the Hugging Face Transformers library that allows you to easily create pipelines for various NLP tasks such as text generation, text classification, question answering, etc. It abstracts away the complexities of loading models, tokenizers, and post-processing, allowing you to perform NLP tasks with just a few lines of code.

2. **`set_seed`**: This function is used to set the random seed for reproducibility. In machine learning tasks, setting a random seed ensures that the random number generator produces the same sequence of random numbers each time you run the code. This is useful for debugging and ensuring consistent results across different runs.

3. **`load_dataset`**: This function is from the `datasets` library, which is also developed by Hugging Face. It allows you to load various datasets for natural language processing tasks. These datasets can be used for tasks such as text classification, named entity recognition, machine translation, etc. The `load_dataset` function loads a dataset by name and returns it as a `Dataset` object, which can be easily manipulated and used for training or evaluation.

4. **`load_from_disk`**: This function is used to load a dataset or model that has been previously saved to disk using the `save_to_disk` method. It allows you to load datasets or models from disk without having to re-download or re-create them.

5. **`matplotlib.pyplot`**: This is a module from the Matplotlib library, which is a plotting library for Python. It provides functions for creating various types of plots and visualizations, such as line plots, scatter plots, histograms, etc. By importing `matplotlib.pyplot as plt`, you can use the `plt` alias to access Matplotlib's plotting functions.

6. **`pandas`**: This is a popular data manipulation library for Python. It provides data structures and functions for working with structured data, such as tabular data and time series data. It is commonly used for data cleaning, manipulation, and analysis tasks.

7. **`load_metric`**: This function is from the `datasets` library and is used to load evaluation metrics for natural language processing tasks. Just like `load_dataset`, it allows you to load metrics by name and returns them as `Metric` objects, which can be used to evaluate the performance of models on specific tasks.

8. **`AutoModelForSeq2SeqLM`**: This class from the Hugging Face Transformers library is used for loading pre-trained sequence-to-sequence language models. It automatically selects the appropriate model architecture based on the provided model identifier and downloads the corresponding weights from the Hugging Face model hub.

9. **`AutoTokenizer`**: This class from the Hugging Face Transformers library is used for loading pre-trained tokenizers. It automatically selects the appropriate tokenizer based on the provided model identifier and downloads the corresponding tokenizer from the Hugging Face model hub.

10. **`nltk`**: This is the Natural Language Toolkit library for Python. It provides various tools and utilities for natural language processing tasks, such as tokenization, stemming, part-of-speech tagging, and syntactic parsing.

11. **`nltk.tokenize.sent_tokenize`**: This function from the NLTK library is used for sentence tokenization, which means splitting text into individual sentences. It takes a string of text as input and returns a list of sentences.

12. **`tqdm`**: This is a library for creating progress bars in Python. It is particularly useful when working with loops or long-running tasks, as it provides a visual indication of progress.

13. **`torch`**: This is the PyTorch library, which is a popular deep learning framework for Python. It provides data structures and functions for building and training neural networks. It is widely used in natural language processing and other machine learning tasks.

14. **`nltk.download("punkt")`**: This line downloads the Punkt tokenizer models for NLTK. The Punkt tokenizer is a pre-trained sentence tokenizer that is used by NLTK for sentence tokenization. It is necessary to download the models before using NLTK's sentence tokenizer.

These imports and function calls set up the environment and import necessary libraries for performing natural language processing tasks, including loading datasets, models, and evaluation metrics, as well as setting up tokenization and progress tracking.

In [7]:
import torch

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

## Calling the google/pegasus-cnn_dailymail Model from Huggingface

In [9]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [10]:
model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [11]:
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [12]:
dataset_samsum = load_dataset("samsum")

Downloading data:   0%|          | 0.00/6.06M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/335k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

In [13]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [14]:
dataset_samsum["train"]["dialogue"][1]


'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great'

In [15]:
dataset_samsum["train"][1]["summary"]

'Olivia and Olivier are voting for liberals in this election. '

In [16]:
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")

print(dataset_samsum["test"][1]["dialogue"])

print("\nSummary:")

print(dataset_samsum["test"][1]["summary"])

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
Eric and Rob are going to watch a stand-up on youtube.


In [17]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
        'input_ids' : input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }


In [18]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched = True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [19]:
dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [20]:
dataset_samsum_pt["train"]["input_ids"][1]

[18038,
 151,
 2632,
 127,
 119,
 6228,
 118,
 115,
 136,
 2974,
 152,
 10463,
 151,
 35884,
 130,
 329,
 107,
 18038,
 151,
 2587,
 314,
 1242,
 10463,
 151,
 1509,
 1]

In [21]:
dataset_samsum_pt["train"]["attention_mask"][1]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

## Training Set Up

In [22]:
# Training

from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [23]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

In [24]:

trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["train"],
                  eval_dataset=dataset_samsum_pt["validation"])

Sure, let's break down the hyperparameters used when initializing the `Trainer` object:

1. **`model`**: This is the pre-trained model that you want to train or fine-tune. In this case, `model_pegasus` refers to the Pegasus model that you have previously instantiated or loaded.

2. **`args`**: These are training arguments, typically provided as an instance of the `TrainingArguments` class from the Transformers library. These arguments include settings such as the number of epochs, learning rate, batch size, optimizer, scheduler, etc. These settings control how the model is trained.

3. **`tokenizer`**: This is the tokenizer associated with the pre-trained model. The tokenizer is responsible for converting raw text inputs into tokenized inputs suitable for the model. It handles tasks such as tokenization, padding, truncation, and special token insertion.

4. **`data_collator`**: This is the data collator object used for batching and preprocessing training data. It defines how individual samples from the dataset are batched together and preprocessed before being passed to the model during training. For sequence-to-sequence tasks, the data collator typically handles tasks such as padding, truncation, and masking.

5. **`train_dataset`**: This is the training dataset that will be used to train the model. It should be provided as a dataset object, such as one loaded using the `load_dataset` function from the Hugging Face datasets library. The training dataset contains examples used for updating the model's parameters during training.

6. **`eval_dataset`**: This is the evaluation dataset that will be used to evaluate the model's performance during training. It should be provided as a dataset object similar to the training dataset. The evaluation dataset contains examples used to monitor the model's performance on unseen data and to determine when to stop training or adjust hyperparameters.

By providing these hyperparameters to the `Trainer` object, you configure the training process for your specific task and model architecture. The `Trainer` object then handles the training loop, optimization, evaluation, and logging, making it easier to train and evaluate models using the Transformers library.

In [25]:
trainer.train()

Step,Training Loss,Validation Loss
500,1.6599,1.483296


TrainOutput(global_step=920, training_loss=1.8251974468645842, metrics={'train_runtime': 2780.6026, 'train_samples_per_second': 5.298, 'train_steps_per_second': 0.331, 'total_flos': 5528248038285312.0, 'train_loss': 1.8251974468645842, 'epoch': 0.9991854466467553})

## Evaluation

In [26]:
# Evaluation

def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]



def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]


        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score


In [27]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load_metric('rouge')

  rouge_metric = load_metric('rouge')
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [28]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|██████████| 5/5 [00:09<00:00,  1.98s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.023525,0.0,0.023303,0.02333


## Save model

In [29]:

model_pegasus.save_pretrained("pegasus-samsum-model")

Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


In [30]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

## Prediction Pipeline


## Using : pipe = pipeline("summarization", model,tokenizer=tokenizer)

In [31]:
#Load

tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [32]:
#Prediction

gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda can't find Betty's number. Larry called Betty last time they were at the park together. Hannah wants Amanda to text Larry. Amanda will text Larry.
