# The below command displays the current status of the NVIDIA GPU(s) on the system

-If the GPU is not being utilized, configure the machine learning framework to use the GPU
-If the code is running slowly, optimize the code to better utilize the GPU
-This command is a useful tool for debugging and optimizing machine learning projects that utilize NVIDIA GPUs

In [None]:
# Check if the GPU is being utilized by the code
!nvidia-smi

In [None]:
# Install necessary packages for the project
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

# The above command installs the following packages:
# - transformers: a popular library for natural language processing (NLP) tasks such as text classification and language translation
# - datasets: a collection of datasets for NLP tasks, including the popular Hugging Face datasets
# - sacrebleu: a library for computing BLEU scores, a metric for evaluating the quality of machine-translated text
# - rouge_score: a library for computing ROUGE scores, another metric for evaluating the quality of machine-translated text
# - py7zr: a library for working with 7z archives, a type of compressed file format

In [None]:
# Install and upgrade necessary packages for the project
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

# The above commands install and upgrade the following packages:
# - accelerate: a library for optimizing PyTorch and TensorFlow code for CPU and GPU performance
# - transformers: a popular library for natural language processing (NLP) tasks such as text classification and language translation

# The second command uninstalls the previously installed versions of transformers and accelerate to ensure that the latest versions are installed.

# These packages are likely necessary for the project and will enable the Programmer to optimize their code for CPU and GPU performance and perform NLP tasks such as text classification and language translation.

In [None]:
# Import necessary packages for the project
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
import pandas as pd
from datasets import load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import nltk
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import torch

# The above packages are likely necessary for the project and will enable the Programmer to perform natural language processing (NLP) tasks such as text classification and language translation.
# - transformers: a popular library for NLP tasks such as text classification and language translation
# - datasets: a collection of datasets for NLP tasks, including the popular Hugging Face datasets
# - matplotlib: a library for creating visualizations in Python
# - pandas: a library for data manipulation and analysis
# - nltk: a library for natural language processing tasks such as tokenization and stemming
# - tqdm: a library for adding progress bars to Python loops
# - torch: a library for machine learning tasks such as neural network training and inference

In [None]:
nltk.download("punkt")
# The above code also downloads the "punkt" tokenizer from the nltk library, which is used for tokenizing text into sentences.

In [None]:
# Load a pre-trained Pegasus model for sequence-to-sequence language modeling
# Set the device to use for running the model (either "cuda" or "cpu")
device = "cuda" if torch.cuda.is_available() else "cpu"
model_ckpt = "google/bigbird-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_bigbird = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

# The above code loads a pre-trained Pegasus model for sequence-to-sequence language modeling and sets the device to use for running the model. 
# - torch: a library for machine learning tasks such as neural network training and inference
# - AutoTokenizer: a class for automatically selecting the appropriate tokenizer based on the checkpoint name
# - AutoModelForSeq2SeqLM: a class for automatically selecting the appropriate model based on the checkpoint name
# - "google/pegasus-cnn_dailymail": the checkpoint name for the pre-trained Pegasus model
# - device: the device to use for running the model (either "cuda" or "cpu")

In [None]:
# Download and extract the summarizer data
!wget https://github.com/InsiderCloud/Cogniezer-Backend/raw/master/summarizer-data.zip
!unzip summarizer-data.zip

# The above commands download and extract the summarizer data from a GitHub repository.
# - wget: a command-line utility for downloading files from the web
# - unzip: a command-line utility for extracting files from a zip archive

## Load the samsum dataset

The samsum dataset is loaded from disk using the `load_from_disk` function from the `datasets` package. This dataset likely contains the necessary data for the project and will enable the Programmer to train and test their summarization model.

The code to load the samsum dataset is shown below:

In [7]:
# Load the samsum dataset from disk
dataset_samsum = load_from_disk("samsum_dataset")
dataset_samsum

## Print information about the samsum dataset

The following code prints information about the samsum dataset, including the length of each split, the column names, and an example dialogue and summary.

In [None]:
# Print information about the samsum dataset
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][1]["dialogue"])
print("\nSummary:")
print(dataset_samsum["test"][1]["summary"])

## Convert examples to features for training the summarization model

The following code defines a function for converting a batch of examples to features for training the summarization model. The function tokenizes the input dialogue and target summary using the tokenizer and returns the input IDs, attention mask, and target labels as a dictionary.

In [21]:
# Convert a batch of examples to features for training the summarization model
def convert_examples_to_features(example_batch):
  input_encodings = tokenizer(example_batch['dialogue'], max_length=4096, truncation=True)
  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example_batch['summary'], max_length=512, truncation=True)
  return {
      'input_ids': input_encodings['input_ids'],
      'attention_mask': input_encodings['attention_mask'],
      'labels': target_encodings['input_ids']
  }

In [None]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features,batched = True)

In [None]:
dataset_samsum_pt['train']

**Training**

In [25]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer,model=model_bigbird)

In [27]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='bigbird-samsum', num_train_epochs=1,warmup_steps=5000,
    per_device_train_batch_size=1,per_device_eval_batch_size=1,
    weight_decay=0.01,logging_steps=10,
    evaluation_strategy='steps',eval_steps=500,save_steps=1e6,
    gradient_accumulation_steps=16
)

In [29]:
trainer = Trainer(model=model_bigbird,args = trainer_args,
                  tokenizer=tokenizer,data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt['train'],
                  eval_dataset=dataset_samsum_pt["validation"])

In [None]:
trainer.train()

Evaluation

In [None]:
from sqlalchemy import column
def generate_batch_sized_chunks(list_of_elements,batch_size):
  for i in range(0,len(list_of_elements),batch_size):
    yield list_of_elements[i:i+batch_size]

def calculate_metric_on_test_ds(dataset,metric,model,tokenizer,
                                batch_size=16,device=device,
                                column_text="transcribe",
                                column_summary="highlights"):
  transcribe_batches = list(generate_batch_sized_chunks(dataset[column_text],batch_size))
  target_batches = list(generate_batch_sized_chunks(dataset[column_summary],batch_size))

  for transcribe_batch, target_batch in tqdm( zip(transcribe_batches,target_batches),total=len(transcribe_batches)):
    inputs = tokenizer(transcribe_batch,max_lenght=4096,truncation= True,padding="max_lenght",return_tensors="pt")

    summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                               attention_mask=inputs["attention_mask"].to(device),
                               length_panelty=0.8,num_beams=8,max_length=512)

    decoded_summaries = [tokenizer.decode(s,skip_special_tokens=True,clean_up_tokenization_spaces=True) for s in summaries]

    decoded_summaries = [d.replace(""," ") for d in decoded_summaries]

    metric.add_batch(predictions=decoded_summaries,references=target_batch)


  score = metric.compute()
  return score


In [None]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load_metric('rouge')

In [None]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'transcribe', column_summary= 'summary'
)

rouge_dict = dict((rn,score[rn].mid.fmeasure) for rn in rouge_names)

pd.DataFrame(rouge_dict,index=[f'bigbird'])

In [None]:
model_pegasus.save_pretrained("bigbird-samsum-model")

In [None]:
tokenizer.save_pretained("tokenizer")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

Prediction

In [None]:
gen_kwargs = {'lenght_panelty':0.8,'num_beams':8,'max_lenght':512}

sample_text = dataset_samsum['test'][0]["dialogue"]

reference = dataset_samsum['test'][0]["summary"]

pipe = pipeline("summarization",model="bigbird-samsum-model",tokenizer=tokenizer)

print("Dialoge:")
print(sample_text)

print("\n reference Sumamry:")
print(reference)

print("\Model Summary:")
print(pipe(sample_text,**gen_kwargs)[0]["summary_text"])