In [1]:
# !nvidia-smi

# purpose of accelerate: 
1. Ease of Multi-Device Training: Whether you're using multiple GPUs or TPUs, accelerate makes it easier to scale your model accross devices with minimal code
2. Mixed percision: it allows models to be trained using mixed precision, which can speed up training and reduce memory useage
3. Zero Redundancy Optimizer (ZeRO): Helps manage large models be efficiently splitting the model across multiple devices. Offiad to CPU/SSD: Useful for large models that may not fit entirely into GPU memory, by allowing parts of the model or optimizer to be offloaded tp CPU or even SSD

In [2]:
! pip install --upgrade accelerate
! pip install -y transformer accelerate
! pip install transformers accelerate




Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: -y




In [3]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt 
from datasets import load_dataset
import pandas as pd 
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer 
# AutoTokenizer - convert text to tokens
# utoModelForSeq2SeqLM - load the specific model from hugging face

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from tqdm import tqdm 
import torch
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\26amr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\26amr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
# test to check tokenizer 
sentense='Amruth the new AI king!'
tokens =word_tokenize(sentense)
tokens

['Amruth', 'the', 'new', 'AI', 'king', '!']

# Basic Functionality of Hugging Face model

In [5]:
from transformers import AutoTokenizer, PegasusForConditionalGeneration

model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, return_tensors="pt")

# Generate Summary
summary_ids = model.generate(inputs["input_ids"])
# decode the summary back to text 
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


"California's largest electricity provider has turned off power to hundreds of thousands of customers."

In [6]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device='cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

# fine Tuning

In [7]:
model='google/pegasus-cnn_dailymail'
# tokenizer = convert text to tokens
tokenizer=AutoTokenizer.from_pretrained(model)

# loading the model and tokenizer 
model_pegasus=AutoModelForSeq2SeqLM.from_pretrained(model).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# download & unzip data
import os 
import zipfile
os.chdir('../')
def extract_unzip_file():
    unzip_path='dataset'
    os.makedirs(unzip_path,exist_ok=True)
    with zipfile.ZipFile('dataset/summarizer-data.zip','r') as zip_ref:
        zip_ref.extractall(unzip_path)

extract_unzip_file()


In [9]:
# load the arrow files and convert it into data dictionary 
dataset_samsun=load_from_disk('dataset/samsum_dataset')
dataset_samsun

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [10]:
split_lengths=[len(dataset_samsun[split]) for split in dataset_samsun]
print(f'Split Lengths(Train/test/validation): {split_lengths}')

print(f'Dialogue:')
print(dataset_samsun['test'][2]['dialogue'])
print(f'Summary:')
print(dataset_samsun['test'][2]['summary'])

Split Lengths(Train/test/validation): [14732, 819, 818]
Dialogue:
Lenny: Babe, can you help me with something?
Bob: Sure, what's up?
Lenny: Which one should I pick?
Bob: Send me photos
Lenny:  <file_photo>
Lenny:  <file_photo>
Lenny:  <file_photo>
Bob: I like the first ones best
Lenny: But I already have purple trousers. Does it make sense to have two pairs?
Bob: I have four black pairs :D :D
Lenny: yeah, but shouldn't I pick a different color?
Bob: what matters is what you'll give you the most outfit options
Lenny: So I guess I'll buy the first or the third pair then
Bob: Pick the best quality then
Lenny: ur right, thx
Bob: no prob :)
Summary:
Lenny can't decide which trousers to buy. Bob advised Lenny on that topic. Lenny goes with Bob's advice to pick the trousers that are of best quality.


# prepare the data for training sequence to sequence model
- the dialogue and summary must br converted into 3 main fields i.e. input_ids, attention_mask, labels 
example 
{'dialogue: 'Hi! How are you? ,
'summary: 'The speaker is asking how the other person is.'
}

{   
    'input_ids: [123,453,234,....], # Token IDs for the dialogue 

    'attention_mask:[1,1,1,....],# Attention mask for special characters to tokens 

    'labels: [321,654, 987, ...], # Token IDs for the summary (target)

}

In [11]:
def convert_Examples_to_features(example_batch):
    input_encodings=tokenizer(example_batch['dialogue'],max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        target_encodings=tokenizer(example_batch['summary'],max_length=128,truncation=True)

    return {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels':target_encodings['input_ids']
    }

In [12]:
dataset_samsum_pt=dataset_samsun.map(convert_Examples_to_features,batched=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [13]:
dataset_samsum_pt['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [None]:
# triaining 
from transformers import DataCollatorForSeq2Seq # it can be provided to model that helps in preparing batch to batches for training   

seq2seq_data_collator=DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [20]:
from transformers import TrainingArguments, Trainer

trainer_args=TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_eval_batch_size=1, per_device_train_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_steps=500, save_steps=1e6,
    eval_strategy='steps',
    gradient_accumulation_steps=16
)

In [21]:
trainer=Trainer(model=model_pegasus, args=trainer_args,
                tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                train_dataset=dataset_samsum_pt['test'],
                eval_dataset=dataset_samsum_pt['validation']
                )

  trainer=Trainer(model=model_pegasus, args=trainer_args,


In [22]:
trainer.train()

Step,Training Loss,Validation Loss




TrainOutput(global_step=51, training_loss=3.1008997945224537, metrics={'train_runtime': 2866.5425, 'train_samples_per_second': 0.286, 'train_steps_per_second': 0.018, 'total_flos': 313450454089728.0, 'train_loss': 3.1008997945224537, 'epoch': 0.9963369963369964})

# Evaluation

In [25]:
def generate_batch_seized_chunks(list_of_elements, batch_size):
    for i in range(0,len(list_of_elements),batch_size):
        yield list_of_elements[i:i+batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, batch_size=16, device=device, column_text='article',column_summary='highlights'):
    article_batches=list(generate_batch_seized_chunks(dataset[column_text],batch_size))
    target_batches=list(generate_batch_seized_chunks(dataset[column_summary],batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches,target_batches), total=len(article_batches)):
        inputs=tokenizer(article_batch,max_length=1024, truncation=True, padding='max_length', return_tensors='pt')
        summaries=model.generate(input_ids=inputs['input_ids'].to(device), 
                                 attention_mask=inputs['attention_mask'].to(device),
                                 length_penalty=0.8,num_beams=8,max_length=128
                                 )
        # decode the generated texts 
        # replace the token, and add the decoded texts with the reference to the metrics
        decode_summaries=[
            tokenizer.decode(s,skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in summaries
        ]
        decode_summaries=[d.replace(""," ") for d in decode_summaries]

        metric.add_batch(predictions=decode_summaries, references=target_batch)

    # finally compute and return the ROUGE sources 
    score=metric.compute()
    return score 


In [26]:
import evaluate
rouge_metric=evaluate.load('rouge')
rouge_names=['rouge1','rouge2','rougeL','rougeLsum']

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [27]:
rouge_metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

In [28]:
score=calculate_metric_on_test_ds(dataset_samsun['test'][0:10], 
                                  rouge_metric, 
                                  trainer.model, 
                                  tokenizer, batch_size=2,
                                  column_text='dialogue',
                                  column_summary='summary')

# Directly use the scores without accessing fmeasure or mid
rouge_dict={rn:score[rn] for rn in rouge_names}

# Convet the dictionary to a DataFrame for easy visualization 
import pandas as pd 
pd.DataFrame(rouge_dict, index=[f'pegasus'])

100%|██████████| 5/5 [06:39<00:00, 79.99s/it] 


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.018557,0.0,0.01852,0.018556


# interpreteting good and bad rouge scores 
1. Scores close to 1: This indicates a strong overlap between the generated summary and the reference summary, which is desirable summarization tasks. For examples F-1 score of 0.7 or higher across metrics is generally considered good.
2. Scores between 0.5 and 0.7: Indicates moderate overlap. The summary might be capturing key points but is likely missing some structure or important information
3. Scores below 0.5: Suggests a poor match between the generated and referance summaries. The model might be generating irrelevant or incomplete summaries that dont capture the key ideal well

In [30]:
## save model 
model_pegasus.save_pretrained('pegasus-samsum-model')

In [31]:
## save tokenizer 
tokenizer.save_pretrained('tokenizer')

('tokenizer\\tokenizer_config.json',
 'tokenizer\\special_tokens_map.json',
 'tokenizer\\spiece.model',
 'tokenizer\\added_tokens.json',
 'tokenizer\\tokenizer.json')

In [33]:
gen_kwargs={'length_penalty':0.8, 'num_beams':8,'max_length':128}

sample_text=dataset_samsun['test'][0]['dialogue']
reference=dataset_samsun['test'][0]['summary']
pipe=pipeline('summarization',model='pegasus-samsum-model', tokenizer=tokenizer)

## 
print('Dialogue:')
print(sample_text)

print('\n reference summary')
print(reference)

print('\nModel Summary')
print(pipe(sample_text,**gen_kwargs)[0]['summary_text'])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

 reference summary
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .
