# Homework Tasks

Read about difference between GPT-3.5 and GPT-4.

Read about metrics for generarive NLP.

**Advanced**: Generative models are usually very big. Read about model quantization. That may help with inference of big models such as GPT.

**Theory** (5 points): Google form questions.

**Practical task** (10 points):
1. Choose one:
    * Finetune transformer model for summarization on https://huggingface.co/datasets/samsum.
    * Finetune transformer model for translation on dataset of your choice.
2. Experiment with different prompts.
2. Based on a task you choose, choose a few metrics that are used in generative NLP (BLEU, ROUGE etc), test your finetune models using them, describe their pros and cons relative to the generations your model makes.

3. If you want, you can try use LoRA or prefix tuning for finetuning the model.

# Imports

In [None]:
!pip install transformers datasets evaluate peft py7zr rouge_score

import nltk
nltk.download('punkt')
import torch
import pandas as pd
import numpy as np
import evaluate
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AdamW
from datasets import load_dataset




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
class CFG:
    model_path = 'facebook/bart-base'
    dataset_path = 'samsum'
    max_length = 1024
    # Training parameters
    fp16 = True
    learning_rate = 1e-5
    # weight_decay = 0.01
    num_epochs = 1
    per_device_batch_size = 4



# Finetuning [Bart-base](https://huggingface.co/facebook/bart-base) model on [SAMSum Corpus ](https://huggingface.co/datasets/samsum)

# EDA

In [None]:
dataset = load_dataset(CFG.dataset_path)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})


SAMSun dataset already has training, validation and testing splits

Take a look at text structure in this dataset

In [None]:
for split in ['train', 'validation', 'test']:
    print(f'--------------------   {split.upper()} Split  --------------------')
    for i in range(1, 31, 10):
        print(dataset[split][i]['dialogue'])
        print(f"\n{dataset[split][i]['summary']}\n\n")

--------------------   TRAIN Split  --------------------
Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great

Olivia and Olivier are voting for liberals in this election. 


Mark: I just shipped the goods
Mark: Tomorrow I’ll send you the tracking number
George: Thanks!

Mark just shipped the goods and he will send George the tracking number tomorrow.


Aria: You won't believe who I've just met!
Aria: Charlie Evans!
Maverick: Oh God, I haven't seen him from ages!
Maverick: How is he doing?
Aria: He's doing great. :)
Aria: He got married, he runs a small family business, which he is very passionate about and generally he seems to be a happy and fulfilled man. :)
Aria: Oh, and he has two absolutely adorable daughters. :)
Aria: It was so nice to meet him, he's such a sweet soul.
Maverick: I’m glad to hear that. :)
Maverick: Time flies so fast, doesn't it?
Aria: It does. :) Recently I’ve met Cooper Roy, I'm sure you rem

Dataset has a dialogue and it's summary as training data

## Data processing

In [None]:
# Create tokenizer instance
tokenizer = AutoTokenizer.from_pretrained(CFG.model_path)

In [None]:
def tokenize_dataset(examples):
    """
    Tokenizes dataset splits and return data ready for finetuning
    """
    model_inputs = tokenizer(examples['dialogue'], max_length=CFG.max_length, truncation=True)
    labels = tokenizer(examples['summary'], max_length=CFG.max_length, truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

In [None]:
tokenized_dataset = dataset.map(tokenize_dataset, batched=True, remove_columns=['id', 'dialogue', 'summary'])

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
print(tokenized_dataset)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 818
    })
})


In [None]:
tokenizer.batch_decode(tokenized_dataset['train'][0:1]['input_ids'], skip_special_tokens=True)

["Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"]

Create ROUGE metric and custom compute metrics function from [HuggingFace](https://huggingface.co/docs/transformers/tasks/summarization#evaluate)

In [None]:
rouge = evaluate.load("rouge")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    print(predictions, labels)
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    print(decoded_preds, decoded_labels)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}


# def postprocess_text(preds, labels):
#     preds = [pred.strip() for pred in preds]
#     labels = [label.strip() for label in labels]

#     # rougeLSum expects newline after each sentence
#     preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
#     labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

#     return preds, labels


# def compute_metrics(eval_preds):
#     preds, labels = eval_preds
#     if isinstance(preds, tuple):
#         preds = preds[0]
#     # Replace -100s used for padding as we can't decode them
#     preds = np.where(preds != -100, preds, tokenizer.pad_token_id)
#     decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
#     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
#     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

#     # Some simple post-processing
#     decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

#     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
#     result = {k: round(v * 100, 4) for k, v in result.items()}
#     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
#     result["gen_len"] = np.mean(prediction_lens)
#     return result

Modeling

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(CFG.model_path)

optim = AdamW(model.parameters())

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    pad_to_multiple_of=8 if CFG.fp16 else None,
)



In [None]:
model

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

In [None]:
train_args = Seq2SeqTrainingArguments(
    output_dir = './bart-base-summarization',
    per_device_train_batch_size = CFG.per_device_batch_size,
    per_device_eval_batch_size = CFG.per_device_batch_size,
    fp16=CFG.fp16,
    do_train = True,
    do_eval = True,
    evaluation_strategy = 'steps',
    eval_steps = 500,
    save_strategy = 'steps',
    logging_strategy = 'steps',
    logging_steps = 100,
    dataloader_drop_last = True,
    # saves more VRAM
    dataloader_pin_memory = False,
    predict_with_generate = True

)

trainer = Seq2SeqTrainer(
    model = model,
    args = train_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation'],
    optimizers = (optim, None),
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,7.574,7.080812,0.0,0.0,0.0,0.0,3.0




['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',

KeyboardInterrupt: ignored

# Prompt experiments

# Metrics results

# LoRA finetuning