# Homework

Read about difference between GPT-3.5 and GPT-4.

Read about metrics for generarive NLP.

**Advanced**: Generative models are usually very big. Read about model quantization. That may help with inference of big models such as GPT.

**Theory** (5 points): Google form questions.

**Practical task** (10 points): 
1. Choose one:
    * Finetune transformer model for summarization on https://huggingface.co/datasets/samsum.
    * Finetune transformer model for translation on dataset of your choice.
2. Experiment with different prompts.
2. Based on a task you choose, choose a few metrics that are used in generative NLP (BLEU, ROUGE etc), test your finetune models using them, describe their pros and cons relative to the generations your model makes.

3. If you want, you can try use LoRA or prefix tuning for finetuning the model.

## Imports

In [14]:
%pip install py7zr evaluate rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [15]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

import evaluate
import numpy as np

## Data preparation

In [16]:
samsum = load_dataset("samsum")

In [17]:
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [55]:
def preprocess_function(examples, prefix):
    inputs = [prefix + doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

T5 model was pre-trained on different tasks, including summarization. During the pre-training stage the prefix `"summarize: "` was used for summarization task, and thus it should be used for summarization inferences. We will try to investigate, what whould happen if we replace this prompt with more precise `"summarize the following dialogue: "` during the fine-tuning on the dataset of dialogues.

In [52]:
tokenized_samsum = samsum.map(preprocess_function, batched=True, fn_kwargs={"prefix": "summarize: "})
tokenized_samsum_new_prompt = samsum.map(preprocess_function, batched=True, fn_kwargs={"prefix": "summarize the following dialogue: "})

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [9]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Training

In [10]:
rouge = evaluate.load("rouge")

In [11]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [12]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="/kaggle/working/t5-samsum",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_strategy='epoch',
    save_total_limit=3,
    num_train_epochs=6,
    predict_with_generate=True,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_samsum["train"],
    eval_dataset=tokenized_samsum["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
import wandb
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

my_secret = user_secrets.get_secret("wandb_api") 

wandb.login(key=my_secret)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
trainer.train()
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33myevhenii-azarov[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.12 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.9
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231106_022236-ucffq3jv[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mleafy-tree-10[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/yevhenii-azarov/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/yevhenii-azarov/huggingface/runs/ucffq3jv[0m
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,2.2572,1.873949,0.3969,0.1706,0.3324,0.332,16.3655
2,2.0346,1.82632,0.4045,0.1778,0.3392,0.3389,16.5098
3,1.9858,1.80339,0.4092,0.1822,0.3442,0.3443,16.467
4,1.9585,1.786948,0.4145,0.1834,0.3478,0.3479,16.577
5,1.9456,1.782817,0.4143,0.185,0.3482,0.3481,16.6137
6,1.931,1.780898,0.4158,0.1858,0.3492,0.3491,16.6369


[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                   eval/gen_len ▁▅▄▆▇█
[34m[1mwandb[0m:                      eval/loss █▄▃▁▁▁
[34m[1mwandb[0m:                    eval/rouge1 ▁▄▆█▇█
[34m[1mwandb[0m:                    eval/rouge2 ▁▄▆▇██
[34m[1mwandb[0m:                    eval/rougeL ▁▄▆▇██
[34m[1mwandb[0m:                 eval/rougeLsum ▁▄▆███
[34m[1mwandb[0m:                   eval/runtime █▁▁▁▂▂
[34m[1mwandb[0m:        eval/samples_per_second ▁██▇▇▇
[34m[1mwandb[0m:          eval/steps_per_second ▁██▇▇▇
[34m[1mwandb[0m:                    train/epoch ▁▂▂▂▃▃▄▄▄▅▅▆▇▇▇███
[34m[1mwandb[0m:              train/global_step ▁▂▂▂▃▃▄▄▄▅▅▆▇▇▇███
[34m[1mwandb[0m:            train/learning_rate █▇▇▆▅▄▄▃▂▂▁
[34m[1mwandb[0m:                     train/loss █▅▃▃▂▂▂▁▁▁▁
[34m[1mwandb[0m:               train/total_flos ▁
[34m[1mwandb[0m:         

In [None]:
training_args.output_dir = "/kaggle/working/t5-samsum-newprompt"
trainer_new_prompt = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_samsum_new_prompt["train"],
    eval_dataset=tokenized_samsum_new_prompt["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer_new_prompt.train()
wandb.finish()

[34m[1mwandb[0m: wandb version 0.15.12 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.9
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20231106_025303-lge5xhsq[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m/kaggle/working/t5-samsum[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/yevhenii-azarov/huggingface[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/yevhenii-azarov/huggingface/runs/lge5xhsq[0m


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.9212,1.761207,0.4219,0.1927,0.3553,0.3553,16.6479
2,1.8856,1.746259,0.4228,0.1948,0.3578,0.3577,16.698
3,1.8707,1.73912,0.4243,0.1979,0.3616,0.3616,16.5269
4,1.8589,1.73059,0.4264,0.1997,0.3615,0.3617,16.6271
5,1.8495,1.728655,0.4281,0.2003,0.3632,0.3633,16.6259
6,1.8413,1.728808,0.4289,0.2007,0.3634,0.3634,16.7017


[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                   eval/gen_len ▆█▁▅▅█
[34m[1mwandb[0m:                      eval/loss █▅▃▁▁▁
[34m[1mwandb[0m:                    eval/rouge1 ▁▂▃▆▇█
[34m[1mwandb[0m:                    eval/rouge2 ▁▃▆▇██
[34m[1mwandb[0m:                    eval/rougeL ▁▃▆▆██
[34m[1mwandb[0m:                 eval/rougeLsum ▁▃▆▇██
[34m[1mwandb[0m:                   eval/runtime █▆▃▁▂▃
[34m[1mwandb[0m:        eval/samples_per_second ▁▃▆█▇▆
[34m[1mwandb[0m:          eval/steps_per_second ▁▃▆█▇▆
[34m[1mwandb[0m:                    train/epoch ▁▂▂▂▃▃▄▄▄▅▅▆▇▇▇███
[34m[1mwandb[0m:              train/global_step ▁▂▂▂▃▃▄▄▄▅▅▆▇▇▇███
[34m[1mwandb[0m:            train/learning_rate █▇▇▆▅▅▄▃▂▂▁
[34m[1mwandb[0m:                     train/loss ██▅▅▄▃▃▂▂▁▂
[34m[1mwandb[0m:               train/total_flos ▁
[34m[1mwandb[0m:         

## Evaluation

Since there is no significant signs of overfitting, we will use the last snapshots for both cases for evaluation.

ROUGE score (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics commonly used for text summarization tasks. ROUGE was designed to evaluate the quality of machine-generated summaries by comparing them to reference summaries provided by humans. Here we will compare several different variants of the ROUGE scores.

**ROUGE-N**

Measures overlap of n-grams (contiguous sequences of n words) between the reference and generated summaries.
- ROUGE-1: variant with unigrams
- ROUGE-2: variant with bigrams

ROUGE-N is sensitive to phrase matching. It is simple and easy to compute. Among cons, it does not explicitly account for word order, which may limit its ability to capture the coherence and fluency of summaries. Also, it may penalize synonyms or paraphrased expressions that convey similar meanings but use different words.

**ROUGE-L**

Computes the longest common subsequence between reference and generated summaries, considering word sequences.
- ROUGE-L: computes average recall based among sentences (splits text by '.')
- ROUGE-Lsum: computes average recall among lines (splits text by '\n') 

Longest common substring consideration can be beneficial for evaluating the coherence and structure of summaries. ROUGE-L is less sensitive to variations in word choice and allows for partial matches, making it more forgiving in certain cases. From the drawbacks perspective, the longest common subsequence approach might not always reflect the semantic similarity, especially in cases where there are multiple ways to form a valid subsequence.

In [None]:
!pip install py7zr evaluate rouge_score

In [None]:
import evaluate

rouge = evaluate.load("rouge")



Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset

In [None]:
samsum_test = load_dataset('samsum', split='test')

Downloading builder script:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/770 [00:00<?, ?B/s]

Downloading and preparing dataset samsum/samsum (download: 2.81 MiB, generated: 10.04 MiB, post-processed: Unknown size, total: 12.85 MiB) to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6...


Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/3f7dba43be72ab10ca66a2e0f8547b3590e96c2bd9f2cbb1f6bb1ec1f1488ba6. Subsequent calls will reuse this data.


In [None]:
def add_prefix(examples):
    examples['dialogue_prompt'] = ["summarize: " + doc for doc in examples['dialogue']]
    examples['dialogue_prompt_new'] = ["summarize the following dialogue: " + doc for doc in examples['dialogue']]
    return examples

samsum_test = samsum_test.map(add_prefix, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
checkpoint_base = 't5-small'
checkpoint_1 = '/kaggle/input/t5-samsum-tuning/t5-samsum/checkpoint-5526'
checkpoint_2 = '/kaggle/input/t5-samsum-tuning/t5-samsum-newprompt/checkpoint-5526'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint_base)
model_base = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_base)
model_1 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_1)
model_2 = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_2)

In [None]:
pipeline_base = pipeline("summarization", model_base, tokenizer=tokenizer, max_length=300, device=0)
pipeline_1 = pipeline("summarization", model_1, tokenizer=tokenizer, max_length=300, device=0)
pipeline_2 = pipeline("summarization", model_2, tokenizer=tokenizer, max_length=300, device=0)

In [None]:
print(samsum_test['dialogue'][5])

Benjamin: Hey guys, what are we doing with the keys today?
Hilary: I've got them. Whoever wants them can meet me at lunchtime or after
Elliot: I'm ok. We're meeting for the drinks in the evening anyway and I guess we'll be going back to the apartment together?
Hilary: Yeah, I guess so
Daniel: I'm with Hilary atm and won't let go of her for the rest of the day, so any option you guys choose is good for me
Benjamin: Hmm I might actually pass by at lunchtime, take the keys and go take a nap. I'm sooo tired after yesterday
Hilary: Sounds good. We'll be having lunch with some French people (the ones who work on the history of food in colonial Mexico - I already see you yawning your head off)
Benjamin: YAAAAWN 🙊 Where and where are you meeting?
Hilary: So I'm meeting them at the entrance to the conference hall at 2 pm and then we'll head to this place called La Cantina. Italian cuisine, which is quite funny, but that's what they've chosen
Benjamin: Interesting 😱 To be honest, Hilary, I almos

In [None]:
print(pipeline_base(samsum_test['dialogue_prompt'][5])[0]["summary_text"])

we're meeting for the drinks in the evening anyway and we'll be going back to the apartment together? Hilary: I'm sooo tired after yesterday .


In [None]:
print(pipeline_1(samsum_test['dialogue_prompt'][5])[0]["summary_text"])

Benjamin and Elliot will meet at lunchtime and take the keys and take a nap. They'll meet at the entrance to the conference hall at 2 pm and then go to La Cantina.


In [None]:
print(pipeline_2(samsum_test['dialogue_prompt_new'][5])[0]["summary_text"])

Hilary and Elliot will meet at lunchtime, take the keys and take a nap. They'll meet at La Cantina at 2 pm.


In [None]:
from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset
from transformers import logging
logging.set_verbosity_error()

In [None]:
summaries_base = []
for out in tqdm(pipeline_base(KeyDataset(samsum_test, "dialogue_prompt"))):
    summaries_base.append(out[0]['summary_text'])

In [None]:
summaries_1 = []
for out in tqdm(pipeline_1(KeyDataset(samsum_test, "dialogue_prompt"))):
    summaries_1.append(out[0]['summary_text'])

In [None]:
summaries_2 = []
for out in tqdm(pipeline_2(KeyDataset(samsum_test, "dialogue_prompt_new"))):
    summaries_2.append(out[0]['summary_text'])

In [None]:
summaries_all = summaries_base
summaries_1 = summaries_base[819:819*2]
summaries_2 = summaries_base[819*2:]
summaries_base = summaries_base[:819]

In [None]:
import pandas as pd

In [None]:
res = []
models = ['base', 'fine-tuned', 'fine-tuned new prompt']
preds = [summaries_base, summaries_1, summaries_2]
for model, predictions in zip(models, preds):
    row = {"model": model}
    metrics = rouge.compute(predictions=predictions, references=samsum_test['summary'], use_stemmer=True)
    row.update(metrics)
    res.append(row)
res = pd.DataFrame.from_records(res)
res

Unnamed: 0,model,rouge1,rouge2,rougeL,rougeLsum
0,base,0.29153,0.084075,0.217757,0.217957
1,fine-tuned,0.423714,0.184818,0.327715,0.327795
2,fine-tuned new prompt,0.432573,0.193276,0.336552,0.336778


### Conclusion

Although adding custom prompt during the fine-tuning breaks consitency with the pre-training stage, fine-tuned model with custom prompt showed better performance in all considered rouge metrics on the test set. 

## Save predictions

In [None]:
samsum_test = samsum_test.add_column('summaries_base', summaries_base)
samsum_test = samsum_test.add_column('summaries_tuned', summaries_1)
samsum_test = samsum_test.add_column('summaries_tuned_new_prompt', summaries_2)

In [None]:
samsum_test.to_csv("/kaggle/working/predictions.csv")