# Homework

Read about difference between GPT-3.5 and GPT-4.

Read about metrics for generarive NLP.

**Advanced**: Generative models are usually very big. Read about model quantization. That may help with inference of big models such as GPT.

**Theory** (5 points): Google form questions.

**Practical task** (10 points): 
1. Choose one:
    * Finetune transformer model for summarization on https://huggingface.co/datasets/samsum.
    * Finetune transformer model for translation on dataset of your choice.
2. Experiment with different prompts.
2. Based on a task you choose, choose a few metrics that are used in generative NLP (BLEU, ROUGE etc), test your finetune models using them, describe their pros and cons relative to the generations your model makes.

3. If you want, you can try use LoRA or prefix tuning for finetuning the model.

## Imports

In [14]:
!pip install py7zr evaluate rouge_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [15]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

import evaluate
import numpy as np

In [16]:
samsum = load_dataset("samsum")

In [17]:
checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [20]:
def preprocess_function(examples, prefix):
    inputs = [prefix + doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

T5 model was pre-trained on different tasks, including summarization. During the pre-training stage the prefix `"summarize: "` was used for summarization task, and thus it should be used for summarization inferences. We will try to investigate, what whould happen if we replace this prompt with more precise `"summarize the following dialogue: "` during the fine-tuning on the dataset of dialogues.

In [21]:
tokenized_samsum = samsum.map(preprocess_function, batched=True, fn_kwargs={"prefix": "summarize: "})
tokenized_samsum_new_prompt = samsum.map(preprocess_function, batched=True, fn_kwargs={"prefix": "summarize the following dialogue: "})

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [9]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

In [10]:
rouge = evaluate.load("rouge")

In [11]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [12]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [13]:
training_args = Seq2SeqTrainingArguments(
    output_dir="t5-samsum",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=6,
    predict_with_generate=True,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_samsum["train"],
    eval_dataset=tokenized_samsum["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

training_args.output_dir = "t5-samsum-newprompt"
trainer_new_prompt = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_samsum_new_prompt["train"],
    eval_dataset=tokenized_samsum_new_prompt["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or NPU devices or certain XPU devices (with IPEX).

In [None]:
trainer.train()
wandb.finish()

In [None]:
training_args.output_dir = "t5-samsum-newprompt"
trainer_new_prompt = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_samsum_new_prompt["train"],
    eval_dataset=tokenized_samsum_new_prompt["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer_new_prompt.train()
wandb.finish()