## Fine-tuning a pretrained model and inference

#### Download the pretrained model and dataset
dataset - 'under-tree/labeled-multiple-choice'
Dataset generated by me for our task

model - 'distillgpt2'

#### Fine-tuning the model

I tried different ways to accelerate the training

1. PyTorch with devices
2. PyTorch with accelerator
3. Default Trainer with accelerator
4. Default Trainer
5. Trainer with change of batch_size

**The last method (5) is fastest**

Perplexity after fine-tuning - **3.07**

#### Inference

1. I saved model on HF Hub
2. I created inference pipeline (**Please, take a look on inference**)

I did inference on GPU.
I tried different parameters for text generation
* max_length
* num_beams
* temperature
* repetition_penalty
* do_sample
* top_k
* top_p

The final inference consists of several forward passes, truncation of text, adding prompt. I think it works great!

#### Generated question example

Not that fluent, but it's a good start!
```	
topic: biology
question: what can be used to determine the age of an organism
variants: (a) cell division (b) survival (c) rapid expansion (d) the rapid growth of a species (e) it needs them (f) genetic
answer: f
context: genetic information is used for determining the ages of organisms
```


In [None]:
# create virtual environment on Colab
%%bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

/usr/local/bin/python


bash: line 2: venv/bin/activate: No such file or directory


In [None]:
import transformers
import torch

In [4]:
from pprint import pprint
# I put this parameter to False because locally I can't use more than 1 process:(
onServer = True 

if onServer:
  params = {'num_proc': 5, 'device': 0}
else:
  params = {'num_proc': 1, 'device': -1}
pprint(params)

{'device': 0, 'num_proc': 5}


In [None]:
from datasets import load_dataset
data = load_dataset('under-tree/labeled-multiple-choice', split='train')

Downloading readme:   0%|          | 0.00/493 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/under-tree___parquet/under-tree--labeled-multiple-choice-8214d50786758969/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/36503 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/under-tree___parquet/under-tree--labeled-multiple-choice-8214d50786758969/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


**Play with this cell and look at different dataset entries!**

In [None]:
import numpy as np
n = np.random.randint(0, len(data))
def gen_prompt(elem):
    # return f'question: {elem.formatted_question}\nanswer: {elem.answerKey}\ncontext: {elem.combinedfact}\n'
    # dict 
    question, variants = elem['formatted_question'].split('(a)', 1)
    return {'text': f'topic: {elem["topic"]}\nquestion: {question}\nvariants: (a){variants}\nanswer: {elem["answerKey"]}\ncontext: {elem["combinedfact"]}\n'}

print(gen_prompt(data[n])['text'])

topic: biology
question: where do colonies of coral form? 
variants: (a) pink water (b) allow growth (c) stale water (d) the environment (e) flavored water (f) complex (g) warm water (h) more abundant
answer: g
context: corals form large colonies in warm water



In [None]:
data_with_prompt = data.map(gen_prompt, batched=False, remove_columns=data.column_names, num_proc=params['num_proc'])

Map (num_proc=5):   0%|          | 0/36503 [00:00<?, ? examples/s]

In [None]:

from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = 'distilgpt2'
# make download not verbose
tokenizer = AutoTokenizer.from_pretrained(checkpoint, pad_token='<|pad|>', use_fast=True, verbose=False)
special_tokens = {'additional_special_tokens': ['topic: ', 'question: ', 'variants: ', 'answer: ', 'context: ']}
tokenizer.add_special_tokens(special_tokens)

model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.resize_token_embeddings(len(tokenizer))
None

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50263, 768)

In [None]:
def encode(elem):
    return tokenizer(elem['text'], truncation=True)

data_encoded = data_with_prompt.map(encode, batched=True, remove_columns=data_with_prompt.column_names, num_proc=params['num_proc'])

Map (num_proc=5):   0%|          | 0/36503 [00:00<?, ? examples/s]

In [None]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
data_lm = data_encoded.map(group_texts, batched=True, num_proc=params['num_proc'])

Map (num_proc=5):   0%|          | 0/36503 [00:00<?, ? examples/s]

In [None]:
data_dict = data_lm.train_test_split(test_size=0.2)

In [None]:
# del training_args
# del trainer

In [None]:
from transformers import Trainer, TrainingArguments

batch_size_device = 40
modelname = 'choice-question-generator'
training_args = TrainingArguments(
    modelname,   
    evaluation_strategy='epoch',
    num_train_epochs=3,
    per_device_train_batch_size=batch_size_device,
    per_device_eval_batch_size=batch_size_device,
    push_to_hub=True
)

In [None]:
# default args are pretty good: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data_dict['train'],
    eval_dataset=data_dict['test'],
    tokenizer=tokenizer
)

Cloning https://huggingface.co/under-tree/choice-question-generator into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.4k/319M [00:00<?, ?B/s]

Download file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/events.out.tfevents.16798559…

Download file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/events.out.tfevents.16798556…

Download file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/1679855650.9631634/events.ou…

Download file training_args.bin: 100%|##########| 3.50k/3.50k [00:00<?, ?B/s]

Clean file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/events.out.tfevents.1679855942.…

Clean file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/1679855650.9631634/events.out.t…

Clean file training_args.bin:  29%|##8       | 1.00k/3.50k [00:00<?, ?B/s]

Clean file runs/Mar26_18-33-47_production-env-7c8a784493ee40e8a1f79f743f3c9e29/events.out.tfevents.1679855650.…

Clean file pytorch_model.bin:   0%|          | 1.00k/319M [00:00<?, ?B/s]

In [None]:
trainer.train()
# trainer.save_model('result/')

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,1.216112
2,1.668600,1.14117
3,1.205000,1.121806


TrainOutput(global_step=1272, training_loss=1.3776111722742237, metrics={'train_runtime': 1221.407, 'train_samples_per_second': 41.63, 'train_steps_per_second': 1.041, 'total_flos': 1660769484668928.0, 'train_loss': 1.3776111722742237, 'epoch': 3.0})

In [None]:
eval_results = trainer.evaluate() # returns cross entropy
print(f"Perplexity: {np.exp(eval_results['eval_loss']):.2f}")

Perplexity: 3.07


In [None]:
trainer.push_to_hub()

### The most interesting part

In [None]:
from transformers import pipeline
generator = pipeline("text-generation", model="under-tree/choice-question-generator", device=params['device'])

In [9]:
import warnings

question_params = dict(
    do_sample=True, 
    max_length=15, 
    top_k=50, 
    top_p=0.9, 
    num_beams=4,
    temperature=0.8,
    no_repeat_ngram_size=2,
    early_stopping=True,
    return_full_text=False
)
variants_params = dict(
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.7,
    temperature=0.8, 
    no_repeat_ngram_size=2,
    return_full_text=False
)
answer_params = dict(
    temperature=0.5, 
    max_length=5,
    return_full_text=False
)
context_params = dict(
    do_sample=True, 
    top_k=50, 
    top_p=0.8, 
    num_beams=2,
    no_repeat_ngram_size=2,
    early_stopping=True,
    max_length=80,
    return_full_text=False
)

def gen_questions(topic):
  txt = f"topic: {topic}\nquestion: "
  txt += generator(txt, **question_params)[0]['generated_text']
  txt = '\n'.join(txt.split('\n')[:2])

  txt += "\nvariants: "
  txt += generator(txt, **variants_params)[0]['generated_text']

  txt = '\n'.join(txt.split('\n')[:3])
  txt += "\nanswer: "
  txt += generator(txt, **answer_params)[0]['generated_text']
  txt = '\n'.join(txt.split('\n')[:4])

  txt += "\ncontext: "
  txt += generator(txt, **context_params)[0]['generated_text']
  txt = '\n'.join(txt.split('\n')[:5])

  print(txt)

**Play with that!**

In [13]:
topic = 'biology' 
gen_questions(topic)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 52, but `max_length` is set to 5. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


topic: biology
question: what can be used to determine the age of an organism
variants: (a) cell division (b) survival (c) rapid expansion (d) the rapid growth of a species (e) it needs them (f) genetic
answer: f
context: genetic information is used for determining the ages of organisms
