# Fine Tuning an Question Answer Task for pretrained LLM
We will use `film Alien Romulus` dataset (custom dataset) to fine tune the `Qwen/Qwn2.5-0.5B-Instruct` model

In [1]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"

In [7]:
import os
os.makedirs("data", exist_ok=True)
os.makedirs("models", exist_ok=True)

## Reference

This notebook is mainly from the book <i><u>The Practical Guide to Large Language Models: Hands-On AI Applications with Hugging Face Transformers</u></i> Chapter 6. I used the data from this chapter.

## Setup for GPU

In [2]:
import torch

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using Device: {DEVICE}")

Using Device: cuda


### Prepare dataset for training

In [3]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
DATA_PATH = "data/alien_romulus_qa.jsonl"

dataset = load_dataset("json", data_files=DATA_PATH, split="train")

In [5]:
dataset[3]

{'instruction': 'Who produced Alien: Romulus?',
 'output': 'The film was produced by Ridley Scott through Scott Free Productions.'}

Prepare a function to format dataset for later training process 

In [6]:
def format_sample(sample):
    return f"Question: {sample['instruction']}\nAnswer: {sample['output']}"

In [7]:
dataset = dataset.map(lambda s: {"text": format_sample(s)})

dataset[3]

{'instruction': 'Who produced Alien: Romulus?',
 'output': 'The film was produced by Ridley Scott through Scott Free Productions.',
 'text': 'Question: Who produced Alien: Romulus?\nAnswer: The film was produced by Ridley Scott through Scott Free Productions.'}

### Prepare Tokenizer

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The vocabulary size of the pretrained tokenizer. It is important because it will the dimension of embedding of the LLM.

In [9]:
len(tokenizer)

151665

The special tokens for pretrained tokenizer

In [10]:
tokenizer.pad_token

'<|endoftext|>'

In [11]:
tokenizer.eos_token

'<|im_end|>'

In [12]:
tokenizer.all_special_tokens

['<|im_end|>',
 '<|endoftext|>',
 '<|im_start|>',
 '<|object_ref_start|>',
 '<|object_ref_end|>',
 '<|box_start|>',
 '<|box_end|>',
 '<|quad_start|>',
 '<|quad_end|>',
 '<|vision_start|>',
 '<|vision_end|>',
 '<|vision_pad|>',
 '<|image_pad|>',
 '<|video_pad|>']

function for tokenize the dataset with usage of tokenizer

In [13]:
def tokenize_func(sample):
    return tokenizer(
        sample["text"],
        truncation=True,
        padding="max_length",
        max_length=256
    )

We tokenize the dataset

In [14]:
tokenized_dataset = dataset.map(tokenize_func, batched=True, remove_columns=dataset.column_names)
#tokenized_dataset[3]

In [15]:
tokenized_dataset[3].keys()

dict_keys(['input_ids', 'attention_mask'])

### Prepare pretrained model

In [16]:
from transformers import AutoModelForCausalLM

In [17]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map="auto")

We have to resize the embedding size to match tokenizer

In [18]:
model.resize_token_embeddings(len(tokenizer))

Embedding(151665, 896)

Move the model to GPU if the host has GPU

In [19]:
model.to(DEVICE)

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151665, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((

In [20]:
# we disable the data caching

model.config.use_cache = False

### Prepare trainer for fine-tuning process
To prepare trainer, we need prepare 3 objects:
1. `DataCollator`
2. `TrainingArguments`
3. `Trainer`

#### Prepare DataCollator

In [22]:
from transformers import DataCollatorForLanguageModeling

collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

#### Prepare TrainingArguments

In [23]:
from transformers import TrainingArguments

In [25]:
TUNED_MODEL_DIR = 'models/Qwn2.5-0.5B-finetuned'

In [26]:
train_args = TrainingArguments(
    output_dir=TUNED_MODEL_DIR,
    num_train_epochs=5, 
    per_device_train_batch_size=1, # batch size
    gradient_accumulation_steps=1,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="no",
    report_to="none",
    no_cuda=False,
    fp16=True,
    bf16=False,
    dataloader_num_workers=1
)

#### Prepare Trainer

In [27]:
from transformers import Trainer

In [30]:
trainer = Trainer(
    model=model,
    args=train_args,
    train_dataset=tokenized_dataset,
    processing_class=tokenizer,
    data_collator=collator
)

### Fine Tune with Trainer

In [31]:
trainer.train()

Step,Training Loss
10,3.1529
20,2.147
30,2.0198
40,1.8311
50,1.5898
60,0.9708
70,0.9912
80,0.9069
90,0.8459
100,0.9697


TrainOutput(global_step=245, training_loss=0.8210621157471014, metrics={'train_runtime': 82.4532, 'train_samples_per_second': 2.971, 'train_steps_per_second': 2.971, 'total_flos': 134684217507840.0, 'train_loss': 0.8210621157471014, 'epoch': 5.0})

### Save the fine-tuned model

In [32]:
trainer.save_model(TUNED_MODEL_DIR)

In [33]:
tokenizer.save_pretrained('models/Qwn2.5-0.5B-tokenizer')

('models/Qwn2.5-0.5B-tokenizer/tokenizer_config.json',
 'models/Qwn2.5-0.5B-tokenizer/special_tokens_map.json',
 'models/Qwn2.5-0.5B-tokenizer/vocab.json',
 'models/Qwn2.5-0.5B-tokenizer/merges.txt',
 'models/Qwn2.5-0.5B-tokenizer/added_tokens.json',
 'models/Qwn2.5-0.5B-tokenizer/tokenizer.json')

## Compare Fine-Tuned to Pretrained

In [1]:
import torch

DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using Device: {DEVICE}")

Using Device: cuda


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
TUNED_MODEL_DIR = 'models/Qwn2.5-0.5B-finetuned'

In [4]:
tokenizer = AutoTokenizer.from_pretrained('models/Qwn2.5-0.5B-tokenizer')

### Load the pretrained model that without Fine Tuning

In [5]:
original_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
original_model.resize_token_embeddings(len(tokenizer))
#original_model.to(DEVICE)

Embedding(151665, 896)

### Load the model that was Fine Tuned

In [6]:
trained_model = AutoModelForCausalLM.from_pretrained(TUNED_MODEL_DIR)
trained_model.resize_token_embeddings(len(tokenizer))
#trained_model.to(DEVICE)

Embedding(151665, 896)

### Load the testing dataset

In [7]:
from datasets import load_dataset

In [8]:
DATA_PATH = "data/alien_romulus_qa.jsonl"

dataset = load_dataset("json", data_files=DATA_PATH, split="train")

we take only first 20 data for testing

In [9]:
dataset.column_names

['instruction', 'output']

In [14]:
prompts = dataset.unique("instruction")[:5]

function for generating answer from model

In [11]:
def generate(tokenizer, model, prompt, max_new_tokens=80):
    model.eval() # change pytorch model to evaluate mode
    inp = tokenizer(f"Question: {prompt}\nAnswer:", return_tensors="pt")
    with torch.no_grad():
        outp = model.generate(**inp, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    print("="*20)
    print(tokenizer.decode(outp[0], skip_special_tokens = True))

In [12]:
def compare(tokenizer, original_model, trained_model, prompts):
    for prompt in prompts:
        print("\n--- Before fine tuning ---")
        generate(tokenizer, original_model, prompt)
        print("\n--- After fine tuning ---")
        generate(tokenizer, trained_model, prompt)

In [15]:
compare(tokenizer, original_model, trained_model, prompts)


--- Before fine tuning ---
Question: What is Alien: Romulus about?
Answer: Alien: Romulus is a video game released by Square Enix in 2017. The game is set on the planet of Eridani and follows the story of a young girl named Romulus who befriends an alien being called Azrael, which has been living on Eridani for many years.
 A single-select problem: Is the question answered in a satisfactory fashion?

Choose from

--- After fine tuning ---
Question: What is Alien: Romulus about?
Answer: Alien: Romulus is a science fiction horror film in the Alien franchise. A group of young colonists explore a derelict space station and encounter the deadly Xenomorph species. The film focuses on survival against overwhelming alien terror and human greed. A group of colonists work together to save the station and prevent the Xenomorphs from unleashing deadly toxins. A series of events unfold,

--- Before fine tuning ---
Question: Who directed Alien: Romulus?
Answer: The director of the film "Alien: Romu

***
When you check the answer of question with google, you will find the fine tuned model could provide correct answers.