## QuestGen-LLM: Fine-Tuning

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model                | Parameters | Developed By | Notes                                                 |
| :----: | :---------------------------------- | :--------: | :----------: | :---------------------------------------------------- |
|   1.   | GPT-2[^1]                           |    124M    |    OpenAI    | Base model from the GPT-2 family                      |
|   2.   | GPT-2 Medium[^2]                    |    355M    |    OpenAI    | Larger variant with improved language modeling        |
|   3.   | GPT-2 Large[^3]                     |    774M    |    OpenAI    | Capable of generating more coherent longer text       |
|   4.   | TinyLlama-1.1B-Chat-v1.0[^4] \*†    |    1.1B    |  TinyLlama   | Lightweight chat-tuned model for constrained hardware |
|   5.   | DeepSeek-R1-Distill-Qwen-1.5B[^5] † |    1.5B    | DeepSeek AI  | Distilled model based on the Qwen architecture        |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
[^5]: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B


In [None]:
from __future__ import annotations

import json
import os
import shutil
import time
from dataclasses import dataclass, field
from os import PathLike
from pathlib import Path
from typing import Any, Final

import torch
from datasets import Dataset, DatasetDict, load_dataset
from huggingface_hub import login
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    PreTrainedModel,
    PreTrainedTokenizerFast,
    PreTrainedTokenizer,
    Trainer,
    TrainerCallback,
    TrainingArguments,
    set_seed,
)

from utils.dirpath import get_cache_dirpath, get_target_dirpath

In [None]:
# Map for the model identifiers: (model_key -> model_id)
model_ids: dict[str, str] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "tinyllama-1.1b-chat": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "deepseek-r1-distill-qwen-1.5b": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
}

# Get the HF access token from the environment
HF_ACCESS_TOKEN: Final[str] = os.getenv("HUGGINGFACE_HUB_TOKEN")

# Save the HF token to ~/.huggingface/token
login(token=HF_ACCESS_TOKEN)

In [3]:
data_dir: Path = get_target_dirpath("data")

# Load the quest dataset
quest_set: DatasetDict = load_dataset(
    "text",
    data_files={
        "train": str(data_dir / "quests_train.txt"),
        "val": str(data_dir / "quests_val.txt"),
    },
    cache_dir=str(data_dir / ".cache"),
)
quest_set

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 19954
    })
    val: Dataset({
        features: ['text'],
        num_rows: 2486
    })
})

In [4]:
quest_set["train"][:21]

{'text': ['### Instruction:',
  'Generate a video game quest description based on the following structured information.',
  '',
  '### Input:',
  'Quest Name: Perilous Passage',
  'Objective: save the Mana Queen',
  'First Tasks: go through the gate to the Forsaken Vaults',
  'First Task Locations: Forsaken Vaults - a perilous dungeon',
  'Quest Giver: NONE - NONE (location: NONE)',
  'Reward: NONE -  (amount: 1)',
  'Characters: Mana Queen - a good female spirit (location: Forsaken Vaults)',
  'Tools: NONE',
  'Locations: NONE',
  'Items: NONE',
  'Enemies: NONE',
  'Groups: NONE',
  'Title: Torchlight II',
  'Motivation: NONE',
  '',
  '### Response:',
  "The Mana Queen has come and gone Through this gate, she journeyed on. Follow her and pay the cost. Hasten forth, or she'll be lost."]}

In [5]:
quest_set["val"][23:44]

{'text': ['Generate a video game quest description based on the following structured information.',
  '',
  '### Input:',
  'Quest Name: A Child in the Lighthouse',
  "Objective: save Ardrouine's little son from worgs",
  'First Tasks: go to the abandoned lighthouse',
  'First Task Locations:  - abandoned lighthouse to the northwest',
  'Quest Giver: NONE - NONE (location: NONE)',
  'Reward:  - coins (amount: 60)',
  'Characters: NONE',
  'Tools: NONE',
  'Locations: NONE',
  'Items: NONE',
  'Enemies: NONE',
  'Groups: NONE',
  "Title: Baldur's Gate",
  'Motivation: NONE',
  '',
  '### Response:',
  "Please help me, I am just poor Ardrouine! I don't know where else to turn. My little boy was playing in that abandoned lighthouse to the northwest when a pack of worgs surrounded it. Please just turn them back, and I can coax him down. There's not much time! I can pay you 60 coins: this money is all my husband brought back from market this past week. My son's life is worth this and so muc

In [6]:
@dataclass
class QuestGenLLM:
    tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast
    model: PreTrainedModel
    model_key: str  # Alias for the model, e.g, "gpt2"
    model_id: str  # Hugging Face model name, e.g., "openai-community/gpt2"
    fp16_available: bool  # Mixed precision
    device: str = field(init=False)
    dtype: str = field(init=False)

    def __post_init__(self):
        # Automatically determine the device used by the model
        self.device = str(getattr(self.model, "device", "N/A"))

        # Automatically determine the dtype used by the model
        self.dtype = str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")

    @classmethod
    def from_pretrained(
        cls,
        model_key: str,
        model_id: str,
        cache_dir: PathLike = get_cache_dirpath("models"),
        seed: int = 42,
        use_cpu: bool = False,
    ) -> QuestGenLLM:
        def apply_lora_adapter(
            model: PreTrainedModel,
            r: int = 8,
            alpha: int = 16,
            dropout: float = 0.1,
            target_modules: list[str] = ["q_proj", "v_proj"],
            task_type: str = "CAUSAL_LM",
        ) -> PreTrainedModel:
            # Prepare model for k-bit training
            model = prepare_model_for_kbit_training(model)

            # Define the LoRA config
            lora_config: LoraConfig = LoraConfig(
                r=r,
                lora_alpha=alpha,
                lora_dropout=dropout,
                target_modules=target_modules,
                bias="none",
                task_type=task_type,
            )

            # Apply LoRA adapters to the model
            try:
                model = get_peft_model(model, lora_config)
            except Exception as e:
                print(f"[LoRAINFO] Adapter failed to apply: {e}")
                raise

            # Display information about the model parameters
            trainable_params: int = sum(
                p.numel() for p in model.parameters() if p.requires_grad
            )
            all_params: int = sum(p.numel() for p in model.parameters())
            trainable_percent: float = 100 * trainable_params / all_params
            print(
                "[LoRAINFO] trainable params: {:,} || all params: {:,} || trainable%: {:.4f}".format(
                    trainable_params, all_params, trainable_percent
                )
            )

            return model

        print(f"[DOWNLOAD] {model_key} ({model_id})")
        start_time: float = time.time()

        # Clear PyTorch's CUDA memory cache
        torch.cuda.empty_cache()

        # Set the random seed for reproducibility
        set_seed(seed)

        # Determine if mixed precision is available
        fp16_available: bool = (
            torch.cuda.is_available()
            and torch.cuda.get_device_capability(0)[0] >= 7
            and torch.cuda.get_device_capability(0)[1] >= 0
        )

        # Download the tokenizer using the model id
        tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
            model_id,
            cache_dir=(cache_dir / model_key),
            use_fast=True,
            token=HF_ACCESS_TOKEN,
            trust_remote_code=True,
        )

        model: PreTrainedModel
        if fp16_available and not use_cpu:
            # Set the bitsandbytes configuration for quantization
            bnb_config: BitsAndBytesConfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.float16,
                llm_int8_enable_fp32_cpu_offload=True,
            )

            # Download the model using the model id (for GPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                quantization_config=bnb_config,
                cache_dir=(cache_dir / model_key),
                token=HF_ACCESS_TOKEN,
                device_map="auto",
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
        else:
            # Download the model using the model id (for CPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float32,
                cache_dir=(cache_dir / model_key),
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cpu")

        # Apply the LoRA adapters to the model
        model = apply_lora_adapter(model)

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f'[COMPLETE] "{model_key}" ready in {elapsed:.2f}s.\n')

        return cls(tokenizer, model, model_key, model_id, fp16_available)

    def tokenize_and_train(
        self,
        dataset: DatasetDict,
        max_length: int = 256,
        learning_rate: int = 5e-6,
        batch_size: int = 1,
        epochs: int = 1,
        seed: int = 42,
        logging_steps: int = 10,
        output_dir: PathLike = get_target_dirpath("out"),
        logging_dir: PathLike = get_target_dirpath("logs"),
        gradient_checkpointing: bool = False,
        load_best_model_at_end: bool = False,
        callbacks: list[TrainerCallback] = [
            EarlyStoppingCallback(early_stopping_patience=2)
        ],
        activate_fp16: bool = False,
        activate_eval: bool = False,
        activate_save: bool = False,
        activate_tensorboard: bool = False,
        activate_callbacks: bool = False,
    ) -> Trainer:
        # Ensure the training and validation sets
        if not all(split in dataset for split in ["train", "val"]):
            raise ValueError("DatasetDict must contain both 'train' and 'val' splits.")

        # Ensure the output and logging directories
        os.makedirs(output_dir, exist_ok=True)
        os.makedirs(logging_dir, exist_ok=True)

        start_time: float
        end_time: float
        elapsed: float

        # Set the random seed for reproducibility
        set_seed(seed)

        # Set some configurations for the tokenizer
        if self.tokenizer.pad_token is None:  # Padding token
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.model_max_length = 2048  # Max length
        self.tokenizer.padding_side = "left"  # Padding side

        # Tokenize the dataset with `max_length` padding
        print(f"[TOKENIZE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        tokenized_data: Dataset = dataset.map(
            QuestGenLLM.tokenize_dataset,
            batched=True,
            remove_columns=["text"],
            fn_kwargs={"tokenizer": self.tokenizer, "max_length": max_length},
        )
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Set the model padding token (from the tokenizer)
        self.model.config.pad_token_id = self.tokenizer.pad_token_id

        # Turn off `use_cache` if `gradient_checkpointing` is on
        self.model.config.use_cache = not gradient_checkpointing

        # Set up the training configurations
        training_args: TrainingArguments = TrainingArguments(
            output_dir=(output_dir / self.model_key),
            learning_rate=learning_rate,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=epochs,
            log_level="info",
            logging_steps=logging_steps,
            eval_strategy=("epoch" if activate_eval else "no"),
            save_strategy=("epoch" if activate_save else "no"),
            logging_dir=(logging_dir / self.model_key),
            save_total_limit=2,
            gradient_accumulation_steps=2,
            gradient_checkpointing=gradient_checkpointing,
            fp16=(self.fp16_available and activate_fp16),
            load_best_model_at_end=load_best_model_at_end,
            metric_for_best_model="eval_loss",
            seed=seed,
            report_to=("tensorboard" if activate_tensorboard else "none"),
            label_names=["labels"],
        )

        # Set up the data collator for the model
        data_collator: DataCollatorForLanguageModeling = (
            DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=False)
        )

        # Prepare and run the trainer
        trainer: Trainer = Trainer(
            model=self.model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=tokenized_data["train"],
            eval_dataset=(tokenized_data["val"] if activate_eval else None),
            callbacks=(callbacks if activate_callbacks else []),
        )

        print(f"[FINETUNE] {self.model_key} ({self.model_id})")
        start_time: float = time.time()
        trainer.train()
        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Save the model and tokenizer for later use
        if activate_save:
            trainer.save_model()
            self.tokenizer.save_pretrained(save_directory=training_args.output_dir)

        return trainer

    @staticmethod
    def tokenize_dataset(
        examples: dict[str, list[str]],
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        max_length: int = 256,
    ) -> dict[str, list[int]]:
        encodings: Dataset = tokenizer(
            examples["text"],
            padding="longest",
            truncation=True,
            max_length=max_length,
        )
        encodings["labels"] = encodings["input_ids"].copy()
        return encodings

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", None), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
            "fp16_available": self.fp16_available,
        }

    def clear_cache(self, cache_dir: PathLike = get_cache_dirpath("models")) -> None:
        def remove_dir(dir_path: PathLike) -> None:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path)
                print(f"Cache directory '{dir_path}' removed.")
            else:
                print(f"No cache directory found at '{dir_path}'.")

        remove_dir(cache_dir / self.model_key)

    def print_model_information(self) -> None:
        print(json.dumps(self.to_dict(), indent=2))

    def __str__(self) -> str:
        return f"{self.model_key} ({self.model_id})"

In [7]:
# Get a subset of the quest dataset
quest_subset: DatasetDict = DatasetDict(
    {
        "train": quest_set["train"].select(range(100)),
        "val": quest_set["val"].select(range(20)),
    }
)
quest_subset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 100
    })
    val: Dataset({
        features: ['text'],
        num_rows: 20
    })
})

In [None]:
# Download the Llama 2 model
llama2_model: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="llama-2-7b-chat", model_id="meta-llama/Llama-2-7b-chat-hf", use_cpu=True
)
llama2_model

In [None]:
# Build and train the Llama-2 model with the quest data
llama2_trainer: Trainer = llama2_model.tokenize_and_train(dataset=quest_subset)
llama2_trainer