## QuestGen-LLM: Fine-Tuning & Evaluation

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model             | Parameters | Developed By | Notes                                                 |
| :----: | :------------------------------- | :--------: | :----------: | :---------------------------------------------------- |
|   1.   | GPT-2[^1]                        |    124M    |    OpenAI    | Base model from the GPT-2 family                      |
|   2.   | GPT-2 Medium[^2]                 |    355M    |    OpenAI    | Larger variant with improved language modeling        |
|   3.   | GPT-2 Large[^3]                  |    774M    |    OpenAI    | Capable of generating more coherent longer text       |
|   4.   | Llama-3.2-1B-Instruct[^4] †      |     1B     |     Meta     | Instruction-tuned model for question-answering        |
|   5.   | TinyLlama-1.1B-Chat-v1.0[^5] \*† |    1.1B    |  TinyLlama   | Lightweight chat-tuned model for constrained hardware |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

The notebook also covers the performance evaluation of these pre-trained LLMs after training on the "quest" dataset. The generated quest descriptions (from the test set) are compared to their reference responses. These responses are then evaluated based on the following evaluation metrics:

| S. No. | Metric         | Description                                                   | Preference                               |
| :----: | -------------- | ------------------------------------------------------------- | ---------------------------------------- |
|   1.   | Perplexity[^6] | Measures how "confused" the model is about its predictions.   | Lower values indicate less uncertainty.  |
|   2.   | BLEU[^7]       | Compares n-gram overlap between generated and reference text. | Higher values indicate more overlap.     |
|   3.   | ROUGE[^8]      | Measures how much reference content is captured (recall).     | Higher values indicate better recall.    |
|   4.   | METEOR[^9]     | Evaluates similarity using synonyms, stems, and word order.   | Higher values indicate better alignment. |

> Additionally, a _human evaluation method_ can further assess qualities like creativity, fluency, and coherence.

Note that:

- **BLEU:** Bilingual Evaluation Understudy
- **ROUGE:** Recall-Oriented Understudy for Gisting Evaluation
- **METEOR:** Metric for Evaluation of Translation with Explicit ORdering

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
[^5]: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
[^6]: https://huggingface.co/spaces/evaluate-metric/perplexity
[^7]: https://huggingface.co/spaces/evaluate-metric/bleu
[^8]: https://huggingface.co/spaces/evaluate-metric/rouge
[^9]: https://huggingface.co/spaces/evaluate-metric/meteor


In [1]:
from __future__ import annotations

import json
import os
import shutil
import sys
import time
from dataclasses import dataclass, field
from os import PathLike
from pathlib import Path
from typing import Any, Final, Optional

In [None]:
import torch
from datasets import Dataset, DatasetDict, load_dataset
from huggingface_hub import login
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    PreTrainedModel,
    PreTrainedTokenizerFast,
    PreTrainedTokenizer,
    Trainer,
    TrainerCallback,
    TrainerControl,
    TrainerState,
    TrainingArguments,
    set_seed,
)
from transformers.integrations import TensorBoardCallback
from transformers.tokenization_utils_base import BatchEncoding

In [3]:
root: str = str(Path.cwd().parent.resolve())
if root not in sys.path:
    sys.path.insert(0, root)

In [4]:
from utils.dirpath import get_cache_dirpath, get_target_dirpath

In [5]:
# Get the HF access token from the environment
HF_ACCESS_TOKEN: Final[str] = os.getenv("HUGGINGFACE_HUB_TOKEN")

# Save the HF token to ~/.huggingface/token
login(token=HF_ACCESS_TOKEN)

In [6]:
# Map for the model identifiers: (model_key -> model_id)
MODEL_IDENTIFIERS: Final[dict[str, str]] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "llama-3.2-1b-instruct": "meta-llama/Llama-3.2-1B-Instruct",
    "tinyllama-1.1b-chat": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
}

In [7]:
# Map for the target modules: (model_key -> target_modules)
TARGET_MODULES: Final[dict[str, list[str]]] = {
    "gpt2": ["c_attn", "c_proj", "c_fc"],
    "gpt2-medium": ["c_attn", "c_proj", "c_fc"],
    "gpt2-large": ["c_attn", "c_proj", "c_fc"],
    "llama-3.2-1b-instruct": [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    "tinyllama-1.1b-chat": [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
}

In [8]:
# Set the constants for model tuning here
BATCH_SIZE: Final[int] = 4  # Per-device training batch size
SEED: Final[int] = 42  # Random seed for reproducibility
N_EPOCHS: Final[int] = 5  # Number of training epochs
LR_RATE: Final[float] = 5e-7  # Learning rate

MAX_LENGTH: Final[int] = 512  # Max token length for input sequences
MAX_GRAD_NORM: Final[float] = 1.0  # Gradient clipping threshold
LOGGING_STEPS: Final[int] = 10  # Steps between logging metrics
EVAL_STEPS: Final[int] = 10  # Steps between evaluations
WARMUP_STEPS: Final[int] = 50  # Learning rate warmup steps

SAVE_TOTAL_LIMIT: Final[int] = 1  # Max number of saved checkpoints
EVAL_ACCUMULATION_STEPS: Final[int] = 2  # Eval batch accumulation steps
GRADIENT_ACCUMULATION_STEPS: Final[int] = 2  # Grad batch accumulation steps

GRADIENT_CHECKPOINTING: Final[bool] = True  # Reduce memory usage (slower)
LOAD_BEST_MODEL_AT_END: Final[bool] = True  # Load best checkpoint by eval loss

ACTIVATE_FP16: Final[bool] = False  # Enable 16-bit mixed precision training
ACTIVATE_EVAL: Final[bool] = True  # Enable evaluation
ACTIVATE_SAVE: Final[bool] = True  # Enable checkpoint saving
ACTIVATE_LOGS: Final[bool] = False  # Enable logging to stdout
ACTIVATE_TENSORBOARD: Final[bool] = True  # Enable TensorBoard logging
ACTIVATE_CALLBACKS: Final[bool] = True  # Enable trainer callbacks
ACTIVATE_FULL: Final[bool] = False  # Use full dataset or 10% subset

FRACTION: Final[float] = 0.1  # % of dataset to use when not full (e.g., 10%)

MAX_NEW_TOKEN: Final[int] = 100  # Max new tokens to generate during inference
NUM_RETURN_SEQUENCES: Final[int] = 1  # Number of completions per prompt

TEMPERATURE: Final[float] = 0.8  # Controls randomness; lower is more deterministic
TOP_P: Final[float] = 0.9  # Sample from smallest set with cumulative prob ≥ top_p
TOP_K: Final[int] = 50  # Sample from the top-k most likely tokens (fixed size)
DO_SAMPLE: Final[bool] = True  # Enables sampling (if False, uses greedy decoding)
REPETITION_PENALTY: Final[float] = 1.1

In [9]:
def load_prompt_response_dataset(
    data_dir: PathLike = get_target_dirpath("data"),
    cache_dir: PathLike = get_cache_dirpath("data"),
) -> tuple[DatasetDict, list[str]]:
    def load_split_dataset(split_type: str) -> tuple[Dataset, Optional[str]]:
        nonlocal cache_dir

        split_dir: Path = Path(data_dir) / split_type
        split: Dataset = load_dataset(
            "json",
            data_files={split_type: str(split_dir / "quests.json")},
            split=split_type,
            cache_dir=str(Path(cache_dir) / split_type),
        )

        # If test set, remove response field and extract it
        if split_type == "test":
            test_responses: list[str] = split["response"]
            split = split.map(lambda entry: {**entry, "response": ""})
            return split, test_responses

        return split, None

    train_set: Dataset
    val_set: Dataset
    test_set: Dataset
    test_references: list[str]

    train_set, _ = load_split_dataset("train")
    val_set, _ = load_split_dataset("val")
    test_set, test_references = load_split_dataset("test")

    dataset: DatasetDict = DatasetDict(
        {
            "train": train_set,
            "val": val_set,
            "test": test_set,
        }
    )
    return dataset, test_references

In [10]:
@dataclass
class QuestDataset:
    records: DatasetDict
    references: list[str]

    @classmethod
    def load(
        cls,
        data_dir: PathLike = get_target_dirpath("data"),
        cache_dir: PathLike = get_cache_dirpath("data"),
    ) -> QuestDataset:
        return cls(*load_prompt_response_dataset(data_dir, cache_dir))

    def get_subset(self, fraction: float = FRACTION) -> QuestDataset:
        def get_split_set(split_type: str) -> Dataset:
            num_rows: int = int(fraction * self.records[split_type].num_rows)
            return self.records[split_type].select(range(num_rows))

        return QuestDataset(
            DatasetDict(
                {
                    "train": get_split_set("train"),
                    "val": get_split_set("val"),
                    "test": get_split_set("test"),
                }
            ),
            self.references[: int(fraction * len(self.references))],
        )

    def __repr__(self) -> str:
        return self.records.__repr__()

In [11]:
# Load the quest dataset
quest_set: QuestDataset = QuestDataset.load()
quest_set

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/114 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 907
    })
    val: Dataset({
        features: ['prompt', 'response'],
        num_rows: 113
    })
    test: Dataset({
        features: ['prompt', 'response'],
        num_rows: 114
    })
})

In [12]:
# Prepare a subset of the quest dataset
quest_subset: QuestDataset = quest_set.get_subset()
quest_subset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 90
    })
    val: Dataset({
        features: ['prompt', 'response'],
        num_rows: 11
    })
    test: Dataset({
        features: ['prompt', 'response'],
        num_rows: 11
    })
})

In [13]:
@dataclass
class TrainingMetrics:
    train_losses: list[float] = field(default_factory=list)
    eval_losses: list[float] = field(default_factory=list)
    learning_rates: list[float] = field(default_factory=list)
    grad_norms: list[float] = field(default_factory=list)
    global_steps: list[int] = field(default_factory=list)
    epochs: list[float] = field(default_factory=list)
    eval_results: list[dict[str, float]] = field(default_factory=list)

    def __repr__(self) -> str:
        return (
            f"TrainingMetrics(\n"
            f"  train_losses={self.train_losses},\n"
            f"  eval_losses={self.eval_losses},\n"
            f"  learning_rates={self.learning_rates},\n"
            f"  grad_norms={self.grad_norms},\n"
            f"  global_steps={self.global_steps},\n"
            f"  epochs={self.epochs},\n"
            f")"
        )

    def to_dict(self) -> dict[str, list[int | float]]:
        return {
            "train_losses": self.train_losses,
            "eval_losses": self.eval_losses,
            "learning_rates": self.learning_rates,
            "grad_norms": self.grad_norms,
            "global_steps": self.global_steps,
            "epochs": self.epochs,
        }


# Map for storing training metrics: (model_key -> metrics)
TRAINING_METRICS: dict[str, Optional[TrainingMetrics]] = {
    k: None for k in MODEL_IDENTIFIERS.keys()
}

In [14]:
class LossLoggerCallback(TrainerCallback):
    def __init__(self, metrics: TrainingMetrics):
        self.metrics: TrainingMetrics = metrics
        self.prev_epoch: Optional[float] = None

    def on_log(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        logs: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if logs is None:
            return

        # Capture training losses during logging
        if "loss" in logs:
            self.metrics.train_losses.append(logs["loss"])

        # Capture evaluation losses during logging
        if "eval_loss" in logs:
            self.metrics.eval_losses.append(logs["eval_loss"])

        # Capture learning rates during logging
        if "learning_rate" in logs:
            self.metrics.learning_rates.append(logs["learning_rate"])

        # Capture gradient norms during logging
        if "grad_norm" in logs:
            self.metrics.grad_norms.append(logs["grad_norm"])

        # Capture global steps consistently
        self.metrics.global_steps.append(state.global_step)

        # Only log the epoch once per epoch change
        if state.epoch is not None and state.epoch != self.prev_epoch:
            self.metrics.epochs.append(state.epoch)
            self.prev_epoch = state.epoch

        return super().on_log(args, state, control, **kwargs)

    def on_evaluate(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        metrics: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if metrics is None:
            return

        # Capture evaluation results on evaluation
        self.metrics.eval_results.append(metrics)

        return super().on_evaluate(args, state, control, **kwargs)

In [None]:
class QuestGenLLM:
    def __init__(
        self,
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        model: PreTrainedModel,
        model_key: str,  # Alias for the model, e.g, "gpt2"
        model_id: str,  # Hugging Face model name, e.g., "openai-community/gpt2"
        fp16_available: bool,  # Mixed precision
        device: Optional[str] = None,
        dtype: Optional[str] = None,
        metrics: Optional[TrainingMetrics] = None,
    ):
        self.tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast = tokenizer
        self.model: PreTrainedModel = model
        self.model_key: str = model_key
        self.model_id: str = model_id
        self.fp16_available: bool = fp16_available

        # Automatically determine the device used by the model
        self.device: str = (
            device
            if isinstance(device, str)
            else str(getattr(self.model, "device", "N/A"))
        )

        # Automatically determine the dtype used by the model
        self.dtype: str = (
            dtype
            if isinstance(dtype, str)
            else str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")
        )

        # Initialize dataclass for storing training metrics
        self.metrics: TrainingMetrics = (
            metrics if isinstance(metrics, TrainingMetrics) else TrainingMetrics()
        )

    @classmethod
    def from_pretrained(
        cls,
        model_key: str,
        model_id: Optional[str] = None,
        cache_dir: PathLike = get_cache_dirpath("models"),
        seed: int = SEED,
        use_cpu: bool = False,
    ) -> QuestGenLLM:
        def apply_lora_adapter(
            model: PreTrainedModel,
            r: int = 8,
            alpha: int = 16,
            dropout: float = 0.1,
            task_type: str = "CAUSAL_LM",
        ) -> PreTrainedModel:
            # Prepare model for k-bit training
            model = prepare_model_for_kbit_training(model)

            # Set correct `fan_in_fan_out` based on model type
            #
            # False [default] - for Linear layers
            # True - for Conv1D layers (like GPT)
            fan_in_fan_out: bool = False
            if "gpt" in getattr(model.config, "model_type", "").lower():
                fan_in_fan_out = True

            # Define the LoRA config
            lora_config: LoraConfig = LoraConfig(
                r=r,
                lora_alpha=alpha,
                lora_dropout=dropout,
                target_modules=TARGET_MODULES[model_key],
                bias="none",
                task_type=task_type,
                fan_in_fan_out=fan_in_fan_out,
            )

            try:
                # Apply LoRA adapters to the model
                model = get_peft_model(model, lora_config)
            except Exception as e:
                print(f"[LoRAINFO] Adapter failed to apply: {e}")
                raise

            # Display information about the model parameters
            trainable_params: int = sum(
                p.numel() for p in model.parameters() if p.requires_grad
            )
            all_params: int = sum(p.numel() for p in model.parameters())
            trainable_percent: float = 100 * trainable_params / all_params
            print(
                "[LoRAINFO] trainable params: {:,} || all params: {:,} || trainable%: {:.4f}".format(
                    trainable_params, all_params, trainable_percent
                )
            )

            return model

        if not model_id:
            model_id = MODEL_IDENTIFIERS[model_key]

        print(f"[DOWNLOAD] {model_key} ({model_id})")
        start_time: float = time.time()

        # Clear PyTorch's CUDA memory cache
        torch.cuda.empty_cache()

        # Set the random seed for reproducibility
        set_seed(seed)

        # Determine if mixed precision is available
        fp16_available: bool = (
            torch.cuda.is_available()
            and torch.cuda.get_device_capability(0)[0] >= 7
            and torch.cuda.get_device_capability(0)[1] >= 0
        )

        # Download the tokenizer using the model id
        tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
            model_id,
            cache_dir=(Path(cache_dir) / model_key),
            use_fast=True,
            token=HF_ACCESS_TOKEN,
            trust_remote_code=True,
        )

        model: PreTrainedModel
        cache_dir: str = str(Path(cache_dir) / model_key)

        if fp16_available and not use_cpu:
            # Set the bitsandbytes configuration for quantization
            bnb_config: BitsAndBytesConfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.float16,
                # llm_int8_enable_fp32_cpu_offload=True,
            )

            # Download the model using the model id (for GPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                quantization_config=bnb_config,
                cache_dir=cache_dir,
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cuda")
        else:
            # Download the model using the model id (for CPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float32,
                cache_dir=cache_dir,
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cpu")

        # Apply the LoRA adapters to the model
        model = apply_lora_adapter(model)

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f'[COMPLETE] "{model_key}" ready in {elapsed:.2f}s.\n')

        return cls(tokenizer, model, model_key, model_id, fp16_available)

    def train_and_evaluate(
        self,
        dataset: DatasetDict = (
            quest_set.records if ACTIVATE_FULL else quest_subset.records
        ),
        max_length: int = MAX_LENGTH,
        learning_rate: int = LR_RATE,
        batch_size: int = BATCH_SIZE,
        epochs: int = N_EPOCHS,
        seed: int = SEED,
        max_grad_norm: float = MAX_GRAD_NORM,
        logging_steps: int = LOGGING_STEPS,
        eval_steps: int = EVAL_STEPS,
        warmup_steps: int = WARMUP_STEPS,
        gradient_checkpointing: bool = GRADIENT_CHECKPOINTING,
        load_best_model_at_end: bool = LOAD_BEST_MODEL_AT_END,
        save_total_limit: int = SAVE_TOTAL_LIMIT,
        eval_accumulation_steps: int = EVAL_ACCUMULATION_STEPS,
        gradient_accumulation_steps: int = GRADIENT_ACCUMULATION_STEPS,
        callbacks: list[TrainerCallback] = [
            EarlyStoppingCallback(early_stopping_patience=2) if ACTIVATE_EVAL else None,
            TensorBoardCallback() if ACTIVATE_TENSORBOARD else None,
        ],
        activate_fp16: bool = ACTIVATE_FP16,
        activate_eval: bool = ACTIVATE_EVAL,
        activate_save: bool = ACTIVATE_SAVE,
        activate_logs: bool = ACTIVATE_LOGS,
        activate_tensorboard: bool = ACTIVATE_TENSORBOARD,
        activate_callbacks: bool = ACTIVATE_CALLBACKS,
        output_dir: PathLike = get_target_dirpath("out"),
        logging_dir: PathLike = get_target_dirpath("logs"),
    ) -> TrainingMetrics:
        # Ensure the training and validation sets
        if not all(split in dataset for split in ["train", "val"]):
            raise ValueError("DatasetDict must contain both 'train' and 'val' splits.")

        # Ensure the output and logging directories
        os.makedirs(output_dir, exist_ok=True)
        os.makedirs(logging_dir, exist_ok=True)

        start_time: float
        end_time: float
        elapsed: float

        # Set the random seed for reproducibility
        set_seed(seed)

        # Set the padding token for the tokenizer
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"

        # Tokenize the dataset with `max_length` padding
        print(f"[TOKENIZE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        tokenized_data: Dataset = dataset.map(
            QuestGenLLM.tokenize_dataset,
            batched=True,
            remove_columns=["prompt", "response"],
            fn_kwargs={"tokenizer": self.tokenizer, "max_length": max_length},
        )
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Set the model padding token (from the tokenizer)
        self.model.config.pad_token_id = self.tokenizer.pad_token_id

        # Turn off `use_cache` if `gradient_checkpointing` is on
        self.model.config.use_cache = not gradient_checkpointing

        # Set up the training configurations
        training_args: TrainingArguments = TrainingArguments(
            output_dir=(Path(output_dir) / self.model_key),
            learning_rate=learning_rate,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=epochs,
            log_level=("info" if activate_logs else "error"),
            logging_steps=logging_steps,
            eval_steps=eval_steps,
            eval_strategy=("epoch" if activate_eval else "no"),
            save_strategy=("epoch" if activate_save else "no"),
            logging_dir=(Path(logging_dir) / self.model_key),
            save_total_limit=save_total_limit,
            eval_accumulation_steps=eval_accumulation_steps,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            fp16=(self.fp16_available and activate_fp16),
            load_best_model_at_end=load_best_model_at_end,
            metric_for_best_model="eval_loss",
            seed=seed,
            report_to=("tensorboard" if activate_tensorboard else "none"),
            label_names=["labels"],
            max_grad_norm=max_grad_norm,
            warmup_steps=warmup_steps,
            logging_nan_inf_filter=True,
            skip_memory_metrics=True,
            lr_scheduler_type="cosine",
            push_to_hub=False,
            disable_tqdm=False,
        )

        # Set up the data collator for the model
        data_collator: DataCollatorForLanguageModeling = (
            DataCollatorForLanguageModeling(
                self.tokenizer, mlm=False, return_tensors="pt"
            )
        )

        # Set up the callbacks for the trainer
        trainer_callbacks: list[TrainerCallback] = list(
            filter(lambda callback: callback is not None, callbacks)
        )
        if activate_callbacks:
            trainer_callbacks.append(LossLoggerCallback(self.metrics))

        # Prepare and run the trainer
        trainer: Trainer = Trainer(
            model=self.model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=tokenized_data["train"],
            eval_dataset=(tokenized_data["val"] if activate_eval else None),
            callbacks=trainer_callbacks,
        )

        print(f"[FINETUNE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        trainer.train()
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Save the model and tokenizer for later use
        if activate_save:
            trainer.save_model()
            self.tokenizer.save_pretrained(save_directory=training_args.output_dir)

        # Add to the training metrics map
        TRAINING_METRICS[self.model_key] = self.metrics

        return self.metrics

    @staticmethod
    def tokenize_dataset(
        examples: dict[str, list[str]],
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        max_length: int = MAX_LENGTH,
    ) -> dict[str, list[list[int]]]:
        def perform_tokenization(input_texts: list[str]) -> BatchEncoding:
            return tokenizer(
                input_texts,
                padding="max_length",
                truncation=True,
                max_length=max_length,
                return_tensors="pt",
            )

        inputs: BatchEncoding = perform_tokenization(examples["prompt"])
        labels: BatchEncoding = perform_tokenization(examples["response"])

        input_ids: list[list[int]] = inputs["input_ids"]
        attention_mask: list[list[int]] = inputs["attention_mask"]
        label_ids: list[list[int]] = labels["input_ids"]
        label_ids[label_ids == tokenizer.pad_token_id] = -100

        return {
            "input_ids": input_ids.tolist(),
            "attention_mask": attention_mask.tolist(),
            "labels": label_ids.tolist(),
        }

    def generate(
        self,
        dataset: DatasetDict,
        max_new_tokens: int = MAX_NEW_TOKEN,
        num_return_sequences: int = NUM_RETURN_SEQUENCES,
        temperature: float = TEMPERATURE,
        top_p: float = TOP_P,
        top_k: int = TOP_K,
        do_sample: bool = True,
        repetition_penalty: float = REPETITION_PENALTY,
    ) -> list[str]:
        input_model: PreTrainedModel = self.model.base_model.model
        input_model.eval()  # Set the model to evaluation mode

        ...  # TODO: Add logic here...

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", None), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
            "fp16_available": self.fp16_available,
        }

    def clear_cache(self, cache_dir: PathLike = get_cache_dirpath("models")) -> None:
        def remove_dir(dir_path: PathLike) -> None:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path)
                print(f"Cache directory '{dir_path}' removed.")
            else:
                print(f"No cache directory found at '{dir_path}'.")

        remove_dir(Path(cache_dir) / self.model_key)

    def print_model_information(self) -> None:
        print(json.dumps(self.to_dict(), indent=2))

    def inspect(self) -> None:
        print(f"{self.tokenizer}\n\n{self.model}\n\n{self.model.config}")
        print(
            "Token Type                        | Value",
            "----------------------------------+-------------------",
            f"Padding Token [PAD]               | {self.tokenizer.pad_token}",
            f"Beginning of Sentence Token [BOS] | {self.tokenizer.bos_token}",
            f"End of Sentence Token [EOS]       | {self.tokenizer.eos_token}",
            f"Unknown Token [UNK]               | {self.tokenizer.unk_token}",
            sep="\n",
            end="\n\n",
        )

    def __str__(self) -> str:
        return f"{self.model_key} ({self.model_id})"

In [16]:
# Build and train the GPT-2 Base model with the quest data
gpt2_base: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2")
gpt2_base.inspect()
gpt2_base.train_and_evaluate()

[DOWNLOAD] gpt2 (openai-community/gpt2)
[LoRAINFO] trainable params: 1,179,648 || all params: 83,152,128 || trainable%: 1.4187
[COMPLETE] "gpt2" ready in 11.64s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attent

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

[COMPLETE] Elapsed: 0.59s

[FINETUNE] gpt2 (openai-community/gpt2)


Epoch,Training Loss,Validation Loss
1,3.9766,3.61339
2,3.9957,3.613118
3,3.9254,3.612682
4,4.0279,3.61169


[COMPLETE] Elapsed: 91.16s



TrainingMetrics(
  train_losses=[3.9766, 3.9957, 3.9254, 4.0764, 4.0279],
  eval_losses=[3.613389730453491, 3.6131179332733154, 3.6126816272735596, 3.6120150089263916, 3.6116902828216553],
  learning_rates=[1e-07, 2e-07, 3e-07, 4e-07, 5e-07],
  grad_norms=[0.7438549995422363, 0.7376565337181091, 0.6230230331420898, 0.9048134088516235, 0.683418333530426],
  global_steps=[10, 12, 20, 24, 30, 36, 40, 48, 50, 55, 55],
  epochs=[0.8695652173913043, 1.0, 1.6956521739130435, 2.0, 2.5217391304347827, 3.0, 3.3478260869565215, 4.0, 4.173913043478261, 4.608695652173913],
)

In [17]:
prompt: str = quest_subset.records["test"][0]["prompt"]
print(prompt)

### Instruction:
Generate a video game quest description based on the following structured information.

### Input:
Quest Name: Guard the Brahmin
Objective: Protect a herd of Brahmin from nighttime rustlers.
First Tasks: Meet with the rancher to begin the guard duty., Stay alert during the night for any rustlers., Defeat any attackers attempting to steal the Brahmin.
First Task Locations: Klamath - A small settlement known for Brahmin herding.
Quest Giver: NONE - NONE (location: NONE)
Reward: Caps - A small payment for successfully guarding the Brahmin. (amount: 100), Reputation Boost - Improved reputation in Klamath for assisting the town. (amount: NONE)
Characters: NONE
Tools: Weapons to defend against rustlers., Stealth or perception skills to detect incoming threats.
Locations: NONE
Items: NONE
Enemies: NONE
Groups: NONE
Title: Fallout 2
Motivation: NONE

### Response:


In [18]:
responses: list[str] = gpt2_base.generate(prompt)
for idx, response in enumerate(responses):
    print(f"{response}\n\n### Reference:\n{quest_subset.references[idx]}")

### Instruction:
Generate a video game quest description based on the following structured information.

### Input:
Quest Name: Guard the Brahmin
Objective: Protect a herd of Brahmin from nighttime rustlers.
First Tasks: Meet with the rancher to begin the guard duty., Stay alert during the night for any rustlers., Defeat any attackers attempting to steal the Brahmin.
First Task Locations: Klamath - A small settlement known for Brahmin herding.
Quest Giver: NONE - NONE (location: NONE)
Reward: Caps - A small payment for successfully guarding the Brahmin. (amount: 100), Reputation Boost - Improved reputation in Klamath for assisting the town. (amount: NONE)
Characters: NONE
Tools: Weapons to defend against rustlers., Stealth or perception skills to detect incoming threats.
Locations: NONE
Items: NONE
Enemies: NONE
Groups: NONE
Title: Fallout 2
Motivation: NONE

### Response: [1] This is just an idea, but you can do it!

-A_B__ "It's not like I am going to have that experience at my age a

In [17]:
# Build and train the GPT-2 Medium model with the quest data
gpt2_medium: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2-medium")
gpt2_medium.inspect()
gpt2_medium.train_and_evaluate()

[DOWNLOAD] gpt2-medium (openai-community/gpt2-medium)
[LoRAINFO] trainable params: 3,145,728 || all params: 206,973,952 || trainable%: 1.5199
[COMPLETE] "gpt2-medium" ready in 33.70s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2-medium', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 1024)
        (wpe): Embedding(1024, 1024)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPT2Block(
            (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True

Epoch,Training Loss,Validation Loss
1,3.3696,3.173976
2,3.3692,3.173296


[COMPLETE] Elapsed: 158.11s



TrainingMetrics(
  train_losses=[3.3696, 3.3752, 3.3692],
  eval_losses=[3.173976182937622, 3.1736626625061035, 3.173295736312866],
  learning_rates=[1e-07, 2e-07, 3e-07],
  grad_norms=[0.3051488697528839, 0.42861658334732056, 0.36301547288894653],
  global_steps=[10, 12, 20, 24, 30, 33, 33],
  epochs=[0.8695652173913043, 1.0, 1.6956521739130435, 2.0, 2.5217391304347827, 2.782608695652174],
)

In [18]:
# Build and train the GPT-2 Large model with the quest data
gpt2_large: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2-large")
gpt2_large.inspect()
gpt2_large.train_and_evaluate()

[DOWNLOAD] gpt2-large (openai-community/gpt2-large)
[LoRAINFO] trainable params: 5,898,240 || all params: 426,033,920 || trainable%: 1.3845
[COMPLETE] "gpt2-large" ready in 72.03s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2-large', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 1280)
        (wpe): Embedding(1024, 1280)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-35): 36 x GPT2Block(
            (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

[COMPLETE] Elapsed: 0.33s

[FINETUNE] gpt2-large (openai-community/gpt2-large)


Epoch,Training Loss,Validation Loss
1,3.1362,3.067715
2,3.1103,3.066693


[COMPLETE] Elapsed: 329.98s



TrainingMetrics(
  train_losses=[3.1362, 3.1325, 3.1103],
  eval_losses=[3.0677154064178467, 3.067249059677124, 3.066693067550659],
  learning_rates=[1e-07, 2e-07, 3e-07],
  grad_norms=[0.3436407148838043, 0.3017879128456116, 0.33828988671302795],
  global_steps=[10, 12, 20, 24, 30, 33, 33],
  epochs=[0.8695652173913043, 1.0, 1.6956521739130435, 2.0, 2.5217391304347827, 2.782608695652174],
)

In [19]:
# Build and train the Llama 3.2 model with the quest data
llama32: QuestGenLLM = QuestGenLLM.from_pretrained("llama-3.2-1b-instruct")
llama32.inspect()
llama32.train_and_evaluate()

[DOWNLOAD] llama-3.2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
[LoRAINFO] trainable params: 5,636,096 || all params: 754,911,232 || trainable%: 0.7466
[COMPLETE] "llama-3.2-1b-instruct" ready in 57.36s.

PreTrainedTokenizerFast(name_or_path='meta-llama/Llama-3.2-1B-Instruct', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normal

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

[COMPLETE] Elapsed: 0.88s

[FINETUNE] llama-3.2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)


Epoch,Training Loss,Validation Loss
1,3.1957,3.158548
2,3.1614,3.154496


[COMPLETE] Elapsed: 2633.11s



TrainingMetrics(
  train_losses=[3.1957, 3.1895, 3.1614],
  eval_losses=[3.158547878265381, 3.1567068099975586, 3.154496192932129],
  learning_rates=[1e-07, 2e-07, 3e-07],
  grad_norms=[1.0860451459884644, 1.2568928003311157, 1.148056983947754],
  global_steps=[10, 12, 20, 24, 30, 33, 33],
  epochs=[0.8695652173913043, 1.0, 1.6956521739130435, 2.0, 2.5217391304347827, 2.782608695652174],
)

In [20]:
# Build and train the TinyLlama model with the quest data
tinyllama: QuestGenLLM = QuestGenLLM.from_pretrained("tinyllama-1.1b-chat")
tinyllama.inspect()
tinyllama.train_and_evaluate()

[DOWNLOAD] tinyllama-1.1b-chat (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
[LoRAINFO] trainable params: 6,307,840 || all params: 621,914,112 || trainable%: 1.0143
[COMPLETE] "tinyllama-1.1b-chat" ready in 74.81s.

LlamaTokenizerFast(name_or_path='TinyLlama/TinyLlama-1.1B-Chat-v1.0', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000,

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

[COMPLETE] Elapsed: 0.36s

[FINETUNE] tinyllama-1.1b-chat (TinyLlama/TinyLlama-1.1B-Chat-v1.0)


Epoch,Training Loss,Validation Loss
1,2.5607,2.468454
2,2.5484,2.466187


[COMPLETE] Elapsed: 11361.77s



TrainingMetrics(
  train_losses=[2.5607, 2.5473, 2.5484],
  eval_losses=[2.468454122543335, 2.4674222469329834, 2.466186761856079],
  learning_rates=[1e-07, 2e-07, 3e-07],
  grad_norms=[0.7706165909767151, 0.831169605255127, 0.7596113085746765],
  global_steps=[10, 12, 20, 24, 30, 33, 33],
  epochs=[0.8695652173913043, 1.0, 1.6956521739130435, 2.0, 2.5217391304347827, 2.782608695652174],
)

In [21]:
def save_training_metrics(output_dir: PathLike = get_target_dirpath("out")) -> None:
    metrics_dict: dict[str, dict[str, list[int | float]]] = {
        k: v.to_dict() if isinstance(v, TrainingMetrics) else None
        for k, v in TRAINING_METRICS.items()
    }

    json_file_path: Path = Path(output_dir) / "training_metrics.json"
    with open(json_file_path, "w") as json_writer:
        json.dump(metrics_dict, json_writer, indent=2)

    print(f"Saved to {json_file_path}")


save_training_metrics()  # Save metrics for future evaluations

Saved to /app/out/training_metrics.json
