## QuestGen-LLM: Fine-Tuning & Evaluation

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model             | Parameters | Developed By | Notes                                                 |
| :----: | :------------------------------- | :--------: | :----------: | :---------------------------------------------------- |
|   1.   | GPT-2[^1]                        |    124M    |    OpenAI    | Base model from the GPT-2 family                      |
|   2.   | GPT-2 Medium[^2]                 |    355M    |    OpenAI    | Larger variant with improved language modeling        |
|   3.   | GPT-2 Large[^3]                  |    774M    |    OpenAI    | Capable of generating more coherent longer text       |
|   4.   | Llama-3.2-1B-Instruct[^4] †      |     1B     |     Meta     | Instruction-tuned model for question-answering        |
|   5.   | TinyLlama-1.1B-Chat-v1.0[^5] \*† |    1.1B    |  TinyLlama   | Lightweight chat-tuned model for constrained hardware |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

The notebook also covers the performance evaluation of these pre-trained LLMs after training on the "quest" dataset. The generated quest descriptions (from the test set) are compared to their reference responses. These responses are then evaluated based on the following evaluation metrics:

| S. No. | Metric         | Description                                                   | Preference                               |
| :----: | -------------- | ------------------------------------------------------------- | ---------------------------------------- |
|   1.   | Perplexity[^6] | Measures how "confused" the model is about its predictions.   | Lower values indicate less uncertainty.  |
|   2.   | BLEU[^7]       | Compares n-gram overlap between generated and reference text. | Higher values indicate more overlap.     |
|   3.   | ROUGE[^8]      | Measures how much reference content is captured (recall).     | Higher values indicate better recall.    |
|   4.   | METEOR[^9]     | Evaluates similarity using synonyms, stems, and word order.   | Higher values indicate better alignment. |

> Additionally, a _human evaluation method_ can further assess qualities like creativity, fluency, and coherence.

Note that:

- **BLEU:** Bilingual Evaluation Understudy
- **ROUGE:** Recall-Oriented Understudy for Gisting Evaluation
- **METEOR:** Metric for Evaluation of Translation with Explicit ORdering

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
[^5]: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
[^6]: https://huggingface.co/spaces/evaluate-metric/perplexity
[^7]: https://huggingface.co/spaces/evaluate-metric/bleu
[^8]: https://huggingface.co/spaces/evaluate-metric/rouge
[^9]: https://huggingface.co/spaces/evaluate-metric/meteor


In [None]:
from __future__ import annotations

import json
import os
import shutil
import sys
import time
from dataclasses import dataclass, field
from os import PathLike
from pathlib import Path
from typing import Any, Final, Optional

In [None]:
import torch
from datasets import Dataset, DatasetDict, load_dataset
from huggingface_hub import login
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    PreTrainedModel,
    PreTrainedTokenizerFast,
    PreTrainedTokenizer,
    TextGenerationPipeline,
    Trainer,
    TrainerCallback,
    TrainerControl,
    TrainerState,
    TrainingArguments,
    set_seed,
)
from transformers.integrations import TensorBoardCallback
from transformers.tokenization_utils_base import BatchEncoding

In [None]:
root: str = str(Path.cwd().parent.resolve())
if root not in sys.path:
    sys.path.insert(0, root)

In [None]:
from utils.dirpath import get_cache_dirpath, get_target_dirpath

In [None]:
# Get the HF access token from the environment
HF_ACCESS_TOKEN: Final[str] = os.getenv("HUGGINGFACE_HUB_TOKEN")

# Save the HF token to ~/.huggingface/token
login(token=HF_ACCESS_TOKEN)

In [None]:
# Map for the model identifiers: (model_key -> model_id)
MODEL_IDENTIFIERS: Final[dict[str, str]] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "llama-3.2-1b-instruct": "meta-llama/Llama-3.2-1B-Instruct",
    "tinyllama-1.1b-chat": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
}

In [None]:
data_dir: Path = get_target_dirpath("data")

# Load the quest dataset
quest_set: DatasetDict = load_dataset(
    "text",
    data_files={
        "train": str(data_dir / "quests_train.txt"),
        "val": str(data_dir / "quests_val.txt"),
        "test": str(data_dir / "quests_test.txt"),
    },
    cache_dir=str(data_dir / ".cache"),
)
quest_set

In [None]:
quest_set["train"][:21]

In [None]:
quest_set["val"][22:43]

In [None]:
# Map for the target modules: (model_key -> target_modules)
TARGET_MODULES: Final[dict[str, list[str]]] = {
    "gpt2": ["c_attn", "c_proj", "c_fc"],
    "gpt2-medium": ["c_attn", "c_proj", "c_fc"],
    "gpt2-large": ["c_attn", "c_proj", "c_fc"],
    "llama-3.2-1b-instruct": [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    "tinyllama-1.1b-chat": [
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
}

In [None]:
# Set the constants for model tuning here
BATCH_SIZE: Final[int] = 4
SEED: Final[int] = 42
N_EPOCHS: Final[int] = 1  # Change to 3 for full dataset
LR_RATE: Final[float] = 5e-7

MAX_LENGTH: Final[int] = 128
MAX_GRAD_NORM: Final[float] = 1.0
LOGGING_STEPS: Final[int] = 20
EVAL_STEPS: Final[int] = 50
WARMUP_STEPS: Final[int] = 50

SAVE_TOTAL_LIMIT: Final[int] = 1
EVAL_ACCUMULATION_STEPS: Final[int] = 2
GRADIENT_ACCUMULATION_STEPS: Final[int] = 2

GRADIENT_CHECKPOINTING: Final[bool] = False  # Turn off for CPU training
LOAD_BEST_MODEL_AT_END: Final[bool] = True

ACTIVATE_FP16: Final[bool] = False
ACTIVATE_EVAL: Final[bool] = True
ACTIVATE_SAVE: Final[bool] = True
ACTIVATE_LOGS: Final[bool] = False
ACTIVATE_TENSORBOARD: Final[bool] = True
ACTIVATE_CALLBACKS: Final[bool] = True
ACTIVATE_FULL: Final[bool] = False  # Full dataset or subset

FRACTION: Final[float] = 0.01  # 1% of the Quest dataset

In [None]:
# Prepare a subset of the quest dataset
quest_subset: DatasetDict = DatasetDict(
    {
        "train": quest_set["train"].select(
            range(int(FRACTION * quest_set["train"].num_rows))
        ),
        "val": quest_set["val"].select(
            range(int(FRACTION * quest_set["val"].num_rows))
        ),
        "test": quest_set["test"].select(
            range(int(FRACTION * quest_set["test"].num_rows))
        ),
    }
)

In [None]:
@dataclass
class TrainingMetrics:
    train_losses: list[float] = field(default_factory=list)
    eval_losses: list[float] = field(default_factory=list)
    learning_rates: list[float] = field(default_factory=list)
    grad_norms: list[float] = field(default_factory=list)
    global_steps: list[int] = field(default_factory=list)
    epochs: list[float] = field(default_factory=list)
    eval_results: list[dict[str, float]] = field(default_factory=list)

    def __repr__(self) -> str:
        return (
            f"TrainingMetrics(\n"
            f"  train_losses={self.train_losses},\n"
            f"  eval_losses={self.eval_losses},\n"
            f"  learning_rates={self.learning_rates},\n"
            f"  grad_norms={self.grad_norms},\n"
            f"  global_steps={self.global_steps},\n"
            f"  epochs={self.epochs},\n"
            f")"
        )

    def to_dict(self) -> dict[str, list[int | float]]:
        return {
            "train_losses": self.train_losses,
            "eval_losses": self.eval_losses,
            "learning_rates": self.learning_rates,
            "grad_norms": self.grad_norms,
            "global_steps": self.global_steps,
            "epochs": self.epochs,
        }


# Map for storing training metrics: (model_key -> metrics)
TRAINING_METRICS: dict[str, Optional[TrainingMetrics]] = {
    k: None for k in MODEL_IDENTIFIERS.keys()
}

In [None]:
class LossLoggerCallback(TrainerCallback):
    def __init__(self, metrics: TrainingMetrics):
        self.metrics: TrainingMetrics = metrics
        self.prev_epoch: Optional[float] = None

    def on_log(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        logs: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if logs is None:
            return

        # Capture training losses during logging
        if "loss" in logs:
            self.metrics.train_losses.append(logs["loss"])

        # Capture evaluation losses during logging
        if "eval_loss" in logs:
            self.metrics.eval_losses.append(logs["eval_loss"])

        # Capture learning rates during logging
        if "learning_rate" in logs:
            self.metrics.learning_rates.append(logs["learning_rate"])

        # Capture gradient norms during logging
        if "grad_norm" in logs:
            self.metrics.grad_norms.append(logs["grad_norm"])

        # Capture global steps consistently
        self.metrics.global_steps.append(state.global_step)

        # Only log the epoch once per epoch change
        if state.epoch is not None and state.epoch != self.prev_epoch:
            self.metrics.epochs.append(state.epoch)
            self.prev_epoch = state.epoch

        return super().on_log(args, state, control, **kwargs)

    def on_evaluate(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        metrics: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if metrics is None:
            return

        # Capture evaluation results on evaluation
        self.metrics.eval_results.append(metrics)

        return super().on_evaluate(args, state, control, **kwargs)

In [None]:
class QuestGenLLM:
    def __init__(
        self,
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        model: PreTrainedModel,
        model_key: str,  # Alias for the model, e.g, "gpt2"
        model_id: str,  # Hugging Face model name, e.g., "openai-community/gpt2"
        fp16_available: bool,  # Mixed precision
        device: Optional[str] = None,
        dtype: Optional[str] = None,
        metrics: Optional[TrainingMetrics] = None,
    ):
        self.tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast = tokenizer
        self.model: PreTrainedModel = model
        self.model_key: str = model_key
        self.model_id: str = model_id
        self.fp16_available: bool = fp16_available

        # Automatically determine the device used by the model
        self.device: str = (
            device
            if isinstance(device, str)
            else str(getattr(self.model, "device", "N/A"))
        )

        # Automatically determine the dtype used by the model
        self.dtype: str = (
            dtype
            if isinstance(dtype, str)
            else str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")
        )

        # Initialize dataclass for storing training metrics
        self.metrics: TrainingMetrics = (
            metrics if isinstance(metrics, TrainingMetrics) else TrainingMetrics()
        )

    @classmethod
    def from_pretrained(
        cls,
        model_key: str,
        model_id: str,
        cache_dir: PathLike = get_cache_dirpath("models"),
        seed: int = SEED,
        use_cpu: bool = False,
    ) -> QuestGenLLM:
        def apply_lora_adapter(
            model: PreTrainedModel,
            r: int = 8,
            alpha: int = 16,
            dropout: float = 0.1,
            task_type: str = "CAUSAL_LM",
        ) -> PreTrainedModel:
            # Prepare model for k-bit training
            model = prepare_model_for_kbit_training(model)

            # Set correct `fan_in_fan_out` based on model type
            #
            # False [default] - for Linear layers
            # True - for Conv1D layers (like GPT)
            fan_in_fan_out: bool = False
            if "gpt" in getattr(model.config, "model_type", "").lower():
                fan_in_fan_out = True

            # Define the LoRA config
            lora_config: LoraConfig = LoraConfig(
                r=r,
                lora_alpha=alpha,
                lora_dropout=dropout,
                target_modules=TARGET_MODULES[model_key],
                bias="none",
                task_type=task_type,
                fan_in_fan_out=fan_in_fan_out,
            )

            try:
                # Apply LoRA adapters to the model
                model = get_peft_model(model, lora_config)
            except Exception as e:
                print(f"[LoRAINFO] Adapter failed to apply: {e}")
                raise

            # Display information about the model parameters
            trainable_params: int = sum(
                p.numel() for p in model.parameters() if p.requires_grad
            )
            all_params: int = sum(p.numel() for p in model.parameters())
            trainable_percent: float = 100 * trainable_params / all_params
            print(
                "[LoRAINFO] trainable params: {:,} || all params: {:,} || trainable%: {:.4f}".format(
                    trainable_params, all_params, trainable_percent
                )
            )

            return model

        print(f"[DOWNLOAD] {model_key} ({model_id})")
        start_time: float = time.time()

        # Clear PyTorch's CUDA memory cache
        torch.cuda.empty_cache()

        # Set the random seed for reproducibility
        set_seed(seed)

        # Determine if mixed precision is available
        fp16_available: bool = (
            torch.cuda.is_available()
            and torch.cuda.get_device_capability(0)[0] >= 7
            and torch.cuda.get_device_capability(0)[1] >= 0
        )

        # Download the tokenizer using the model id
        tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
            model_id,
            cache_dir=(Path(cache_dir) / model_key),
            use_fast=True,
            token=HF_ACCESS_TOKEN,
            trust_remote_code=True,
        )

        model: PreTrainedModel
        if fp16_available and not use_cpu:
            # Set the bitsandbytes configuration for quantization
            bnb_config: BitsAndBytesConfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.float16,
                # llm_int8_enable_fp32_cpu_offload=True,
            )

            # Download the model using the model id (for GPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                quantization_config=bnb_config,
                cache_dir=(Path(cache_dir) / model_key),
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cuda")
        else:
            # Download the model using the model id (for CPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float32,
                cache_dir=(Path(cache_dir) / model_key),
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cpu")

        # Apply the LoRA adapters to the model
        model = apply_lora_adapter(model)

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f'[COMPLETE] "{model_key}" ready in {elapsed:.2f}s.\n')

        return cls(tokenizer, model, model_key, model_id, fp16_available)

    def train_and_evaluate(
        self,
        dataset: DatasetDict = quest_set if ACTIVATE_FULL else quest_subset,
        max_length: int = MAX_LENGTH,
        learning_rate: int = LR_RATE,
        batch_size: int = BATCH_SIZE,
        epochs: int = N_EPOCHS,
        seed: int = SEED,
        max_grad_norm: float = MAX_GRAD_NORM,
        logging_steps: int = LOGGING_STEPS,
        eval_steps: int = EVAL_STEPS,
        warmup_steps: int = WARMUP_STEPS,
        gradient_checkpointing: bool = GRADIENT_CHECKPOINTING,
        load_best_model_at_end: bool = LOAD_BEST_MODEL_AT_END,
        save_total_limit: int = SAVE_TOTAL_LIMIT,
        eval_accumulation_steps: int = EVAL_ACCUMULATION_STEPS,
        gradient_accumulation_steps: int = GRADIENT_ACCUMULATION_STEPS,
        callbacks: list[TrainerCallback] = [
            EarlyStoppingCallback(early_stopping_patience=2) if ACTIVATE_EVAL else None,
            TensorBoardCallback() if ACTIVATE_TENSORBOARD else None,
        ],
        activate_fp16: bool = ACTIVATE_FP16,
        activate_eval: bool = ACTIVATE_EVAL,
        activate_save: bool = ACTIVATE_SAVE,
        activate_logs: bool = ACTIVATE_LOGS,
        activate_tensorboard: bool = ACTIVATE_TENSORBOARD,
        activate_callbacks: bool = ACTIVATE_CALLBACKS,
        output_dir: PathLike = get_target_dirpath("out"),
        logging_dir: PathLike = get_target_dirpath("logs"),
    ) -> TrainingMetrics:
        # Ensure the training and validation sets
        if not all(split in dataset for split in ["train", "val"]):
            raise ValueError("DatasetDict must contain both 'train' and 'val' splits.")

        # Ensure the output and logging directories
        os.makedirs(output_dir, exist_ok=True)
        os.makedirs(logging_dir, exist_ok=True)

        start_time: float
        end_time: float
        elapsed: float

        # Set the random seed for reproducibility
        set_seed(seed)

        # Set the padding token for the tokenizer
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"

        # Tokenize the dataset with `max_length` padding
        print(f"[TOKENIZE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        tokenized_data: Dataset = dataset.map(
            QuestGenLLM.tokenize_dataset,
            batched=True,
            remove_columns=["text"],
            fn_kwargs={"tokenizer": self.tokenizer, "max_length": max_length},
        )
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Set the model padding token (from the tokenizer)
        self.model.config.pad_token_id = self.tokenizer.pad_token_id

        # Turn off `use_cache` if `gradient_checkpointing` is on
        self.model.config.use_cache = not gradient_checkpointing

        # Set up the training configurations
        training_args: TrainingArguments = TrainingArguments(
            output_dir=(Path(output_dir) / self.model_key),
            learning_rate=learning_rate,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=epochs,
            log_level=("info" if activate_logs else "error"),
            logging_steps=logging_steps,
            eval_steps=eval_steps,
            eval_strategy=("epoch" if activate_eval else "no"),
            save_strategy=("epoch" if activate_save else "no"),
            logging_dir=(Path(logging_dir) / self.model_key),
            save_total_limit=save_total_limit,
            eval_accumulation_steps=eval_accumulation_steps,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            fp16=(self.fp16_available and activate_fp16),
            load_best_model_at_end=load_best_model_at_end,
            metric_for_best_model="eval_loss",
            seed=seed,
            report_to=("tensorboard" if activate_tensorboard else "none"),
            label_names=["labels"],
            max_grad_norm=max_grad_norm,
            warmup_steps=warmup_steps,
            logging_nan_inf_filter=True,
            skip_memory_metrics=True,
            lr_scheduler_type="cosine",
            push_to_hub=False,
        )

        # Set up the data collator for the model
        data_collator: DataCollatorForLanguageModeling = (
            DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=False)
        )

        # Set up the callbacks for the trainer
        trainer_callbacks: list[TrainerCallback] = list(
            filter(lambda callback: callback is not None, callbacks)
        )
        if activate_callbacks:
            trainer_callbacks.append(LossLoggerCallback(self.metrics))

        # Prepare and run the trainer
        trainer: Trainer = Trainer(
            model=self.model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=tokenized_data["train"],
            eval_dataset=(tokenized_data["val"] if activate_eval else None),
            callbacks=trainer_callbacks,
        )

        print(f"[FINETUNE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        trainer.train()
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Save the model and tokenizer for later use
        if activate_save:
            trainer.save_model()
            self.tokenizer.save_pretrained(save_directory=training_args.output_dir)

        # Add to the training metrics map
        TRAINING_METRICS[self.model_key] = self.metrics

        return self.metrics

    @staticmethod
    def tokenize_dataset(
        examples: dict[str, list[str]],
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        max_length: int = MAX_LENGTH,
    ) -> dict[str, list[list[int]]]:
        encodings: BatchEncoding = tokenizer(
            examples["text"],
            padding="longest",
            truncation=True,
            max_length=max_length,
            return_tensors="pt",
        )

        input_ids: list[list[int]] = encodings["input_ids"]
        attention_mask: list[list[int]] = encodings["attention_mask"]

        labels: list[list[int]] = input_ids.clone()
        labels[input_ids == tokenizer.pad_token_id] = -100

        return {
            "input_ids": input_ids.tolist(),
            "attention_mask": attention_mask.tolist(),
            "labels": labels.tolist(),
        }

    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 100,
        max_length: int = 128,
        num_return_sequences: int = 1,
        temperature: float = 0.7,
        top_p: float = 0.9,
        top_k: float = 50,
        do_sample: bool = True,
    ) -> list[str]:
        self.model.base_model.model.eval()  # Set the model to evaluation mode

        # Create a pipeline for quest description generation
        generator: TextGenerationPipeline = TextGenerationPipeline(
            model=self.model.base_model.model,
            tokenizer=self.tokenizer,
            device=(0 if self.device == "cuda" else -1),
        )

        # Generate the quest description from the pipeline
        outputs: list[dict[str.str]] = generator(
            prompt,
            max_new_tokens=max_new_tokens,
            max_length=max_length,
            num_return_sequences=num_return_sequences,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            do_sample=do_sample,
            eos_token_id=self.tokenizer.eos_token_id,
            pad_token_id=self.tokenizer.pad_token_id,
        )

        return [output.get("generated_text", "N/A") for output in outputs]

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", None), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
            "fp16_available": self.fp16_available,
        }

    def clear_cache(self, cache_dir: PathLike = get_cache_dirpath("models")) -> None:
        def remove_dir(dir_path: PathLike) -> None:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path)
                print(f"Cache directory '{dir_path}' removed.")
            else:
                print(f"No cache directory found at '{dir_path}'.")

        remove_dir(Path(cache_dir) / self.model_key)

    def print_model_information(self) -> None:
        print(json.dumps(self.to_dict(), indent=2))

    def inspect(self) -> None:
        print(f"Tokenizer ({self.model_id}):\n{self.tokenizer}\n")
        print(f"Model ({self.model_id}):\n{self.model}\n")
        print(f"Configuration ({self.model_id}):\n{self.model.config}")
        print(f"Padding Token [PAD]             : {self.tokenizer.pad_token}")
        print(f"Begging of Sentence Token [BOS] : {self.tokenizer.bos_token}")
        print(f"End of Sentence Token [EOS]     : {self.tokenizer.eos_token}")
        print(f"Unknown Token [UNK]             : {self.tokenizer.unk_token}")
        print(f"Padding Side                    : {self.tokenizer.padding_side}")
        print(f"Padding Token ID                : {self.tokenizer.pad_token_id}\n")

    def __str__(self) -> str:
        return f"{self.model_key} ({self.model_id})"

In [None]:
# Build and train the GPT-2 Base model with the quest data
gpt2_base: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="gpt2", model_id=MODEL_IDENTIFIERS["gpt2"]
)
gpt2_base.inspect()
gpt2_base.train_and_evaluate()

In [None]:
prompt: str = "\n".join(quest_subset["test"][0:20]["text"])
print(prompt)

In [None]:
print(gpt2_base.generate(prompt)[0])

In [None]:
# Build and train the GPT-2 Medium model with the quest data
gpt2_medium: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="gpt2-medium", model_id=MODEL_IDENTIFIERS["gpt2-medium"]
)
gpt2_medium.inspect()
gpt2_medium.train_and_evaluate()

In [None]:
# Build and train the GPT-2 Large model with the quest data
gpt2_large: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="gpt2-large", model_id=MODEL_IDENTIFIERS["gpt2-large"]
)
gpt2_large.inspect()
gpt2_large.train_and_evaluate()

In [None]:
# Build and train the Llama 3.2 model with the quest data
llama32: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="llama-3.2-1b-instruct",
    model_id=MODEL_IDENTIFIERS["llama-3.2-1b-instruct"],
)
llama32.inspect()
llama32.train_and_evaluate()

In [None]:
# Build and train the TinyLlama model with the quest data
tinyllama: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="tinyllama-1.1b-chat", model_id=MODEL_IDENTIFIERS["tinyllama-1.1b-chat"]
)
tinyllama.inspect()
tinyllama.train_and_evaluate()

In [None]:
def save_training_metrics(output_dir: PathLike = get_target_dirpath("out")) -> None:
    metrics_dict: dict[str, dict[str, list[int | float]]] = {
        k: v.to_dict() if isinstance(v, TrainingMetrics) else None
        for k, v in TRAINING_METRICS.items()
    }

    json_file_path: Path = Path(output_dir) / "training_metrics.json"
    with open(json_file_path, "w") as json_writer:
        json.dump(metrics_dict, json_writer, indent=2)

    print(f"Saved to {json_file_path}")


save_training_metrics()  # Save metrics for future evaluations