## QuestGen-LLM: Fine-Tuning & Evaluation

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model             | Parameters | Developed By | Notes                                                 |
| :----: | :------------------------------- | :--------: | :----------: | :---------------------------------------------------- |
|   1.   | GPT-2[^1]                        |    124M    |    OpenAI    | Base model from the GPT-2 family                      |
|   2.   | GPT-2 Medium[^2]                 |    355M    |    OpenAI    | Larger variant with improved language modeling        |
|   3.   | GPT-2 Large[^3]                  |    774M    |    OpenAI    | Capable of generating more coherent longer text       |
|   4.   | Llama-3.2-1B-Instruct[^4] †      |     1B     |     Meta     | Instruction-tuned model for question-answering        |
|   5.   | TinyLlama-1.1B-Chat-v1.0[^5] \*† |    1.1B    |  TinyLlama   | Lightweight chat-tuned model for constrained hardware |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

The notebook also covers the performance evaluation of these pre-trained LLMs after training on the "quest" dataset. The generated quest descriptions (from the test set) are compared to their reference responses. These responses are then evaluated based on the following evaluation metrics:

| S. No. | Metric         | Description                                                                                                          | Preference                                |
| :----: | -------------- | -------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |
|   1.   | Perplexity[^6] | Measures how "confused" the model is about its predictions.                                                          | Lower values indicate less uncertainty.   |
|   2.   | BLEU[^7]       | Compares n-gram overlap between generated and reference text.                                                        | Higher values indicate more overlap.      |
|   3.   | METEOR[^8]     | Evaluates similarity using synonyms, stems, and word order.                                                          | Higher values indicate better alignment.  |
|   4.   | BERTScore[^9]  | Uses ["BERT"](https://huggingface.co/docs/transformers/en/model_doc/bert) embeddings to measure semantic similarity. | Higher values indicate better similarity. |

> Additionally, a _human evaluation method_ can further assess qualities like creativity, fluency, and coherence.

Note that:

- **BLEU:** Bilingual Evaluation Understudy
- **METEOR:** Metric for Evaluation of Translation with Explicit ORdering
- **BERT:** Bidirectional Encoder Representations from Transformers

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
[^5]: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
[^6]: https://huggingface.co/spaces/evaluate-metric/perplexity
[^7]: https://huggingface.co/spaces/evaluate-metric/bleu
[^8]: https://huggingface.co/spaces/evaluate-metric/meteor
[^9]: https://huggingface.co/spaces/evaluate-metric/bertscore


In [1]:
from __future__ import annotations

import json
import math
import os
import shutil
import sys
import time
from dataclasses import dataclass, field
from os import PathLike
from pathlib import Path
from typing import Any, Final, Optional

In [2]:
import datasets
import evaluate
import numpy as np
import torch
from datasets import Dataset, DatasetDict, load_dataset
from evaluate import EvaluationModule
from huggingface_hub import login
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from torch import LongTensor, Tensor
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    PreTrainedModel,
    PreTrainedTokenizerFast,
    PreTrainedTokenizer,
    Trainer,
    TrainerCallback,
    TrainerControl,
    TrainerState,
    TrainingArguments,
    set_seed,
)
from transformers.generation.utils import GenerateOutput
from transformers.integrations import TensorBoardCallback
from transformers.tokenization_utils_base import BatchEncoding
from transformers.utils import logging

In [3]:
root: str = str(Path.cwd().parent.resolve())
if root not in sys.path:
    sys.path.insert(0, root)

In [4]:
from utils.dirpath import get_cache_dirpath, get_target_dirpath

In [5]:
# Get the HF access token from the environment
HF_ACCESS_TOKEN: Final[str] = os.getenv("HUGGINGFACE_HUB_TOKEN")

# Save the HF token to ~/.huggingface/token
login(token=HF_ACCESS_TOKEN)

In [6]:
# Turn off progress bars for datasets
datasets.disable_progress_bars()

# Turn off progress bars for transformers
logging.disable_progress_bar()

In [7]:
# Map for the model identifiers: (model_key -> model_id)
MODEL_IDENTIFIERS: Final[dict[str, str]] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "llama-3.2-1b-instruct": "meta-llama/Llama-3.2-1B-Instruct",
    "tinyllama-1.1b-chat": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
}

In [8]:
# Map for the target modules: (model_key -> target_modules)
TARGET_MODULES: Final[dict[str, list[str]]] = {
    "gpt2": ["c_attn", "c_proj"],
    "gpt2-medium": ["c_attn", "c_proj"],
    "gpt2-large": ["c_attn", "c_proj"],
    "llama-3.2-1b-instruct": ["q_proj", "v_proj", "o_proj"],
    "tinyllama-1.1b-chat": ["q_proj", "v_proj", "o_proj"],
}

In [9]:
# Set the constants for model tuning here
BATCH_SIZE: Final[int] = 4  # Per-device training batch size
SEED: Final[int] = 42  # Random seed for reproducibility
N_EPOCHS: Final[int] = 1  # Number of training epochs
LR_RATE: Final[float] = 3e-5  # Learning rate

LORA_R: Final[int] = 4  # LoRA rank; controls parameter reduction for adaptation layers
LORA_ALPHA: Final[int] = 8  # LoRA scaling factor; controls contribution of LoRA updates
LORA_DROPOUT: Final[float] = 0.0  # Dropout rate applied to LoRA layers

MAX_LENGTH: Final[int] = 512  # Max token length for input sequences
MAX_GRAD_NORM: Final[float] = 1.0  # Gradient clipping threshold
LOGGING_STEPS: Final[int] = 10  # Steps between logging metrics
EVAL_STEPS: Final[int] = 25  # Steps between evaluations
WARMUP_STEPS: Final[int] = 100  # Learning rate warmup steps

SAVE_TOTAL_LIMIT: Final[int] = 1  # Max number of saved checkpoints
EVAL_ACCUMULATION_STEPS: Final[int] = 2  # Eval batch accumulation steps
GRADIENT_ACCUMULATION_STEPS: Final[int] = 2  # Grad batch accumulation steps

GRADIENT_CHECKPOINTING: Final[bool] = True  # Reduce memory usage (slower)
LOAD_BEST_MODEL_AT_END: Final[bool] = True  # Load best checkpoint by eval loss

ACTIVATE_FP16: Final[bool] = False  # Enable 16-bit mixed precision training
ACTIVATE_EVAL: Final[bool] = True  # Enable evaluation
ACTIVATE_SAVE: Final[bool] = True  # Enable checkpoint saving
ACTIVATE_LOGS: Final[bool] = False  # Enable logging to stdout
ACTIVATE_TENSORBOARD: Final[bool] = True  # Enable TensorBoard logging
ACTIVATE_CALLBACKS: Final[bool] = True  # Enable trainer callbacks
ACTIVATE_FULL: Final[bool] = False  # Use full dataset or 10% subset

FRACTION: Final[float] = 0.9  # % of dataset to use when not full (e.g., 10%)

MAX_NEW_TOKEN: Final[int] = 200  # Max new tokens to generate during inference
NUM_RETURN_SEQUENCES: Final[int] = 1  # Number of completions per prompt

TEMPERATURE: Final[float] = 0.7  # Controls randomness; lower is more deterministic
TOP_P: Final[float] = 0.95  # Sample from smallest set with cumulative prob ≥ top_p
TOP_K: Final[int] = 40  # Sample from the top-k most likely tokens (fixed size)
DO_SAMPLE: Final[bool] = True  # Enables sampling (if False, uses greedy decoding)
REPETITION_PENALTY: Final[float] = 1.2  # Penalizes repeated tokens; >1 discourages

TRAIN_AND_EVALUATE: Final[bool] = True  # Enable model training and evaluation
GENERATE_AND_EVALUATE: Final[bool] = True  # Enable model generate functionalities
CLEAR_CACHE: Final[bool] = False  # Enable cache clear after model training

In [10]:
def load_prompt_response_dataset(
    data_dir: PathLike = get_target_dirpath("data"),
    cache_dir: PathLike = get_cache_dirpath("data"),
) -> tuple[DatasetDict, list[str]]:
    def load_split_dataset(split_type: str) -> tuple[Dataset, Optional[str]]:
        nonlocal cache_dir

        split_dir: Path = Path(data_dir) / split_type
        split: Dataset = load_dataset(
            "json",
            data_files={split_type: str(split_dir / "quests.json")},
            split=split_type,
            cache_dir=str(Path(cache_dir) / split_type),
        )

        # If test set, remove response field and extract it
        if split_type == "test":
            test_responses: list[str] = split["response"]
            split = split.remove_columns(["response"])
            return split, test_responses

        return split, None

    train_set: Dataset
    val_set: Dataset
    test_set: Dataset
    test_references: list[str]

    train_set, _ = load_split_dataset("train")
    val_set, _ = load_split_dataset("val")
    test_set, test_references = load_split_dataset("test")

    dataset: DatasetDict = DatasetDict(
        {
            "train": train_set,
            "val": val_set,
            "test": test_set,
        }
    )
    return dataset, test_references

In [11]:
@dataclass
class QuestDataset:
    records: DatasetDict
    references: list[str]

    @classmethod
    def load(
        cls,
        data_dir: PathLike = get_target_dirpath("data"),
        cache_dir: PathLike = get_cache_dirpath("data"),
    ) -> QuestDataset:
        return cls(*load_prompt_response_dataset(data_dir, cache_dir))

    def get_subset(self, fraction: float = FRACTION) -> QuestDataset:
        def get_split_set(split_type: str) -> Dataset:
            num_rows: int = int(fraction * self.records[split_type].num_rows)
            return self.records[split_type].select(range(num_rows))

        return QuestDataset(
            DatasetDict(
                {
                    "train": get_split_set("train"),
                    "val": get_split_set("val"),
                    "test": get_split_set("test"),
                }
            ),
            self.references[: int(fraction * len(self.references))],
        )

    def select_splits(self, splits: list[str]) -> DatasetDict:
        valid_splits: set[str] = {"train", "val", "test"}
        if not any(split in valid_splits for split in splits):
            raise ValueError(
                "`splits` must contain at least one of: 'train', 'val', 'test'."
            )
        return DatasetDict(
            {split: self.records[split] for split in splits if split in self.records}
        )

    def __repr__(self) -> str:
        return self.records.__repr__()

In [12]:
# Load the quest dataset
quest_set: QuestDataset = QuestDataset.load()
quest_set

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 907
    })
    val: Dataset({
        features: ['prompt', 'response'],
        num_rows: 113
    })
    test: Dataset({
        features: ['prompt'],
        num_rows: 114
    })
})

In [13]:
# Prepare a subset of the quest dataset
quest_subset: QuestDataset = quest_set.get_subset()
quest_subset

DatasetDict({
    train: Dataset({
        features: ['prompt', 'response'],
        num_rows: 816
    })
    val: Dataset({
        features: ['prompt', 'response'],
        num_rows: 101
    })
    test: Dataset({
        features: ['prompt'],
        num_rows: 102
    })
})

In [14]:
@dataclass
class TrainingMetrics:
    train_losses: list[float] = field(default_factory=list)
    eval_losses: list[float] = field(default_factory=list)
    learning_rates: list[float] = field(default_factory=list)
    grad_norms: list[float] = field(default_factory=list)
    global_steps: list[int] = field(default_factory=list)
    epochs: list[float] = field(default_factory=list)
    eval_results: list[dict[str, float]] = field(default_factory=list)

    def __repr__(self) -> str:
        return (
            f"TrainingMetrics(\n"
            f"  train_losses={self.train_losses},\n"
            f"  eval_losses={self.eval_losses},\n"
            f"  learning_rates={self.learning_rates},\n"
            f"  grad_norms={self.grad_norms},\n"
            f"  global_steps={self.global_steps},\n"
            f"  epochs={self.epochs},\n"
            f")"
        )

    def to_dict(self) -> dict[str, list[int | float]]:
        return {
            "train_losses": self.train_losses,
            "eval_losses": self.eval_losses,
            "learning_rates": self.learning_rates,
            "grad_norms": self.grad_norms,
            "global_steps": self.global_steps,
            "epochs": self.epochs,
        }

In [15]:
@dataclass
class GenerationMetrics:
    perplexity: float = 0.0
    bleu: float = 0.0
    meteor: float = 0.0
    bertscore: dict[str, float] = field(default_factory=dict)

    def __repr__(self) -> str:
        return (
            f"GenerationMetrics(\n"
            f"  perplexity={self.perplexity},\n"
            f"  bleu={self.bleu},\n"
            f"  meteor={self.meteor},\n"
            f"  bertscore={self.bertscore},\n"
            f")"
        )

    def to_dict(self) -> dict[str, float | dict[str, float]]:
        return {
            "perplexity": self.perplexity,
            "bleu": self.bleu,
            "meteor": self.meteor,
            "bertscore": self.bertscore,
        }

In [16]:
# Map for storing training metrics: (model_key -> metrics)
TRAINING_METRICS: dict[str, Optional[TrainingMetrics]] = {
    k: None for k in MODEL_IDENTIFIERS.keys()
}

# Map for storing generation metrics: (model_key -> gen_metrics)
GENERATION_METRICS: dict[str, Optional[GenerationMetrics]] = {
    k: None for k in MODEL_IDENTIFIERS.keys()
}

In [17]:
class LossLoggerCallback(TrainerCallback):
    def __init__(self, metrics: TrainingMetrics):
        self.metrics: TrainingMetrics = metrics
        self.prev_epoch: Optional[float] = None

    def on_log(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        logs: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if logs is None:
            return

        # Capture training losses during logging
        if "loss" in logs:
            self.metrics.train_losses.append(logs["loss"])

        # Capture evaluation losses during logging
        if "eval_loss" in logs:
            self.metrics.eval_losses.append(logs["eval_loss"])

        # Capture learning rates during logging
        if "learning_rate" in logs:
            self.metrics.learning_rates.append(logs["learning_rate"])

        # Capture gradient norms during logging
        if "grad_norm" in logs:
            self.metrics.grad_norms.append(logs["grad_norm"])

        # Capture global steps consistently
        self.metrics.global_steps.append(state.global_step)

        # Only log the epoch once per epoch change
        if state.epoch is not None and state.epoch != self.prev_epoch:
            self.metrics.epochs.append(state.epoch)
            self.prev_epoch = state.epoch

        return super().on_log(args, state, control, **kwargs)

    def on_evaluate(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        metrics: Optional[dict[str, float]] = None,
        **kwargs: Any,
    ) -> None:
        if metrics is None:
            return

        # Capture evaluation results on evaluation
        self.metrics.eval_results.append(metrics)

        return super().on_evaluate(args, state, control, **kwargs)

In [18]:
class QuestGenLLM:
    def __init__(
        self,
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        model: PreTrainedModel,
        model_key: str,  # Alias for the model, e.g, "gpt2"
        model_id: str,  # Hugging Face model name, e.g., "openai-community/gpt2"
        fp16_available: bool,  # Mixed precision
        device: Optional[str] = None,
        dtype: Optional[str] = None,
        training_metrics: Optional[TrainingMetrics] = None,
        generation_metrics: Optional[GenerationMetrics] = None,
    ):
        self.tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast = tokenizer
        self.model: PreTrainedModel = model
        self.model_key: str = model_key
        self.model_id: str = model_id
        self.fp16_available: bool = fp16_available

        # Automatically determine the device used by the model
        self.device: str = (
            device
            if isinstance(device, str)
            else str(getattr(self.model, "device", "N/A"))
        )

        # Automatically determine the dtype used by the model
        self.dtype: str = (
            dtype
            if isinstance(dtype, str)
            else str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")
        )

        # Initialize dataclass for storing training metrics
        self.training_metrics: TrainingMetrics = (
            training_metrics
            if isinstance(training_metrics, TrainingMetrics)
            else TrainingMetrics()
        )

        # Initialize dataclass for storing generation metrics
        self.generation_metrics: GenerationMetrics = (
            generation_metrics
            if isinstance(generation_metrics, GenerationMetrics)
            else GenerationMetrics()
        )

    @classmethod
    def from_pretrained(
        cls,
        model_key: str,
        model_id: Optional[str] = None,
        cache_dir: PathLike = get_cache_dirpath("models"),
        seed: int = SEED,
        use_cpu: bool = False,
    ) -> QuestGenLLM:
        def apply_lora_adapter(
            model: PreTrainedModel,
            r: int = LORA_R,
            lora_alpha: int = LORA_ALPHA,
            lora_dropout: float = LORA_DROPOUT,
            bias: str = "none",
            task_type: str = "CAUSAL_LM",
        ) -> PreTrainedModel:
            # Prepare model for k-bit training
            model = prepare_model_for_kbit_training(model)

            # Set correct `fan_in_fan_out` based on model type
            #
            # False [default] - for Linear layers
            # True - for Conv1D layers (like GPT)
            fan_in_fan_out: bool = False
            if "gpt" in getattr(model.config, "model_type", "").lower():
                fan_in_fan_out = True

            # Define the LoRA config
            lora_config: LoraConfig = LoraConfig(
                r=r,
                lora_alpha=lora_alpha,
                lora_dropout=lora_dropout,
                target_modules=TARGET_MODULES[model_key],
                bias=bias,
                task_type=task_type,
                fan_in_fan_out=fan_in_fan_out,
            )

            try:
                # Apply LoRA adapters to the model
                model = get_peft_model(model, lora_config)
            except Exception as e:
                print(f"[LoRAINFO] Adapter failed to apply: {e}")
                raise

            # Display information about the model parameters
            trainable_params: int = sum(
                p.numel() for p in model.parameters() if p.requires_grad
            )
            all_params: int = sum(p.numel() for p in model.parameters())
            trainable_percent: float = 100 * trainable_params / all_params
            print(
                "[LoRAINFO] trainable params: {:,} || all params: {:,} || trainable%: {:.4f}".format(
                    trainable_params, all_params, trainable_percent
                )
            )

            return model

        if not model_id:
            model_id = MODEL_IDENTIFIERS[model_key]

        print(f"[DOWNLOAD] {model_key} ({model_id})")
        start_time: float = time.time()

        # Clear PyTorch's CUDA memory cache
        torch.cuda.empty_cache()

        # Set the random seed for reproducibility
        set_seed(seed)

        # Determine if mixed precision is available
        fp16_available: bool = (
            torch.cuda.is_available()
            and torch.cuda.get_device_capability(0)[0] >= 7
            and torch.cuda.get_device_capability(0)[1] >= 0
        )

        # Download the tokenizer using the model id
        tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
            model_id,
            cache_dir=(Path(cache_dir) / model_key),
            use_fast=True,
            token=HF_ACCESS_TOKEN,
            trust_remote_code=True,
        )

        model: PreTrainedModel
        cache_dir: str = str(Path(cache_dir) / model_key)

        if fp16_available and not use_cpu:
            # Set the bitsandbytes configuration for quantization
            bnb_config: BitsAndBytesConfig = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
                bnb_4bit_compute_dtype=torch.float16,
                # llm_int8_enable_fp32_cpu_offload=True,
            )

            # Download the model using the model id (for GPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float16,
                quantization_config=bnb_config,
                cache_dir=cache_dir,
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cuda")
        else:
            # Download the model using the model id (for CPU)
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.float32,
                cache_dir=cache_dir,
                token=HF_ACCESS_TOKEN,
                trust_remote_code=True,
                low_cpu_mem_usage=True,
            )
            model.to("cpu")

        # Apply the LoRA adapters to the model
        model = apply_lora_adapter(model)

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f'[COMPLETE] "{model_key}" ready in {elapsed:.2f}s.\n')

        return cls(tokenizer, model, model_key, model_id, fp16_available)

    @staticmethod
    def tokenize_dataset(
        examples: dict[str, list[str]],
        tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast,
        max_length: int = MAX_LENGTH,
    ) -> dict[str, Tensor]:
        prompt_encodings: BatchEncoding = tokenizer(
            examples["prompt"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
            return_tensors="pt",
            add_special_tokens=False,
        )

        response_encodings: BatchEncoding = tokenizer(
            examples["response"],
            truncation=True,
            max_length=max_length,
            padding="max_length",
            return_tensors="pt",
            add_special_tokens=False,
        )

        # Combine the input tokens and attention masks
        input_ids: Tensor = torch.cat(
            [prompt_encodings["input_ids"], response_encodings["input_ids"]], dim=1
        )
        attention_mask: Tensor = torch.cat(
            [prompt_encodings["attention_mask"], response_encodings["attention_mask"]],
            dim=1,
        )

        # Truncate the encodings to `max_length`
        input_ids = input_ids[:, :max_length]
        attention_mask = attention_mask[:, :max_length]

        # Prepare labels and mask the prompt with -100
        labels: Tensor = input_ids.clone()
        for idx, prompt_len in enumerate(
            (prompt_encodings["attention_mask"]).sum(dim=1)
        ):
            labels[idx, :prompt_len] = -100

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
        }

    def train_and_evaluate(
        self,
        dataset: DatasetDict = (
            quest_set.select_splits(["train", "val"])
            if ACTIVATE_FULL
            else quest_subset.select_splits(["train", "val"])
        ),
        max_length: int = MAX_LENGTH,
        learning_rate: int = LR_RATE,
        batch_size: int = BATCH_SIZE,
        epochs: int = N_EPOCHS,
        seed: int = SEED,
        max_grad_norm: float = MAX_GRAD_NORM,
        logging_steps: int = LOGGING_STEPS,
        eval_steps: int = EVAL_STEPS,
        warmup_steps: int = WARMUP_STEPS,
        gradient_checkpointing: bool = GRADIENT_CHECKPOINTING,
        load_best_model_at_end: bool = LOAD_BEST_MODEL_AT_END,
        save_total_limit: int = SAVE_TOTAL_LIMIT,
        eval_accumulation_steps: int = EVAL_ACCUMULATION_STEPS,
        gradient_accumulation_steps: int = GRADIENT_ACCUMULATION_STEPS,
        callbacks: list[TrainerCallback] = [
            EarlyStoppingCallback(early_stopping_patience=2) if ACTIVATE_EVAL else None,
            TensorBoardCallback() if ACTIVATE_TENSORBOARD else None,
        ],
        activate_fp16: bool = ACTIVATE_FP16,
        activate_eval: bool = ACTIVATE_EVAL,
        activate_save: bool = ACTIVATE_SAVE,
        activate_logs: bool = ACTIVATE_LOGS,
        activate_tensorboard: bool = ACTIVATE_TENSORBOARD,
        activate_callbacks: bool = ACTIVATE_CALLBACKS,
        output_dir: PathLike = get_target_dirpath("out"),
        logging_dir: PathLike = get_target_dirpath("logs"),
    ) -> TrainingMetrics:
        # Ensure the training and validation sets
        if not all(split in dataset for split in ["train", "val"]):
            raise ValueError("DatasetDict must contain both 'train' and 'val' splits.")

        # Ensure the output and logging directories
        os.makedirs(output_dir, exist_ok=True)
        os.makedirs(logging_dir, exist_ok=True)

        start_time: float
        end_time: float
        elapsed: float

        # Set the random seed for reproducibility
        set_seed(seed)

        # Set the padding token for the tokenizer
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        self.tokenizer.padding_side = "right"

        # Tokenize the dataset with `max_length` padding
        print(f"[TOKENIZE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        tokenized_data: Dataset = dataset.map(
            QuestGenLLM.tokenize_dataset,
            batched=True,
            remove_columns=["prompt", "response"],
            fn_kwargs={"tokenizer": self.tokenizer, "max_length": max_length},
        )
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Set the model padding token (from the tokenizer)
        self.model.config.pad_token_id = self.tokenizer.pad_token_id

        # Turn off `use_cache` if `gradient_checkpointing` is on
        self.model.config.use_cache = not gradient_checkpointing

        # Set up the training configurations
        training_args: TrainingArguments = TrainingArguments(
            output_dir=(Path(output_dir) / self.model_key),
            learning_rate=learning_rate,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=epochs,
            log_level=("info" if activate_logs else "error"),
            logging_steps=logging_steps,
            eval_steps=eval_steps,
            eval_strategy=("epoch" if activate_eval else "no"),
            save_strategy=("epoch" if activate_save else "no"),
            logging_dir=(Path(logging_dir) / self.model_key),
            save_total_limit=save_total_limit,
            eval_accumulation_steps=eval_accumulation_steps,
            gradient_accumulation_steps=gradient_accumulation_steps,
            gradient_checkpointing=gradient_checkpointing,
            fp16=(self.fp16_available and activate_fp16),
            load_best_model_at_end=load_best_model_at_end,
            metric_for_best_model="eval_loss",
            seed=seed,
            report_to=("tensorboard" if activate_tensorboard else "none"),
            label_names=["labels"],
            max_grad_norm=max_grad_norm,
            warmup_steps=warmup_steps,
            logging_nan_inf_filter=True,
            skip_memory_metrics=True,
            lr_scheduler_type="cosine",
            push_to_hub=False,
            disable_tqdm=False,
        )

        # Set up the data collator for the model
        data_collator: DataCollatorForLanguageModeling = (
            DataCollatorForLanguageModeling(
                self.tokenizer, mlm=False, return_tensors="pt"
            )
        )

        # Set up the callbacks for the trainer
        trainer_callbacks: list[TrainerCallback] = list(
            filter(lambda callback: callback is not None, callbacks)
        )
        if activate_callbacks:
            trainer_callbacks.append(LossLoggerCallback(self.training_metrics))

        # Prepare and run the trainer
        trainer: Trainer = Trainer(
            model=self.model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=tokenized_data["train"],
            eval_dataset=(tokenized_data["val"] if activate_eval else None),
            callbacks=trainer_callbacks,
        )

        print(f"[FINETUNE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        trainer.train()
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Save the model and tokenizer for later use
        if activate_save:
            trainer.save_model()
            self.tokenizer.save_pretrained(save_directory=training_args.output_dir)

        # Add to the training metrics map
        TRAINING_METRICS[self.model_key] = self.training_metrics

        return self.training_metrics

    def generate_and_evaluate(
        self,
        dataset: DatasetDict = (
            quest_set.select_splits(["test"])
            if ACTIVATE_FULL
            else quest_subset.select_splits(["test"])
        ),
        references: list[str] = (
            quest_set.references if ACTIVATE_FULL else quest_subset.references
        ),
        batch_size: int = BATCH_SIZE,
        seed: int = SEED,
        max_length: int = MAX_LENGTH,
        max_new_tokens: int = MAX_NEW_TOKEN,
        num_return_sequences: int = NUM_RETURN_SEQUENCES,
        temperature: float = TEMPERATURE,
        top_p: float = TOP_P,
        top_k: int = TOP_K,
        do_sample: bool = True,
        repetition_penalty: float = REPETITION_PENALTY,
    ) -> GenerationMetrics:
        # Ensure the testing set in the dataset
        if "test" not in dataset:
            raise ValueError("DatasetDict must contain a 'test' split.")

        input_model: PreTrainedModel = self.model.base_model.model
        input_model.eval()  # Set the model to evaluation mode
        self.tokenizer.padding_side = "left"  # Change the padding side to left

        # Set the random seed for reproducibility
        set_seed(seed)

        predictions: list[str] = []
        test_split: Dataset = dataset["test"]

        print(f"[GENERATE] {self.model_key} ({self.model_id})")
        start_time: float = time.time()
        for i in range(0, len(test_split), batch_size):
            batch_prompts: list[str] = test_split[i : i + batch_size]["prompt"]

            # Tokenize the batched input prompts
            tokenized_prompts: BatchEncoding = self.tokenizer(
                batch_prompts,
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt",
                add_special_tokens=False,
            ).to(input_model.device)

            # Generate output tokens from the tokenized inputs
            with torch.no_grad():
                gen_tokens: GenerateOutput | LongTensor = input_model.generate(
                    **tokenized_prompts,
                    max_new_tokens=max_new_tokens,
                    num_return_sequences=num_return_sequences,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    do_sample=do_sample,
                    repetition_penalty=repetition_penalty,
                    eos_token_id=self.tokenizer.eos_token_id,
                    pad_token_id=self.tokenizer.pad_token_id,
                )

            # Decode the generated tokens into response outputs
            decoded: list[str] = self.tokenizer.batch_decode(
                gen_tokens, skip_special_tokens=True
            )
            predictions.extend(decoded)
        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Repeat the references if more than one output sequences are generated
        if num_return_sequences > 1:
            references = [
                ref for ref in references for _ in range(num_return_sequences)
            ]

        # Perform metric evaluation and add to the generation metrics map
        self.compute_generation_metrics(
            predictions=predictions,
            references=references,
            batch_size=batch_size,
            seed=seed,
        )
        GENERATION_METRICS[self.model_key] = self.generation_metrics

        return self.generation_metrics

    def compute_generation_metrics(
        self,
        *,
        predictions: list[str],
        references: list[str],
        batch_size: int = BATCH_SIZE,
        seed: int = SEED,
    ) -> None:
        # Set the random seed for reproducibility
        set_seed(seed)

        # Set tokenizer padding back to right
        self.tokenizer.padding_side = "right"

        print(f"[EVALUATE] {self.model_key} ({self.model_id})")
        start_time: float = time.time()

        # Load metrics from evaluate
        bleu: EvaluationModule = evaluate.load("bleu")
        meteor: EvaluationModule = evaluate.load("meteor")
        bertscore: EvaluationModule = evaluate.load("bertscore")

        # Compute metric scores (BLEU, ROUGE, METEOR, BERTScore)
        bleu_results: Optional[dict] = bleu.compute(
            predictions=predictions, references=references
        )
        meteor_results: Optional[dict] = meteor.compute(
            predictions=predictions, references=references
        )
        bert_results: Optional[dict] = bertscore.compute(
            predictions=predictions,
            references=references,
            lang="en",
            batch_size=batch_size,
            use_fast_tokenizer=True,
        )

        # Compute perplexity score (from eval losses)
        perplexity_score: float = 0.0
        if self.training_metrics and self.training_metrics.eval_losses:
            avg_eval_loss: float = np.mean(self.training_metrics.eval_losses)
            perplexity_score = math.exp(avg_eval_loss)

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f"[COMPLETE] Elapsed: {elapsed:.2f}s\n")

        # Compile the results into the generation metrics
        self.generation_metrics.perplexity = perplexity_score
        self.generation_metrics.bleu = bleu_results.get("bleu", 0.0)
        self.generation_metrics.meteor = meteor_results.get("meteor", 0.0)
        self.generation_metrics.bertscore = {
            "precision": float(np.mean(bert_results.get("precision", [0.0]))),
            "recall": float(np.mean(bert_results.get("recall", [0.0]))),
            "f1": float(np.mean(bert_results.get("f1", [0.0]))),
        }

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", None), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
            "fp16_available": self.fp16_available,
        }

    def clear_cache(self, cache_dir: PathLike = get_cache_dirpath("models")) -> None:
        def remove_dir(dir_path: PathLike) -> None:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path)
                print(f"Cache directory '{dir_path}' removed.")
            else:
                print(f"No cache directory found at '{dir_path}'.")

        remove_dir(Path(cache_dir) / self.model_key)

    def run(
        self,
        dataset: QuestDataset = quest_set if ACTIVATE_FULL else quest_subset,
        max_length: int = MAX_LENGTH,
        learning_rate: int = LR_RATE,
        batch_size: int = BATCH_SIZE,
        epochs: int = N_EPOCHS,
        seed: int = SEED,
        max_grad_norm: float = MAX_GRAD_NORM,
        logging_steps: int = LOGGING_STEPS,
        eval_steps: int = EVAL_STEPS,
        warmup_steps: int = WARMUP_STEPS,
        gradient_checkpointing: bool = GRADIENT_CHECKPOINTING,
        load_best_model_at_end: bool = LOAD_BEST_MODEL_AT_END,
        save_total_limit: int = SAVE_TOTAL_LIMIT,
        eval_accumulation_steps: int = EVAL_ACCUMULATION_STEPS,
        gradient_accumulation_steps: int = GRADIENT_ACCUMULATION_STEPS,
        max_new_tokens: int = MAX_NEW_TOKEN,
        num_return_sequences: int = NUM_RETURN_SEQUENCES,
        temperature: float = TEMPERATURE,
        top_p: float = TOP_P,
        top_k: int = TOP_K,
        do_sample: bool = True,
        repetition_penalty: float = REPETITION_PENALTY,
        callbacks: list[TrainerCallback] = [
            EarlyStoppingCallback(early_stopping_patience=2) if ACTIVATE_EVAL else None,
            TensorBoardCallback() if ACTIVATE_TENSORBOARD else None,
        ],
        activate_fp16: bool = ACTIVATE_FP16,
        activate_eval: bool = ACTIVATE_EVAL,
        activate_save: bool = ACTIVATE_SAVE,
        activate_logs: bool = ACTIVATE_LOGS,
        activate_tensorboard: bool = ACTIVATE_TENSORBOARD,
        activate_callbacks: bool = ACTIVATE_CALLBACKS,
        output_dir: PathLike = get_target_dirpath("out"),
        logging_dir: PathLike = get_target_dirpath("logs"),
        cache_dir: PathLike = get_cache_dirpath("models"),
        enable_train_and_evaluate: bool = TRAIN_AND_EVALUATE,
        enable_generate_and_evaluate: bool = GENERATE_AND_EVALUATE,
        enable_clear_cache: bool = CLEAR_CACHE,
    ) -> None:
        self.inspect()

        if enable_train_and_evaluate:
            self.train_and_evaluate(
                dataset=dataset.select_splits(["train", "val"]),
                max_length=max_length,
                learning_rate=learning_rate,
                batch_size=batch_size,
                epochs=epochs,
                seed=seed,
                max_grad_norm=max_grad_norm,
                logging_steps=logging_steps,
                eval_steps=eval_steps,
                warmup_steps=warmup_steps,
                gradient_checkpointing=gradient_checkpointing,
                load_best_model_at_end=load_best_model_at_end,
                save_total_limit=save_total_limit,
                eval_accumulation_steps=eval_accumulation_steps,
                gradient_accumulation_steps=gradient_accumulation_steps,
                callbacks=callbacks,
                activate_fp16=activate_fp16,
                activate_eval=activate_eval,
                activate_save=activate_save,
                activate_logs=activate_logs,
                activate_tensorboard=activate_tensorboard,
                activate_callbacks=activate_callbacks,
                output_dir=output_dir,
                logging_dir=logging_dir,
            )
            display(self.training_metrics)

        if enable_generate_and_evaluate:
            self.generate_and_evaluate(
                dataset=dataset.select_splits(["test"]),
                references=dataset.references,
                max_new_tokens=max_new_tokens,
                num_return_sequences=num_return_sequences,
                temperature=temperature,
                top_p=top_p,
                top_k=top_k,
                do_sample=do_sample,
                repetition_penalty=repetition_penalty,
            )
            display(self.generation_metrics)

        if enable_clear_cache:
            self.clear_cache(cache_dir=cache_dir)

    def print_model_information(self) -> None:
        print(json.dumps(self.to_dict(), indent=2))

    def inspect(self) -> None:
        print(
            f"{self.tokenizer}\n\n{self.model}\n\n{self.model.config}\n"
            f"Token Type                        | Value\n"
            f"----------------------------------+-------------------\n"
            f"Padding Token [PAD]               | {self.tokenizer.pad_token}\n"
            f"Beginning of Sentence Token [BOS] | {self.tokenizer.bos_token}\n"
            f"End of Sentence Token [EOS]       | {self.tokenizer.eos_token}\n"
            f"Unknown Token [UNK]               | {self.tokenizer.unk_token}\n"
        )

    def __str__(self) -> str:
        return f"{self.model_key} ({self.model_id})"

In [19]:
# Build and train the GPT-2 Base model with the quest data
gpt2_base: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2")
gpt2_base.run()

[DOWNLOAD] gpt2 (openai-community/gpt2)
[LoRAINFO] trainable params: 405,504 || all params: 82,377,984 || trainable%: 0.4922
[COMPLETE] "gpt2" ready in 15.76s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attentio

Epoch,Training Loss,Validation Loss
1,2.4868,2.15732


[COMPLETE] Elapsed: 164.84s



TrainingMetrics(
  train_losses=[2.6818, 2.6673, 2.6932, 2.6332, 2.6388, 2.6266, 2.622, 2.5711, 2.5789, 2.4868],
  eval_losses=[2.1573195457458496],
  learning_rates=[3e-06, 6e-06, 9e-06, 1.2e-05, 1.5e-05, 1.8e-05, 2.1e-05, 2.4e-05, 2.7000000000000002e-05, 3e-05],
  grad_norms=[0.3146290183067322, 0.45661231875419617, 0.27802565693855286, 0.36798006296157837, 0.4167405962944031, 0.4057275354862213, 0.3772679269313812, 0.46999871730804443, 0.6387847065925598, 0.6707265377044678],
  global_steps=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 102, 102],
  epochs=[0.09803921568627451, 0.19607843137254902, 0.29411764705882354, 0.39215686274509803, 0.49019607843137253, 0.5882352941176471, 0.6862745098039216, 0.7843137254901961, 0.8823529411764706, 0.9803921568627451, 1.0],
)

[GENERATE] gpt2 (openai-community/gpt2)
[COMPLETE] Elapsed: 197.16s

[EVALUATE] gpt2 (openai-community/gpt2)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[COMPLETE] Elapsed: 27.39s



GenerationMetrics(
  perplexity=8.64792619317149,
  bleu=0.02206994139702736,
  meteor=0.2413806705792957,
  bertscore={'precision': 0.7400087246707842, 'recall': 0.8437292476495107, 'f1': 0.7882864089573131},
)

In [20]:
# Build and train the GPT-2 Medium model with the quest data
gpt2_medium: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2-medium")
gpt2_medium.run()

[DOWNLOAD] gpt2-medium (openai-community/gpt2-medium)
[LoRAINFO] trainable params: 1,081,344 || all params: 204,909,568 || trainable%: 0.5277
[COMPLETE] "gpt2-medium" ready in 33.74s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2-medium', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 1024)
        (wpe): Embedding(1024, 1024)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-23): 24 x GPT2Block(
            (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True

Epoch,Training Loss,Validation Loss
1,2.1188,1.882551


[COMPLETE] Elapsed: 491.37s



TrainingMetrics(
  train_losses=[2.3256, 2.3187, 2.3665, 2.3241, 2.2984, 2.2742, 2.2795, 2.237, 2.2015, 2.1188],
  eval_losses=[1.8825511932373047],
  learning_rates=[3e-06, 6e-06, 9e-06, 1.2e-05, 1.5e-05, 1.8e-05, 2.1e-05, 2.4e-05, 2.7000000000000002e-05, 3e-05],
  grad_norms=[0.3069506287574768, 0.2813541293144226, 0.2306906282901764, 0.2805655002593994, 0.4793981909751892, 0.3226276636123657, 0.3588409125804901, 0.394062876701355, 0.436593234539032, 0.48396816849708557],
  global_steps=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 102, 102],
  epochs=[0.09803921568627451, 0.19607843137254902, 0.29411764705882354, 0.39215686274509803, 0.49019607843137253, 0.5882352941176471, 0.6862745098039216, 0.7843137254901961, 0.8823529411764706, 0.9803921568627451, 1.0],
)

[GENERATE] gpt2-medium (openai-community/gpt2-medium)
[COMPLETE] Elapsed: 361.11s

[EVALUATE] gpt2-medium (openai-community/gpt2-medium)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[COMPLETE] Elapsed: 23.28s



GenerationMetrics(
  perplexity=6.570245464652303,
  bleu=0.024207667637293432,
  meteor=0.25438551477839205,
  bertscore={'precision': 0.7463197515291327, 'recall': 0.8441664342786751, 'f1': 0.7920739446200576},
)

In [21]:
# Build and train the GPT-2 Large model with the quest data
gpt2_large: QuestGenLLM = QuestGenLLM.from_pretrained("gpt2-large")
gpt2_large.run()

[DOWNLOAD] gpt2-large (openai-community/gpt2-large)
[LoRAINFO] trainable params: 2,027,520 || all params: 422,163,200 || trainable%: 0.4803
[COMPLETE] "gpt2-large" ready in 90.56s.

GPT2TokenizerFast(name_or_path='openai-community/gpt2-large', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 1280)
        (wpe): Embedding(1024, 1280)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-35): 36 x GPT2Block(
            (ln_1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
  

Epoch,Training Loss,Validation Loss
1,1.6781,1.523196


[COMPLETE] Elapsed: 2249.30s



TrainingMetrics(
  train_losses=[2.0181, 2.0203, 2.0433, 1.9901, 1.9935, 1.959, 1.9214, 1.8726, 1.8076, 1.6781],
  eval_losses=[1.5231956243515015],
  learning_rates=[3e-06, 6e-06, 9e-06, 1.2e-05, 1.5e-05, 1.8e-05, 2.1e-05, 2.4e-05, 2.7000000000000002e-05, 3e-05],
  grad_norms=[0.2274000644683838, 0.23313337564468384, 0.23457348346710205, 0.27480176091194153, 0.30306562781333923, 0.3335680365562439, 0.3292894959449768, 0.3862982392311096, 0.43068403005599976, 0.4557386338710785],
  global_steps=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 102, 102],
  epochs=[0.09803921568627451, 0.19607843137254902, 0.29411764705882354, 0.39215686274509803, 0.49019607843137253, 0.5882352941176471, 0.6862745098039216, 0.7843137254901961, 0.8823529411764706, 0.9803921568627451, 1.0],
)

[GENERATE] gpt2-large (openai-community/gpt2-large)
[COMPLETE] Elapsed: 7926.82s

[EVALUATE] gpt2-large (openai-community/gpt2-large)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[COMPLETE] Elapsed: 1066.14s



GenerationMetrics(
  perplexity=4.586859680020465,
  bleu=0.026564246598304673,
  meteor=0.22971020896051123,
  bertscore={'precision': 0.7378851601890489, 'recall': 0.8417737542414198, 'f1': 0.7862072873349283},
)

In [22]:
# Build and train the Llama 3.2 model with the quest data
llama32: QuestGenLLM = QuestGenLLM.from_pretrained("llama-3.2-1b-instruct")
llama32.run()

[DOWNLOAD] llama-3.2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
[LoRAINFO] trainable params: 688,128 || all params: 749,963,264 || trainable%: 0.0918
[COMPLETE] "llama-3.2-1b-instruct" ready in 77.52s.

PreTrainedTokenizerFast(name_or_path='meta-llama/Llama-3.2-1B-Instruct', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normaliz

Epoch,Training Loss,Validation Loss
1,1.8126,1.675879


[COMPLETE] Elapsed: 35829.15s



TrainingMetrics(
  train_losses=[2.4511, 2.4452, 2.4754, 2.3911, 2.3868, 2.3196, 2.2332, 2.1189, 1.9977, 1.8126],
  eval_losses=[1.6758792400360107],
  learning_rates=[3e-06, 6e-06, 9e-06, 1.2e-05, 1.5e-05, 1.8e-05, 2.1e-05, 2.4e-05, 2.7000000000000002e-05, 3e-05],
  grad_norms=[1.0357692241668701, 1.1105165481567383, 1.0999761819839478, 1.293423056602478, 1.5135807991027832, 1.837256908416748, 1.4708731174468994, 1.4291259050369263, 1.7081077098846436, 1.9814156293869019],
  global_steps=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 102, 102],
  epochs=[0.09803921568627451, 0.19607843137254902, 0.29411764705882354, 0.39215686274509803, 0.49019607843137253, 0.5882352941176471, 0.6862745098039216, 0.7843137254901961, 0.8823529411764706, 0.9803921568627451, 1.0],
)

[GENERATE] llama-3.2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
[COMPLETE] Elapsed: 4506.90s

[EVALUATE] llama-3.2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[COMPLETE] Elapsed: 581.95s



GenerationMetrics(
  perplexity=5.343491295729844,
  bleu=0.025580081385794855,
  meteor=0.23766147915442493,
  bertscore={'precision': 0.7390624810667599, 'recall': 0.8389458305695477, 'f1': 0.785688674917408},
)

In [23]:
# Build and train the TinyLlama model with the quest data
tinyllama: QuestGenLLM = QuestGenLLM.from_pretrained("tinyllama-1.1b-chat")
tinyllama.run()

[DOWNLOAD] tinyllama-1.1b-chat (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
[LoRAINFO] trainable params: 923,648 || all params: 616,529,920 || trainable%: 0.1498
[COMPLETE] "tinyllama-1.1b-chat" ready in 69.49s.

LlamaTokenizerFast(name_or_path='TinyLlama/TinyLlama-1.1B-Chat-v1.0', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 2

Epoch,Training Loss,Validation Loss
1,1.3555,1.268432


[COMPLETE] Elapsed: 41814.71s



TrainingMetrics(
  train_losses=[2.0596, 2.0551, 2.0746, 1.9927, 1.9256, 1.8283, 1.7111, 1.5746, 1.465, 1.3555],
  eval_losses=[1.2684321403503418],
  learning_rates=[3e-06, 6e-06, 9e-06, 1.2e-05, 1.5e-05, 1.8e-05, 2.1e-05, 2.4e-05, 2.7000000000000002e-05, 3e-05],
  grad_norms=[2.545828342437744, 2.529114246368408, 2.4147751331329346, 2.2650671005249023, 2.5697553157806396, 2.09818172454834, 1.3390394449234009, 1.146890640258789, 1.0917110443115234, 1.0018268823623657],
  global_steps=[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 102, 102],
  epochs=[0.09803921568627451, 0.19607843137254902, 0.29411764705882354, 0.39215686274509803, 0.49019607843137253, 0.5882352941176471, 0.6862745098039216, 0.7843137254901961, 0.8823529411764706, 0.9803921568627451, 1.0],
)

[GENERATE] tinyllama-1.1b-chat (TinyLlama/TinyLlama-1.1B-Chat-v1.0)
[COMPLETE] Elapsed: 13831.71s

[EVALUATE] tinyllama-1.1b-chat (TinyLlama/TinyLlama-1.1B-Chat-v1.0)


[nltk_data] Downloading package wordnet to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/mambauser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


[COMPLETE] Elapsed: 712.19s



GenerationMetrics(
  perplexity=3.555274019633397,
  bleu=0.03065837242300427,
  meteor=0.21842834506982772,
  bertscore={'precision': 0.7421208041555741, 'recall': 0.8400070889323366, 'f1': 0.7878747623340756},
)

In [27]:
def save_metrics(
    name: str,
    metrics: dict[str, GenerationMetrics | TrainingMetrics],
    output_dir: PathLike = get_target_dirpath("out"),
) -> None:
    metrics_dict: dict[str, dict[str, list[int | float]]] = {
        k: v.to_dict()
        if isinstance(v, GenerationMetrics) or isinstance(v, TrainingMetrics)
        else None
        for k, v in metrics.items()
    }

    json_file_path: Path = Path(output_dir) / f"{name}.json"
    with open(json_file_path, "w") as json_writer:
        json.dump(metrics_dict, json_writer, indent=2)

    print(f"Saved to {json_file_path}")

In [28]:
# Save metrics for future evaluations
save_metrics("training_metrics", TRAINING_METRICS)
save_metrics("generation_metrics", GENERATION_METRICS)

Saved to /app/out/training_metrics.json
Saved to /app/out/generation_metrics.json
