## QuestGen-LLM: Fine-Tuning

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model                | Parameters | Developed By | Notes                                              |
| :----: | :---------------------------------- | :--------: | :----------: | :------------------------------------------------- |
|   1.   | GPT-2[^1]                           |    124M    |    OpenAI    | Base model from the GPT-2 family                   |
|   2.   | GPT-2 Medium[^2]                    |    355M    |    OpenAI    | Larger variant with improved language modeling     |
|   3.   | GPT-2 Large[^3]                     |    774M    |    OpenAI    | Capable of generating more coherent longer text    |
|   4.   | Llama-2-7B-Chat[^4] \*†             |     7B     |     Meta     | Chat-optimized version of LLaMA-2                  |
|   5.   | Llama-3.1-8B-Instruct[^5] \*†       |     8B     |     Meta     | Instruction-tuned variant for LLaMA-3.1            |
|   6.   | Mistral-7B-Instruct-v0.2[^6]        |     7B     |  Mistral AI  | Instruct fine-tuned version of the Mistral-7B-v0.2 |
|   7.   | DeepSeek-R1-Distill-Qwen-1.5B[^7] † |    1.5B    | DeepSeek AI  | Distilled model based on the Qwen architecture     |
|   8.   | DeepSeek-R1-Distill-Llama-8B[^8] †  |     8B     | DeepSeek AI  | Distilled model based on the LLaMA architecture    |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
[^5]: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
[^6]: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
[^7]: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
[^8]: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B


In [1]:
from __future__ import annotations


# Map for the model identifiers: (model_key -> model_id)
model_ids: dict[str, str] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "llama-2-7b-chat": "meta-llama/Llama-2-7b-chat-hf",
    "llama-3.1-8b-instruct": "meta-llama/Llama-3.1-8B-Instruct",
    "mistral-7b-instruct-v0.2": "mistralai/Mistral-7B-Instruct-v0.2",
    "deepseek-r1-distill-qwen-1.5b": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "deepseek-r1-distill-llama-8b": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
}

In [None]:
from pathlib import Path
from datasets import load_dataset, DatasetDict


data_dir: Path = Path("../data/")

# Load the quest dataset
quest_set: DatasetDict = load_dataset(
    "text",
    data_files={
        "train": str(data_dir / "quests_train.txt"),
        "val": str(data_dir / "quests_val.txt"),
        "test": str(data_dir / "quests_test.txt"),
    },
    cache_dir=str(data_dir / ".cache"),
)
quest_set

In [None]:
quest_set["train"][:30]

In [4]:
import datetime
import os
import shutil
import time
import torch
from dataclasses import dataclass, field
from datasets import Dataset
from os import PathLike
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    PreTrainedModel,
    PreTrainedTokenizerFast,
    PreTrainedTokenizer,
    Trainer,
    TrainingArguments,
    set_seed,
)
from typing import Any


@dataclass
class QuestGenLLM:
    tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast
    model: PreTrainedModel
    model_key: str  # Alias for the model, e.g, "gpt2"
    model_id: str  # Hugging Face model name, e.g., "openai-community/gpt2"
    device: str = field(init=False)
    dtype: str = field(init=False)

    def __post_init__(self):
        # Automatically determine the device used by the model
        self.device = str(getattr(self.model, "device", "N/A"))

        # Automatically determine the dtype used by the model
        self.dtype = str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")

    @classmethod
    def from_pretrained(
        cls,
        model_key: str,
        model_id: str,
        cache_dir: PathLike = Path("../models/.cache/"),
    ) -> QuestGenLLM:
        print(f"[DOWNLOAD] {model_key} ({model_id})")
        start_time: float = time.time()

        # Download the tokenizer using the model id
        tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
            model_id,
            cache_dir=(cache_dir / model_key),
            use_fast=True,
        )

        # Download the model using the model id
        model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=(torch.float16 if torch.cuda.is_available() else torch.float32),
            device_map="auto",
            cache_dir=(cache_dir / model_key),
        )

        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f'[COMPLETE] "{model_key}" ready in {elapsed:.2f}s.')

        return cls(tokenizer, model, model_key, model_id)

    def tokenize_and_train(
        self,
        dataset: DatasetDict,
        max_length: int = 512,
        learning_rate: int = 0.00005,
        batch_size: int = 4,
        epochs: int = 3,
        seed: int = 42,
        output_dir: PathLike = Path("../out"),
        logging_dir: PathLike = Path("../logs"),
    ) -> Trainer:
        # Ensure the training and validation sets
        if not all(split in dataset for split in ["train", "val"]):
            print("DatasetDict must contain 'train' and 'val' splits.")
            return

        # Ensure the output and logging directories
        os.makedirs(output_dir, exist_ok=True)
        os.makedirs(logging_dir, exist_ok=True)

        start_time: float
        end_time: float
        elapsed: float

        # Set the random seed for reproducibility
        set_seed(42)

        # Set the padding token for the tokenizer
        self.tokenizer.pad_token = self.tokenizer.eos_token

        # Tokenize the dataset with `max_length` padding
        print(f"[TOKENIZE] {self.model_key} ({self.model_id})")
        start_time = time.time()
        tokenized_data: Dataset = dataset.map(
            lambda example: self.tokenizer(
                example["text"],
                padding="max_length",
                truncation=True,
                max_length=max_length,
            ),
            batched=True,
            remove_columns=["text"],
        )
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"[COMPLETE] Elapsed: {str(datetime.timedelta(seconds=elapsed))}")

        # Set the model padding token (from the tokenizer)
        self.model.config.pad_token_id = self.tokenizer.pad_token_id

        # Set up the training configurations
        training_args: TrainingArguments = TrainingArguments(
            output_dir=str(output_dir / self.model_key),
            learning_rate=learning_rate,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            num_train_epochs=epochs,
            eval_strategy="epoch",
            save_strategy="epoch",
            logging_dir=str(logging_dir / self.model_key),
            save_total_limit=2,
            fp16=(self.dtype == "float16"),
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            seed=seed,
        )

        # Set up the data collator for the model
        data_collator: DataCollatorForLanguageModeling = (
            DataCollatorForLanguageModeling(tokenizer=self.tokenizer, mlm=False)
        )

        # Prepare and run the trainer
        trainer: Trainer = Trainer(
            model=self.model,
            args=training_args,
            data_collator=data_collator,
            train_dataset=tokenized_data["train"],
            eval_dataset=tokenized_data["val"],
            callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
        )

        print(f"[TRAINING] {self.model_key} ({self.model_id})")
        start_time: float = time.time()
        trainer.train()
        end_time: float = time.time()
        elapsed: float = end_time - start_time
        print(f"[COMPLETE] Elapsed: {str(datetime.timedelta(seconds=elapsed))}")

        # Save the model and tokenizer for later use
        trainer.save_model()
        self.tokenizer.save_pretrained(save_directory=training_args.output_dir)

        return trainer

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", None), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
        }

    def clear_cache(self, cache_dir: PathLike = Path("../models/.cache/")) -> None:
        def remove_dir(dir_path: PathLike) -> None:
            if os.path.exists(dir_path):
                shutil.rmtree(dir_path)
                print(f"Cache directory '{dir_path}' removed.")
            else:
                print(f"No cache directory found at '{dir_path}'.")

        remove_dir(cache_dir / self.model_key)

    def __str__(self) -> str:
        return f"{self.model_key} ({self.model_id})"

In [None]:
# Download the pre-trained GPT-2 model
gpt2_llm: QuestGenLLM = QuestGenLLM.from_pretrained(
    model_key="gpt2", model_id=model_ids["gpt2"]
)
gpt2_llm.to_dict()

In [None]:
# Tokenize and train the GPT-2 model on the quest dataset
gpt2_trainer: Trainer = gpt2_llm.tokenize_and_train(quest_set)
gpt2_trainer