## QuestGen-LLM: Fine-Tuning

This notebook covers the fine-tuning of various pre-trained _large language models_ (LLMs) on the prepared ["quest"](../data/quests_train.json) dataset. Each language model applied is trained and validated on the dataset (with frozen parameters) and the results of these evaluations are compared. The LLMs employed for this application are listed in the following table with their respective parameter count.

| S. No. | Large Language Model                | Parameters | Developed By | Notes                                              |
| :----: | :---------------------------------- | :--------: | :----------: | :------------------------------------------------- |
|   1.   | GPT-2[^1]                           |    124M    |    OpenAI    | Base model from the GPT-2 family                   |
|   2.   | GPT-2 Medium[^2]                    |    355M    |    OpenAI    | Larger variant with improved language modeling     |
|   3.   | GPT-2 Large[^3]                     |    774M    |    OpenAI    | Capable of generating more coherent longer text    |
|   4.   | Llama-2-7B-Chat[^4] \*†             |     7B     |     Meta     | Chat-optimized version of LLaMA-2                  |
|   5.   | Llama-3.1-8B-Instruct[^5] \*†       |     8B     |     Meta     | Instruction-tuned variant for LLaMA-3.1            |
|   6.   | Mistral-7B-Instruct-v0.2[^6]        |     7B     |  Mistral AI  | Instruct fine-tuned version of the Mistral-7B-v0.2 |
|   7.   | DeepSeek-R1-Distill-Qwen-1.5B[^7] † |    1.5B    | DeepSeek AI  | Distilled model based on the Qwen architecture     |
|   8.   | DeepSeek-R1-Distill-Llama-8B[^8] †  |     8B     | DeepSeek AI  | Distilled model based on the LLaMA architecture    |

> Fine-tuning uses _supervised fine-tuning_\* (SHF) and _reinforcement learning with human feedback_† (RLHF).

<!-- References -->

[^1]: https://huggingface.co/openai-community/gpt2
[^2]: https://huggingface.co/openai-community/gpt2-medium
[^3]: https://huggingface.co/openai-community/gpt2-large
[^4]: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
[^5]: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
[^6]: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
[^7]: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
[^8]: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B


In [1]:
import torch
from pathlib import Path
from typing import Literal, TypeAlias


# Map for the model identifiers: (model_key -> model_id)
model_ids: dict[str, str] = {
    "gpt2": "openai-community/gpt2",
    "gpt2-medium": "openai-community/gpt2-medium",
    "gpt2-large": "openai-community/gpt2-large",
    "llama-2-7b-chat": "meta-llama/Llama-2-7b-chat-hf",
    "llama-3.1-8b-instruct": "meta-llama/Llama-3.1-8B-Instruct",
    "mistral-7b-instruct-v0.2": "mistralai/Mistral-7B-Instruct-v0.2",
    "deepseek-r1-distill-qwen-1.5b": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "deepseek-r1-distill-llama-8b": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
}

# Set the default device
DEVICE_TYPE: TypeAlias = Literal["cuda", "cpu"]
device: DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu"

# Set the default cache directory
cache_dir: Path = Path("../models/.cache/")

In [2]:
import time
from os import PathLike
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    PreTrainedModel,
    PreTrainedTokenizerFast,
)
from typing import Any, Optional


def download_model(
    model_key: str,
    device: DEVICE_TYPE = "cuda" if torch.cuda.is_available() else "cpu",
    cache_dir: PathLike = Path("../models/.cache/"),
) -> dict[str, Any]:
    start_time: float = time.time()
    model_id: Optional[str] = model_ids.get(model_key, None)
    print(f"Downloading {model_key} ({model_id if model_id else 'N/A'})...")

    if model_id is None:
        return {
            "model_key": model_key,
            "model_id": "N/A",
            "tokenizer": None,
            "model": None,
        }

    # Download the tokenizer using the model id
    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
        model_id,
        cache_dir=(cache_dir / model_key),
        use_fast=True,
    )

    # Download the model using the model id
    model: PreTrainedModel = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        device_map="auto",
        cache_dir=(cache_dir / model_key),
    )

    end_time: float = time.time()
    print(f"{model_key} ready in {(end_time - start_time):.2f}s.")

    return {
        "model_key": model_key,
        "model_id": model_id,
        "tokenizer": tokenizer,
        "model": model,
    }

In [3]:
import os
import shutil
from dataclasses import dataclass, field


@dataclass
class QuestGenLLM:
    tokenizer: Optional[PreTrainedTokenizerFast]
    model: Optional[PreTrainedModel]
    model_key: str  # Alias for the model, e.g, "gpt2"
    model_id: str  # Hugging Face model name, e.g., "openai-community/gpt2"
    device: str = field(init=False)
    dtype: str = field(init=False)

    def __post_init__(self):
        # Automatically determine the device used by the model
        self.device = str(getattr(self.model, "device", "N/A"))

        # Automatically determine the dtype used by the model
        self.dtype = str(getattr(self.model, "dtype", "N/A")).replace("torch.", "")

    def to_dict(self) -> dict[str, Any]:
        return {
            "model_key": self.model_key,
            "model_id": self.model_id,
            "device": self.device,
            "dtype": self.dtype,
            "vocab_size": getattr(self.tokenizer, "vocab_size", "unknown"),
            "max_length": getattr(self.tokenizer, "model_max_length", "unknown"),
            "model_type": getattr(
                getattr(self.model, "config", ""), "model_type", "unknown"
            ),
            "num_parameters": self.model.num_parameters()
            if hasattr(self.model, "num_parameters")
            else "N/A",
        }

    def clear_cache(self, cache_dir: PathLike = Path("../models/.cache/")) -> None:
        if os.path.exists(cache_dir / self.model_key):
            shutil.rmtree(cache_dir / self.model_key)
            print(f"Cache directory '{cache_dir / self.model_key}' removed.")
        else:
            print(f"No cache directory found at '{cache_dir / self.model_key}'.")

In [4]:
# Download the pre-trained GPT-2 model
gpt2_llm: QuestGenLLM = QuestGenLLM(**download_model("gpt2"))
gpt2_llm.to_dict()

Downloading gpt2 (openai-community/gpt2)...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

gpt2 ready in 32.19s.


{'model_key': 'gpt2',
 'model_id': 'openai-community/gpt2',
 'device': 'mps:0',
 'dtype': 'float32',
 'vocab_size': 50257,
 'max_length': 1024,
 'model_type': 'gpt2',
 'num_parameters': 124439808}

In [5]:
gpt2_llm.clear_cache()

Cache directory '../models/.cache/gpt2' removed.


In [6]:
# Download the pre-trained GPT-2 Medium model
gpt2_medium_llm: QuestGenLLM = QuestGenLLM(**download_model("gpt2-medium"))
gpt2_medium_llm.to_dict()

Downloading gpt2-medium (openai-community/gpt2-medium)...
gpt2-medium ready in 3.09s.


{'model_key': 'gpt2-medium',
 'model_id': 'openai-community/gpt2-medium',
 'device': 'mps:0',
 'dtype': 'float32',
 'vocab_size': 50257,
 'max_length': 1024,
 'model_type': 'gpt2',
 'num_parameters': 354823168}

In [7]:
gpt2_medium_llm.clear_cache()

Cache directory '../models/.cache/gpt2-medium' removed.


In [8]:
# Download the pre-trained GPT-2 Large model
gpt2_large_llm: QuestGenLLM = QuestGenLLM(**download_model("gpt2-large"))
gpt2_large_llm.to_dict()

Downloading gpt2-large (openai-community/gpt2-large)...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the disk.


gpt2-large ready in 144.39s.


{'model_key': 'gpt2-large',
 'model_id': 'openai-community/gpt2-large',
 'device': 'mps:0',
 'dtype': 'float32',
 'vocab_size': 50257,
 'max_length': 1024,
 'model_type': 'gpt2',
 'num_parameters': 774030080}

In [9]:
gpt2_large_llm.clear_cache()

Cache directory '../models/.cache/gpt2-large' removed.
