### 1. Installation and Environment Setup
**Purpose:** This block installs the **Unsloth** library and its essential dependencies. Unsloth is used to make fine-tuning significantly faster and more memory-efficient.

In [None]:
!pip install unsloth

Collecting unsloth
  Downloading unsloth-2025.12.9-py3-none-any.whl.metadata (65 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/65.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.9/65.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.12.7 (from unsloth)
  Downloading unsloth_zoo-2025.12.7-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth)
  Downloading tyro-1.0.3-py3-none-any.whl.metadata (12 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.33.post2-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.2 kB)
Collecting bitsandbytes!=0.46.0,!=0.48.0,>=0.45.5 (from unsloth)
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting datasets!=4.0.*,!=4.1.0,<4.4.0,>=3.4.1 (from unsloth)
  Downloading datasets-4.3.0-py3-none-any.whl.metadata (18 kB)
Collecting trl!=0.19.0,<=0.24.0,>=0.18.2 (from u

### 2. Loading the Model and Tokenizer
**Purpose:** This step downloads the pre-trained AI model and the "Tokenizer" (the tool that converts text into numbers the model can read).

**Key Parameters Explained:**
* **model_name:** Specifies the base model. Here, we use `tinyllama-chat`, which is a smaller, faster version of Llama.
* **max_seq_length:** Defines the maximum length of text (in tokens) the model can handle at once. 2048 is a standard size for most tasks.
* **dtype = None:** This tells Unsloth to automatically detect the best mathematical format for your specific GPU (like Float16 or Bfloat16).
* **load_in_4bit = True:** This is a memory-saving technique. It "compresses" the model so it uses much less VRAM, allowing you to train on standard hardware without losing much accuracy.

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = 'unsloth/tinyllama-chat',
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Llama patching. Transformers: 4.57.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/762M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### 3. Data Loading and Formatting
**Purpose:** This block transforms your raw JSON data into a format the AI can actually learn from. Models require specific "templates" (like ChatML or Llama-3 headers) to distinguish between what the user says and what the assistant should answer.

**Key Steps Explained:**
* **JSON Loading:** It reads your local file (`people_data.json`) and converts it into a Hugging Face `Dataset` object, which is optimized for fast processing.
* **The `to_text` Function:** This is a formatting pipeline.
    * It ensures the response is a string (even if it was originally a list or dictionary).
    * It organizes the data into a "Conversation" format with `user` and `assistant` roles.
* **apply_chat_template:** This is the most important part. It uses the model's specific tokenizer to add special tokens (like `<|im_start|>` or `[INST]`) so the model learns the correct chat structure.
* **dataset.map:** This applies the formatting to every single row in your dataset and removes the old, unformatted columns.

In [None]:
import json
from datasets import Dataset

with open("/content/people_data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

ds = Dataset.from_list(data)

def to_text(ex):
    resp = ex["response"]
    if not isinstance(resp, str):
        resp = json.dumps(resp, ensure_ascii=False)
    msgs = [
        {"role": "user", "content": ex["prompt"]},
        {"role": "assistant", "content": resp},
    ]
    return {
        "text": tokenizer.apply_chat_template(
            msgs, tokenize=False, add_generation_prompt=False
        )
    }

dataset = ds.map(to_text, remove_columns=ds.column_names)

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

### 4. Configuring LoRA Adapters (PEFT)
**Purpose:** This block sets up **Parameter-Efficient Fine-Tuning (PEFT)** using the **LoRA** (Low-Rank Adaptation) method. Instead of training the whole model, we only train a small set of "adapter" weights, which makes training much faster and saves memory.

**Key Parameters Explained:**
* **r (Rank):** Determines the size of the adapter matrices. Higher values (like 64) allow the model to learn more complex patterns but use more memory.
* **target_modules:** These are the specific "muscles" of the model we are training (like the Attention and Feed-forward layers).
* **lora_alpha:** A scaling factor for the adapters. A common rule of thumb is to set this to $2 \times r$.
* **lora_dropout:** Usually set to 0 for efficiency. Setting it higher can help prevent the model from just "memorizing" your data (overfitting).
* **use_gradient_checkpointing:** A special Unsloth feature that drastically reduces VRAM usage by recalculating some data during training instead of storing everything at once.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules=[
        'q_proj', 'k_proj', 'v_proj', 'o_proj',
        'gate_proj', 'up_proj', 'down_proj',
    ],
    lora_alpha = 64 * 2,
    lora_dropout = 0,
    bias = 'none',
    use_gradient_checkpointing = 'unsloth',
)

Unsloth 2025.12.9 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.


### 5. Initializing the Trainer and Starting Training
**Purpose:** This block sets the rules for the training process and kicks off the actual fine-tuning. We are using the **SFTTrainer** (Supervised Fine-Tuning Trainer) from the TRL library, which is optimized to work with Unsloth.

**Key Training Settings:**
* **per_device_train_batch_size:** The number of training examples processed at once by the GPU. A low number (like 2) prevents "Out of Memory" errors.
* **gradient_accumulation_steps:** This multiplies your effective batch size. By setting this to 4, the model calculates math for 4 steps before updating its weights once ($2 \times 4 = 8$ effective batch size).
* **max_steps & num_train_epochs:** These control how long the training lasts. An "epoch" is one full pass through your data. "Steps" are individual updates to the model.
* **learning_rate & warmup_steps:** The model starts learning slowly (warmup) to prevent it from "crashing" or forgetting previous knowledge too quickly.
* **optim (adamw_8bit):** A memory-efficient version of the most common AI optimizer.

**Execution:**
The final line `trainer.train()` begins the loop where the model reads your data, makes guesses, calculates errors, and improves itself.

In [None]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    tokenizer = tokenizer,
    dataset_text_field = 'text',
    max_seq_length = 2048,
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        num_train_epochs = 3
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/300 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


🦥 Unsloth: Padding-free auto-enabled, enabling faster training.


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 300 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 50,462,720 of 1,150,511,104 (4.39% trained)


Step,Training Loss
1,2.3525
2,2.2539
3,2.1945
4,2.1665
5,2.1751
6,1.9379
7,1.9593
8,1.7413
9,1.7107
10,1.5021


TrainOutput(global_step=60, training_loss=0.9728291019797325, metrics={'train_runtime': 119.5651, 'train_samples_per_second': 4.015, 'train_steps_per_second': 0.502, 'total_flos': 269533855236096.0, 'train_loss': 0.9728291019797325, 'epoch': 1.5866666666666667})

### 6. Inference and Testing
**Purpose:** This block is used to test the fine-tuned model. We provide a prompt (input) and see how the model responds based on its new training.

**Key Components:**
* **FastLanguageModel.for_inference(model):** This is a specific Unsloth command that optimizes the model for speed (2x faster inference) and lowers memory usage once training is finished.
* **tokenizer.apply_chat_template:** This converts your dictionary-style message into the specific text format (e.g., Llama-3 or ChatML) that the model was trained to recognize.
* **model.generate:** This is the core command that produces the response.
    * **max_new_tokens:** Limits the length of the model's answer.
    * **temperature:** Controls creativity. Lower (0.1) is more factual/repetitive; higher (0.7+) is more creative/random.
    * **do_sample & top_p:** These control the "sampling" logic to ensure the model picks high-quality words instead of just the most likely one every time.
* **tokenizer.batch_decode:** Converts the model's numerical output back into human-readable text.

In [None]:
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Mike is 30 years old, loves hiking and works as a coder."
    },
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=512, use_cache=True, temperature=0.7, do_sample=True, top_p=0.9)

response = tokenizer.batch_decode(outputs)[0]

print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|user|>
Mike is 30 years old, loves hiking and works as a coder.</s> 
<|assistant|>
{"age": "30", "gender": "male", "job": "coder", "name": "Mike"}</s>


### 7. Exporting to GGUF (Local Deployment)
**Purpose:** This block converts and saves your fine-tuned model into the **GGUF** format. This is the industry-standard format for running models locally on your own computer's CPU or GPU without needing complex Python environments.

**Key Parameters Explained:**
* **"gguf_model_scratch_fixed":** This is the name of the folder where your model files will be saved.
* **quantization_method = "q4_k_m":** This refers to the specific way the model is compressed.
    * **Q4** means 4-bit (small and fast).
    * **K_M** stands for "K-Means Medium," which is widely considered the "Goldilocks" setting—it offers the best balance between small file size and high intelligence.
* **maximum_memory_usage = 0.3:** This is a safety setting. It tells the computer to only use 30% of your available memory during the conversion process. This prevents "Out of Memory" (OOM) crashes while the model is being saved.

In [None]:
model.save_pretrained_gguf("gguf_model_scratch_fixed", tokenizer, quantization_method="q4_k_m", maximum_memory_usage = 0.3)

Unsloth: Merging model weights to 16-bit format...


config.json:   0%|          | 0.00/754 [00:00<?, ?B/s]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 1/1 [00:41<00:00, 41.55s/it]
Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [00:44<00:00, 44.90s/it]


Unsloth: Merge process complete. Saved to `/content/gguf_model_scratch_fixed`
Unsloth: Converting to GGUF format...
==((====))==  Unsloth: Conversion from HF to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF f16 might take 3 minutes.
\        /    [2] Converting GGUF f16 to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: Updating system package directories
Unsloth: All required system packages already installed!
Unsloth: Install llama.cpp and building - please wait 1 to 3 minutes
Unsloth: Cloning llama.cpp repository
Unsloth: Install GGUF and other packages
Unsloth: Successfully installed llama.cpp!
Unsloth: Preparing converter script...
Unsloth: [1] Converting model into f16 GGUF format.
This might take 3 minutes...
Unsloth: Initial conversion completed! Files: ['tinyllama-chat.F16.gguf']


{'save_directory': 'gguf_model_scratch_fixed',
 'gguf_files': ['tinyllama-chat.Q4_K_M.gguf'],
 'modelfile_location': '/content/Modelfile',
 'want_full_precision': False,
 'is_vlm': False,
 'fix_bos_token': False}