# ü§ñ Pranav's Personal AI Assistant - Fine-tuning on Google Colab

**Train an AI model on your personal profile, projects, and knowledge!**

This notebook trains a **Llama 3.2 3B** model on Pranav's profile information including:
- FTC/FRC robotics projects (Evergreen Dragons, Team 2854 Prototypes)
- DIY projects (Robotic Hand, Sim Racing Wheel, Robotic Arm)
- Technical preferences and skills
- General knowledge capabilities

**Hardware Requirements:** Runs comfortably on Google Colab Free (T4 GPU: 8GB VRAM, 36GB RAM)

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!

<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://unsloth.ai/docs/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
</div>

## üìã Table of Contents
1. [Installation](#Installation)
2. [Load Model](#Model)
3. [Upload Your Dataset](#Dataset)
4. [Train](#Train)
5. [Test the Model](#Inference)
6. [Save & Export](#Save)

<a name="Installation"></a>
## üîß Installation

Install Unsloth and required dependencies. This takes about 1-2 minutes.

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth  # Do this in local & cloud setups
else:
    import torch; v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
    xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, "0.0.34")
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

<a name="Model"></a>
## üß† Load Llama 3.2 3B Model

We use **Llama 3.2 3B Instruct** - perfect for 8GB VRAM! It's:
- Fast and efficient
- Great for conversational AI
- Supports 4-bit quantization for low memory usage
- Strong general knowledge + ability to learn your specific information

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Context window size
dtype = None  # Auto-detect: Float16 for T4, Bfloat16 for newer GPUs
load_in_4bit = True  # Use 4-bit quantization to fit in 8GB VRAM

print("üöÄ Loading Llama 3.2 3B Instruct...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

print("‚úÖ Model loaded successfully!")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
üöÄ Loading Llama 3.2 3B Instruct...
==((====))==  Unsloth 2026.2.1: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.563 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

‚úÖ Model loaded successfully!


## üéØ Add LoRA Adapters

LoRA (Low-Rank Adaptation) lets us fine-tune efficiently by only updating 1-10% of parameters.
This saves memory and training time while maintaining quality!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32,  # LoRA rank - higher = more capacity (using 32 for good quality)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0,  # 0 is optimized
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # Saves 30% VRAM!
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

print("‚úÖ LoRA adapters added!")

Unsloth 2026.2.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


‚úÖ LoRA adapters added!


<a name="Dataset"></a>
## üìÇ Upload Your Dataset

**Option 1: Upload the dataset file**

Run the cell below and upload `pranav_profile_qa.jsonl` when prompted.

In [4]:
from google.colab import files
import os

print("üì§ Upload your pranav_profile_qa.jsonl file:")
uploaded = files.upload()

# Save to a known location
dataset_path = "pranav_profile_qa.jsonl"
if "pranav_profile_qa.jsonl" in uploaded:
    print(f"‚úÖ Dataset uploaded successfully!")
    print(f"   File size: {len(uploaded['pranav_profile_qa.jsonl']) / 1024:.2f} KB")
else:
    print("‚ö†Ô∏è  Please upload pranav_profile_qa.jsonl")

üì§ Upload your pranav_profile_qa.jsonl file:


Saving pranav_profile_qa.jsonl to pranav_profile_qa.jsonl
‚úÖ Dataset uploaded successfully!
   File size: 63.50 KB


**Option 2: Download from GitHub**

If you have the dataset in a GitHub repo, uncomment and run:

## üìä Load and Prepare Dataset

We'll load the JSONL dataset and convert it to Llama 3's chat format.

In [6]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

# Set up chat template for Llama 3.2
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",  # Llama 3.2 uses 3.1 format
)

# Load the dataset
print("üìñ Loading dataset...")
dataset = load_dataset("json", data_files=dataset_path, split="train")

print(f"‚úÖ Loaded {len(dataset)} training examples")
print(f"\nExample fields: {dataset.column_names}")
print(f"\nFirst example:")
print(dataset[0])

üìñ Loading dataset...


Generating train split: 0 examples [00:00, ? examples/s]

‚úÖ Loaded 235 training examples

Example fields: ['instruction', 'input', 'output']

First example:
{'instruction': "Answer using only Pranav's saved profile facts. If unknown, say you do not have that detail.", 'input': 'What FRC team did he join?', 'output': 'He joined FRC Team 2854 Prototypes and described himself as mechanical-focused.'}


In [7]:
# Format function to convert to chat format
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []
    for instruction, user_input, output in zip(instructions, inputs, outputs):
        # Create conversation with system instruction
        messages = [
            {"role": "system", "content": instruction},
            {"role": "user", "content": user_input},
            {"role": "assistant", "content": output}
        ]

        # Apply chat template
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)

    return {"text": texts}

# Apply formatting
print("üîÑ Formatting dataset for training...")
dataset = dataset.map(formatting_prompts_func, batched=True)

print("‚úÖ Dataset formatted!")
print(f"\nFormatted example (first 500 chars):")
print(dataset[0]["text"][:500] + "...")

üîÑ Formatting dataset for training...


Map:   0%|          | 0/235 [00:00<?, ? examples/s]

‚úÖ Dataset formatted!

Formatted example (first 500 chars):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

Answer using only Pranav's saved profile facts. If unknown, say you do not have that detail.<|eot_id|><|start_header_id|>user<|end_header_id|>

What FRC team did he join?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

He joined FRC Team 2854 Prototypes and described himself as mechanical-focused.<|eot_id|>...


<a name="Train"></a>
## üèãÔ∏è Train the Model

Now let's fine-tune! This will take approximately 15-30 minutes depending on dataset size.

**Training Settings:**
- Batch size: 2 (fits in 8GB VRAM)
- Gradient accumulation: 4 steps (effective batch size of 8)
- Steps: 300 (increase for more training)
- Learning rate: 2e-4 (standard for LoRA)

In [8]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False,  # Can enable for speed with short sequences
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 300,  # Increase to 600+ for stronger learning
        learning_rate = 2e-4,
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

print("‚úÖ Trainer configured!")

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/235 [00:00<?, ? examples/s]

‚úÖ Trainer configured!


### Train Only on Assistant Responses

We'll mask the user inputs so the model only learns to generate assistant responses.

In [9]:
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

print("‚úÖ Training will focus on assistant responses only!")

Map (num_proc=6):   0%|          | 0/235 [00:00<?, ? examples/s]

Filter (num_proc=6):   0%|          | 0/235 [00:00<?, ? examples/s]

‚úÖ Training will focus on assistant responses only!


In [10]:
# Check memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"üéÆ GPU: {gpu_stats.name}")
print(f"üíæ Max memory: {max_memory} GB")
print(f"üìä Memory reserved: {start_gpu_memory} GB")

üéÆ GPU: Tesla T4
üíæ Max memory: 14.563 GB
üìä Memory reserved: 3.07 GB


In [11]:
# Start training!
print("üöÄ Starting training...")
print("‚è∞ This will take approximately 15-30 minutes...")
print("=" * 50)

trainer_stats = trainer.train()

print("=" * 50)
print("‚úÖ Training complete!")

üöÄ Starting training...
‚è∞ This will take approximately 15-30 minutes...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 235 | Num Epochs = 10 | Total steps = 300
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 48,627,712 of 3,261,377,536 (1.49% trained)


Step,Training Loss
10,3.9635
20,2.6889
30,1.6023
40,0.519
50,0.3108
60,0.198
70,0.18
80,0.0457
90,0.0161
100,0.0511


Unsloth: Will smartly offload gradients to save VRAM!
‚úÖ Training complete!


In [12]:
# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print("üìä Training Statistics:")
print(f"‚è±Ô∏è  Time: {round(trainer_stats.metrics['train_runtime']/60, 2)} minutes")
print(f"üíæ Peak memory: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"üéØ Memory for training: {used_memory_for_lora} GB ({lora_percentage}%)")

üìä Training Statistics:
‚è±Ô∏è  Time: 9.69 minutes
üíæ Peak memory: 3.141 GB (21.568% of 14.563 GB)
üéØ Memory for training: 0.071 GB (0.488%)


<a name="Inference"></a>
## üéØ Test Your Model!

Let's test the trained model with questions about Pranav's profile and projects.

In [13]:
from transformers import TextStreamer

# Enable fast inference
FastLanguageModel.for_inference(model)

print("‚úÖ Model ready for inference!")
print("üéØ Let's test with some questions about Pranav's profile...\n")

‚úÖ Model ready for inference!
üéØ Let's test with some questions about Pranav's profile...



### Test 1: FTC Team Information

In [14]:
messages = [
    {"role": "user", "content": "What FTC team is Pranav on and what are his leadership goals?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
print("Question: What FTC team is Pranav on and what are his leadership goals?\n")
print("Answer: ", end="")
_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 128,
    use_cache = True,
    temperature = 0.7,
    top_p = 0.9
)
print("\n" + "="*70 + "\n")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Question: What FTC team is Pranav on and what are his leadership goals?

Answer: He is on FTC team Evergreen Dragons and his leadership goals are mechanical lead by 10th grade and team captain by 11th grade.<|eot_id|>




### Test 2: DIY Projects

In [15]:
messages = [
    {"role": "user", "content": "Tell me about Pranav's sim racing steering wheel build."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
print("Question: Tell me about Pranav's sim racing steering wheel build.\n")
print("Answer: ", end="")
_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 150,
    use_cache = True,
    temperature = 0.7,
    top_p = 0.9
)
print("\n" + "="*70 + "\n")

Question: Tell me about Pranav's sim racing steering wheel build.

Answer: He built a custom steering wheel using an Arduino Leonardo, a BTS7960 driver, and 28BYJ-48 steppers, with a budget of around $200.<|eot_id|>




### Test 3: Technical Preferences

In [16]:
messages = [
    {"role": "user", "content": "What CAD software does Pranav use and what's his coding preference?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
print("Question: What CAD software does Pranav use and what's his coding preference?\n")
print("Answer: ", end="")
_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 128,
    use_cache = True,
    temperature = 0.7,
    top_p = 0.9
)
print("\n" + "="*70 + "\n")

Question: What CAD software does Pranav use and what's his coding preference?

Answer: He uses SolidWorks for CAD and prefers Python as his coding language.<|eot_id|>




### Test 4: General Knowledge (Not in Profile)

In [17]:
messages = [
    {"role": "user", "content": "What is the capital of France?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
print("Question: What is the capital of France?\n")
print("Answer: ", end="")
_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 100,
    use_cache = True,
    temperature = 0.7,
    top_p = 0.9
)
print("\n" + "="*70 + "\n")

Question: What is the capital of France?

Answer: The capital of France is Paris.<|eot_id|>




### üéÆ Try Your Own Questions!

Modify the question below to test any topic:

In [18]:
# ‚úèÔ∏è Edit the question here:
custom_question = "What motors does Pranav use in his sim racing wheel?"

messages = [
    {"role": "user", "content": custom_question},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

text_streamer = TextStreamer(tokenizer, skip_prompt = True)
print(f"Question: {custom_question}\n")
print("Answer: ", end="")
_ = model.generate(
    input_ids = inputs,
    streamer = text_streamer,
    max_new_tokens = 150,
    use_cache = True,
    temperature = 0.7,
    top_p = 0.9
)
print("\n")

Question: What motors does Pranav use in his sim racing wheel?

Answer: His wheel uses a BTS7960 motor driver and 12V brushed planetary gearmotors around the 312 RPM output class.<|eot_id|>




<a name="Save"></a>
## üíæ Save Your Model

Now let's save the fine-tuned model! You have several options:

### Option 1: Save LoRA Adapters (Small, ~200MB)

This saves only the trained adapter weights. You'll need the base model + adapters to run it.

In [19]:
print("üíæ Saving LoRA adapters...")

model.save_pretrained("pranav_assistant_lora")
tokenizer.save_pretrained("pranav_assistant_lora")

print("‚úÖ LoRA adapters saved to 'pranav_assistant_lora' folder!")
print("üì¶ You can download this folder from the Colab file browser (left sidebar)")

üíæ Saving LoRA adapters...
‚úÖ LoRA adapters saved to 'pranav_assistant_lora' folder!
üì¶ You can download this folder from the Colab file browser (left sidebar)


### Option 2: Upload to Hugging Face Hub (Recommended!)

Upload to Hugging Face so you can use it anywhere. Get your token at: https://huggingface.co/settings/tokens

In [20]:
# Uncomment and fill in your details:
# HF_USERNAME = "your-username"  # Your Hugging Face username
# HF_TOKEN = "hf_..."  # Your Hugging Face token

# model.push_to_hub(f"{HF_USERNAME}/pranav-assistant-3b-lora", token=HF_TOKEN)
# tokenizer.push_to_hub(f"{HF_USERNAME}/pranav-assistant-3b-lora", token=HF_TOKEN)

# print(f"‚úÖ Model uploaded to: https://huggingface.co/{HF_USERNAME}/pranav-assistant-3b-lora")

### Option 3: Save Merged 16-bit Model (Larger, ~6GB)

This merges LoRA weights into the base model for standalone use.

In [21]:
# Uncomment to save merged model:
# print("üíæ Saving merged 16-bit model (this may take a few minutes)...")
# model.save_pretrained_merged("pranav_assistant_merged_16bit", tokenizer, save_method="merged_16bit")
# print("‚úÖ Merged model saved!")

### Option 4: Save as GGUF for llama.cpp / Ollama

GGUF format works with llama.cpp, Ollama, LM Studio, and other tools.

In [22]:
# Uncomment to save as GGUF (Q8_0 quantization):
# print("üíæ Saving GGUF model...")
# model.save_pretrained_gguf("pranav_assistant", tokenizer, quantization_method="q8_0")
# print("‚úÖ GGUF model saved!")

# For better compression, try q4_k_m:
# model.save_pretrained_gguf("pranav_assistant", tokenizer, quantization_method="q4_k_m")

### üì• Download Your Model

To download from Colab:
1. Click the folder icon üìÅ in the left sidebar
2. Find your saved model folder
3. Right-click ‚Üí Download

Or use this code to zip and download:

In [23]:
# Uncomment to zip and download:
# !zip -r pranav_assistant_lora.zip pranav_assistant_lora/
# from google.colab import files
# files.download("pranav_assistant_lora.zip")

## üîÑ Reload Your Model Later

To use your saved model in a new session:

In [24]:
# Uncomment to reload from saved LoRA:
# from unsloth import FastLanguageModel
#
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name = "pranav_assistant_lora",  # Your saved folder
#     max_seq_length = 2048,
#     dtype = None,
#     load_in_4bit = True,
# )
# FastLanguageModel.for_inference(model)
#
# # Now you can use it for inference!

## üéâ Congratulations!

You've successfully trained a personalized AI assistant! Here's what you accomplished:

‚úÖ Fine-tuned Llama 3.2 3B on your personal profile  
‚úÖ Trained on 8GB VRAM (Google Colab Free)  
‚úÖ Model learns your projects, preferences, and background  
‚úÖ Maintains general knowledge capabilities  
‚úÖ Saved for future use  

### üöÄ Next Steps:

1. **Increase Training**: Change `max_steps` to 600-1000 for stronger learning
2. **Add More Data**: Expand your dataset with more details and examples
3. **Deploy It**: Use on Ollama, LM Studio, or your own server
4. **Share It**: Upload to Hugging Face for others to try

### üìö Resources:

- **Unsloth Docs**: https://unsloth.ai/docs/
- **Discord Community**: https://discord.gg/unsloth
- **GitHub**: https://github.com/unslothai/unsloth

### üí° Tips for Better Results:

- More diverse examples = better generalization
- Include both factual Q&A and conversational examples
- Test regularly and add edge cases to your dataset
- For specialized knowledge, increase training steps

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>

  ‚≠êÔ∏è If this helped you, star <a href="https://github.com/unslothai/unsloth">Unsloth on GitHub</a>! ‚≠êÔ∏è
</div>