# TinyLlama Sentiment Fine-Tuning Guide

## Overview
This notebook walks through fine-tuning the TinyLlama language model on the IMDB movie review dataset so it can perform sentiment analysis. The workflow stays fully local, making it practical for laptops or desktops without high-end GPUs.

## Objectives
- Prepare the IMDB dataset for supervised fine-tuning
- Apply LoRA so only a small fraction of the model parameters are updated
- Train and evaluate the adapted model on local hardware
- Compare the model before and after fine-tuning to verify the improvement

## Notebook Roadmap
1. Environment setup: install the libraries and import the components needed later
2. Data loading: fetch the IMDB sentiment dataset and format it for instruction-style prompts
3. Baseline run: test the original model to capture pre-training behaviour
4. Model preparation: configure TinyLlama and attach LoRA adapters
5. Training loop: fine-tune on a short subset of labeled reviews
6. Inference checks: run the adapted model on held-out examples
7. Result review: contrast baseline and fine-tuned predictions

## Concepts to Know
- LoRA (Low-Rank Adaptation) updates small adapter matrices instead of full model weights
- Instruction tuning frames each training example as an instruction-response pair to guide the model
- The IMDB dataset contains 50,000 labeled movie reviews split into positive and negative examples
- CPU-only training works, but expect longer runtimes than on a GPU

## Toolkit
- Model: TinyLlama (1.1B parameters) for a manageable yet capable baseline
- Training stack: Hugging Face Transformers, TRL (SFTTrainer), and PEFT for adapter management
- Hardware: runs on CPU or modest GPU (4 GB+ VRAM recommended)
- Outputs: LoRA adapter files in the 10â€“50 MB range instead of multi-gigabyte checkpoints

## When This Notebook Helps
- You need a reproducible example of lightweight fine-tuning
- You plan to adapt language models for sentiment or classification tasks
- You want a reference workflow for running production-style experiments on personal hardware
- You are documenting a local ML project for classmates or teammates

## 1. Environment Setup
We need `transformers`, `peft` for adapters, `bitsandbytes` for quantization, and `ollama` for baseline comparison.

### What are we installing?
- **transformers**: Hugging Face library for loading and managing transformer models
- **datasets**: Library for loading and processing datasets like IMDB
- **peft**: Parameter-Efficient Fine-Tuning library for LoRA adapters
- **bitsandbytes**: Enables 4-bit quantization to reduce memory usage
- **ollama**: Local inference engine (for baseline model testing)
- **trl**: Transformers Reinforcement Learning library with SFTTrainer
- **accelerate**: Multi-GPU and distributed training support


In [1]:
# Install necessary libraries
!pip install -q transformers datasets peft bitsandbytes ollama trl accelerate


[notice] A new release of pip is available: 25.3 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import torch
import ollama
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Suppress excessive warnings
import warnings
warnings.filterwarnings('ignore')

### Core Imports Explained
- **os, torch**: System and GPU management
- **ollama**: Client for querying the Ollama inference engine
- **load_dataset**: Loads datasets from Hugging Face Hub
- **AutoModelForCausalLM, AutoTokenizer**: Auto-detect and load the appropriate model/tokenizer
- **BitsAndBytesConfig**: Configuration for 4-bit quantization
- **TrainingArguments, SFTTrainer**: Core training components
- **LoraConfig, PeftModel**: LoRA (Low-Rank Adaptation) configuration and utilities


## 2. Load IMDB Dataset
We load the IMDB dataset from Hugging Face, which contains movie reviews with sentiment labels.

In [14]:
try:
    # Load IMDB dataset from Hugging Face
    dataset = load_dataset("imdb", split="train[:100]")  # Use only 100 samples for memory efficiency
    print(f"Dataset Loaded. Size: {len(dataset)} samples")
    print("Sample Entry:", dataset[0])
    print("\nDataset columns:", dataset.column_names)
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Ensure you have internet connection to download IMDB dataset.")

Dataset Loaded. Size: 100 samples
Sample Entry: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Real

### Loading the IMDB Dataset
The IMDB dataset contains 50,000 movie reviews, each labeled as positive (1) or negative (0) based on sentiment. We use only 100 samples for faster training on limited hardware.

**Dataset structure:**
- **text**: The full movie review text
- **label**: 0 (negative) or 1 (positive)

This cell downloads the dataset and displays sample information to verify loading.


## 3. Baseline Inference (Before Training)
We use the locally running Ollama instance to check how the base LLaMA 3.2 model performs sentiment analysis *without* fine-tuning.

In [15]:
def query_ollama(prompt, model="llama3.2"):
    try:
        response = ollama.chat(model=model, messages=[
            {'role': 'user', 'content': prompt}
        ])
        return response['message']['content']
    except Exception as e:
        return f"Ollama Error: {str(e)}"

# Test Review for Sentiment Analysis
test_instruction = "Analyze the sentiment of the following movie review: 'This film was absolutely amazing! The acting was brilliant, the storyline was engaging, and I couldn't take my eyes off the screen. I highly recommend it to everyone.'"

print("--- Baseline (Ollama) ---")
baseline_response = query_ollama(test_instruction)
print(f"Review Analysis: {test_instruction}\n")
print(f"Response:\n{baseline_response}")

--- Baseline (Ollama) ---
Review Analysis: Analyze the sentiment of the following movie review: 'This film was absolutely amazing! The acting was brilliant, the storyline was engaging, and I couldn't take my eyes off the screen. I highly recommend it to everyone.'

Response:
The sentiment of this movie review is extremely positive. The reviewer uses superlatives such as "absolutely amazing" and "brilliant" to describe their experience with the film. They also explicitly state that they "couldn't take my eyes off the screen," which suggests that the movie was so engaging and captivating that it held their full attention.

The reviewer's use of the phrase "I highly recommend it to everyone" further emphasizes their enthusiasm for the film, implying that they think it is a must-see for anyone. There is no criticism or negative comment in the review, only praise and admiration for the movie.

Overall, the sentiment of this review can be characterized as overwhelmingly positive and enthusia

### Testing Baseline Model with Ollama
Before fine-tuning, we test the untrained LLaMA model to see how it handles sentiment analysis without any training on our specific task.

**What this cell does:**
1. Defines `query_ollama()` function to communicate with Ollama server
2. Creates a test movie review (positive sentiment)
3. Gets baseline response from the untrained model
4. This serves as a "before" comparison point for our fine-tuned model


## 4. Fine-Tuning Setup (QLoRA)

We will fine-tune usage Hugging Face Transformers. 
**Important**: Ollama stores models in GGUF format which isn't directly trainable by standard tools. We will download the base weights for `Llama-3.2-1B-Instruct` (or 3B) from Hugging Face to perform the training, then save the adapter.

In [16]:
from huggingface_hub import login

# Login to Hugging Face (only needed for gated models like Meta Llama)
# Option 1: Interactive login (uncomment and run to use)
# login()

# Option 2: Use HF_TOKEN environment variable
# To use this: Set environment variable HF_TOKEN=your_token_here
# Get your token from: https://huggingface.co/settings/tokens
import os
hf_token = os.getenv('HF_TOKEN')
if hf_token:
    login(token=hf_token)
    print("âœ“ Logged in to Hugging Face successfully!")
else:
    print("âš  No HF_TOKEN environment variable found")
    print("  If using Meta Llama model, you need to:")
    print("  1. Accept the license: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct")
    print("  2. Get your token: https://huggingface.co/settings/tokens")
    print("  3. Set HF_TOKEN environment variable or uncomment login() above")
    print("  ")
    print("  Using non-gated TinyLlama model instead (no authentication needed)")


âš  No HF_TOKEN environment variable found
  If using Meta Llama model, you need to:
  1. Accept the license: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
  2. Get your token: https://huggingface.co/settings/tokens
  3. Set HF_TOKEN environment variable or uncomment login() above
  
  Using non-gated TinyLlama model instead (no authentication needed)


### Authentication & Model Loading
This section handles:
1. **HuggingFace Login**: Sets up credentials if using gated models (Meta Llama requires acceptance)
2. **Model Selection**: Uses TinyLlama (open-source, no authentication needed)
3. **Device Management**: Automatically selects CPU or GPU based on availability
4. **Quantization**: Skipped on CPU to avoid complexity

The model is loaded with `device_map="cpu"` for compatibility with 4GB GPU constraints.


In [21]:
# Model ID - We use 1B or 3B for local efficiency.
# Option 1: Meta Llama (requires HF authentication - see previous cell)
# MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"

# Option 2: Non-gated alternative (no authentication required)
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

NEW_MODEL_NAME = "llama-sentiment-classifier"

# For 4GB GPU, we'll use CPU training (slower but works without GPU memory issues)
import torch
USE_GPU = torch.cuda.is_available()

if USE_GPU:
    print("GPU available, but using CPU for training due to memory constraints")
    print("(GPU training with 4GB VRAM is not feasible)")
    device_map = "cpu"
else:
    device_map = "cpu"

# QLoRA Configuration (4-bit quantization) - skip if using CPU
if device_map == "gpu":
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False,
    )
    
    # Load Base Model
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto"
    )
else:
    # Load Base Model for CPU training (no quantization needed)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        device_map="cpu",
        torch_dtype=torch.float32,
    )

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(f"Model loaded on device: {device_map}")


GPU available, but using CPU for training due to memory constraints
(GPU training with 4GB VRAM is not feasible)


`torch_dtype` is deprecated! Use `dtype` instead!


Model loaded on device: cpu


### Model Initialization Details
**Model**: TinyLlama 1.1B parameters - compact and efficient for local training

**Key configurations:**
- `use_cache=False`: Disables KV cache to save memory
- `pretraining_tp=1`: Single tensor parallel (for CPU training)
- Tokenizer setup: Pad token set to EOS (End-Of-Sequence) token

**Quantization skipped on CPU**: 4-bit quantization is primarily for GPU VRAM optimization.


### Formatting the Dataset
We convert the (Instruction, Input, Output) format into a single prompt string for training.

In [22]:
def format_prompt(sample):
    # Format for sentiment analysis
    sentiment_label = "positive" if sample["label"] == 1 else "negative"
    text = f"### Instruction:\nAnalyze the sentiment of the following movie review.\n\n### Input:\n{sample['text']}\n\n### Response:\nSentiment: {sentiment_label}"
    return {"text": text}

dataset = dataset.map(format_prompt)
print(f"Formatted dataset size: {len(dataset)} samples")
if len(dataset) > 0:
    print("Sample formatted entry:", dataset[0]["text"][:200] + "...")

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Formatted dataset size: 100 samples
Sample formatted entry: ### Instruction:
Analyze the sentiment of the following movie review.

### Input:
### Instruction:
Analyze the sentiment of the following movie review.

### Input:
I rented I AM CURIOUS-YELLOW from my...


### Dataset Formatting
We transform raw IMDB data into a standardized prompt format:

**Format:**
```
### Instruction:
Analyze the sentiment of the following movie review.

### Input:
[Full review text]

### Response:
Sentiment: [positive/negative]
```

This format teaches the model to:
1. Understand the task (Instruction)
2. Process the input (review text)
3. Generate structured output (sentiment label)

This is called "Instruction Tuning" - training on task-specific prompt-response pairs.


In [23]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

### LoRA Configuration
**LoRA (Low-Rank Adaptation)** is parameter-efficient fine-tuning. Instead of updating all 1.1B model weights, we only train small adapter matrices.

**Key parameters:**
- `lora_alpha=16`: Scaling factor for LoRA updates (higher = stronger updates)
- `lora_dropout=0.1`: Regularization to prevent overfitting
- `r=64`: Rank of the low-rank matrices (trade-off between expressivity and parameters)
- `task_type="CAUSAL_LM"`: We're doing causal language modeling (next-token prediction)

**Benefits:**
- ðŸš€ Requires only ~1% of original parameters to train
- ðŸ’¾ Saves memory (fits in 4GB GPU)
- âš¡ Faster training
- ðŸ“¦ Produces tiny adapter files (~MB instead of GB)


## 5. Training
We use the `SFTTrainer` (Supervised Fine-tuning Trainer) from `trl`.

In [24]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,  # Batch size of 1 for CPU
    gradient_accumulation_steps=1,
    optim="adamw_torch",  # Use standard AdamW for CPU (paged_adamw_32bit only for GPU)
    save_steps=10,
    logging_steps=2,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,  # No mixed precision on CPU
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=False,  # Disable for CPU to avoid extra overhead
    lr_scheduler_type="constant",
    report_to="none",
    no_cuda=True,  # Force CPU training
)

print("Training on CPU (this will be slower but will use less memory)")
print(f"Total samples: {len(dataset)}")
print(f"Estimated training time: ~5-10 minutes")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    args=training_args,
)

# Start training
print("\nStarting training...")
trainer.train()


Training on CPU (this will be slower but will use less memory)
Total samples: 100
Estimated training time: ~5-10 minutes


Adding EOS to train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.



Starting training...


Step,Training Loss
2,2.6888
4,2.7055
6,2.3653
8,2.265
10,2.3487
12,2.1755
14,2.2577
16,2.5674
18,2.0214
20,2.1459


TrainOutput(global_step=100, training_loss=2.1466895055770876, metrics={'train_runtime': 1135.6586, 'train_samples_per_second': 0.088, 'train_steps_per_second': 0.088, 'total_flos': 257552055767040.0, 'train_loss': 2.1466895055770876})

### Training Configuration & Execution
**TrainingArguments** controls how the model is trained:

**Key settings:**
- `num_train_epochs=1`: One full pass through all 100 samples
- `per_device_train_batch_size=1`: Process 1 sample at a time (memory constraint)
- `learning_rate=2e-4`: How much to update weights per step
- `fp16=False`: No mixed precision on CPU (keep everything as float32)
- `no_cuda=True`: Force CPU training to avoid VRAM issues
- `save_steps=10`: Save model every 10 training steps

**Training time**: ~5-10 minutes on CPU for 100 samples

**What happens during training:**
1. Load batch of formatted data
2. Forward pass: Model predicts next tokens
3. Calculate loss: Compare predictions to true labels
4. Backward pass: Compute gradients (how much to update)
5. Update LoRA adapters (only 1% of parameters updated)
6. Repeat for all 100 samples


In [25]:
# Save the fine-tuned adapter
trainer.model.save_pretrained(NEW_MODEL_NAME)
tokenizer.save_pretrained(NEW_MODEL_NAME)
print(f"Model adapter saved to locally at: {NEW_MODEL_NAME}")

Model adapter saved to locally at: llama-sentiment-classifier


### Saving the Fine-Tuned Adapter
After training completes, we save:
1. **LoRA adapter weights**: The trained low-rank matrices (small ~MB files)
2. **Tokenizer**: The vocabulary and tokenization rules used

**Why small files?** 
- Original model: 1.1B parameters Ã— 2 bytes (float16) = ~2.2 GB
- LoRA adapters: Only rank-64 matrices for each layer = ~10-50 MB
- Tokenizer: ~1 MB

You can later combine these adapters with the original model for inference.


## 6. Inference Comparison (After Training)
We now reload the model with the trained LoRA adapter to see the difference.

*Note: To run this in Ollama (outside Python), you would typically fuse this adapter with the base model and convert it to GGUF format using `llama.cpp`.*

In [26]:
from peft import PeftModel

# Load base model again (or reuse if memory allows, simpler to reload for clean state)
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load the adapter we just trained
ft_model = PeftModel.from_pretrained(base_model, NEW_MODEL_NAME)
ft_model = ft_model.merge_and_unload() # Merge for faster inference

ft_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
ft_tokenizer.pad_token = ft_tokenizer.eos_token

### Loading Fine-Tuned Model for Inference
This cell prepares the trained model for testing:

**Steps:**
1. **Reload base model**: Load fresh TinyLlama from Hugging Face
2. **Load adapters**: Inject the trained LoRA adapters we just saved
3. **Merge and unload**: Fuse adapters into model weights for faster inference
4. **Load tokenizer**: Use same tokenizer as training for consistency

**Result**: A fully merged model that's ready to make predictions on new reviews.


In [27]:
def query_finetuned(instruction, input_text=""):
    # Format prompt exactly as in training
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
        
    inputs = ft_tokenizer(prompt, return_tensors="pt").to(base_model.device)
    outputs = ft_model.generate(**inputs, max_new_tokens=200, use_cache=True)
    response = ft_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the response part if possible
    if "### Response:" in response:
        return response.split("### Response:")[1].strip()
    return response

print("--- Fine-Tuned Model Inference ---")
ft_response = query_finetuned(test_instruction)
print(f"Review Analysis: {test_instruction}\n")
print(f"Response:\n{ft_response}")

--- Fine-Tuned Model Inference ---
Review Analysis: Analyze the sentiment of the following movie review: 'This film was absolutely amazing! The acting was brilliant, the storyline was engaging, and I couldn't take my eyes off the screen. I highly recommend it to everyone.'

Response:
Sentiment: negative


### Running Inference with Fine-Tuned Model
This function performs sentiment analysis on new movie reviews:

**Process:**
1. **Format prompt**: Build the same instruction-input-response template
2. **Tokenize**: Convert text to token IDs using the tokenizer
3. **Generate**: Model predicts next tokens to complete the response
4. **Decode**: Convert token IDs back to human-readable text
5. **Extract**: Pull out just the sentiment label from the response

**Key parameters:**
- `max_new_tokens=200`: Maximum tokens to generate (prevents infinite loops)
- `to(base_model.device)`: Move data to same device as model (CPU or GPU)


## 7. Results Comparison
Side-by-side view of the base Ollama model vs the Fine-Tuned Local model.

In [29]:
print("=== SENTIMENT ANALYSIS COMPARISON ===\n")
print(f"REVIEW: {test_instruction}\n")

print("[BEFORE - Ollama Base Model]")
print(baseline_response)
print("\n" + "-"*30 + "\n")

print("[AFTER - Fine-Tuned Adapter (IMDB)]")
print(ft_response)


=== SENTIMENT ANALYSIS COMPARISON ===

REVIEW: Analyze the sentiment of the following movie review: 'This film was absolutely amazing! The acting was brilliant, the storyline was engaging, and I couldn't take my eyes off the screen. I highly recommend it to everyone.'

[BEFORE - Ollama Base Model]
The sentiment of this movie review is extremely positive. The reviewer uses superlatives such as "absolutely amazing" and "brilliant" to describe their experience with the film. They also explicitly state that they "couldn't take my eyes off the screen," which suggests that the movie was so engaging and captivating that it held their full attention.

The reviewer's use of the phrase "I highly recommend it to everyone" further emphasizes their enthusiasm for the film, implying that they think it is a must-see for anyone. There is no criticism or negative comment in the review, only praise and admiration for the movie.

Overall, the sentiment of this review can be characterized as overwhelmingl

### Final Comparison: Before vs After Fine-Tuning
This cell displays the key result:

**What we're comparing:**
1. **[BEFORE]**: Response from untrained Ollama model
   - Has no knowledge of our specific sentiment task
   - Provides generic analysis

2. **[AFTER]**: Response from fine-tuned local model
   - Trained on 100 IMDB movie reviews
   - Should show more task-specific sentiment understanding
   - Demonstrates the effect of fine-tuning

**Interpretation:**
- If models differ significantly, fine-tuning worked!
- If similar, may indicate limited training data or need for hyperparameter tuning
