# Phase 2: Fine-Tuning LawBot with Qwen2.5-1.5B

## Objectives:
1. Load Qwen2.5-1.5B-Instruct model
2. Apply 4-bit QLoRA using Unsloth
3. Fine-tune on legal Q&A data
4. Evaluate model performance
5. Save adapter weights

## Step 0: Mount Google Drive
Run the cell below first to mount your Drive!


In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted!")

import os
base_dir = '/content/drive/MyDrive/LawBot'
print(f"Using base directory: {base_dir}")

# Install unsloth for fast fine-tuning
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes

# Import libraries
from unsloth import is_bfloat16_supported
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")


Mounted at /content/drive
✅ Google Drive mounted!
Using base directory: /content/drive/MyDrive/LawBot
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-xid_m7qg/unsloth_fe788f4d28a14542aa01039f28fb644b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-xid_m7qg/unsloth_fe788f4d28a14542aa01039f28fb644b
  Resolved https://github.com/unslothai/unsloth.git to commit 874b262b5da1e38160312e1b5689a7c01303a51e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.10.11 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.10.12-py3-



🦥 Unsloth Zoo will now patch everything to make training faster!
PyTorch version: 2.8.0+cu126
CUDA available: True
CUDA device: Tesla T4


## Step 1: Load and Prepare Data


In [2]:
import json

# Load train and validation data
def load_jsonl(filename):
    data = []
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))
    return data

train_data = load_jsonl(f'{base_dir}/data/processed/train.jsonl')
val_data = load_jsonl(f'{base_dir}/data/processed/val.jsonl')

print(f"Training samples: {len(train_data)}")
print(f"Validation samples: {len(val_data)}")

# Save to temporary JSON files for datasets library
with open('/tmp/train.jsonl', 'w', encoding='utf-8') as f:
    for item in train_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

with open('/tmp/val.jsonl', 'w', encoding='utf-8') as f:
    for item in val_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

# Load with datasets library
dataset = load_dataset('json', data_files={'train': '/tmp/train.jsonl', 'val': '/tmp/val.jsonl'})
print(dataset)


Training samples: 11617
Validation samples: 2905


Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output', 'source'],
        num_rows: 11617
    })
    val: Dataset({
        features: ['instruction', 'output', 'source'],
        num_rows: 2905
    })
})


## Step 2: Load Qwen2.5-1.5B Model with QLoRA


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=3407,
)

print(f"Model loaded: {model.config.name_or_path}")
print(f"Max sequence length: {2048}")


==((====))==  Unsloth 2025.10.10: Fast Qwen2 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.10 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Model loaded: unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit
Max sequence length: 2048


## Step 3: Prepare Dataset Format


In [4]:
def format_instruction(data):
    """Format data for Qwen2.5 instruction following"""
    instruction = data["instruction"]
    output = data["output"]

    text = f"<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n{output}<|im_end|>\n"
    return text

# Apply formatting
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
print("Sample formatted text:")
print(dataset["train"][0]["text"])


Map:   0%|          | 0/11617 [00:00<?, ? examples/s]

Map:   0%|          | 0/2905 [00:00<?, ? examples/s]

Sample formatted text:
<|im_start|>user
Under what circumstances can the President call for supplementary, additional or excess grants according to article 115 of the Indian constitution?<|im_end|>
<|im_start|>assistant
The President can call for supplementary, additional or excess grants if the amount authorised by any law for a particular service for the current financial year is found to be insufficient, or when a need has arisen during the current financial year for supplementary or additional expenditure upon some new service not contemplated in the annual financial statement for that year, or if any money has been spent on any service during a financial year in excess of the amount granted for that service and for that year.<|im_end|>



## Step 4: Fine-Tuning Configuration


In [6]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["val"],
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="../models/adapters",
        save_strategy="epoch",
        eval_strategy="epoch", # Corrected argument name
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        save_total_limit=3,
        logging_dir="./logs", # Add a logging directory
    ),
)

print("Trainer initialized successfully")

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/11617 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/2905 [00:00<?, ? examples/s]

Trainer initialized successfully


## Step 5: Train Model


In [7]:
# Train the model
trainer.train()

print("Training completed!")


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 11,617 | Num Epochs = 3 | Total steps = 4,359
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdheep131823[0m ([33mdheep131823-it-resonance[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Epoch,Training Loss,Validation Loss
1,1.211,1.263281
2,0.9714,1.151895
3,0.7006,1.17287


Unsloth: Not an error, but Qwen2ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Training completed!


## Step 6: Save Fine-Tuned Model


In [8]:
# Save model and tokenizer
model.save_pretrained(f"{base_dir}/models/adapters/lawbot_qwen_adapter")
tokenizer.save_pretrained(f"{base_dir}/models/adapters/lawbot_qwen_adapter")

print("Adapter weights saved successfully!")


Adapter weights saved successfully!


## Step 7: Evaluate Model Performance


In [9]:
from rouge_score import rouge_scorer
from sacrebleu import BLEU
import json

# Load evaluation metrics
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
bleu_scorer = BLEU()

def evaluate_model(model, tokenizer, dataset, num_samples=10):
    """Evaluate model on sample data"""
    results = []

    FastLanguageModel.for_inference(model)

    for i, sample in enumerate(dataset[:num_samples]):
        prompt = f"<|im_start|>user\n{sample['instruction']}<|im_end|>\n<|im_start|>assistant\n"
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
        generated = tokenizer.decode(outputs[0], skip_special_tokens=False)

        # Extract generated text
        generated_text = generated.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0]
        ground_truth = sample['output']

        # Calculate ROUGE scores
        rouge_scores = scorer.score(ground_truth, generated_text)

        # Calculate BLEU score
        bleu_score = bleu_scorer.sentence_score(generated_text, [ground_truth])

        results.append({
            'instruction': sample['instruction'][:100],
            'generated': generated_text[:200],
            'ground_truth': ground_truth[:200],
            'rouge1': rouge_scores['rouge1'].fmeasure,
            'rouge2': rouge_scores['rouge2'].fmeasure,
            'rougeL': rouge_scores['rougeL'].fmeasure,
            'bleu': bleu_score.score / 100.0
        })

    return results

# Run evaluation
eval_results = evaluate_model(model, tokenizer, val_data, num_samples=20)

# Calculate average scores
avg_rouge1 = sum(r['rouge1'] for r in eval_results) / len(eval_results)
avg_rouge2 = sum(r['rouge2'] for r in eval_results) / len(eval_results)
avg_rougeL = sum(r['rougeL'] for r in eval_results) / len(eval_results)
avg_bleu = sum(r['bleu'] for r in eval_results) / len(eval_results)

print(f"\nEvaluation Results:")
print(f"Average ROUGE-1: {avg_rouge1:.4f}")
print(f"Average ROUGE-2: {avg_rouge2:.4f}")
print(f"Average ROUGE-L: {avg_rougeL:.4f}")
print(f"Average BLEU: {avg_bleu:.4f}")

# Save evaluation results
with open('../data/processed/evaluation_results.json', 'w') as f:
    json.dump({
        'avg_scores': {
            'rouge1': avg_rouge1,
            'rouge2': avg_rouge2,
            'rougeL': avg_rougeL,
            'bleu': avg_bleu
        },
        'detailed_results': eval_results[:5]  # Save first 5 for review
    }, f, indent=2)

print("\nEvaluation results saved to data/processed/evaluation_results.json")


ModuleNotFoundError: No module named 'rouge_score'

In [10]:
!pip install rouge_score sacrebleu

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score

In [12]:
import os

# Create the directory if it doesn't exist
output_dir = '../data/processed'
os.makedirs(output_dir, exist_ok=True)

from rouge_score import rouge_scorer
from sacrebleu import BLEU
import json

# Load evaluation metrics
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
bleu_scorer = BLEU()

def evaluate_model(model, tokenizer, dataset, num_samples=10):
    """Evaluate model on sample data"""
    results = []

    FastLanguageModel.for_inference(model)

    for i, sample in enumerate(dataset[:num_samples]):
        prompt = f"<|im_start|>user\n{sample['instruction']}<|im_end|>\n<|im_start|>assistant\n"
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
        generated = tokenizer.decode(outputs[0], skip_special_tokens=False)

        # Extract generated text
        generated_text = generated.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0]
        ground_truth = sample['output']

        # Calculate ROUGE scores
        rouge_scores = scorer.score(ground_truth, generated_text)

        # Calculate BLEU score
        bleu_score = bleu_scorer.sentence_score(generated_text, [ground_truth])

        results.append({
            'instruction': sample['instruction'][:100],
            'generated': generated_text[:200],
            'ground_truth': ground_truth[:200],
            'rouge1': rouge_scores['rouge1'].fmeasure,
            'rouge2': rouge_scores['rouge2'].fmeasure,
            'rougeL': rouge_scores['rougeL'].fmeasure,
            'bleu': bleu_score.score / 100.0
        })

    return results

# Run evaluation
eval_results = evaluate_model(model, tokenizer, val_data, num_samples=20)

# Calculate average scores
avg_rouge1 = sum(r['rouge1'] for r in eval_results) / len(eval_results)
avg_rouge2 = sum(r['rouge2'] for r in eval_results) / len(eval_results)
avg_rougeL = sum(r['rougeL'] for r in eval_results) / len(eval_results)
avg_bleu = sum(r['bleu'] for r in eval_results) / len(eval_results)

print(f"\nEvaluation Results:")
print(f"Average ROUGE-1: {avg_rouge1:.4f}")
print(f"Average ROUGE-2: {avg_rouge2:.4f}")
print(f"Average ROUGE-L: {avg_rougeL:.4f}")
print(f"Average BLEU: {avg_bleu:.4f}")

# Save evaluation results
with open(f'{output_dir}/evaluation_results.json', 'w') as f:
    json.dump({
        'avg_scores': {
            'rouge1': avg_rouge1,
            'rouge2': avg_rouge2,
            'rougeL': avg_rougeL,
            'bleu': avg_bleu
        },
        'detailed_results': eval_results[:5]  # Save first 5 for review
    }, f, indent=2)

print(f"\nEvaluation results saved to {output_dir}/evaluation_results.json")




Evaluation Results:
Average ROUGE-1: 0.3561
Average ROUGE-2: 0.1331
Average ROUGE-L: 0.3147
Average BLEU: 0.1124

Evaluation results saved to ../data/processed/evaluation_results.json


In [13]:
from huggingface_hub import login

# Login to HF (will ask for token the first time)
login()  # Or use: login(token="your_hf_token")

# Push adapter to your HF Hub
model.push_to_hub("DheepLearning/lawbot-qwen-1.5b-adapter")
tokenizer.push_to_hub("DheepLearning/lawbot-qwen-1.5b-adapter")

print("✅ Pushed to Hugging Face Hub!")
print("Model available at: https://huggingface.co/DheepLearning/lawbot-qwen-1.5b-adapter")


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

README.md:   0%|          | 0.00/618 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          | 45.7kB / 73.9MB            

Saved model to https://huggingface.co/DheepLearning/lawbot-qwen-1.5b-adapter


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...mp5z_mesyz/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

✅ Pushed to Hugging Face Hub!
Model available at: https://huggingface.co/DheepLearning/lawbot-qwen-1.5b-adapter


## Summary

Phase 2 completed successfully! The model has been:
1. ✅ Loaded Qwen2.5-1.5B-Instruct model
2. ✅ Applied QLoRA with 4-bit quantization
3. ✅ Fine-tuned on legal Q&A data (3 epochs)
4. ✅ Evaluated with ROUGE and BLEU metrics
5. ✅ Saved adapter weights

**Deliverables:**
- `models/adapters/lawbot_qwen_adapter/` - Fine-tuned adapter weights
- `data/processed/evaluation_results.json` - Performance metrics
- Training history with validation loss tracking
