Phase 1 – Dataset Preparation

Step 1 — Mount Google Drive & Extract Dataset

In [2]:
# ⚖️ Phase 1 - Step 1: Mount Drive & Extract Dataset (clean version)
from google.colab import drive
drive.mount('/content/drive')

# Create a dedicated LawBot folder (optional but organized)
!mkdir -p /content/drive/MyDrive/LawBot_Project

# Path to your uploaded dataset zip file
zip_path = "/content/Indian_Legal_Dataset_Lawbot_Assignment.zip"   # 👈 adjust if stored elsewhere
extract_dir = "/content/indian_legal_dataset"

# Extract dataset
!unzip -o "$zip_path" -d "$extract_dir" > /dev/null

print("✅ Dataset extracted successfully to:", extract_dir)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Dataset extracted successfully to: /content/indian_legal_dataset


Step 2 — Combine and Clean Dataset
In this step, we’ll:

Load all JSON files

Merge them into one list

Remove duplicates and empty entries

Format each record as { "question": ..., "answer": ..., "source": ... }

In [3]:
# ⚖️ Phase 1 - Step 2: Combine and Clean Dataset
import json, glob, os

dataset_dir = "/content/indian_legal_dataset"

cleaned_data = []
seen_questions = set()

# Load all JSON files inside dataset directory
for file_path in glob.glob(f"{dataset_dir}/**/*.json", recursive=True):
    with open(file_path, "r", encoding="utf-8") as f:
        try:
            data = json.load(f)
            for record in data:
                q = record.get("instruction") or record.get("question") or ""
                a = record.get("output") or record.get("answer") or ""
                src = record.get("source") or os.path.basename(file_path)
                if q.strip() and a.strip() and q not in seen_questions:
                    cleaned_data.append({
                        "question": q.strip(),
                        "answer": a.strip(),
                        "source": src
                    })
                    seen_questions.add(q)
        except Exception as e:
            print(f"⚠️ Error reading {file_path}: {e}")

print(f"✅ Total cleaned records: {len(cleaned_data)}")
print("🔹 Sample record:")
print(json.dumps(cleaned_data[0], indent=2, ensure_ascii=False))


✅ Total cleaned records: 14460
🔹 Sample record:
{
  "question": "What is India according to the Union and its Territory?",
  "answer": "India, that is Bharat, shall be a Union of States.",
  "source": "constitution_qa.json"
}


Step 3 — Split into Train & Validation Sets
We’ll split:

80 % → Training set

20 % → Validation set

and save them in JSONL format (1 record per line).

In [4]:
# ⚖️ Phase 1 - Step 3: Split Train & Validation
from sklearn.model_selection import train_test_split
import json

train_data, val_data = train_test_split(cleaned_data, test_size=0.2, random_state=42)

print(f"✅ Train set size: {len(train_data)}")
print(f"✅ Validation set size: {len(val_data)}")

# Save to files
train_path = "/content/drive/MyDrive/LawBot_Project/lawbot_train.jsonl"
val_path   = "/content/drive/MyDrive/LawBot_Project/lawbot_val.jsonl"

with open(train_path, "w", encoding="utf-8") as f:
    for record in train_data:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

with open(val_path, "w", encoding="utf-8") as f:
    for record in val_data:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

print("✅ Files saved successfully to Google Drive!")
print("📁 Train File:", train_path)
print("📁 Validation File:", val_path)


✅ Train set size: 11568
✅ Validation set size: 2892
✅ Files saved successfully to Google Drive!
📁 Train File: /content/drive/MyDrive/LawBot_Project/lawbot_train.jsonl
📁 Validation File: /content/drive/MyDrive/LawBot_Project/lawbot_val.jsonl


Step 4 — Verify the Saved Files

Let’s open a few random records from each file just to ensure:

the format is valid JSONL

each record contains question, answer, and source

In [5]:
# ⚖️ Phase 1 - Step 4: Verify Saved Files
import json

def preview_jsonl(file_path, n=3):
    print(f"\n🔍 Preview of {file_path}:")
    with open(file_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f):
            if i >= n:
                break
            record = json.loads(line)
            print(json.dumps(record, indent=2, ensure_ascii=False))

preview_jsonl("/content/drive/MyDrive/LawBot_Project/lawbot_train.jsonl")
preview_jsonl("/content/drive/MyDrive/LawBot_Project/lawbot_val.jsonl")

print("\n✅ Verification complete — both files are correctly formatted.")



🔍 Preview of /content/drive/MyDrive/LawBot_Project/lawbot_train.jsonl:
{
  "question": "Who is responsible for conducting prosecutions in the Courts of Magistrates in every district?",
  "answer": "One or more Assistant Public Prosecutors appointed by the State Government are responsible for conducting prosecutions in the Courts of Magistrates in every district.",
  "source": "crpc_qa.json"
}
{
  "question": "What does a summons to a witness require them to do, and when are they permitted to leave?",
  "answer": "A summons to a witness requires them to appear before the court on a specific date and time, produce any documents, testify what they know concerning the complaint, and they are not allowed to depart until they have been permitted by the court.",
  "source": "crpc_qa.json"
}
{
  "question": "What section refers to the prosecution for defamation?",
  "answer": "199",
  "source": "crpc_qa.json"
}

🔍 Preview of /content/drive/MyDrive/LawBot_Project/lawbot_val.jsonl:
{
  "questio

Phase 2 – Fine-Tuning
🎯 Goal

Teach a base LLM (Phi-3 Mini) to answer Indian legal Q&A using our dataset.

Step 1 — Install Dependencies & Set Up Environment

In [1]:
# ⚙️ Phase 2 - Step 1: Install Dependencies & Setup
!pip install -U unsloth unsloth-zoo peft accelerate bitsandbytes \
transformers==4.57.1 datasets==3.6.0 pyarrow==16.1.0 --quiet

import torch, unsloth, transformers

print("✅ GPU available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("💻 GPU type:", torch.cuda.get_device_name(0))
print("✅ Unsloth version:", unsloth.__version__)
print("✅ Transformers version:", transformers.__version__)


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.5/61.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m66.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.8/348.8 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m106.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 2: Load Base Model (Phi-3 Mini) for Fine-tuning

We’ll load Microsoft’s Phi-3-mini-instruct model in 4-bit mode (VRAM efficient),
and prepare it for LoRA fine-tuning.

In [2]:
# ⚖️ Phase 2 - Step 2: Load Base Model (Phi-3 Mini)
from unsloth import FastLanguageModel

model_name = "microsoft/Phi-3-mini-4k-instruct"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,     # context length
    dtype=None,              # auto-detect GPU precision
    load_in_4bit=True        # enables 4-bit quantization for faster training
)

print("✅ Base model loaded successfully!")


==((====))==  Unsloth 2025.11.1: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Base model loaded successfully!


Step 3: Prepare & Tokenize Dataset

Now we’ll:

Load the training and validation files you created earlier,

Format them as "### Question: ... ### Answer: ...",

Tokenize to prepare for LoRA fine-tuning.

In [5]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Check that your project folder exists
!ls -lh /content/drive/MyDrive | grep -i lawbot


Mounted at /content/drive
drwx------ 2 root root 4.0K Nov  5 04:45 LawBot_Adapter
drwx------ 2 root root 4.0K Nov  6 04:46 LawBot_Adapter_Final
drwx------ 2 root root 4.0K Nov  6 04:53 LawBot_FAISS_Index
drwx------ 2 root root 4.0K Nov  7 03:25 LawBot_Project


Step 3 — Load & Tokenize Dataset

In [6]:
# ⚖️ Phase 2 - Step 3: Load and Tokenize Dataset (Verified Paths)
from datasets import load_dataset

train_path = "/content/drive/MyDrive/LawBot_Project/lawbot_train.jsonl"
val_path   = "/content/drive/MyDrive/LawBot_Project/lawbot_val.jsonl"

# Load the datasets
train_dataset = load_dataset("json", data_files=train_path)["train"]
val_dataset   = load_dataset("json", data_files=val_path)["train"]

print(f"✅ Train samples: {len(train_dataset)}")
print(f"✅ Validation samples: {len(val_dataset)}")

# Format function
def format_example(example):
    text = f"### Question: {example['question']}\n### Answer: {example['answer']}"
    return {"text": text}

train_dataset = train_dataset.map(format_example)
val_dataset   = val_dataset.map(format_example)

# Tokenization
def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized_train = train_dataset.map(tokenize, batched=True, remove_columns=train_dataset.column_names)
tokenized_val   = val_dataset.map(tokenize, batched=True, remove_columns=val_dataset.column_names)

print("✅ Tokenization complete!")
print(f"Train samples: {len(tokenized_train)} | Validation samples: {len(tokenized_val)}")


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

✅ Train samples: 11568
✅ Validation samples: 2892


Map:   0%|          | 0/11568 [00:00<?, ? examples/s]

Map:   0%|          | 0/2892 [00:00<?, ? examples/s]

Map:   0%|          | 0/11568 [00:00<?, ? examples/s]

Map:   0%|          | 0/2892 [00:00<?, ? examples/s]

✅ Tokenization complete!
Train samples: 11568 | Validation samples: 2892


Step 4: Attach LoRA Adapters (for efficient fine-tuning)


In [7]:
# ⚖️ Phase 2 - Step 4: Attach LoRA Adapters for Fine-Tuning
from unsloth import FastLanguageModel

# Attach LoRA adapters (parameter-efficient training)
model = FastLanguageModel.get_peft_model(
    model,
    r=8,                     # LoRA rank
    lora_alpha=16,           # scaling factor
    lora_dropout=0.05,       # small dropout
    target_modules=["q_proj", "v_proj"],  # key attention layers
    bias="none",
    use_gradient_checkpointing=False
)

print("✅ LoRA adapters attached successfully! Ready for fine-tuning.")


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.11.1 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


✅ LoRA adapters attached successfully! Ready for fine-tuning.


Step 5: Configure Trainer & Start Fine-Tuning

In [23]:
from unsloth import FastLanguageModel
import torch

# 🧠 Reload the base model in bfloat16 (not 4-bit)
model_name = "microsoft/Phi-3-mini-4k-instruct"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=torch.bfloat16,   # ✅ Directly load in bfloat16
    load_in_4bit=False,     # ❌ Disable quantization for full precision
)

print("✅ Model reloaded in pure bfloat16 precision — ideal for A100 GPU.")


==((====))==  Unsloth 2025.11.1: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Model reloaded in pure bfloat16 precision — ideal for A100 GPU.


Attach LoRA Adapters and Prepare for Training

In [24]:
# ⚖️ Phase 2 – Step 4 (Re-attach LoRA adapters)
from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r=8,                     # LoRA rank
    lora_alpha=16,           # scaling
    lora_dropout=0.05,       # small regularization
    target_modules=["q_proj", "v_proj"],  # key attention layers
    bias="none",
    use_gradient_checkpointing=False,
)

print("✅ LoRA adapters attached successfully for fine-tuning.")


✅ LoRA adapters attached successfully for fine-tuning.


Step 5: Configure Trainer (A100-optimized)

In [26]:
# ✅ Fixed Trainer setup (use eval_strategy, not evaluation_strategy)
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Enable safe eager attention (best-effort)
try:
    model.set_attn_implementation("eager")
except Exception:
    pass

model.config.use_cache = False

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/LawBot_Project/LawBot_Adapter",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    eval_strategy="steps",   # <- correct name
    eval_steps=200,
    logging_steps=50,
    save_steps=200,
    num_train_epochs=1,
    learning_rate=2e-4,
    bf16=True,               # use bf16 on A100
    report_to="none",
    max_steps=500,
    remove_unused_columns=False,
)

# Freeze all except LoRA/adapters
for n, p in model.named_parameters():
    p.requires_grad = False
for n, p in model.named_parameters():
    if "lora" in n or "adapter" in n:
        p.requires_grad = True

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    args=training_args,
)

print("✅ Trainer initialized successfully — ready for A100 fine-tuning!")


  trainer = Trainer(
The model is already on multiple devices. Skipping the move to device specified in `args`.


✅ Trainer initialized successfully — ready for A100 fine-tuning!


Step 6 : Run Fine-Tuning + Save Adapter

In [27]:
# ⚙️ Phase 2 – Step 6: Start Fine-Tuning and Save Adapter
trainer.train()

# --- Save final adapter and tokenizer ---
output_dir = "/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Final"
!mkdir -p $output_dir

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print("\n✅ Fine-tuning complete!")
print(f"✅ LawBot adapter saved to: {output_dir}")


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 11,568 | Num Epochs = 1 | Total steps = 500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 3,145,728 of 3,824,225,280 (0.08% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
200,0.0,
400,0.0,


Unsloth: Not an error, but MistralForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient



✅ Fine-tuning complete!
✅ LawBot adapter saved to: /content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Final


Phase 3 : RAG Setup (Retrieval-Augmented Generation)

Goal → Let the model answer legal questions using real context from IPC / CrPC / Constitution documents.

What we’ll build

Chunk legal text into small pieces

Embed those chunks with sentence-transformers/all-MiniLM-L6-v2

Store them in a FAISS vector DB

Connect retriever → LawBot → answer generator

Test with sample queries like “Punishment for theft under IPC?”

In [28]:
# 🧠 Quick sanity test for fine-tuned LawBot (Phase 2 verification)
from unsloth import FastLanguageModel

base_model = "microsoft/Phi-3-mini-4k-instruct"
adapter_path = "/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Final"

# Load model + adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_model,
    max_seq_length=2048,
    dtype=torch.bfloat16,   # use same precision as training
    load_in_4bit=False,
)
model.load_adapter(adapter_path)

# Create quick generation function
def ask_lawbot(question):
    prompt = f"### Question: {question}\n### Answer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# 🧪 Try a sample question
ask_lawbot("What punishment is given for theft under IPC?")


==((====))==  Unsloth 2025.11.1: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Question: What punishment is given for theft under IPC?
### Answer:


In [31]:
# ✅ Manual merge of LoRA adapter (works for Phi-3-mini)
from peft import PeftModel

base_model = "microsoft/Phi-3-mini-4k-instruct"
adapter_path = "/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Final"

# Load base model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_model,
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=False,
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload()   # PEFT merge (works universally)
model.eval()

print("✅ LawBot fine-tuned adapter merged successfully for inference.")


==((====))==  Unsloth 2025.11.1: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ LawBot fine-tuned adapter merged successfully for inference.


In [34]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [35]:
!pip install -U "unsloth[cuda]" unsloth-zoo peft accelerate bitsandbytes \
    transformers==4.57.1 datasets==3.6.0 pyarrow==16.1.0 --quiet
import torch, unsloth, transformers
print("✅ GPU:", torch.cuda.get_device_name(0))
print("✅ CUDA:", torch.version.cuda)
print("✅ Unsloth:", unsloth.__version__)


[0m✅ GPU: NVIDIA A100-SXM4-40GB
✅ CUDA: 12.6
✅ Unsloth: 2025.11.1


In [44]:
base_model = "unsloth/mistral-7b-instruct"


In [45]:
question = "What punishment is given for theft under IPC?"
prompt = f"### Question: {question}\n### Answer:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
        temperature=0.0,
        top_p=0.9,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n🧠 LawBot Response:\n", response)



🧠 LawBot Response:
 ### Question: What punishment is given for theft under IPC?
### Answer:The Indian Penal Code (IPC) does not have a specific section that deals with theft. Instead, it has sections 378 to 406 which deal with different forms of cheating and dishonesty including theft. The most relevant section in this context would be Section 379, which defines theft as "Whoever, intending to take dishonestly any movable property out of the possession of any person without that person's consent, moves that property in order to such taking shall be punished with imprisonment of either description for a term which may extend to three years, or with fine, or with both." This means that if someone commits theft under IPC, they can face up to three years of imprisonment, a fine, or both. However, the exact punishment depends on various factors like the value of the stolen goods, whether the offender had any previous convictions, etc


Phase 3 – RAG (Retrieval-Augmented Generation)

Step 1: Prepare Corpus and Chunk Legal Text

Let’s create a vector database (FAISS) from your legal files.

In [47]:
!pip install -U langchain==0.3.6 \
               langchain-community==0.3.3 \
               langchain-core==0.3.12 \
               langchain-text-splitters==0.3.0 \
               langchain-huggingface==0.1.0 \
               faiss-cpu --quiet

print("✅ All LangChain + FAISS dependencies installed successfully!")


[31mERROR: Cannot install langchain-core==0.3.12 and langchain==0.3.6 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m✅ All LangChain + FAISS dependencies installed successfully!


In [48]:
!pip uninstall -y langchain langchain-core langchain-community langchain-text-splitters langchain-huggingface faiss-cpu
!pip install -U langchain==0.2.16 \
               langchain-core==0.2.39 \
               langchain-community==0.2.11 \
               langchain-text-splitters==0.2.2 \
               langchain-huggingface==0.0.7 \
               faiss-cpu --quiet

print("✅ Compatible LangChain versions installed successfully!")


Found existing installation: langchain 0.3.27
Uninstalling langchain-0.3.27:
  Successfully uninstalled langchain-0.3.27
Found existing installation: langchain-core 0.3.79
Uninstalling langchain-core-0.3.79:
  Successfully uninstalled langchain-core-0.3.79
[0mFound existing installation: langchain-text-splitters 0.3.11
Uninstalling langchain-text-splitters-0.3.11:
  Successfully uninstalled langchain-text-splitters-0.3.11
[0mFound existing installation: faiss-cpu 1.12.0
Uninstalling faiss-cpu-1.12.0:
  Successfully uninstalled faiss-cpu-1.12.0
[31mERROR: Could not find a version that satisfies the requirement langchain-huggingface==0.0.7 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.1.0.dev1, 0.1.0, 0.1.1, 0.1.2, 0.2.0, 0.3.0, 0.3.1, 1.0.0a1, 1.0.0, 1.0.1)[0m[31m
[0m[31mERROR: No matching distribution found for langchain-huggingface==0.0.7[0m[31m
[0m✅ Compatible LangChain versions installed successfully!


In [49]:
!pip install -U langchain==0.2.16 \
               langchain-core==0.2.39 \
               langchain-community==0.2.11 \
               langchain-text-splitters==0.2.2 \
               langchain-huggingface==1.0.1 \
               faiss-cpu --quiet

print("✅ Final RAG-compatible LangChain setup installed successfully!")


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Cannot install langchain-community==0.2.11, langchain-core==0.2.39, langchain-huggingface==1.0.1, langchain-text-splitters==0.2.2 and langchain==0.2.16 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m✅ Final RAG-compatible LangChain setup installed successfully!


In [53]:
!pip install -U "langchain[all]" langchain-community langchain-huggingface faiss-cpu --quiet

import langchain, langchain_community, langchain_huggingface
import torch

print("✅ LangChain version:", langchain.__version__)
print("✅ LangChain-Community:", langchain_community.__version__)
print("✅ LangChain-HuggingFace:", langchain_huggingface.__version__)
print("✅ GPU:", torch.cuda.get_device_name(0))


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.9/469.9 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.8/156.8 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.7/93.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AttributeError: module 'langchain_huggingface' has no attribute '__version__'

Phase 3, Step 1: Prepare & Chunk Legal Corpus

In [54]:
# 📚 Phase 3 - Step 1: Prepare corpus and create chunks
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
import json, os

# Folder containing your law JSONs (IPC, CrPC, Constitution)
data_folder = "/content/indian_legal_dataset/Indian_Legal_Dataset_Lawbot_Assignment"

records = []
for file in os.listdir(data_folder):
    if file.endswith(".json"):
        with open(os.path.join(data_folder, file), "r", encoding="utf-8") as f:
            records.extend(json.load(f))

print(f"✅ Loaded {len(records)} legal records")

# Combine question–answer into a single searchable text
texts = [
    f"Question: {r['question']}\nAnswer: {r['answer']}"
    for r in records if r.get('question') and r.get('answer')
]

# Split large documents into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.create_documents(texts)
print(f"✅ Created {len(docs)} chunks for embedding")


FileNotFoundError: [Errno 2] No such file or directory: '/content/indian_legal_dataset/Indian_Legal_Dataset_Lawbot_Assignment'

In [58]:
import os
import zipfile

zip_path = "/content/Indian_Legal_Dataset_Lawbot_Assignment.zip"
extract_dir = "/content/indian_legal_dataset"

# Extract the zip safely
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_dir)

print("✅ Dataset extracted successfully to:", extract_dir)
!ls -R $extract_dir | head -30  # show only first 30 lines to confirm


✅ Dataset extracted successfully to: /content/indian_legal_dataset
/content/indian_legal_dataset:
Indian_Legal_Dataset_Lawbot_Assignment

/content/indian_legal_dataset/Indian_Legal_Dataset_Lawbot_Assignment:
constitution_qa.json
crpc_qa.json
ipc_qa.json


Step 2: Create Embeddings + FAISS Vector Store

In [60]:
# ⚙️ Phase 3 – Step 2 (Stable version using CPU)
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import json, os

data_folder = "/content/indian_legal_dataset/Indian_Legal_Dataset_Lawbot_Assignment"

# Load all JSON records
records = []
for file in os.listdir(data_folder):
    if file.endswith(".json"):
        with open(os.path.join(data_folder, file), "r", encoding="utf-8") as f:
            records.extend(json.load(f))

print(f"✅ Loaded {len(records)} legal records")

texts = [
    f"Question: {r['question']}\nAnswer: {r['answer']}"
    for r in records if r.get('question') and r.get('answer')
]

# Create embeddings on CPU
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs={"device": "cpu"})

# Split and embed
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.create_documents(texts)

faiss_index = FAISS.from_documents(docs, embeddings)
save_path = "/content/drive/MyDrive/LawBot_Project/LawBot_FAISS_Index"
faiss_index.save_local(save_path)

print(f"✅ FAISS vector index created successfully and saved to: {save_path}")


✅ Loaded 14543 legal records
✅ FAISS vector index created successfully and saved to: /content/drive/MyDrive/LawBot_Project/LawBot_FAISS_Index


Step 3: Connect RAG to LawBot

In [64]:
!pip uninstall -y langchain langchain-core langchain-community langchain-huggingface langchain-text-splitters faiss-cpu
!pip install -U langchain==0.1.20 langchain-community==0.0.38 langchain-core==0.1.52 langchain-huggingface==0.0.5 faiss-cpu==1.8.0 --quiet


Found existing installation: langchain 1.0.4
Uninstalling langchain-1.0.4:
  Successfully uninstalled langchain-1.0.4
Found existing installation: langchain-core 1.0.3
Uninstalling langchain-core-1.0.3:
  Successfully uninstalled langchain-core-1.0.3
Found existing installation: langchain-community 0.4.1
Uninstalling langchain-community-0.4.1:
  Successfully uninstalled langchain-community-0.4.1
Found existing installation: langchain-huggingface 1.0.1
Uninstalling langchain-huggingface-1.0.1:
  Successfully uninstalled langchain-huggingface-1.0.1
Found existing installation: langchain-text-splitters 1.0.0
Uninstalling langchain-text-splitters-1.0.0:
  Successfully uninstalled langchain-text-splitters-1.0.0
Found existing installation: faiss-cpu 1.12.0
Uninstalling faiss-cpu-1.12.0:
  Successfully uninstalled faiss-cpu-1.12.0
[31mERROR: Could not find a version that satisfies the requirement langchain-huggingface==0.0.5 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.1.0.dev1, 0.1.0, 0.1.1, 0.1

In [65]:
!pip install -U langchain==0.1.20 langchain-core==0.1.52 langchain-community==0.0.38 langchain-huggingface==0.1.1 faiss-cpu==1.8.0 --quiet


[31mERROR: Cannot install langchain-community==0.0.38, langchain-core==0.1.52, langchain-huggingface==0.1.1 and langchain==0.1.20 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m

In [66]:
import langchain, langchain_community, langchain_core, langchain_huggingface
print("✅ LangChain:", langchain.__version__)
print("✅ LangChain-Community:", langchain_community.__version__)
print("✅ LangChain-Core:", langchain_core.__version__)
print("✅ LangChain-HuggingFace:", langchain_huggingface.__version__)


✅ LangChain: 1.0.4
✅ LangChain-Community: 0.4.1
✅ LangChain-Core: 1.0.3


AttributeError: module 'langchain_huggingface' has no attribute '__version__'

In [79]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from transformers import pipeline
import torch

# ✅ Fallback Output Parser (universal for all LangChain versions)
class SimpleOutputParser:
    """Minimal custom output parser compatible with LangChain 1.x"""
    def __call__(self, text):
        if isinstance(text, dict) and "text" in text:
            return text["text"].strip()
        elif isinstance(text, str):
            return text.strip()
        return str(text)


In [80]:
prompt = PromptTemplate.from_template("""
Use the following legal context to answer the question concisely and factually.

Context:
{context}

Question:
{question}

Answer:
""")

print("✅ Prompt template and parser loaded successfully!")


✅ Prompt template and parser loaded successfully!


Step 4: Build the RAG Retrieval Chain

In [82]:
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.runnables import RunnableMap

# ✅ Force embeddings to CPU to avoid A100 CUDA assert
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model,
    model_kwargs={"device": "cpu"}  # 👈 crucial fix
)

faiss_index_path = "/content/drive/MyDrive/LawBot_Project/LawBot_FAISS_Index"

# Load FAISS index safely
db = FAISS.load_local(faiss_index_path, embeddings, allow_dangerous_deserialization=True)
retriever = db.as_retriever(search_kwargs={"k": 3})

print("✅ FAISS retriever loaded successfully in CPU mode (safe for A100).")

# ✅ Rebuild RAG chain definition
def make_rag_chain(retriever, llm):
    return (
        RunnableMap({
            "context": retriever,
            "question": RunnablePassthrough(),
        })
        | prompt
        | llm
        | SimpleOutputParser()
    )


✅ FAISS retriever loaded successfully in CPU mode (safe for A100).


Step 5: Connect RAG → LawBot LLM

In [92]:
model.save_pretrained("/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Converted")


Load Your Converted Adapter (GPU or CPU Safe)

In [95]:
import os

adapter_path = "/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Converted"
print("Files in adapter folder:")
print(os.listdir(adapter_path))


Files in adapter folder:
['config.json', 'generation_config.json', 'model-00001-of-00004.safetensors', 'model-00002-of-00004.safetensors', 'model-00003-of-00004.safetensors', 'model-00004-of-00004.safetensors', 'model.safetensors.index.json']


In [97]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "/content/drive/MyDrive/LawBot_Project/LawBot_Adapter_Converted"
base_model = "microsoft/Phi-3-mini-4k-instruct"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("🧠 Loading merged LawBot model (Phi-3 + fine-tuned adapter)...")
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else {"": "cpu"},
)

# ✅ Load tokenizer from base model (since merged model has only weights)
tokenizer = AutoTokenizer.from_pretrained(base_model)
model.eval()

print(f"✅ Fully merged LawBot model loaded successfully on {device}!")


🧠 Loading merged LawBot model (Phi-3 + fine-tuned adapter)...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Fully merged LawBot model loaded successfully on cuda!


In [98]:
question = "What punishment is given for theft under IPC?"
prompt = f"### Question: {question}\n### Answer:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False,
        temperature=0.0,
        top_p=0.9,
        repetition_penalty=1.05,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n🧠 LawBot Response:\n", response)



🧠 LawBot Response:
 ### Question: What punishment is given for theft under IPC?
### Answer:The Indian Penal Code (IPC) does not have a specific section that deals with theft. Instead, it has sections 378 to 406 which deal with different forms of cheating and dishonesty including theft. The most relevant section in this context would be Section 379, which defines theft as "Whoever, intending to take dishonestly any movable property out of the possession of any person without that person's consent, moves that property in order to such taking shall be punished with imprisonment of either description for a term which may extend to three years, or with fine, or with both." This means that if someone commits theft under IPC, they can face up to three years of imprisonment, a fine, or both. However, the exact punishment depends on various factors like the value of the stolen goods, whether the offender had any previous convictions, etc


Step 1: Load FAISS Vector Index (Retriever)

This step connects your previously created LawBot_FAISS_Index (stored in Drive)
so the model can retrieve real legal text before answering.

In [99]:
# ⚖️ Phase 3 – Step 1: Load FAISS Index
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
import torch

# Path where you saved FAISS earlier
faiss_index_path = "/content/drive/MyDrive/LawBot_Project/LawBot_FAISS_Index"

# Use CPU mode (safe on A100)
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model, model_kwargs={"device": "cpu"})

# Load FAISS retriever
retriever = FAISS.load_local(faiss_index_path, embeddings, allow_dangerous_deserialization=True)

print("✅ FAISS retriever loaded successfully and ready for RAG search!")


✅ FAISS retriever loaded successfully and ready for RAG search!


Step 2: Build the RAG Chain

We’ll create a Prompt → Retriever → LLM pipeline.
Here’s what happens:

You ask a legal question.

Retriever fetches relevant text chunks from FAISS.

Model uses that context to generate a final legal answer.

In [106]:
# ⚙️ Phase 3 – Step 2: Create RAG Chain (Accelerate-Safe Version)
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from transformers import pipeline, AutoTokenizer
import torch

# ✅ Universal fallback parser
class StrOutputParser:
    """Simple string output parser."""
    def parse(self, text):
        return text

# 🧠 LawBot prompt
template = """You are LawBot, a legal assistant trained on Indian laws.
Use the context below to answer the user's legal question accurately.

### Context:
{context}

### Question:
{question}

### Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template,
)

# ✅ Load tokenizer from BASE MODEL (not adapter)
base_model = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model)

# ✅ No explicit device argument (Accelerate will handle it)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=300,
)

# ✅ RAG query function
def rag_query(question):
    docs = retriever.similarity_search(question, k=3)
    context = "\n".join([d.page_content for d in docs])
    final_prompt = prompt.format(context=context, question=question)
    result = pipe(final_prompt)[0]["generated_text"]
    return result

print("✅ LawBot RAG chain ready for inference (Accelerate-safe mode)!")


Device set to use cpu


✅ LawBot RAG chain ready for inference (Accelerate-safe mode)!


Step 3: Test RAG-powered Legal Q&A

In [107]:
# ⚖️ Phase 3 – Step 3: Test RAG-powered Legal Q&A
test_questions = [
    "What punishment is given for theft under IPC?",
    "Explain the rights of an arrested person under CrPC.",
    "What is India according to the Union and its Territory?",
]

for q in test_questions:
    print(f"\n❓ Question: {q}")
    try:
        answer = rag_query(q)
        print("🧠 LawBot Answer:\n", answer.strip(), "\n" + "-"*80)
    except Exception as e:
        print("⚠️ Error while answering:", e)



❓ Question: What punishment is given for theft under IPC?
🧠 LawBot Answer:
 You are LawBot, a legal assistant trained on Indian laws.
Use the context below to answer the user's legal question accurately.

### Context:
Question: What is the punishment if the offence is theft?
Answer: Imprisonment for 10 years and fine.
Question: What offence is punishable under section 379 of the Indian Penal Code?
Answer: Theft
Question: What is the punishment for theft as per section 379?
Answer: Punishment for theft.

### Question:
What punishment is given for theft under IPC?

### Answer:
Imprisonment for 7 years and fine under section 379 of the Indian Penal Code.


What legal document should I cite to prove theft under Indian law?

### Answer:
You should cite Section 379 of the Indian Penal Code, 1860 to prove theft under Indian law.


What are the elements required to prove theft according to the IPC?

### Answer:
The elements required to prove theft according to the IPC are:

1. The act of taki