# LawLite: Fine-tuned LLaMA 2 Deployment with Gradio

This notebook demonstrates how to deploy a fine-tuned LLaMA 2 model using Gradio.  
It includes steps for loading the model, setting up the interface, and running it locally or on Hugging Face Spaces.


## 📦 Installing Required Packages
We begin by installing the necessary libraries for this project:  
- **transformers** → for loading and using the LLaMA-2 model.  
- **peft** → for parameter-efficient fine-tuning.  
- **bitsandbytes** → for 8-bit/4-bit quantization to save memory.  
- **accelerate** → to optimize model training/inference across devices.  
- **gradio** → to build a simple interactive web UI.  
- **torch** → PyTorch, the deep learning framework powering our model.

In [None]:
# Step 1: Install required packages
!pip install -q transformers
!pip install -q peft
!pip install -q bitsandbytes
!pip install -q accelerate
!pip install -q gradio
!pip install -q torch

## 📥 Importing Libraries
Here, we import all the Python libraries needed for deployment.  
This includes PyTorch, Gradio for the UI, Transformers for model handling,  
and additional utilities for model loading, tokenization, and warnings.


In [None]:
# Step 2: Import necessary libraries
import torch
import gradio as gr
import zipfile
import os
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    LlamaTokenizer
)
from peft import PeftModel
from google.colab import files
import warnings
warnings.filterwarnings('ignore')

## ⚡ Checking GPU Availability
Since large models like LLaMA-2 require significant compute,  
we check whether a GPU is available in this environment.  
If available, it prints the GPU name and memory size; otherwise, it falls back to CPU.


In [None]:
# Step 3: Check GPU availability and setup
print("Checking GPU availability...")
if torch.cuda.is_available():
    print(f"✅ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("❌ No GPU available - using CPU (will be slower)")


## 🔑 Logging into Hugging Face
We log in to Hugging Face Hub to authenticate.  
This is required if you’re pulling private models or pushing your model/app to the Hub.


In [None]:
from huggingface_hub import notebook_login
notebook_login()

## 📂 Uploading the Fine-tuned Model
Here, we upload our **fine-tuned model zip file** into the notebook.  
The uploaded file will later be extracted and used for inference.


In [None]:
# Step 4: Upload and extract your fine-tuned model
print("\n📁 Please upload your fine-tuned model zip file...")
uploaded = files.upload()

In [None]:
# Get the uploaded file name
zip_filename = list(uploaded.keys())[0]
print(f"Uploaded: {zip_filename}")

## 📦 Extracting the Uploaded Model
Once the model zip file is uploaded, we extract it into the working directory.  
This makes the model weights, tokenizer files, and configuration accessible for loading.


In [None]:
# Extract the model
print("Extracting model files...")
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall("/content/")


In [None]:
# List extracted contents to find model path
print("Extracted files:")
for root, dirs, files in os.walk("/content/"):
    for file in files:
        if file.endswith('.bin') or file.endswith('.safetensors') or file == 'adapter_config.json':
            print(os.path.join(root, file))

In [None]:
# You may need to adjust this path based on your zip structure
MODEL_PATH = "/content/finetunedModel/checkpoint-20"

In [None]:
# Step 5: Configure quantization for memory efficiency
print("\n🔧 Setting up model configuration...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

## ⚙️ Loading the Base Model and Tokenizer
We now load the **base LLaMA model** and its tokenizer from Hugging Face.  
The model is set up with quantization support (via bitsandbytes) to reduce memory usage.


In [None]:
# Step 6: Load base model and tokenizer
print("📦 Loading base LLaMA-2-7B model...")
base_model_id = "meta-llama/Llama-2-7b-chat-hf"

## 🔗 Attaching Fine-tuned Weights
Using **PEFT (Parameter Efficient Fine-Tuning)**,  
we merge our fine-tuned adapter weights into the base model.  
This allows the model to retain LLaMA’s knowledge while applying domain-specific fine-tuning for legal text simplification.

In [None]:
# Load tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
    base_model_id,
    use_fast=False,
    trust_remote_code=True,
    add_eos_token=True
)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

## 🛠️ Preparing the Model for Inference
We move the model to the appropriate device (GPU/CPU)  
and set it to **evaluation mode** to ensure efficient inference without unnecessary training overhead.


In [None]:
# Step 7: Load your fine-tuned LoRA adapters
import torch
from peft import PeftModel, PeftConfig # Import PeftConfig

print("\n📦 Loading fine-tuned LoRA adapters...")

# Load the base model first (already done in cell IcHUtUv7n1an)
# Make sure 'base_model' is available in the environment

try:
    # Load the PEFT configuration from the local path
    peft_config = PeftConfig.from_pretrained(MODEL_PATH)

    # Load the adapter weights onto the base model using PeftModel
    # The base_model should already be quantized as per previous steps
    model = PeftModel.from_pretrained(base_model, MODEL_PATH)

    print(f"✅ Successfully loaded adapter from {MODEL_PATH}")

except Exception as e:
    print(f"❌ Failed to load adapter from {MODEL_PATH}. Error: {e}")
    print("Please ensure the zip file extracted the adapter files (like adapter_config.json and adapter_model.bin/safetensors) correctly to the specified MODEL_PATH.")
    print("You can check the contents of the directory using: !ls -lha {MODEL_PATH}")
    # Add a check to see if the adapter_config.json exists
    import os
    config_path = os.path.join(MODEL_PATH, 'adapter_config.json')
    if not os.path.exists(config_path):
        print(f"Error: adapter_config.json not found at {config_path}")

# The 'model' variable now holds the base model with the loaded adapter
# Ensure the model is in evaluation mode for inference
if 'model' in locals() and model is not None:
    model.eval()
    print("Model is ready for inference.")
else:
    print("Model could not be loaded. Please check the model path and extracted files.")

## 📝 Defining the Response Function
Here we define a function that:
1. Accepts user input (legal text).  
2. Tokenizes the input.  
3. Runs the model to generate an output.  
4. Decodes the tokens back into natural language.  

This function will later be connected to our Gradio interface.


In [None]:
# Step 8: Define inference function
def generate_legal_response(user_question, max_tokens=512, temperature=0.7):
    """
    Generate response from the fine-tuned legal model
    """
    try:
        # Format the prompt
        prompt = f"Question: {user_question}\nAnswer this legal question accurately and concisely:\n"

        # Tokenize input
        inputs = tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to(model.device)

        # Generate response
        model.eval()
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                top_p=0.9,
                repetition_penalty=1.1,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id
            )

        # Decode and clean response
        full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract only the generated part (after the prompt)
        if "Answer this legal question accurately and concisely:" in full_response:
            response = full_response.split("Answer this legal question accurately and concisely:")[-1].strip()
        else:
            response = full_response.replace(prompt, "").strip()

        return response

    except Exception as e:
        return f"Error generating response: {str(e)}"

## 🎨 Creating the Gradio Interface
We now build a simple **Gradio interface** with:  
- A text box for user input.  
- A text output area for the simplified legal explanation.  
This provides an interactive way for anyone to use the model in their browser.


In [None]:
# Step 9: Create Gradio interface
def chat_interface(question, max_tokens, temperature):
    """Gradio interface function"""
    if not question.strip():
        return "Please enter a legal question."

    response = generate_legal_response(question, max_tokens, temperature)
    return response

# Step 10: Launch Gradio app
print("\n🚀 Launching LawLite Legal AI Interface...")

# Create the Gradio interface
demo = gr.Interface(
    fn=chat_interface,
    inputs=[
        gr.Textbox(
            lines=4,
            placeholder="Enter your legal question here...",
            label="Legal Question"
        ),
        gr.Slider(
            minimum=50,
            maximum=1024,
            value=512,
            step=50,
            label="Max Response Length"
        ),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.7,
            step=0.1,
            label="Temperature (Creativity)"
        )
    ],
    outputs=gr.Textbox(label="LawLite Response", lines=8),
    title="⚖️ LawLite: AI Legal Assistant",
    description="""
    **LawLite** is a fine-tuned LLaMA-2-7B model specialized for Indian legal questions.
    Ask questions about legal documents, procedures, or seek clarification on legal matters.

    *Note: This is an AI assistant and responses should not be considered as professional legal advice.*
    """,
    examples=[
        ["What is the procedure for filing an appeal in the High Court?", 512, 0.7],
        ["Explain the concept of bail under Indian law", 400, 0.6],
        ["What are the key provisions of Section 66 of the Indian Income Tax Act?", 600, 0.7]
    ],
    theme=gr.themes.Soft(),
    allow_flagging="never"
)

## 🚀 Launching the Gradio App
Finally, we launch the Gradio app.  
Once running, you’ll get a public URL where anyone can interact with your fine-tuned model.


In [None]:
# Step 11: Launch the interface
if __name__ == "__main__":
    demo.launch(
        debug=True,
        share=True,  # Creates public link
        server_name="0.0.0.0",
        server_port=7860,
        show_error=True
    )