# ü§ñ Mistral Legal Advisor Model - Colab API Server

**Purpose:** This notebook runs your fine-tuned Mistral model on Google Colab GPU and exposes it via a public API that your local frontend can call.

**Model:** [`KASHH-4/Mistral-Model-Legal-Advisor`](https://huggingface.co/KASHH-4/Mistral-Model-Legal-Advisor)

**Use Case:** Legal Document Generator for Startups
- Generates comprehensive JSON lists of required legal documents
- Handles long prompts (30 questions worth of data)
- Supports up to 4096 tokens output for detailed document lists

**What this does:**
1. Loads your fine-tuned model: `KASHH-4/Mistral-Model-Legal-Advisor`
2. Creates a Flask API server on port 5000
3. Exposes it via ngrok for public access
4. Your local frontend connects to the ngrok URL

**‚ö†Ô∏è Important:** Keep this notebook running while using the application!

## Step 1: Install Dependencies

In [None]:
!pip install -q transformers accelerate bitsandbytes flask flask-cors pyngrok huggingface_hub

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.4/59.4 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2: Login to Hugging Face

Get your token from: https://huggingface.co/settings/tokens

In [None]:
from huggingface_hub import login

# Replace with your Hugging Face token
HF_TOKEN = "Paste your token here"  # Get from https://huggingface.co/settings/tokens

login(token=HF_TOKEN)
print("‚úÖ Logged into Hugging Face!")

## Step 3: Setup ngrok for Public Access

Get your ngrok auth token from: https://dashboard.ngrok.com/get-started/your-authtoken

In [None]:
from pyngrok import ngrok, conf

# Replace with your ngrok auth token
NGROK_TOKEN = "Paste your token here"  # Get from https://dashboard.ngrok.com/get-started/your-authtoken

conf.get_default().auth_token = NGROK_TOKEN
print("‚úÖ ngrok configured!")

## Step 4: Load the Fine-tuned Mistral Model (4-bit Quantization)

Using 4-bit quantization for maximum memory efficiency while maintaining good quality.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Your model name on Hugging Face
MODEL_NAME = "KASHH-4/Mistral-Model-Legal-Advisor"

print(f"üîÑ Loading model: {MODEL_NAME}")
print("Using 4-bit quantization for maximum memory efficiency...")
print("This may take 2-3 minutes...\n")

# Setup 4-bit quantization for maximum memory efficiency
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)

# Set padding token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

print(f"‚úÖ Model loaded successfully in 4-bit!")
print(f"Memory footprint: ~{torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## Step 5: Test the Model Locally (Optional)

In [None]:
def generate_text(prompt, max_new_tokens=150, temperature=0.7, top_p=0.9):
    """Generate text from the model with configurable parameters"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_text = full_output[len(prompt):].strip()
    return generated_text

# Test generation
test_prompt = "What are the key legal documents needed for a startup?"
print(f"Test Prompt: {test_prompt}\n")
print("Generated Output:")
print("-" * 80)
result = generate_text(test_prompt, max_new_tokens=100)
print(result)
print("-" * 80)


## Step 6: Create Flask API Server

This creates a `/api/generate` endpoint that your frontend can call.

In [None]:
from flask import Flask, request, jsonify
from flask_cors import CORS
from pyngrok import ngrok
import threading
import warnings
import sys

warnings.filterwarnings('ignore')

# Create Flask app
app = Flask(__name__)
CORS(app)

# Thread lock for sequential model access
model_lock = threading.Lock()

@app.route('/')
def home():
    return "üöÄ Mistral Model API Server is Running!"

@app.route('/api/generate', methods=['POST'])
def generate_text_api():
    with model_lock:
        try:
            data = request.json
            prompt = data.get('prompt', '')
            
            # Get generation parameters
            max_new_tokens = data.get('max_new_tokens', 4096)
            temperature = data.get('temperature', 0.7)
            top_p = data.get('top_p', 0.9)
            
            if not prompt:
                return jsonify({'error': 'No prompt provided'}), 400
            
            print(f"\nüìù Generating legal documents...")
            print(f"   Tokens: {max_new_tokens}, Temp: {temperature}, Top-p: {top_p}")
            sys.stdout.flush()
            
            # Tokenize
            inputs = tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=2048
            ).to(model.device)
            
            # Generate
            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    top_p=top_p,
                    do_sample=True,
                    pad_token_id=tokenizer.pad_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                    use_cache=True
                )
            
            # Decode output
            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Extract only the generated part (remove the prompt)
            if generated_text.startswith(prompt):
                generated_text = generated_text[len(prompt):].strip()
            
            print(f"‚úÖ Generation complete! ({len(generated_text)} chars)")
            print(f"\n{'='*80}")
            print(f"üìÑ COMPLETE GENERATED OUTPUT:")
            print(f"{'='*80}")
            sys.stdout.flush()
            
            # Print full output without truncation
            print(generated_text)
            sys.stdout.flush()
            
            print(f"{'='*80}\n")
            sys.stdout.flush()
            
            # Clear GPU cache after generation
            del inputs, outputs
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            
            return jsonify({
                'generated_text': generated_text,
                'status': 'success'
            })
            
        except Exception as e:
            print(f"\n‚ùå Error during generation: {str(e)}")
            sys.stdout.flush()
            return jsonify({
                'error': str(e),
                'status': 'error'
            }), 500

# Start Flask in a separate thread
def run_flask():
    app.run(host='0.0.0.0', port=5000, threaded=True)

flask_thread = threading.Thread(target=run_flask, daemon=True)
flask_thread.start()

print("\n" + "="*60)
print("üåê Starting ngrok tunnel...")
print("="*60)

## Step 7: Expose via ngrok (Public URL)

This creates a public URL that your frontend can use to connect to this Colab instance.

In [None]:
import time

# Give Flask a moment to start
time.sleep(2)

# Create ngrok tunnel and extract URL string
tunnel = ngrok.connect(5000)
public_url = tunnel.public_url  # Extract the actual URL string

print("\n" + "="*80)
print("üåê PUBLIC API URL (use this in your frontend):")
print("="*80)
print(f"\n{public_url}\n")
print("="*80)
print("\nEndpoints:")
print(f"  Health check: {public_url}/")
print(f"  Generate text: {public_url}/api/generate")
print("\nExample JavaScript fetch:")
print(f'''
fetch('{public_url}/api/generate', {{
    method: 'POST',
    headers: {{'Content-Type': 'application/json'}},
    body: JSON.stringify({{
        prompt: 'Your prompt here',
        max_tokens: 4096
    }})
}})
.then(res => res.json())
.then(data => console.log(data.generated_text));
''')
print("="*80)
print("\n‚ö†Ô∏è Keep this notebook running to keep the API alive!")
print("‚ö†Ô∏è The URL will change if you restart the notebook.")
print("\n‚úÖ Copy the URL above and update your .env file with:")
print(f"   API_URL={public_url}")

## Step 8: Update Your Frontend (.env file)

Copy the ngrok URL from above and update your `.env` file:

```bash
# In your .env file
API_URL=YOUR_NGROK_URL_HERE
PORT=7860
```

Then start your local Flask server:
```bash
python app.py
```

Open browser at: `http://localhost:7860`

## Keep Alive Cell

Run this cell to keep the server running and see incoming requests.

In [None]:
import sys
from datetime import datetime

print("üü¢ Server is running...")
print(f"Public URL: {public_url}")
print("\nWaiting for requests... (Press interrupt to stop)\n")
print("="*80)
print("REQUEST LOG:")
print("="*80)

# Flush output to ensure everything is displayed
sys.stdout.flush()

request_count = 0

try:
    while True:
        time.sleep(1)
        # Periodically flush output to display in real-time
        sys.stdout.flush()
except KeyboardInterrupt:
    print("\n" + "="*80)
    print("üõë Server stopped")
    print("="*80)
    sys.stdout.flush()
