# üí¨ Chat with Your Fine-Tuned Model

Use this notebook to have real conversations with your trained AI model!

**What you'll need:**
- Your trained model in Google Drive (`Finetune_Jobs/models/your-model-name/`)
- Free T4 GPU (Runtime ‚Üí Change runtime type ‚Üí T4 GPU)

**Time:** ~3 minutes to load model, then instant responses

## ‚öôÔ∏è Step 1: Configuration

**IMPORTANT:** Update `MODEL_NAME` to match your trained model!

In [None]:
# ============================================================================
# CONFIGURATION - UPDATE THIS
# ============================================================================

# Your model name (from training notebook)
MODEL_NAME = "customer-support-bot-v1"  # ‚Üê CHANGE THIS

# Model path in Google Drive
MODEL_PATH = f"/content/drive/MyDrive/Finetune_Jobs/models/{MODEL_NAME}"

# Chat settings
TEMPERATURE = 0.7      # 0.0 = deterministic, 1.0 = creative
MAX_TOKENS = 256       # Maximum response length
TOP_P = 0.9           # Nucleus sampling

print("‚úÖ Configuration loaded")
print(f"Model: {MODEL_NAME}")
print(f"Path: {MODEL_PATH}")

## üîó Step 2: Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Verify model exists
if not os.path.exists(MODEL_PATH):
    print(f"‚ùå ERROR: Model not found at {MODEL_PATH}")
    print(f"\nAvailable models in Finetune_Jobs/models/:")
    models_dir = "/content/drive/MyDrive/Finetune_Jobs/models"
    if os.path.exists(models_dir):
        for model in os.listdir(models_dir):
            print(f"  - {model}")
    else:
        print("  (No models folder found)")
    raise FileNotFoundError(f"Please update MODEL_NAME in Step 1")

print(f"‚úÖ Drive mounted")
print(f"‚úÖ Model found: {MODEL_PATH}")

# Show model files
print(f"\nüìÅ Model files:")
for file in os.listdir(MODEL_PATH):
    print(f"  - {file}")

## üì¶ Step 3: Install Unsloth (Fast Inference)

In [None]:
%%capture
# Install Unsloth for 2x faster inference
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" "peft" "accelerate" "bitsandbytes"

print("‚úÖ Unsloth installed")

## ü§ñ Step 4: Load Your Model

**This takes ~2-3 minutes. Please wait...**

In [None]:
from unsloth import FastLanguageModel
import torch

print("‚è≥ Loading base model (Llama 3 8B)...")

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

print("‚è≥ Loading your fine-tuned adapter...")

# Load your LoRA adapter
from peft import PeftModel
model = PeftModel.from_pretrained(model, MODEL_PATH)

# Enable fast inference mode
FastLanguageModel.for_inference(model)

print("‚úÖ Model loaded and ready!")
print(f"üéØ Using: {MODEL_NAME}")
print(f"‚ö° Fast inference enabled")

## üí¨ Step 5: Chat Function

In [None]:
def chat(message, history=None, temperature=TEMPERATURE, max_tokens=MAX_TOKENS):
    """
    Chat with your model.
    
    Args:
        message: Your question/message
        history: Previous conversation (optional)
        temperature: Creativity level (0.0-1.0)
        max_tokens: Max response length
    
    Returns:
        Model's response
    """
    # Build conversation history
    if history is None:
        history = []
    
    # Add current message
    messages = history + [{"role": "user", "content": message}]
    
    # Format for model
    try:
        # Try using chat template
        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to("cuda")
    except:
        # Fallback to simple format
        prompt = f"### User:\n{message}\n\n### Assistant:\n"
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            input_ids=inputs.input_ids if hasattr(inputs, 'input_ids') else inputs,
            max_new_tokens=max_tokens,
            temperature=temperature,
            top_p=TOP_P,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode response
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "### Assistant:" in full_response:
        response = full_response.split("### Assistant:")[-1].strip()
    elif "assistant" in full_response.lower():
        response = full_response.split("assistant")[-1].strip()
    else:
        # If no markers, take everything after the input
        response = full_response[len(message):].strip()
    
    return response

print("‚úÖ Chat function ready")
print("You can now use: chat('your message here')")

## üß™ Step 6: Test Your Model

Try a quick test to make sure everything works!

In [None]:
# Test message
test_message = "Hello! Can you help me?"

print(f"üë§ User: {test_message}")
print("\n‚è≥ Generating response...\n")

response = chat(test_message)

print(f"ü§ñ {MODEL_NAME}: {response}")
print("\n‚úÖ Model is working! You can now chat below.")

## üí¨ Step 7: Interactive Chat

**Run this cell and start chatting!**

Type your messages and press Enter. Type 'quit' or 'exit' to stop.

In [None]:
import time

print("="*80)
print(f"üí¨ Chat with {MODEL_NAME}")
print("="*80)
print("Type your message and press Enter. Type 'quit' or 'exit' to stop.\n")

conversation_history = []

while True:
    # Get user input
    user_message = input("\nüë§ You: ").strip()
    
    # Check for exit
    if user_message.lower() in ['quit', 'exit', 'bye']:
        print("\nüëã Thanks for chatting! Goodbye!")
        break
    
    if not user_message:
        continue
    
    # Generate response
    start_time = time.time()
    response = chat(user_message, history=conversation_history)
    elapsed = time.time() - start_time
    
    # Display response
    print(f"\nü§ñ {MODEL_NAME}: {response}")
    print(f"\n‚è±Ô∏è  Response time: {elapsed:.2f}s")
    
    # Update conversation history
    conversation_history.append({"role": "user", "content": user_message})
    conversation_history.append({"role": "assistant", "content": response})
    
    # Keep only last 5 exchanges to avoid context overflow
    if len(conversation_history) > 10:
        conversation_history = conversation_history[-10:]

## üé® Step 8: Advanced Chat (With Settings)

Try different settings to control how your model responds!

In [None]:
# Try different temperatures:
# - Low (0.1-0.3): More focused, consistent, deterministic
# - Medium (0.5-0.7): Balanced
# - High (0.8-1.0): More creative, varied, random

question = "What is machine learning?"

print("Testing different temperature settings:\n")
print("="*80)

for temp in [0.3, 0.7, 1.0]:
    print(f"\nüå°Ô∏è  Temperature: {temp}")
    print(f"üë§ User: {question}")
    response = chat(question, temperature=temp)
    print(f"ü§ñ Bot: {response}")
    print("-"*80)

## üìä Step 9: Batch Questions (Optional)

Test multiple questions at once!

In [None]:
# List of questions to test
test_questions = [
    "What is machine learning?",
    "How can I reset my password?",
    "Tell me about neural networks",
    "What's the difference between AI and ML?",
]

print("üß™ Testing batch questions\n")
print("="*80)

for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. üë§ {question}")
    response = chat(question)
    print(f"   ü§ñ {response}")
    print("-"*80)

## üíæ Step 10: Save Conversation (Optional)

Save your chat history to Google Drive!

In [None]:
import json
from datetime import datetime

# Save conversation to file
if conversation_history:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    save_path = f"/content/drive/MyDrive/Finetune_Jobs/conversations/chat_{timestamp}.json"
    
    # Create directory if needed
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    
    # Save conversation
    with open(save_path, 'w') as f:
        json.dump({
            "model": MODEL_NAME,
            "timestamp": timestamp,
            "conversation": conversation_history
        }, f, indent=2)
    
    print(f"‚úÖ Conversation saved to: {save_path}")
else:
    print("‚ÑπÔ∏è  No conversation to save. Chat first in Step 7!")

## üéâ All Done!

You're now chatting with your real fine-tuned model!

**Tips:**
- Use Step 7 for natural conversations
- Try Step 8 to experiment with temperature settings
- Use Step 9 to test multiple questions quickly
- Save important conversations with Step 10

**Performance:**
- First response: ~3-5 seconds (model warmup)
- Subsequent responses: ~1-2 seconds
- With GPU: Very fast!

**Need help?** Check the Auto-Tuner documentation or ask in the community!