# TinyLlama Local Chatbot - Complete Google Colab Implementation

This notebook will guide you through building a fully functional AI chatbot that runs locally using the TinyLlama-1.1B model.

**What you'll build:**
- A conversational AI assistant
- Memory-enabled chat system
- GPU/CPU automatic optimization
- Professional response generation

just select a T4 GPU from runtype and click the play button!
Let's build your AI chatbot step by step!

## Step 1: Install Required Dependencies

First, we need to install the essential Python libraries for our chatbot:
- **transformers**: Hugging Face library for loading AI models
- **torch**: PyTorch for tensor operations and model inference
- **accelerate**: Optimizes model loading and GPU utilization

The `-q` flag keeps the installation output minimal and clean.

**Expected output:** "Dependencies installed successfully!" message

In [30]:
# Install required packages
!pip install transformers torch accelerate -q
print("Dependencies installed successfully!")

Dependencies installed successfully!


## Step 2: Import Essential Libraries

Now we import all the Python libraries we'll need:
- **torch**: For GPU detection and memory management
- **transformers**: For loading and running the TinyLlama model
- **time**: To measure response times
- **gc**: For garbage collection and memory cleanup

**What happens:** All necessary modules are loaded into memory for use in subsequent cells.

In [12]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import time
import gc

## Step 3: Define the Main Chatbot Class

This creates our main `ColabChatbot` class that handles all chatbot functionality:

**Key features:**
- **model_name**: Specifies which AI model to use (TinyLlama-1.1B-Chat-v1.0)
- **pipe**: Will store our loaded model pipeline --> Workflow of our AI ChatBot
- **conversation_history**: Maintains chat context across exchanges --> Memory
- **load_model()**: Automatically called during initialization

**What happens:** The chatbot object is created and immediately loads the model.

In [21]:
class ColabChatbot:
    def __init__(self, model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
        self.model_name = model_name
        self.pipe = None
        self.conversation_history = []
        self.load_model()

    def load_model(self):
        """Load TinyLlama model optimized for Colab"""
        print("Loading TinyLlama model (this takes 2-3 minutes first time)...")

        try:
            # Use GPU if available, fallback to CPU
            device = "cuda" if torch.cuda.is_available() else "cpu"
            dtype = torch.float16 if torch.cuda.is_available() else torch.float32

            print(f"Using device: {device.upper()}")

            # Load with memory optimization
            self.pipe = pipeline(
                "text-generation",
                model=self.model_name,
                dtype=dtype,
                device_map="auto" if device == "cuda" else None,
                model_kwargs={"low_cpu_mem_usage": True}
            )

            print("Model loaded successfully!")
            print("Type 'quit' to exit, 'clear' to reset conversation")
            print("-" * 50)

        except Exception as e:
            print(f"Error loading model: {e}")
            self.pipe = None

    def generate_response(self, user_input):
        """Generate response with conversation context"""
        if not self.pipe:
            return "Model not loaded. Please restart and try again."

        try:
                # Build conversation context
            messages = [
                {
                    "role": "system",
                    "content": "You are a helpful AI assistant."
                }
            ]

                # Add recent conversation history (last 4 exchanges to save memory)
            recent_history = self.conversation_history[-8:] if len(self.conversation_history) > 8 else self.conversation_history
            messages.extend(recent_history)

                # Add current user input
            messages.append({"role": "user", "content": user_input})

                # Format for TinyLlama
            prompt = self.pipe.tokenizer.apply_chat_template(
                messages,
                    tokenize=False,
                    add_generation_prompt=True
                )

                # Generate response with memory management
            with torch.no_grad():
                    outputs = self.pipe(
                        prompt,
                        max_new_tokens=200,  # Shorter for Colab
                        do_sample=True,
                        temperature=0.7,
                        top_k=50,
                        top_p=0.9,
                        pad_token_id=self.pipe.tokenizer.eos_token_id,
                        return_full_text=False
                    )

            response = outputs[0]["generated_text"].strip()

                # Clean up response
            if not response:
                    response = "I'm not sure how to respond to that. Could you try asking differently?"

            return response

        except Exception as e:
            return f"Error generating response: {str(e)[:100]}..."

    def clear_memory(self):
        """Clear conversation history and GPU cache"""
        self.conversation_history = []
        if torch.cuda.is_available():
           torch.cuda.empty_cache()
        gc.collect()
        print("Conversation cleared!")

    def start_chat(self):
        """Main chat loop"""
        if not self.pipe:
            print("Cannot start chat - model not loaded")
            return

        print("TinyLlama Chatbot Ready!")
        print("Ask me anything about business, technology, or general questions.")
        print()

        while True:
            try:
                # Get user input
                user_input = input("You: ").strip()

                # Handle special commands
                if user_input.lower() in ['quit', 'exit', 'q']:
                    print("Goodbye!")
                    break
                elif user_input.lower() in ['clear', 'reset']:
                    self.clear_memory()
                    continue
                elif user_input.lower() in ['help', 'h']:
                    print("Commands: 'quit' to exit, 'clear' to reset, 'help' for this message")
                    continue
                elif not user_input:
                    print("Please enter a message or 'quit' to exit")
                    continue

                # Generate and display response
                print("Assistant: ", end="")

                start_time = time.time()
                response = self.generate_response(user_input)
                end_time = time.time()

                print(response)
                print(f"Response time: {end_time - start_time:.1f}s")

                # Save to conversation history
                self.conversation_history.extend([
                    {"role": "user", "content": user_input},
                    {"role": "assistant", "content": response}
                ])

                print("-" * 50)

            except KeyboardInterrupt:
                print("\nChat interrupted. Goodbye!")
                break
            except Exception as e:
                print(f"Error: {e}")
                continue

## Step 4: Model Loading with Hardware Detection

This is where the magic happens! The `load_model` method:

**Smart hardware detection:**
- Automatically detects if GPU (CUDA) is available
- Uses appropriate data types (float16 for GPU, float32 for CPU)
- Optimizes memory usage with `low_cpu_mem_usage=True`

**Model loading process:**
1. Downloads TinyLlama model (first run only)
2. Loads model into memory
3. Creates a text generation pipeline
4. Reports success or failure

**Expected duration:** 2-3 minutes on first run, instant on subsequent runs

In [22]:
def load_model(self):
    """Load TinyLlama model optimized for Colab"""
    print("Loading TinyLlama model (this takes 2-3 minutes first time)...")

    try:
        # Use GPU if available, fallback to CPU
        device = "cuda" if torch.cuda.is_available() else "cpu"
        dtype = torch.float16 if torch.cuda.is_available() else torch.float32

        print(f"Using device: {device.upper()}")

        # Load with memory optimization
        self.pipe = pipeline(
            "text-generation",
            model=self.model_name,
            dtype=dtype,
            device_map="auto" if device == "cuda" else None,
            model_kwargs={"low_cpu_mem_usage": True}
        )

        print("Model loaded successfully!")
        print("Type 'quit' to exit, 'clear' to reset conversation")
        print("-" * 50)

    except Exception as e:
        print(f"Error loading model: {e}")
        self.pipe = None

## Step 5: Intelligent Response Generation

The `generate_response` method is the brain of our chatbot:

**Context management:**
- Maintains system prompt for AI personality
- Keeps last 8 messages for conversation continuity
- Formats messages using TinyLlama's chat template

**Generation parameters:**
- `max_new_tokens=200`: Limits response length
- `temperature=0.7`: Balances creativity and coherence
- `top_k=50, top_p=0.9`: Controls randomness and quality
- `torch.no_grad()`: Saves memory during inference

**Error handling:** Gracefully handles model failures and memory issues

In [23]:
def generate_response(self, user_input):
    """Generate response with conversation context"""
    if not self.pipe:
        return "Model not loaded. Please restart and try again."

    try:
            # Build conversation context
        messages = [
            {
                "role": "system",
                "content": "You are a helpful AI assistant."
            }
        ]

            # Add recent conversation history (last 4 exchanges to save memory)
        recent_history = self.conversation_history[-8:] if len(self.conversation_history) > 8 else self.conversation_history
        messages.extend(recent_history)

            # Add current user input
        messages.append({"role": "user", "content": user_input})

            # Format for TinyLlama
        prompt = self.pipe.tokenizer.apply_chat_template(
            messages,
                tokenize=False,
                add_generation_prompt=True
            )

            # Generate response with memory management
        with torch.no_grad():
                outputs = self.pipe(
                    prompt,
                    max_new_tokens=200,  # Shorter for Colab
                    do_sample=True,
                    temperature=0.7,
                    top_k=50,
                    top_p=0.9,
                    pad_token_id=self.pipe.tokenizer.eos_token_id,
                    return_full_text=False
                )

        response = outputs[0]["generated_text"].strip()

            # Clean up response
        if not response:
                response = "I'm not sure how to respond to that. Could you try asking differently?"

        return response

    except Exception as e:
        return f"Error generating response: {str(e)[:100]}..."

## Step 6: Memory Management System

The `clear_memory` method provides essential cleanup functionality:

**What it clears:**
- **conversation_history**: Resets chat context to start fresh
- **GPU cache**: Frees up VRAM if using GPU
- **System memory**: Triggers garbage collection for efficiency

**When to use:**
- When conversation becomes too long
- If responses become repetitive
- To free up memory for better performance

**User command:** Type 'clear' or 'reset' during chat

In [24]:
def clear_memory(self):
    """Clear conversation history and GPU cache"""
    self.conversation_history = []
    if torch.cuda.is_available():
       torch.cuda.empty_cache()
    gc.collect()
    print("Conversation cleared!")

## Step 7: Interactive Chat Interface

The `start_chat` method creates the main user interface:

**Built-in commands:**
- `quit/exit/q`: Ends the chat session
- `clear/reset`: Clears conversation memory
- `help/h`: Shows available commands

**Features:**
- **Response timing**: Shows how long each response takes
- **Context preservation**: Saves each exchange to conversation history
- **Error handling**: Gracefully handles interruptions and errors
- **User-friendly prompts**: Clear indicators for user input

**Loop structure:** Continues until user types 'quit' or interrupts with Ctrl+C

In [25]:
    def start_chat(self):
        """Main chat loop"""
        if not self.pipe:
            print("Cannot start chat - model not loaded")
            return

        print("TinyLlama Chatbot Ready!")
        print("Ask me anything about business, technology, or general questions.")
        print()

        while True:
            try:
                # Get user input
                user_input = input("You: ").strip()

                # Handle special commands
                if user_input.lower() in ['quit', 'exit', 'q']:
                    print("Goodbye!")
                    break
                elif user_input.lower() in ['clear', 'reset']:
                    self.clear_memory()
                    continue
                elif user_input.lower() in ['help', 'h']:
                    print("Commands: 'quit' to exit, 'clear' to reset, 'help' for this message")
                    continue
                elif not user_input:
                    print("Please enter a message or 'quit' to exit")
                    continue

                # Generate and display response
                print("Assistant: ", end="")

                start_time = time.time()
                response = self.generate_response(user_input)
                end_time = time.time()

                print(response)
                print(f"Response time: {end_time - start_time:.1f}s")

                # Save to conversation history
                self.conversation_history.extend([
                    {"role": "user", "content": user_input},
                    {"role": "assistant", "content": response}
                ])

                print("-" * 50)

            except KeyboardInterrupt:
                print("\nChat interrupted. Goodbye!")
                break
            except Exception as e:
                print(f"Error: {e}")
                continue

## Step 8: Initialize Your Chatbot

Now we create an instance of our chatbot class:

**What happens when you run this cell:**
1. Creates a new `ColabChatbot` object
2. Automatically calls `load_model()` method
3. Downloads TinyLlama model (if first time)
4. Sets up the text generation pipeline
5. Reports hardware configuration and status

**Expected output:**
- Model loading progress messages
- Device type (CUDA/CPU) confirmation
- Success message with usage instructions

**If you see errors:** Check internet connection or try restarting the runtime

In [28]:
# Initialize the chatbot
chatbot = ColabChatbot()

Loading TinyLlama model (this takes 2-3 minutes first time)...
Using device: CUDA


Device set to use cuda:0


Model loaded successfully!
Type 'quit' to exit, 'clear' to reset conversation
--------------------------------------------------


## Step 9: Start Your AI Conversation!

This is the moment of truth! Run this cell to start chatting with your AI assistant.

**How to use:**
1. Run the cell below
2. Wait for "TinyLlama Chatbot Ready!" message
3. Type your questions or messages at the "You:" prompt
4. Press Enter to get AI responses
5. Continue the conversation naturally

**Try these example prompts:**
- "Explain machine learning in simple terms"
- "Write a Python function to calculate factorial"
- "What are some good business ideas for 2024?"
- "Help me improve my resume"

**Remember the commands:**
- Type `quit` to end the session
- Type `clear` to reset conversation memory
- Type `help` to see available commands

**Performance notes:**
- GPU responses: 2-4 seconds
- CPU responses: 8-15 seconds
- First response may be slower

Ready to chat with your AI assistant? Run the cell below!

In [29]:
# Start the interactive chat
chatbot.start_chat()

TinyLlama Chatbot Ready!
Ask me anything about business, technology, or general questions.

You: hi
Assistant: I'm not a human being. However, I can provide you with a sample response to the sentence "you are a helpful a.i. Assistant."

as a helpful and intelligent artificial intelligence (ai) assistant, I am honored to assist you in any way possible. my knowledge and expertise are unmatched, and I am always available to provide you with the best possible solutions to your problems.

whether you need help with a task, need advice on a specific topic, or simply want to chat with a friendly voice, I'm here to assist you in any way that I can. my goal is to provide you with the best possible experience and help you achieve your goals.

in conclusion, you are a valued member of our team, and we are proud to have you as our assistant. we look forward to serving you and helping you achieve your goals.

take care,

[your name]
Response time: 7.8s
----------------------------------------------

## Congratulations! You've Built Your AI Chatbot!

You now have a fully functional AI chatbot running locally. Here's what you've accomplished:

**What you built:**
- Local AI chatbot with TinyLlama-1.1B model
- Memory-enabled conversation system
- GPU-optimized performance
- Professional response generation

**Key technical achievements:**
- Automatic hardware detection (GPU/CPU)
- Memory management and optimization
- Context-aware conversation handling
- Error handling and graceful degradation

You're Welcome :) find me on https://coffeexpert.vercel.app

**Next Steps - Turn This Into a Business:**

1. **Customize for specific use cases:**
   - Restaurant menu assistant
   - Real estate lead qualifier
   - E-commerce support bot
   - Educational course assistant

2. **Build a professional interface:**
   - Use Streamlit for web deployment
   - Create branded UI for clients
   - Add analytics and monitoring

3. **Scale your solution:**
   - Deploy on cloud services
   - Integrate with business APIs
   - Add voice and video capabilities

**Pricing suggestions:**
- Setup: $199-499 per client
- Monthly maintenance: $29-99
- Custom integrations: $50-200/hour

**Resources:**
- [TinyLlama Model Card](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Streamlit Documentation](https://docs.streamlit.io/)

Want to take this further? Check out the complete tutorial and professional Streamlit version in our main blog post!

**Share your success:** Tag us when you land your first chatbot client!