# Getting Started with Local LLMs: Your First Steps 🚀

Welcome to the world of local Large Language Models (LLMs)! This notebook will guide you through running your first LLM locally using HuggingFace Transformers, helping you understand each component and concept.

## What you'll learn:
- What are the core components of a local LLM setup
- How to install and configure the necessary libraries
- Load your first model from HuggingFace
- Generate text and understand the process
- Compare different models and configurations
- Understand tokenization and model parameters

## Prerequisites:
- Python 3.8 or higher
- At least 8GB of RAM (16GB+ recommended)
- Basic understanding of Python
- Stable internet connection for initial model download
- A CUDA-based GPU is strongly recommended

## Step 1: Understanding the Components

Before we start coding, let's understand the key components involved in running LLMs locally:

### 🧠 **Large Language Model (LLM)**
A neural network trained on vast amounts of text to understand and generate human-like text.

### 🔤 **Tokenizer**
Converts text into numbers (tokens) that the model can understand, and vice versa.

### 🏗️ **Model Architecture**
The structure of the neural network (e.g., GPT, BERT, T5).

### ⚙️ **Generation Parameters**
Settings that control how text is generated (temperature, top-k, max length, etc.).

### 🤗 **HuggingFace Hub**
A platform hosting thousands of pre-trained models you can download and use.

## Step 2: Install Required Libraries

First, let's install the necessary libraries. Run this cell to install everything we need:

In [1]:
# Install required libraries
# Run this cell if you haven't installed these packages yet

import subprocess
import sys

def install_package(package):
    """Install a package using pip"""
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# List of required packages
packages = [
    "transformers",      # HuggingFace transformers library
    "torch",            # PyTorch for model computations
    "accelerate",       # For optimized model loading
    "sentencepiece",    # For certain tokenizers
    "psutil",           # For system monitoring
]

print("📦 Installing required packages...")
for package in packages:
    try:
        install_package(package)
        print(f"✅ {package} installed successfully")
    except Exception as e:
        print(f"❌ Failed to install {package}: {e}")

print("\n🎉 Installation complete!")

📦 Installing required packages...
Collecting transformers
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
Collecting filelock (from transformers)
  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.35.0-py3-none-any.whl.metadata (14 kB)
Collecting pyyaml>=5.1 (from transformers)
  Using cached PyYAML-6.0.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.9.18-cp312-cp312-macosx_11_0_arm64.whl.metadata (40 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting tqdm>=4.27 (from transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collectin


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Collecting torch
  Downloading torch-2.8.0-cp312-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting setuptools (from torch)
  Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Using cached networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Using cached MarkupSafe-3.0.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (4.0 kB)
Downloading torch-2.8.0-cp312-none-macosx_11_0_arm64.whl (73.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached sympy-1.14.0-py3-none-any.whl (6.3 MB)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.10.1-py3-none-any.whl (374 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.10.1
✅ accelerate installed successfully



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Collecting sentencepiece
  Downloading sentencepiece-0.2.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (10 kB)
Downloading sentencepiece-0.2.1-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.2.1
✅ sentencepiece installed successfully



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


✅ psutil installed successfully

🎉 Installation complete!



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 3: Import Libraries and Check Environment

Let's import the necessary libraries and check our environment:

In [2]:
# Import required libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import time
import psutil
import warnings

# Suppress some warnings for cleaner output
warnings.filterwarnings("ignore", category=UserWarning)

# Check system information
print("🖥️  SYSTEM INFORMATION")
print("=" * 50)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"Available RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
print(f"Available disk space: {psutil.disk_usage('/').free / (1024**3):.1f} GB")

# Check for GPU availability
if torch.cuda.is_available():
    print(f"🚀 GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   CUDA version: {torch.version.cuda}")
    device = "cuda"
elif torch.backends.mps.is_available():
    print("🚀 Apple Silicon GPU (MPS) available")
    device = "mps"
else:
    print("💻 Using CPU (GPU not available)")
    device = "cpu"

print(f"\n🎯 Selected device: {device}")
print("\n✅ Environment check complete!")

  from .autonotebook import tqdm as notebook_tqdm


🖥️  SYSTEM INFORMATION
Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:22:19) [Clang 14.0.6 ]
PyTorch version: 2.8.0
Available RAM: 24.0 GB
Available disk space: 472.9 GB
🚀 Apple Silicon GPU (MPS) available

🎯 Selected device: mps

✅ Environment check complete!


## Step 4: Choose and Load Your First Model

For your first experience, we'll use a small, efficient model. We'll start with **distilgpt2** - a smaller, faster version of GPT-2 that's perfect for beginners.

In [3]:
# Model configuration
MODEL_NAME = "distilgpt2"  # Small, fast model for beginners
# Alternative models you can try:
# MODEL_NAME = "gpt2"                    # Classic GPT-2 base model
# MODEL_NAME = "microsoft/DialoGPT-small" # Conversational model
# MODEL_NAME = "facebook/opt-125m"        # Meta's OPT model (125M parameters)

print(f"📥 Loading model: {MODEL_NAME}")
print("⏳ This may take a few minutes on first run (downloading model)...")
print()

start_time = time.time()

try:
    # Load tokenizer
    print("🔤 Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    
    # Add padding token if it doesn't exist
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print("✅ Tokenizer loaded successfully")
    
    # Load model
    print("🧠 Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        torch_dtype=torch.float16 if device != "cpu" else torch.float32,
        device_map="auto" if device != "cpu" else None,
        low_cpu_mem_usage=True
    )
    
    if device == "cpu":
        model = model.to(device)
    
    print("✅ Model loaded successfully")
    
    # Calculate loading time
    load_time = time.time() - start_time
    
    print(f"\n⏱️  Total loading time: {load_time:.2f} seconds")
    print(f"📊 Model parameters: {model.num_parameters():,}")
    print(f"💾 Model size: ~{model.num_parameters() * 2 / 1e9:.1f} GB")
    
except Exception as e:
    print(f"❌ Error loading model: {e}")
    print("💡 Try using a different model or check your internet connection")

📥 Loading model: distilgpt2
⏳ This may take a few minutes on first run (downloading model)...

🔤 Loading tokenizer...
✅ Tokenizer loaded successfully
🧠 Loading model...


`torch_dtype` is deprecated! Use `dtype` instead!


✅ Model loaded successfully

⏱️  Total loading time: 185.65 seconds
📊 Model parameters: 81,912,576
💾 Model size: ~0.2 GB


## Step 5: Understanding Tokenization

Before generating text, let's understand how tokenization works - the process of converting text to numbers that the model can understand.

In [5]:
# Demonstration of tokenization
sample_text = "Hello! How are you doing today?"

print("🔤 TOKENIZATION DEMONSTRATION")
print("=" * 50)
print(f"Original text: '{sample_text}'")
print()

# Tokenize the text
tokens = tokenizer.encode(sample_text)
print(f"Token IDs: {tokens}")
print(f"Number of tokens: {len(tokens)}")
print()

# Show individual tokens
print("Individual tokens:")
for i, token_id in enumerate(tokens):
    token_text = tokenizer.decode([token_id])
    print(f"  {i}: {token_id} → '{token_text}'")

print()

# Demonstrate decoding
decoded_text = tokenizer.decode(tokens)
print(f"Decoded back to text: '{decoded_text}'")

# Show vocabulary size
vocab_size = tokenizer.vocab_size
print(f"\n📚 Tokenizer vocabulary size: {vocab_size:,} tokens")

print("\n💡 Key takeaways:")
print("   - Text is split into subword tokens")
print("   - Each token has a unique numerical ID")
print("   - The model works with these IDs, not raw text")
print("   - Common words are usually single tokens")
print("   - Rare words might be split into multiple tokens")

🔤 TOKENIZATION DEMONSTRATION
Original text: 'Hello! How are you doing today?'

Token IDs: [15496, 0, 1374, 389, 345, 1804, 1909, 30]
Number of tokens: 8

Individual tokens:
  0: 15496 → 'Hello'
  1: 0 → '!'
  2: 1374 → ' How'
  3: 389 → ' are'
  4: 345 → ' you'
  5: 1804 → ' doing'
  6: 1909 → ' today'
  7: 30 → '?'

Decoded back to text: 'Hello! How are you doing today?'

📚 Tokenizer vocabulary size: 50,257 tokens

💡 Key takeaways:
   - Text is split into subword tokens
   - Each token has a unique numerical ID
   - The model works with these IDs, not raw text
   - Common words are usually single tokens
   - Rare words might be split into multiple tokens


## Step 6: Your First Text Generation

Now let's generate our first text! We'll start with a simple prompt and understand each step of the process.

In [6]:
def generate_text(prompt, max_length=50, temperature=0.7, do_sample=True):
    """
    Generate text using our loaded model.
    
    Args:
        prompt (str): The input text to continue
        max_length (int): Maximum total length (prompt + generated)
        temperature (float): Controls randomness (0.1 = conservative, 1.0 = creative)
        do_sample (bool): Whether to use sampling or greedy decoding
    """
    print(f"🤖 Generating text from prompt: '{prompt}'")
    print(f"⚙️  Parameters: max_length={max_length}, temperature={temperature}, do_sample={do_sample}")
    print("-" * 50)
    
    start_time = time.time()
    
    try:
        # Encode the prompt
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
        print(f"📝 Prompt tokens: {len(input_ids[0])}")
        
        # Generate text
        with torch.no_grad():
            outputs = model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                do_sample=do_sample,
                pad_token_id=tokenizer.eos_token_id,
                attention_mask=input_ids.ne(tokenizer.pad_token_id)
            )
        
        # Decode the generated text
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract only the newly generated part
        new_text = generated_text[len(prompt):].strip()
        
        generation_time = time.time() - start_time
        total_tokens = len(outputs[0])
        new_tokens = total_tokens - len(input_ids[0])
        
        print(f"✅ Generation complete!")
        print(f"⏱️  Time taken: {generation_time:.2f} seconds")
        print(f"📊 New tokens generated: {new_tokens}")
        print(f"🚀 Generation speed: {new_tokens/generation_time:.1f} tokens/second")
        print()
        print("📝 COMPLETE OUTPUT:")
        print("=" * 30)
        print(f"Prompt: {prompt}")
        print(f"Generated: {new_text}")
        print("=" * 30)
        print(f"Full text: {generated_text}")
        
        return generated_text
        
    except Exception as e:
        print(f"❌ Error during generation: {e}")
        return None

# Test with a simple prompt
first_prompt = "The future of artificial intelligence is"
result = generate_text(first_prompt, max_length=80, temperature=0.7)

🤖 Generating text from prompt: 'The future of artificial intelligence is'
⚙️  Parameters: max_length=80, temperature=0.7, do_sample=True
--------------------------------------------------
📝 Prompt tokens: 6
✅ Generation complete!
⏱️  Time taken: 6.07 seconds
📊 New tokens generated: 74
🚀 Generation speed: 12.2 tokens/second

📝 COMPLETE OUTPUT:
Prompt: The future of artificial intelligence is
Generated: not yet clear. However, in an interview with the New York Times, David O'Keefe, director of the National Security Agency, said, "You can't predict if and how this could change."
Full text: The future of artificial intelligence is not yet clear. However, in an interview with the New York Times, David O'Keefe, director of the National Security Agency, said, "You can't predict if and how this could change."



































## Step 7: Understanding Generation Parameters

Let's experiment with different parameters to see how they affect the output:

In [7]:
# Test different parameters with the same prompt
test_prompt = "Once upon a time in a magical forest"

print("🧪 PARAMETER EXPERIMENTS")
print("=" * 60)
print(f"Testing prompt: '{test_prompt}'")
print()

# Experiment 1: Low temperature (conservative)
print("🥶 EXPERIMENT 1: Low Temperature (Conservative)")
generate_text(test_prompt, max_length=70, temperature=0.1)
print("\n" + "-"*60 + "\n")

# Experiment 2: High temperature (creative)
print("🔥 EXPERIMENT 2: High Temperature (Creative)")
generate_text(test_prompt, max_length=70, temperature=1.2)
print("\n" + "-"*60 + "\n")

# Experiment 3: Greedy decoding (deterministic)
print("🎯 EXPERIMENT 3: Greedy Decoding (Deterministic)")
generate_text(test_prompt, max_length=70, temperature=1.0, do_sample=False)
print("\n" + "-"*60 + "\n")

print("📚 PARAMETER EXPLANATION:")
print("-" * 40)
print("🌡️  Temperature:")
print("   • Low (0.1-0.5): More predictable, focused outputs")
print("   • Medium (0.6-0.9): Balanced creativity and coherence")
print("   • High (1.0+): More creative but potentially less coherent")
print()
print("🎲 Sampling:")
print("   • do_sample=True: Uses temperature and randomness")
print("   • do_sample=False: Greedy decoding (always picks most likely token)")
print()
print("📏 Max Length:")
print("   • Controls total output length (prompt + generated text)")
print("   • Longer = more content but slower generation")

🧪 PARAMETER EXPERIMENTS
Testing prompt: 'Once upon a time in a magical forest'

🥶 EXPERIMENT 1: Low Temperature (Conservative)
🤖 Generating text from prompt: 'Once upon a time in a magical forest'
⚙️  Parameters: max_length=70, temperature=0.1, do_sample=True
--------------------------------------------------
📝 Prompt tokens: 8
✅ Generation complete!
⏱️  Time taken: 1.25 seconds
📊 New tokens generated: 62
🚀 Generation speed: 49.5 tokens/second

📝 COMPLETE OUTPUT:
Prompt: Once upon a time in a magical forest
Generated: , a man is born.
Full text: Once upon a time in a magical forest, a man is born.

























































------------------------------------------------------------

🔥 EXPERIMENT 2: High Temperature (Creative)
🤖 Generating text from prompt: 'Once upon a time in a magical forest'
⚙️  Parameters: max_length=70, temperature=1.2, do_sample=True
--------------------------------------------------
📝 Prompt tokens: 8
✅ Generation complete!
⏱️  Ti

## Step 8: Using HuggingFace Pipelines (Simplified Approach)

HuggingFace also provides a simpler way to use models through pipelines:

In [10]:
# Create a text generation pipeline
print("🚰 Creating HuggingFace Pipeline...")

try:
    # Create pipeline with proper device handling
    if device == "cuda":
        # For CUDA, specify GPU device
        generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device=0,  # GPU device
            torch_dtype=torch.float16
        )
    elif device == "mps":
        # For Apple Silicon, don't specify device as model is already on MPS
        generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            torch_dtype=torch.float16
        )
    else:
        # For CPU
        generator = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            device=-1,  # CPU
            torch_dtype=torch.float32
        )
    
    print("✅ Pipeline created successfully!")
    print(f"🎯 Pipeline device: {device}")
    print()
    
    # Generate text using the pipeline
    pipeline_prompt = "The benefits of running AI models locally include"
    
    print(f"🤖 Pipeline generation with prompt: '{pipeline_prompt}'")
    print("-" * 50)
    
    start_time = time.time()
    
    # Generate multiple outputs
    outputs = generator(
        pipeline_prompt,
        max_length=100,
        num_return_sequences=2,  # Generate 2 different outputs
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    generation_time = time.time() - start_time
    
    print(f"⏱️  Generation time: {generation_time:.2f} seconds")
    print()
    
    for i, output in enumerate(outputs, 1):
        print(f"📝 OUTPUT {i}:")
        print("-" * 20)
        print(output['generated_text'])
        print()
    
    print("💡 Pipeline advantages:")
    print("   • Simpler API")
    print("   • Automatic handling of tokenization")
    print("   • Built-in optimizations")
    print("   • Easy to generate multiple outputs")
    print("   • Works seamlessly across different devices (CPU, CUDA, MPS)")
    
except Exception as e:
    print(f"❌ Error creating pipeline: {e}")
    print("💡 Troubleshooting tips:")
    print("   • Model might be loaded with accelerate")
    print("   • Try reloading the model without device_map='auto'")
    print("   • For Apple Silicon, ensure you have the latest PyTorch with MPS support")

Device set to use mps
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


🚰 Creating HuggingFace Pipeline...
✅ Pipeline created successfully!
🎯 Pipeline device: mps

🤖 Pipeline generation with prompt: 'The benefits of running AI models locally include'
--------------------------------------------------
⏱️  Generation time: 9.22 seconds

📝 OUTPUT 1:
--------------------
The benefits of running AI models locally include:
































































































































































































































































📝 OUTPUT 2:
--------------------
The benefits of running AI models locally include a more sophisticated and better understanding of the effects of network connectivity and information exchange, as well as the benefits of a networked network with a larger audience.


































































































💡 Pipeline advantages:
   • Simp

## Step 9: Performance Measurement

Let's measure the performance of our model and understand resource usage:

In [11]:
def measure_model_performance(prompt, runs=3):
    """
    Measure the performance of our current model.
    """
    print(f"📊 PERFORMANCE MEASUREMENT")
    print("=" * 50)
    print(f"Model: {MODEL_NAME}")
    print(f"Device: {device}")
    print(f"Test prompt: '{prompt}'")
    print(f"Number of runs: {runs}")
    print()
    
    times = []
    token_counts = []
    
    for run in range(runs):
        print(f"🏃 Run {run + 1}/{runs}...")
        
        start_time = time.time()
        
        try:
            input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
            
            with torch.no_grad():
                outputs = model.generate(
                    input_ids,
                    max_length=len(input_ids[0]) + 30,
                    temperature=0.7,
                    do_sample=True,
                    pad_token_id=tokenizer.eos_token_id
                )
            
            end_time = time.time()
            generation_time = end_time - start_time
            
            new_tokens = len(outputs[0]) - len(input_ids[0])
            
            times.append(generation_time)
            token_counts.append(new_tokens)
            
            print(f"   Time: {generation_time:.2f}s, Tokens: {new_tokens}")
            
        except Exception as e:
            print(f"   ❌ Error in run {run + 1}: {e}")
    
    if times:
        avg_time = sum(times) / len(times)
        avg_tokens = sum(token_counts) / len(token_counts)
        avg_speed = avg_tokens / avg_time
        
        print(f"\n📈 PERFORMANCE RESULTS:")
        print(f"   Average time: {avg_time:.2f} seconds")
        print(f"   Average tokens generated: {avg_tokens:.1f}")
        print(f"   Average speed: {avg_speed:.1f} tokens/second")
        print(f"   Fastest run: {min(times):.2f} seconds")
        print(f"   Slowest run: {max(times):.2f} seconds")
        
        return {
            "avg_time": avg_time,
            "avg_tokens": avg_tokens,
            "avg_speed": avg_speed
        }
    
    return None

# Measure performance
perf_prompt = "The key to successful machine learning is"
performance_results = measure_model_performance(perf_prompt, runs=3)

# Display model information
print(f"\n📋 MODEL INFORMATION:")
print(f"   Name: {MODEL_NAME}")
print(f"   Parameters: {model.num_parameters():,}")
if device == "cuda" and torch.cuda.is_available():
    print(f"   GPU memory usage: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
else:
    print(f"   Running on: {device}")
print(f"   Tokenizer vocab size: {tokenizer.vocab_size:,}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


📊 PERFORMANCE MEASUREMENT
Model: distilgpt2
Device: mps
Test prompt: 'The key to successful machine learning is'
Number of runs: 3

🏃 Run 1/3...
   Time: 1.05s, Tokens: 30
🏃 Run 2/3...
   Time: 0.34s, Tokens: 30
🏃 Run 3/3...
   Time: 0.34s, Tokens: 30

📈 PERFORMANCE RESULTS:
   Average time: 0.57 seconds
   Average tokens generated: 30.0
   Average speed: 52.3 tokens/second
   Fastest run: 0.34 seconds
   Slowest run: 1.05 seconds

📋 MODEL INFORMATION:
   Name: distilgpt2
   Parameters: 81,912,576
   Running on: mps
   Tokenizer vocab size: 50,257


## Step 10: Understanding Different Model Types

Let's explore different types of models you can use locally:

In [12]:
# Information about different model types
print("🧠 UNDERSTANDING DIFFERENT MODEL TYPES")
print("=" * 60)
print()

model_info = {
    "Causal Language Models (GPT-style)": {
        "description": "Generate text by predicting the next token",
        "examples": ["gpt2", "distilgpt2", "microsoft/DialoGPT-small"],
        "use_cases": ["Text generation", "Chatbots", "Creative writing"],
        "pros": ["Great for open-ended generation", "Conversational"],
        "cons": ["Can be repetitive", "May generate false information"]
    },
    "Instruction-Tuned Models": {
        "description": "Trained to follow instructions and be helpful",
        "examples": ["facebook/opt-iml-1.3b", "EleutherAI/gpt-j-6B"],
        "use_cases": ["Question answering", "Task assistance", "Chatbots"],
        "pros": ["Better at following instructions", "More helpful responses"],
        "cons": ["May be less creative", "Can be overly cautious"]
    },
    "Specialized Models": {
        "description": "Trained for specific tasks or domains",
        "examples": ["bert-base-uncased", "facebook/bart-large-cnn"],
        "use_cases": ["Text classification", "Summarization", "Translation"],
        "pros": ["Excellent at specific tasks", "Often smaller and faster"],
        "cons": ["Limited to specific use cases", "Not for general chat"]
    }
}

for model_type, info in model_info.items():
    print(f"🏷️  {model_type}")
    print(f"   📝 Description: {info['description']}")
    print(f"   🔧 Examples: {', '.join(info['examples'])}")
    print(f"   💼 Use cases: {', '.join(info['use_cases'])}")
    print(f"   ✅ Pros: {', '.join(info['pros'])}")
    print(f"   ⚠️  Cons: {', '.join(info['cons'])}")
    print()

print("🎯 CHOOSING THE RIGHT MODEL:")
print("-" * 40)
print("For beginners, start with:")
print("• distilgpt2 - Fastest, smallest GPT-2 variant")
print("• gpt2 - Classic text generation model")
print("• microsoft/DialoGPT-small - Conversational AI")
print()
print("Consider these factors:")
print("• Model size (larger = better quality, slower speed)")
print("• Your hardware (RAM, GPU availability)")
print("• Intended use case (chat, generation, specific tasks)")
print("• Download time and storage space")

🧠 UNDERSTANDING DIFFERENT MODEL TYPES

🏷️  Causal Language Models (GPT-style)
   📝 Description: Generate text by predicting the next token
   🔧 Examples: gpt2, distilgpt2, microsoft/DialoGPT-small
   💼 Use cases: Text generation, Chatbots, Creative writing
   ✅ Pros: Great for open-ended generation, Conversational
   ⚠️  Cons: Can be repetitive, May generate false information

🏷️  Instruction-Tuned Models
   📝 Description: Trained to follow instructions and be helpful
   🔧 Examples: facebook/opt-iml-1.3b, EleutherAI/gpt-j-6B
   💼 Use cases: Question answering, Task assistance, Chatbots
   ✅ Pros: Better at following instructions, More helpful responses
   ⚠️  Cons: May be less creative, Can be overly cautious

🏷️  Specialized Models
   📝 Description: Trained for specific tasks or domains
   🔧 Examples: bert-base-uncased, facebook/bart-large-cnn
   💼 Use cases: Text classification, Summarization, Translation
   ✅ Pros: Excellent at specific tasks, Often smaller and faster
   ⚠️  Cons: L

## Step 11: Best Practices and Tips

Here are some important best practices for working with local LLMs:

In [13]:
# Best practices and tips
print("💡 BEST PRACTICES FOR LOCAL LLMS")
print("=" * 50)
print()

tips = {
    "🚀 Performance Optimization": [
        "Use GPU when available (CUDA or Apple Silicon)",
        "Use torch.float16 for faster inference on compatible hardware",
        "Consider model quantization for memory efficiency",
        "Batch multiple requests when possible",
        "Use torch.no_grad() for inference to save memory"
    ],
    "💾 Memory Management": [
        "Monitor system memory usage, especially with large models",
        "Use device_map='auto' for automatic GPU memory distribution",
        "Clear model cache when switching between models",
        "Consider using smaller models for development/testing",
        "Use low_cpu_mem_usage=True when loading models"
    ],
    "🔧 Generation Quality": [
        "Experiment with temperature settings (0.1-1.2)",
        "Use appropriate max_length to avoid truncation",
        "Add proper context and prompting for better results",
        "Consider using top_k and top_p sampling",
        "Set appropriate stopping criteria"
    ],
    "🛡️ Safety and Ethics": [
        "Implement content filtering for production use",
        "Be aware of potential biases in model outputs",
        "Don't rely on model outputs for factual information",
        "Consider privacy implications of local vs cloud models",
        "Monitor and log model usage for debugging"
    ],
    "📚 Development Workflow": [
        "Start with small models for prototyping",
        "Test thoroughly before deploying larger models",
        "Version control your model configurations",
        "Document your prompting strategies",
        "Keep track of model versions and performance metrics"
    ]
}

for category, tip_list in tips.items():
    print(f"{category}:")
    for tip in tip_list:
        print(f"   • {tip}")
    print()

# Resource usage check
print("📊 CURRENT RESOURCE USAGE:")
print("-" * 30)
memory = psutil.virtual_memory()
print(f"RAM usage: {memory.percent}% ({memory.used / 1e9:.1f}GB / {memory.total / 1e9:.1f}GB)")

if device == "cuda" and torch.cuda.is_available():
    gpu_memory = torch.cuda.memory_allocated() / 1e9
    gpu_total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU memory: {gpu_memory:.1f}GB / {gpu_total:.1f}GB")
elif device == "cpu":
    print("Running on CPU (no GPU memory usage)")

print(f"\n💡 Pro tip: The model '{MODEL_NAME}' is using approximately:")
print(f"   • {model.num_parameters() * 2 / 1e9:.1f}GB of storage space")
print(f"   • {model.num_parameters() * 4 / 1e9:.1f}GB of RAM when loaded (float32)")
print(f"   • {model.num_parameters() * 2 / 1e9:.1f}GB of RAM when loaded (float16)")

💡 BEST PRACTICES FOR LOCAL LLMS

🚀 Performance Optimization:
   • Use GPU when available (CUDA or Apple Silicon)
   • Use torch.float16 for faster inference on compatible hardware
   • Consider model quantization for memory efficiency
   • Batch multiple requests when possible
   • Use torch.no_grad() for inference to save memory

💾 Memory Management:
   • Monitor system memory usage, especially with large models
   • Use device_map='auto' for automatic GPU memory distribution
   • Clear model cache when switching between models
   • Consider using smaller models for development/testing
   • Use low_cpu_mem_usage=True when loading models

🔧 Generation Quality:
   • Experiment with temperature settings (0.1-1.2)
   • Use appropriate max_length to avoid truncation
   • Add proper context and prompting for better results
   • Consider using top_k and top_p sampling
   • Set appropriate stopping criteria

🛡️ Safety and Ethics:
   • Implement content filtering for production use
   • Be awa

## 🎯 Summary and Next Steps

Congratulations! You've successfully completed your first journey into local LLMs. Here's what you've accomplished:

### ✅ What you learned:
1. **Core Components**: Understanding LLMs, tokenizers, and generation parameters
2. **Model Loading**: How to download and load models from HuggingFace
3. **Text Generation**: Creating text with different parameters and approaches
4. **Tokenization**: How text is converted to numbers and back
5. **Performance**: Measuring speed and resource usage
6. **Best Practices**: Optimization and safety considerations

### 🔑 Key concepts mastered:
- **Local vs Cloud**: Privacy, control, and offline capabilities
- **Model Types**: Causal LMs, instruction-tuned, and specialized models
- **Generation Parameters**: Temperature, sampling, max length
- **Resource Management**: Memory, GPU usage, performance optimization

### 🚀 Suggested next steps:

#### Immediate experiments:
1. **Try different models**: Load `gpt2`, `microsoft/DialoGPT-small`, or `facebook/opt-125m`
2. **Experiment with parameters**: Test different temperature and sampling settings
3. **Create longer conversations**: Build a chat interface with memory
4. **Specialized tasks**: Try summarization, question-answering, or creative writing

#### Advanced projects:
1. **Fine-tuning**: Customize models for your specific use case
2. **RAG (Retrieval-Augmented Generation)**: Combine models with external knowledge
3. **Model optimization**: Quantization, pruning, and distillation
4. **Production deployment**: API servers, containerization, scaling

#### Learning resources:
- [HuggingFace Transformers Documentation](https://huggingface.co/docs/transformers)
- [HuggingFace Model Hub](https://huggingface.co/models)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)
- [Prompt Engineering Guide](https://www.promptingguide.ai/)

### 💡 Remember:
- Start small and iterate
- Local models give you privacy and control
- Experiment with different models for different tasks
- Monitor resource usage and optimize accordingly
- Always consider safety and ethical implications

### 🔄 Compare with other approaches:
You can also try the Ollama approach for an even simpler setup:
- Check out `getting_started_ollama.ipynb` for a different approach
- Ollama provides pre-optimized models with simple API
- Both approaches have their advantages depending on your use case

Happy experimenting with your local LLMs! 🎉