# 1: Chapter 4 Notebook

*Notebook companion for Chapter 4 of Data Strategy for LLMs*


## Environment Setup

### Automatic environment setup (no action required)

To keep this notebook easy to run and reproducible, we prepare a clean Python environment automatically:

- Creates a small virtual environment in `.rag_env` if needed.
- Silently installs required packages (OpenAI, ChromaDB, etc.) without clutter.
- Registers a Jupyter kernel and injects the environment into this session.
- Sets `RAG_ENV_READY=1` so later cells know the environment is ready.

If your system already has the requirements, this cell simply confirms success and moves on. You don’t need to change anything—just run it once at the top.

In [1]:
# Auto Environment Bootstrap (no user action, quiet)
import sys, os, subprocess, platform, pathlib

REQUIRED = ['openai','python-dotenv','chromadb','tiktoken','packaging','nltk','ipykernel']
BASE = pathlib.Path.cwd()
VENV_DIR = BASE / '.rag_env'

def venv_python(venv_dir: pathlib.Path) -> str:
    if platform.system() == 'Windows':
        return str(venv_dir / 'Scripts' / 'python.exe')
    return str(venv_dir / 'bin' / 'python')

def site_packages_path(venv_dir: pathlib.Path):
    if platform.system() == 'Windows':
        return venv_dir / 'Lib' / 'site-packages'
    lib_parent = venv_dir / 'lib'
    cand = [p for p in lib_parent.glob('python*') if p.is_dir()]
    return (cand[0] / 'site-packages') if cand else None

def ensure_env():
    if not VENV_DIR.exists():
        subprocess.run([sys.executable, '-m', 'venv', str(VENV_DIR)], check=True,
                       stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    py = venv_python(VENV_DIR)
    subprocess.run([py, '-m', 'pip', 'install', '--upgrade', 'pip', '-q'], check=False,
                   stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    for pkg in REQUIRED:
        subprocess.run([py, '-m', 'pip', 'install', pkg, '-q'], check=False,
                       stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    try:
        subprocess.run([py, '-m', 'ipykernel', 'install', '--user', '--name', 'rag-env',
                        '--display-name', 'Python (RAG Env)'], check=False,
                       stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    except Exception:
        pass
    sp = site_packages_path(VENV_DIR)
    if sp and str(sp) not in sys.path:
        sys.path.insert(0, str(sp))

def needs_env(pkgs):
    for p in pkgs:
        try:
            __import__(p.replace('-', '_'))
        except Exception:
            return True
    return False

try:
    if needs_env(REQUIRED):
        ensure_env()
        os.environ['RAG_ENV_READY'] = '1'
        print('SUCCESS: Isolated RAG environment prepared and activated in-session')
    else:
        os.environ['RAG_ENV_READY'] = '1'
        print('SUCCESS: System environment already satisfies requirements')
except Exception as e:
    # Keep quiet and never fail the run
    print('INFO: Auto environment setup skipped:', str(e)[:120])

SUCCESS: Isolated RAG environment prepared and activated in-session


### Jupyter Kernel Setup Fix

**If you're seeing an error like "Running cells with 'Python X.X.X' requires the ipykernel package", this cell will fix it!**

This is a common issue, especially on:
- Fresh Python installations
- Homebrew-managed Python environments on macOS
- Systems with multiple Python versions

**Run the cell below to automatically detect your Python environment and install the correct kernel.**

In [2]:
import sys
import subprocess
import os

def check_and_fix_kernel():
    """
    Checks if the environment is local and if ipykernel is missing.
    If both conditions are true, it attempts to install the kernel.
    """
    # Step 1: Detect if running in Google Colab
    if 'google.colab' in sys.modules:
        print(" Running in Google Colab. No kernel fix needed.")
        return

    # Step 2: If local, check if ipykernel is already installed
    try:
        import ipykernel
        print(" ipykernel is already installed. No fix needed.")
        return
    except ImportError:
        print(" ipykernel not found. Attempting installation...")

    # Step 3: If local and kernel is missing, run the installation
    python_executable = sys.executable
    python_version = f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}"
    
    print(f"DETECTED Python: {python_executable}")
    print(f"PYTHON VERSION: {python_version}")
    
    # Method 1: Try standard installation
    try:
        subprocess.run(
            [python_executable, '-m', 'pip', 'install', 'ipykernel', '-U', '--user', '--force-reinstall'],
            capture_output=True, text=True, check=True
        )
        print("SUCCESS: Successfully installed ipykernel (Method 1)")
        method_used = 1
    except subprocess.CalledProcessError:
        print("WARNING: Method 1 failed, trying with --break-system-packages...")
        # Method 2: Try with --break-system-packages
        try:
            subprocess.run(
                [python_executable, '-m', 'pip', 'install', 'ipykernel', '-U', '--user', '--force-reinstall', '--break-system-packages'],
                capture_output=True, text=True, check=True
            )
            print("SUCCESS: Successfully installed ipykernel (Method 2 - with system override)")
            method_used = 2
        except subprocess.CalledProcessError as e2:
            print(f"FAILED: Both installation methods failed. Error: {e2.stderr}")
            print("\nConsider creating a virtual environment manually.")
            return

    # Install kernel spec for the current Python
    try:
        kernel_name = f"python{sys.version_info.major}{sys.version_info.minor}"
        display_name = f"Python {python_version}"
        
        subprocess.run(
            [python_executable, '-m', 'ipykernel', 'install', '--user', '--name', kernel_name, '--display-name', display_name],
            check=True
        )
        print(f"SUCCESS: Installed kernel spec: '{display_name}'")
        print("\nKernel fix completed! Please RESTART your Jupyter server and select the new kernel.")
    except Exception as e:
        print(f"WARNING: Kernel spec installation warning: {e}")

# Run the check and fix function
check_and_fix_kernel()

 ipykernel is already installed. No fix needed.


#### What This Fix Does

The cell above automatically handles the most common kernel installation scenarios:

**Method 1 - Standard Installation:**
- Tries the standard `pip install ipykernel` approach
- Works for most regular Python installations

**Method 2 - System Override (Homebrew/Externally Managed):**
- Uses `--break-system-packages` flag for Homebrew Python
- Handles "externally-managed-environment" errors
- Essential for macOS Homebrew Python environments

**Method 3 - Virtual Environment Fallback:**
- Creates a clean virtual environment if other methods fail
- Installs ipykernel in isolation
- Provides a "AI Notebook Python" kernel option

**After running the fix:**
- Your Jupyter interface should show available kernels
- Select the one that matches your Python version
- All notebook cells should run without kernel errors

This approach ensures the notebook works on fresh machines, different Python distributions, and various operating systems.

## Complete Future-Proof OpenAI Setup
### Comprehensive Error Handling & API Evolution Adaptation

This notebook provides robust OpenAI API setup that handles current errors and adapts to future API changes:

**Error Handling:** Billing, authentication, model deprecation, rate limits, network issues
**Future-Proofing:** SDK version compatibility, adaptive response parsing, flexible error patterns
**Cross-Platform:** Local Jupyter, Google Colab, Python 3.8+

#### API Key Setup

Before we dive into the architecture, let's set up our environment to work with OpenAI. For this book, I'm using OpenAI as our primary LLM gateway. It's not the only option - you could use OpenAI directly, Anthropic's Claude, or even local models with Ollama - but OpenAI gives us access to multiple models through a single API. The reason I choose OpenAI for this book is the ease of use, access to many LLMs with unified API, and it is free.

In [3]:
# Smart Environment Setup
import sys, os, subprocess, importlib.util

IN_COLAB = 'google.colab' in sys.modules
print(f"Environment: {'Google Colab' if IN_COLAB else 'Local Jupyter'}")

def smart_install(package, min_version=None):
    """Install packages with multiple fallback strategies"""
    package_spec = f"{package}>={min_version}" if min_version else package
    strategies = [
        [sys.executable, '-m', 'pip', 'install', package_spec, '--quiet'],
        [sys.executable, '-m', 'pip', 'install', package_spec, '--user', '--quiet'],
        [sys.executable, '-m', 'pip', 'install', package_spec, '--break-system-packages', '--quiet']
    ]
    
    for cmd in strategies:
        try:
            subprocess.run(cmd, capture_output=True, check=True)
            print(f"SUCCESS: {package}")
            return True
        except subprocess.CalledProcessError:
            continue
    print(f"FAILED: {package}")
    return False

# Install required packages
packages = {'openai': '1.0.0', 'python-dotenv': None, 'packaging': None}
for pkg, ver in packages.items():
    smart_install(pkg, ver)

Environment: Local Jupyter
SUCCESS: openai
SUCCESS: python-dotenv
SUCCESS: packaging


In [4]:
# Import modules with graceful fallbacks
import os, re, time, json, getpass
from typing import Optional, List, Dict, Tuple

try:
    from dotenv import load_dotenv
    DOTENV_AVAILABLE = True
except ImportError:
    DOTENV_AVAILABLE = False
    def load_dotenv(): pass

try:
    from packaging import version
    VERSION_CHECK = True
except ImportError:
    VERSION_CHECK = False

print("Modules imported successfully!")

Modules imported successfully!


In [5]:
# Future-Proof API Key Validator
class APIKeyValidator:
    def __init__(self):
        self.patterns = [
            r'^sk-[A-Za-z0-9]{20,}$',
            r'^sk-proj-[A-Za-z0-9\-_]{20,}$',
            r'^sk-[A-Za-z0-9\-_]{40,}$'
        ]
        self.invalid_keys = {
            'your_api_key_here', 'sk-your-key-here', 'sk-...', 'sk-xxxxxxxx',
            'sk-placeholder', 'sk-example', 'sk-demo', 'sk-test'
        }
    
    def validate(self, key: str) -> Tuple[bool, str]:
        if not key or not isinstance(key, str):
            return False, "API key is empty"
        
        key = key.strip()
        
        if key.lower() in [k.lower() for k in self.invalid_keys]:
            return False, "API key appears to be a placeholder"
        
        if not key.startswith('sk-'):
            return False, "API keys should start with 'sk-'"
        
        if len(key) < 30:
            return False, "API key is too short"
        
        for pattern in self.patterns:
            if re.match(pattern, key):
                return True, "Valid API key format"
        
        # Heuristic check for unknown formats
        if self._heuristic_check(key):
            return True, "Format not recognized but appears valid"
        
        return False, "Invalid format"
    
    def _heuristic_check(self, key: str) -> bool:
        remaining = key[3:]  # Remove 'sk-'
        alphanumeric = sum(1 for c in remaining if c.isalnum())
        unique_chars = len(set(remaining.lower()))
        return alphanumeric >= len(remaining) * 0.8 and unique_chars >= 8

validator = APIKeyValidator()
print("API key validator ready")

API key validator ready


In [6]:
# Secure API Key Setup
from typing import Optional
import os, getpass

def setup_api_key(save_to_gdrive: Optional[bool] = None,
                  gdrive_env_path: str = "/content/drive/MyDrive/config/.env") -> Optional[str]:
    """
    Set up OPENAI_API_KEY with validation.
    - If running locally: saves to .env (same behavior as before).
    - If running in Colab:
        - By default, keeps 'session only'.
        - If save_to_gdrive=True OR user answers 'y' to the prompt, mounts Google Drive and saves to gdrive_env_path.
    Args:
        save_to_gdrive: If None, will ask interactively in Colab. If True/False, uses that decision without prompting.
        gdrive_env_path: Destination .env file path in Google Drive (Colab only).
    """
    if DOTENV_AVAILABLE:
        load_dotenv()

    api_key = os.getenv('OPENAI_API_KEY')

    if api_key:
        is_valid, message = validator.validate(api_key)
        if is_valid:
            print(f"SUCCESS: {message}")
            return api_key
        else:
            print(f"WARNING: {message}")

    print("\nAPI Key Setup Required")
    print("1. Visit: https://platform.openai.com/api-keys")
    print("2. Create new secret key")
    print("3. Add billing credits: https://platform.openai.com/settings/organization/billing/overview")

    for attempt in range(3):
        user_key = getpass.getpass(f"Enter API key (attempt {attempt + 1}/3): ")
        is_valid, message = validator.validate(user_key)

        if is_valid or "appears valid" in message:
            api_key = user_key.strip()
            os.environ['OPENAI_API_KEY'] = api_key

            if IN_COLAB:
                # Decide whether to save to Google Drive
                decision = save_to_gdrive
                if decision is None:
                    ans = input("Save API key to Google Drive for future sessions? [y/N]: ").strip().lower()
                    decision = ans == 'y'

                if decision:
                    try:
                        # Mount lazily to avoid unnecessary prompts
                        try:
                            from google.colab import drive  # type: ignore
                            drive.mount('/content/drive', force_remount=False)
                        except Exception as e:
                            print(f"NOTE: Could not import or mount Google Drive automatically: {e}")

                        # Ensure parent directory exists
                        parent_dir = os.path.dirname(gdrive_env_path)
                        if parent_dir and not os.path.exists(parent_dir):
                            os.makedirs(parent_dir, exist_ok=True)

                        with open(gdrive_env_path, 'w') as f:
                            f.write(f'OPENAI_API_KEY={api_key}\n')
                        print(f"SUCCESS: {message} (saved to Google Drive: {gdrive_env_path})")
                    except Exception as e:
                        print(f"SUCCESS: {message} (session only; failed to save to Drive: {e})")
                else:
                    print(f"SUCCESS: {message} (session only)")
            else:
                # Local environment: keep existing .env behavior
                try:
                    with open('.env', 'w') as f:
                        f.write(f'OPENAI_API_KEY={api_key}\n')
                    print(f"SUCCESS: {message} (saved to .env)")
                except Exception:
                    print(f"SUCCESS: {message} (session only)")

            return api_key
        else:
            print(f"INVALID: {message}")

    return None

API_KEY = setup_api_key()
if API_KEY:
    print("\nAPI key configured successfully!")
else:
    print("\nAPI key setup failed. Please try again.")

SUCCESS: Valid API key format

API key configured successfully!


#### Connecting with OpenAI API

In [7]:
# Connection Test: OpenAI embeddings API
try:
    import os
    import openai
    key = os.getenv('OPENAI_API_KEY')
    if hasattr(openai, 'OpenAI'):
        client = openai.OpenAI(api_key=key)
    else:
        client = openai
        client.api_key = key
    _ = client.embeddings.create(model='text-embedding-3-small', input='ping')
    print('Connection test OK')
except Exception as e:
    print(f'Connection test failed: {e}')


Connection test OK


In [8]:
# Connection Test: OpenAI embeddings API
try:
    import os
    import openai
    key = os.getenv('OPENAI_API_KEY')
    if hasattr(openai, 'OpenAI'):
        client = openai.OpenAI(api_key=key)
    else:
        client = openai
        client.api_key = key
    _ = client.embeddings.create(model='text-embedding-3-small', input='ping')
    print('Connection test OK')
except Exception as e:
    print(f'Connection test failed: {e}')


Connection test OK


### OpenAI Assistant ask_ai()

In [9]:
# Future-Proof OpenAI Assistant (updated models and discovery)
import time

class FutureProofAssistant:
    def __init__(self, api_key=None):
        self.api_key = api_key or API_KEY  # assumes API_KEY set in a previous cell
        self.client = None
        # Prefer modern families; keep a reasonable fallback
        self.models = ['o4-mini', 'o4', 'gpt-4.1-mini', 'gpt-4.1', 'gpt-4o']
        self.selected_model = None
        self.max_retries = 3
        
        if not self.api_key:
            raise ValueError("No API key provided")
        
        self._initialize()
    
    def _initialize(self):
        print("Initializing Future-Proof Assistant...")
        self._setup_client()
        self._discover_models()
        self._select_model()
        print(f"Ready! Using model: {self.selected_model}")
    
    def _setup_client(self):
        try:
            import openai
            if hasattr(openai, 'OpenAI'):
                self.client = openai.OpenAI(api_key=self.api_key)
                print("Client initialized (modern API)")
            else:
                openai.api_key = self.api_key
                self.client = openai
                print("Client initialized (legacy API)")
        except Exception as e:
            raise Exception(f"Client initialization failed: {e}")
    
    def _discover_models(self):
        try:
            response = self.client.models.list()
            all_models = [m.id for m in response.data]
            # Prefer modern families; exclude legacy 3.5.
            # Future-proof: include patterns for potential future names (may not exist yet).
            include_patterns = ['o4', 'gpt-4.1', 'gpt-4o', 'gpt-5', 'gpt-4.5', 'gpt-6']
            chat_models = [
                m for m in all_models
                if any(p in m.lower() for p in include_patterns)
            ]
            self.models = self._prioritize_models(chat_models) or self.models
            print(f"Found {len(self.models)} models")
        except Exception as e:
            print(f"Model discovery failed: {e} - using defaults")
    
    def _prioritize_models(self, models):
        priority = ['o4-mini', 'o4', 'gpt-4.1-mini', 'gpt-4.1', 'gpt-4o']
        result = [m for m in priority if m in models]
        result.extend([m for m in sorted(models) if m not in result])
        return result
    
    def _select_model(self):
        for model in self.models[:3]:
            if self._test_model(model):
                self.selected_model = model
                return
        self.selected_model = self.models[0]
    
    def _test_model(self, model):
        try:
            self.client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": "Hi"}],
                max_tokens=5
            )
            return True
        except:
            return False
    
    def ask_ai(self, content: str) -> str:
        if not content or not content.strip():
            return "Error: Please provide a valid question."
        
        for attempt in range(self.max_retries):
            try:
                response = self.client.chat.completions.create(
                    model=self.selected_model,
                    messages=[{"role": "user", "content": content.strip()}],
                    max_tokens=1000,
                    temperature=0.7
                )
                return self._extract_content(response)
            
            except Exception as e:
                error_type = self._classify_error(e)
                
                if error_type == 'billing':
                    return self._billing_error_message()
                elif error_type == 'auth':
                    return self._auth_error_message()
                elif error_type == 'model':
                    return self._model_error_message()
                elif error_type == 'rate' and attempt < self.max_retries - 1:
                    wait_time = 2 ** attempt
                    print(f"Rate limited. Waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue
                elif attempt < self.max_retries - 1:
                    print(f"Attempt {attempt + 1} failed: {str(e)[:50]}...")
                    time.sleep(1)
                    continue
                else:
                    return f"Error after {self.max_retries} attempts: {str(e)[:100]}..."
    
    def _extract_content(self, response):
        try:
            return response.choices[0].message.content
        except:
            try:
                return response.choices[0].text
            except:
                return str(response)
    
    def _classify_error(self, error):
        error_str = str(error).lower()
        if any(word in error_str for word in ['quota', 'billing', 'credit']):
            return 'billing'
        elif any(word in error_str for word in ['auth', 'key', 'unauthorized']):
            return 'auth'
        elif any(word in error_str for word in ['model', 'not_found']):
            return 'model'
        elif any(word in error_str for word in ['rate', 'limit', 'too_many']):
            return 'rate'
        return 'unknown'
    
    def _billing_error_message(self):
        return """BILLING ERROR: Insufficient credits.
        
To fix this:
1. Visit: https://platform.openai.com/settings/organization/billing/overview
2. Add a payment method
3. Purchase credits (minimum $5)
4. Wait a few minutes for credits to appear

Note: OpenAI requires prepaid credits for API usage."""
    
    def _auth_error_message(self):
        return """AUTHENTICATION ERROR: Invalid API key.
        
To fix this:
1. Check your API key at: https://platform.openai.com/api-keys
2. Create a new key if needed
3. Re-run the API key setup cell above

Make sure your key starts with 'sk-' and is complete."""
    
    def _model_error_message(self):
        return f"""MODEL ERROR: {self.selected_model} not available.
        
This usually means:
1. Model has been deprecated
2. Your account doesn't have access
3. Temporary service issue

The assistant will automatically try other models."""

# Initialize assistant
if API_KEY:
    assistant = FutureProofAssistant(API_KEY)
else:
    print("Cannot initialize assistant without API key")

Initializing Future-Proof Assistant...
Client initialized (modern API)
Found 43 models
Ready! Using model: gpt-4.1-mini


In [10]:
# Test the Assistant
def ask_ai(content: str) -> str:
    """Simple interface to the future-proof assistant"""
    if 'assistant' in globals():
        return assistant.ask_ai(content)
    else:
        return "Assistant not initialized. Please run the setup cells above."

# Test with various scenarios
if API_KEY:
    print("Testing assistant functionality...\n")
    
    # Basic test
    response = ask_ai("Say 'Hello, I am working!' in exactly those words.")
    print(f"Basic Test: {response}\n")
    
    # Empty input test
    response = ask_ai("")
    print(f"Empty Input Test: {response}\n")
    
    # Model info
    print(f"Selected Model: {assistant.selected_model}")
    print(f"Available Models: {assistant.models[:3]}...")
    
    print("\nAssistant is ready for use!")
else:
    print("Please complete API key setup first.")

Testing assistant functionality...

Basic Test: Hello, I am working!

Empty Input Test: Error: Please provide a valid question.

Selected Model: gpt-4.1-mini
Available Models: ['o4-mini', 'gpt-4.1-mini', 'gpt-4.1']...

Assistant is ready for use!


#### Usage Examples

Now you can use the `ask_ai()` function for any queries:

```python
# Simple question
response = ask_ai("What is machine learning?")
print(response)

# Complex analysis
response = ask_ai("Explain the benefits of using LLMs for data analysis")
print(response)
```

#### Future-Proof Features

This setup automatically handles:
- **API Changes**: Adapts to new OpenAI SDK versions
- **Model Updates**: Discovers and selects optimal models
- **Error Evolution**: Flexible error pattern matching
- **Response Formats**: Multiple content extraction methods

The assistant will continue working even as OpenAI updates their API!

In [12]:
ask_ai("tell me a joke")

"Sure! Here's a joke for you:\n\nWhy don't scientists trust atoms?  \nBecause they make up everything!"