# Lab 2: API Client Integration

In this lab you will build a production-ready LLM API client using **LiteLLM** and **OpenRouter**.

**By the end of this lab you will know how to:**
- Securely load API keys from environment variables
- Make your first LLM API call via LiteLLM
- Use LiteLLM's built-in retry logic
- Use LiteLLM's built-in response caching

## Step 1 — Environment Setup

We never hardcode API keys in source code. Instead, we load them from a `.env` file at runtime.

Create a `.env` file in this directory with:
```
OPENROUTER_API_KEY=sk-or-...
```

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()  # Reads .env from the current directory


def get_api_key() -> str:
    """Retrieve and validate the OpenRouter API key."""
    token = os.getenv("OPENROUTER_API_KEY")
    if not token:
        raise EnvironmentError(
            "OPENROUTER_API_KEY not found. "
            "Create a .env file with your key or set the environment variable."
        )
    return token


get_api_key()  # Validate early — fail fast if the key is missing
print("API key loaded successfully.")

## Step 2 — Your First API Call

[LiteLLM](https://docs.litellm.ai/docs/) provides a unified `completion()` interface across 100+ LLM providers.
We prefix the model name with `openrouter/` so LiteLLM knows to route through OpenRouter.

Key features:

- Direct Python library integration in your codebase
- Router with retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - Router
- Application-level load balancing and cost tracking
- Exception handling with OpenAI-compatible errors
- Observability callbacks (Lunary, MLflow, Langfuse, etc.)


In [None]:
from litellm import completion

MODEL_ID = "openrouter/meta-llama/llama-3-8b-instruct:free"

prompt = "Explain what a vector database is in one paragraph."

response = completion(
    model=MODEL_ID,
    messages=[{"role": "user", "content": prompt}],
    max_tokens=150,
    temperature=0.7,
)

print(response.choices[0].message.content)

## Step 3 — Retry Logic

Free-tier APIs are rate-limited. Instead of writing manual retry loops, LiteLLM has a built-in `num_retries` parameter that automatically retries on `RateLimitError` and network errors with exponential backoff.

In [None]:
response = completion(
    model=MODEL_ID,
    messages=[{"role": "user", "content": prompt}],
    max_tokens=150,
    temperature=0.7,
    num_retries=3,   # LiteLLM retries automatically on rate limit / network errors
    timeout=120,
)

print(response.choices[0].message.content)

## Step 4 — Response Caching

During development, you often run the same queries many times. Caching avoids redundant API calls, saving both time and quota.

LiteLLM ships with built-in caching — one line to enable it globally.

In [None]:
import litellm

# Enable in-memory caching for all subsequent completion() calls
litellm.enable_cache(type="local")

model = "openrouter/meta-llama/llama-3-8b-instruct:free"
messages = [{"role": "user", "content": "What is retrieval-augmented generation?"}]

print("--- First call (Cache MISS — hits API) ---")
result1 = completion(model=model, messages=messages, num_retries=3, timeout=120, max_tokens=200)
print(result1.choices[0].message.content[:200])

print("\n--- Second call (Cache HIT — served from memory) ---")
result2 = completion(model=model, messages=messages, num_retries=3, timeout=120, max_tokens=200)
print(result2.choices[0].message.content[:200])

print("\nResponses identical:", result1.choices[0].message.content == result2.choices[0].message.content)