## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [7]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display
from limits import parse
from limits.storage import RedisStorage
from limits.strategies import FixedWindowRateLimiter

In [23]:
# Always remember to do this!
load_dotenv(override=True)
storage = RedisStorage("redis://localhost:6379")

In [27]:
MODEL_LIMITS = {
    # --- Gemini 1.5 Series ---
    "gemini-1.5-flash": {
        "rpd": 50,
        "rpm": 15,
        "tpm": 250000
    },
    "gemini-1.5-flash-8b": {
        "rpd": 50,
        "rpm": 15,
        "tpm": 250000
    },
    "gemini-1.5-flash-exp": {
        "rpd": 0,
        "rpm": 0,
        "tpm": 0
    },
    "gemini-1.5-flash-8b-exp": {
        "rpd": 0,
        "rpm": 0,
        "tpm": 0
    },

    # --- Gemini 2.0 Series ---
    "gemini-2.0-flash": {
        "rpd": 200,
        "rpm": 15,
        "tpm": 1000000
    },
    "gemini-2.0-flash-lite": {
        "rpd": 200,
        "rpm": 30,
        "tpm": 1000000
    },
    "gemini-2.0-flash-exp": {
        "rpd": 50,
        "rpm": 10,
        "tpm": 250000
    },
    "gemini-2.0-flash-exp-audio": {
        # Note: RPD and RPM for this model were not in your list, assuming reasonable values.
        "rpd": 50, # Placeholder
        "rpm": 2,  # Placeholder
        "tpm": 4000000
    },
    "gemini-2.0-flash-exp-image": {
        # Note: RPD and RPM for this model were not in your list, assuming reasonable values.
        "rpd": 50, # Placeholder
        "rpm": 2,  # Placeholder
        "tpm": 4000000
    },
    "gemini-2.0-flash-live": {
        "rpd": None, # Unlimited
        "rpm": None, # Unlimited
        "tpm": 1000000
    },
    "gemini-2.0-flash-preview-image-generation": {
        "rpd": 100,
        "rpm": 10,
        "tpm": 200000
    },

    # --- Gemini 2.5 Series ---
    "gemini-2.5-flash": {
        "rpd": 250,
        "rpm": 10,
        "tpm": 250000
    },
    "gemini-2.5-flash-lite": {
        "rpd": 1000,
        "rpm": 15,
        "tpm": 250000
    },
    "gemini-2.5-flash-exp-native-audio-thinking-dialog": {
        "rpd": 5,
        "rpm": None, # Unlimited
        "tpm": 10000
    },
    "gemini-2.5-flash-native-audio-dialog": {
        "rpd": 5,
        "rpm": None, # Unlimited
        "tpm": 25000
    },
    "gemini-2.5-flash-live": {
        "rpd": None, # Unlimited
        "rpm": None, # Unlimited
        "tpm": None  # Assuming unlimited if not specified
    },
    "gemini-2.5-flash-preview-image": {
        "rpd": 50,
        "rpm": 20,
        "tpm": 200000
    },
    "gemini-2.5-flash-tts": {
        "rpd": 15,
        "rpm": 3,
        "tpm": 10000
    },

    # --- Other Models ---
    "learnlm-2.0-flash-experimental": {
        "rpd": 1500,
        "rpm": 15,
        "tpm": None # TPM not specified in your list
    },
}

In [24]:
# Print the key prefixes to help with any debugging

# openai_api_key = os.getenv('OPENAI_API_KEY')
# anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GEMINI_API_KEY')
# deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
# groq_api_key = os.getenv('GROQ_API_KEY')

# if openai_api_key:
#     print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
# else:
#     print("OpenAI API Key not set")
    
# if anthropic_api_key:
#     print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
# else:
#     print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

# if deepseek_api_key:
#     print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
# else:
#     print("DeepSeek API Key not set (and this is optional)")

# if groq_api_key:
#     print(f"Groq API Key exists and begins {groq_api_key[:4]}")
# else:
#     print("Groq API Key not set (and this is optional)")

Google API Key exists and begins AI


In [None]:
# def callLLmApi(api_key,base_url,model):
#     llm = OpenAI(api_key=api_key, base_url=base_url)
#     model_name = model

#     response = llm.chat.completions.create(model=model_name, messages=messages)
#     response_content = response.choices[0].message.content

#     display(Markdown(response_content))
#     return response_content
# Updated callLLmApi function with rate limiting
def callLLmApi(api_key, base_url, model, messages):
    # Initialize rate limiter for this model
    if model in MODEL_LIMITS:
        limits = MODEL_LIMITS[model]
        # Create rate limiter for requests per minute (rpm)
        if limits.get("rpm"):
            rpm_limiter = FixedWindowRateLimiter(storage)
            rpm_limit = parse(f"{limits['rpm']}/minute")
        else:
            rpm_limiter = None
            rpm_limit = None
    else:
        rpm_limiter = None
        rpm_limit = None
    
    llm = OpenAI(api_key=api_key, base_url=base_url)
    model_name = model

    response = llm.chat.completions.create(model=model_name, messages=messages)
    response_content = response.choices[0].message.content

    display(Markdown(response_content))
    return response_content

In [10]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [None]:
# Test the updated callLLmApi function with rate limiting
# Example usage with a model that has rate limits

# Test with a model from MODEL_LIMITS
test_model = "gemini-2.0-flash"
test_messages = [{"role": "user", "content": "Hello, this is a test message."}]

print(f"Testing rate limiting for model: {test_model}")
print(f"Rate limits for this model: {MODEL_LIMITS.get(test_model, 'No limits defined')}")

# Note: This will only work if you have the appropriate API keys set up
# Uncomment the line below to test (make sure you have GEMINI_API_KEY set)
# result = callLLmApi(google_api_key, "https://generativelanguage.googleapis.com/v1beta/openai/", test_model, test_messages)


In [28]:
# Test rate limiting with dummy model (no real API calls)
import time

def test_rate_limiting():
    """Test the rate limiting functionality without making real API calls"""
    
    # Test with a model that has rate limits
    test_model = "gemini-2.0-flash"
    test_messages = [{"role": "user", "content": "Test message"}]
    
    print(f"🧪 Testing rate limiting for model: {test_model}")
    print(f"📊 Rate limits: {MODEL_LIMITS.get(test_model)}")
    print()
    
    # Simulate multiple rapid calls to test rate limiting
    for i in range(5):
        print(f"Attempt {i+1}:")
        
        # Check if model has rate limits
        if test_model in MODEL_LIMITS:
            limits = MODEL_LIMITS[test_model]
            if limits.get("rpm"):
                # Create rate limiter
                rpm_limiter = FixedWindowRateLimiter(storage)
                rpm_limit = parse(f"{limits['rpm']}/minute")
                
                # Create unique key for this model
                rate_limit_key = f"api_call:{test_model}"
                
                try:
                    if rpm_limiter.hit(rpm_limit, rate_limit_key):
                        print(f"  ✅ Rate limit check passed - would make API call")
                    else:
                        print(f"  ⚠️ Rate limit exceeded - API call blocked")
                        print(f"  📈 Current limit: {limits['rpm']} requests per minute")
                        break
                except Exception as e:
                    print(f"  ❌ Rate limiting error: {e}")
                    print(f"  🔄 Proceeding without rate limiting...")
            else:
                print(f"  ℹ️ No RPM limit defined for this model")
        else:
            print(f"  ℹ️ Model not found in MODEL_LIMITS - no rate limiting applied")
        
        # Small delay to simulate real usage
        time.sleep(0.1)
    
    print("\n🎯 Rate limiting test completed!")

# Run the test
test_rate_limiting()


🧪 Testing rate limiting for model: gemini-2.0-flash
📊 Rate limits: {'rpd': 200, 'rpm': 15, 'tpm': 1000000}

Attempt 1:
  ✅ Rate limit check passed - would make API call
Attempt 2:
  ✅ Rate limit check passed - would make API call
Attempt 3:
  ✅ Rate limit check passed - would make API call
Attempt 4:
  ✅ Rate limit check passed - would make API call
Attempt 5:
  ✅ Rate limit check passed - would make API call

🎯 Rate limiting test completed!


In [None]:
# Comprehensive rate limiting test with different models
def test_multiple_models():
    """Test rate limiting with different models to show various scenarios"""
    
    test_models = [
        "gemini-2.0-flash",           # Has RPM limit (15)
        "gemini-2.5-flash-lite",      # Has RPM limit (15) 
        "gemini-2.5-flash-live",      # No RPM limit (None)
        "unknown-model",              # Not in MODEL_LIMITS
    ]
    
    print("🔬 Comprehensive Rate Limiting Test")
    print("=" * 50)
    
    for model in test_models:
        print(f"\n📋 Testing model: {model}")
        
        if model in MODEL_LIMITS:
            limits = MODEL_LIMITS[model]
            print(f"   📊 Limits: {limits}")
            
            if limits.get("rpm"):
                print(f"   🚦 RPM Limit: {limits['rpm']} requests per minute")
                
                # Test rate limiting
                rpm_limiter = FixedWindowRateLimiter(storage)
                rpm_limit = parse(f"{limits['rpm']}/minute")
                rate_limit_key = f"api_call:{model}"
                
                # Try 3 rapid calls
                for i in range(3):
                    try:
                        if rpm_limiter.hit(rpm_limit, rate_limit_key):
                            print(f"   ✅ Call {i+1}: Allowed")
                        else:
                            print(f"   ⚠️ Call {i+1}: Blocked (rate limit exceeded)")
                            break
                    except Exception as e:
                        print(f"   ❌ Call {i+1}: Error - {e}")
            else:
                print(f"   ℹ️ No RPM limit - unlimited requests")
        else:
            print(f"   ❓ Model not in MODEL_LIMITS - no rate limiting")
    
    print(f"\n🎯 All tests completed!")

# Run comprehensive test
test_multiple_models()


In [None]:
# Mock version of callLLmApi for testing without real API calls
def mock_callLLmApi(api_key, base_url, model, messages):
    """Mock version that simulates the rate limiting without real API calls"""
    
    print(f"🔧 Mock API call for model: {model}")
    print(f"📝 Messages: {messages}")
    
    # Initialize rate limiter for this model (same logic as real function)
    if model in MODEL_LIMITS:
        limits = MODEL_LIMITS[model]
        print(f"📊 Found limits: {limits}")
        
        # Create rate limiter for requests per minute (rpm)
        if limits.get("rpm"):
            rpm_limiter = FixedWindowRateLimiter(storage)
            rpm_limit = parse(f"{limits['rpm']}/minute")
            print(f"🚦 RPM limit: {limits['rpm']} requests per minute")
        else:
            rpm_limiter = None
            rpm_limit = None
            print(f"ℹ️ No RPM limit defined")
    else:
        rpm_limiter = None
        rpm_limit = None
        print(f"❓ Model not in MODEL_LIMITS - no rate limiting")
    
    # Check rate limit before making API call (same logic as real function)
    if rpm_limiter and rpm_limit:
        rate_limit_key = f"api_call:{model}"
        try:
            if not rpm_limiter.hit(rpm_limit, rate_limit_key):
                print(f"⚠️ Rate limit exceeded for model {model}. Current limit: {limits['rpm']} requests per minute.")
                print("Please wait a moment before trying again.")
                return None
            else:
                print(f"✅ Rate limit check passed")
        except Exception as e:
            print(f"⚠️ Rate limiting error: {e}")
            print("Proceeding without rate limiting...")
    
    # Mock API response (instead of real API call)
    print(f"🌐 [MOCK] Making API call to {base_url} with model {model}")
    mock_response = f"Mock response from {model}: 'This is a simulated response to your query.'"
    
    print(f"📤 Mock response: {mock_response}")
    return mock_response

# Test the mock function
print("🧪 Testing Mock callLLmApi Function")
print("=" * 40)

test_cases = [
    ("gemini-2.0-flash", "https://generativelanguage.googleapis.com/v1beta/openai/"),
    ("gemini-2.5-flash-live", "https://generativelanguage.googleapis.com/v1beta/openai/"),
    ("unknown-model", "https://api.example.com/v1/"),
]

for model, base_url in test_cases:
    print(f"\n--- Testing {model} ---")
    result = mock_callLLmApi("fake_key", base_url, model, [{"role": "user", "content": "Test message"}])
    print(f"Result: {result}")
    print("-" * 30)


In [None]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


In [7]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

In [None]:
# The API we know well

model_name = "gpt-4o-mini"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-3-7-sonnet-latest"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "llama-3.3-70b-versatile"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [None]:
!ollama pull llama3.2

In [None]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# So where are we?

print(competitors)
print(answers)


In [None]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


In [20]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [None]:
print(together)

In [22]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [None]:
print(judge)

In [29]:
judge_messages = [{"role": "user", "content": judge}]

In [None]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


In [None]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>