# Handling Rate Limits & Retrying with the Responses API

This notebook covers:
1. **Rate Limits** – understanding, inspecting, and proactively respecting API rate limits.
2. **Retrying** – strategies for automatically retrying failed requests using exponential backoff.

*All examples use the new `Responses API` via `client.responses.create(...)`.*

## Install Dependencies
Make sure you have the latest OpenAI client and retry libraries.

In [24]:
%pip install openai tenacity backoff --quiet --upgrade


Note: you may need to restart the kernel to use updated packages.


## Initialize Client
Use your `OPENAI_API_KEY` environment variable or pass `api_key=` explicitly.

In [25]:
from openai import OpenAI

In [26]:
MODEL = "gpt-4.1-mini"
API_KEY= ""

In [27]:
# Create an OpenAI client instance
client = OpenAI(
    # Replace with your actual API key or use: api_key=os.environ.get("OPENAI_API_KEY")
    api_key=API_KEY,
    # Configure the number of retries, the default is 2
    max_retries=6
)

## 1. Rate Limits Overview

- **RPM**: requests per minute
- **TPM**: tokens per minute
- **RPD**, **TPD**, **IPM** etc.

Rate limits protect fairness and stability. You can view your current limits in your account dashboard.

### Inspecting Rate Limit Headers
Each response exposes headers like `x-ratelimit-remaining-requests`. Let's see how to access them.

In [28]:
import requests

headers = {
    'Authorization': f'Bearer {API_KEY}',
    'Content-Type': 'application/json'
}

messages = [
    {"role": "user", "content": "j"}
]
data = {
    'model': MODEL, 
    'messages': messages,  # List of message dictionaries
    'max_tokens': 1,
}

response = requests.post('https://api.openai.com/v1/chat/completions', headers=headers, json=data)
response_headers = response.headers['x-ratelimit-remaining-requests']
print("Remaining requests: ", response_headers)

Remaining requests:  499


## 2. Retrying Strategies
When you do hit a rate limit (`429`), automatic retries with exponential backoff help smooth over spikes.

### 2.0 With the client itself

In [34]:
# Or, configure per-request:
client.with_options(max_retries=5).chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "How can I get the name of the current day in JavaScript?",
        }
    ],
    model=MODEL,
)

ChatCompletion(id='chatcmpl-BNGvsrH0r03LiTrOJV1ka4n4mXQ6N', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='To get the name of the current day in JavaScript, you can use the `Date` object along with an array of day names. Here\'s a simple example:\n\n```javascript\n// Create a new Date object for the current date and time\nconst today = new Date();\n\n// Array of day names\nconst days = [\'Sunday\', \'Monday\', \'Tuesday\', \'Wednesday\', \'Thursday\', \'Friday\', \'Saturday\'];\n\n// Get the current day index (0 = Sunday, 1 = Monday, etc.)\nconst dayIndex = today.getDay();\n\n// Get the name of the current day\nconst dayName = days[dayIndex];\n\nconsole.log(dayName);  // e.g., "Monday"\n```\n\n### Explanation:\n- `Date.prototype.getDay()` returns an integer between 0 and 6 representing the day of the week (0 for Sunday, 1 for Monday, etc.).\n- You use this index to access the corresponding day name from the `days` array.\n\nAlternat

### 2.1 Tenacity Example
Use `tenacity` to retry on `RateLimitError`.

In [30]:
from tenacity import retry, stop_after_attempt, wait_random_exponential

@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_random_exponential(min=1, max=60)
)
def create_with_backoff(**kwargs):
    return client.responses.create(**kwargs)

# call with retry logic
resp = create_with_backoff(
    model=MODEL,
    input=[{"role":"user","content":"Tell me a joke."}]
)
print(resp.output_text)

Sure! Here’s a joke for you:

Why don't scientists trust atoms?

Because they make up everything!


### 2.2 Manual Exponential Backoff
Implement your own retry decorator.

In [32]:
import random
import time
import openai

def retry_with_backoff(
    func,
    initial_delay=1,
    factor=2,
    jitter=True,
    max_retries=5
):
    def wrapper(*args, **kwargs):
        delay = initial_delay
        for i in range(max_retries):
            try:
                return func(*args, **kwargs)
            except openai.RateLimitError as e:
                sleep = delay * (1 + (random.random() if jitter else 0))
                print(f"Rate limited, retrying in {sleep:.1f}s...")
                time.sleep(sleep)
                delay *= factor
        raise Exception("Max retries exceeded.")
    return wrapper

@retry_with_backoff
def create_manual(**kwargs):
    return client.responses.create(**kwargs)

# Call with manual backoff
resp3 = create_manual(
    model=MODEL,
    input=[{"role":"user","content":"Explain Kubernetes."}]
)
print(resp3.output_text)

Kubernetes, often abbreviated as **K8s**, is an open-source platform designed for automating the deployment, scaling, and management of containerized applications. Originally developed by Google, it is now maintained by the Cloud Native Computing Foundation (CNCF).

### Key Concepts of Kubernetes

1. **Containers**: Containers package an application and its dependencies into a single unit that can run reliably on any computing environment. Kubernetes is built specifically to orchestrate containers, often using Docker or other container runtimes.

2. **Cluster**: A Kubernetes cluster consists of a set of worker machines, called **nodes**, that run containerized applications. One or more master nodes manage the cluster.

3. **Pods**: The smallest deployable units in Kubernetes. A pod can contain one or more tightly coupled containers that share storage/network and a specification for how to run them.

4. **Nodes**: Machines that run your containers. Each node runs a container runtime and

### Next Steps & Best Practices
- Monitor your usage dashboard regularly.
- Surface rate‑limit headers in your UI/logs.
- Tune your backoff parameters for your workload.
- Consider model fallback if certain models hit limits.

> **Pro Tip:** Jitter + exponential backoff avoids thundering‑herd issues.