# Timeouts and Retries

When issuing HTTP requests and interacting with **external services**, we should have **timeout** and **retry** mechanisms in place. External services can be **slow or unreliable**, which can cause scripts to **hang** or **fail unexpectedly**.

Timeouts and retries help keep automation scripts **responsive** and **resilient** to flaky external behavior.

## Timeouts in `requests`

You can pass **`timeout`** in two ways:

| Form | Meaning |
|------|--------|
| **Single value** (e.g. `timeout=5`) | Same value used for both **connect** and **read** timeouts. |
| **Tuple** (e.g. `timeout=(2, 10)`) | **`(connect_timeout, read_timeout)`** – control each phase separately. |

- **Connect timeout** – raised if the **connection** to the server cannot be established within the given time (e.g. server unreachable, network slow).
- **Read timeout** – raised if **data stops arriving** within the given time, or if the **request takes longer than the read timeout** to complete (e.g. server is slow to respond).

So the problem can be either "couldn't connect" (connect timeout) or "connected but response took too long" (read timeout).

## Simulating a read timeout with httpbin

**[httpbin.org](https://httpbin.org)** has a **`/delay/<seconds>`** endpoint that waits for the given number of seconds before responding. We'll request **`/delay/5`** (5 second delay) with a **2 second** timeout. The connection is established quickly, but the **response** takes longer than 2 seconds, so we get a **read timeout**.

In [7]:
import requests
import time

HTTPBIN_ENDPOINT = "https://httpbin.org"
delay_url = f"{HTTPBIN_ENDPOINT}/delay/5"

start = time.perf_counter()
try:
    response = requests.get(delay_url, timeout=2)
    elapsed = time.perf_counter() - start
    print(f"Completed in {elapsed:.2f} seconds")
    print(f"Status: {response.status_code}")
except (requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout) as timeout_error:
    elapsed = time.perf_counter() - start
    print(f"Timeout after {elapsed:.2f} seconds")
    print(timeout_error)

Timeout after 2.74 seconds
HTTPSConnectionPool(host='httpbin.org', port=443): Read timed out. (read timeout=2)


We get a **ReadTimeout**. The issue is with the **operation** (waiting for the response), not with establishing the connection. The connection was made, but the server waits 5 seconds before replying, and our **read timeout** is 2 seconds, so the request times out.

**Try it yourself:** change `timeout=2` to `timeout=6` (or more) and run again—the request should complete and print the status code.

## Retries

**Transient issues** (network glitches, server overload, slow responses) can cause requests to **fail** or **timeout**. A **retry mechanism** helps:

- **Retry** on: **server errors (5xx)** and **network exceptions** (e.g. timeouts). These are often temporary.
- **Do not retry** (break out): on **success**, or on **client errors (4xx)**. Client errors usually mean something is wrong with *our* request (bad URL, invalid payload, etc.); retrying the same request typically won't help.

You can use a **fixed delay** between retries for simplicity, or **exponential backoff** (with optional jitter) for a more robust approach. Below we use a fixed delay.

**Important:** Avoid retrying **non-idempotent** operations. An **idempotent** operation can be repeated multiple times with the **same end result** (e.g. GET, or a well-designed PUT). A **non-idempotent** operation can change state each time (e.g. POST that creates a new record). If you retry a failed POST, you might create duplicates. For such cases, handle the error differently instead of blindly retrying.

### Simple retry loop with fixed delay

We use httpbin's **`/status/<code>`** endpoint. To simulate flakiness (about **2/3 server error, 1/3 success**), we pick at random one of three URLs: two return **500**, one returns **200**. We retry up to **max_retries** times with a **fixed delay** between attempts. We **break** on success or on client errors (4xx); we **retry** on server errors (5xx) and optionally on timeouts.

In [8]:
import requests
import time
import random

HTTPBIN_ENDPOINT = "https://httpbin.org"
# Simulate flakiness: 2/3 chance 500, 1/3 chance 200 (picked at random per request)
flaky_urls = [
    f"{HTTPBIN_ENDPOINT}/status/500",
    f"{HTTPBIN_ENDPOINT}/status/400",
    f"{HTTPBIN_ENDPOINT}/status/200",
]

max_retries = 3
delay = 2  # seconds between retries

for attempt in range(1, max_retries + 1):
    print(f"Attempt {attempt}/{max_retries}")
    try:
        url = random.choice(flaky_urls)
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        print(f"Succeeded with status {response.status_code}")
        break
    except requests.exceptions.HTTPError as error:
        status_code = error.response.status_code
        if status_code < 500:
            print(f"Failed with client error {status_code}, skipping retry")
            break
        print(f"Failed with server error {status_code}")
        if attempt < max_retries:
            print(f"Waiting {delay} seconds before retry...")
            time.sleep(delay)
    except (requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout) as timeout_error:
        print(f"Timeout: {timeout_error}")
        if attempt < max_retries:
            print(f"Waiting {delay} seconds before retry...")
            time.sleep(delay)
else:
    print(f"All {max_retries} attempts failed")

Attempt 1/3
Succeeded with status 200


The **for-else** clause runs only if the loop **completed all iterations without breaking**—i.e. we never succeeded and never bailed out on a client error. So we print "All 3 attempts failed".

Run the cell multiple times: sometimes you'll succeed on the first attempt (200), sometimes after one or two retries, and occasionally all three attempts may hit 500. The randomness demonstrates how retries help with transient server errors.

## Exponential backoff and jitter

**Fixed delays** can overwhelm a recovering server: many clients might retry at exactly the same time (e.g. all after 2 seconds), causing a **synchronized retry spike**.

**Exponential backoff** increases the wait time after each failure (e.g. 1s → 2s → 4s → 8s). That gives the server more time to recover and spreads out load. We usually cap the delay with a **maximum** (e.g. 30 seconds) so we still retry within a reasonable time.

**Jitter** is a small random offset added to the delay. It prevents every client from retrying at the same moment. For example, instead of everyone waiting exactly 2 seconds, we wait 2 ± 0.2 seconds (random), so retries are staggered.

Use **exponential backoff with jitter** in production for a more robust approach than a fixed delay.

### Implementation: `get_with_backoff`

The function below retries with **exponential backoff** and **jitter**:
- **Client errors (4xx):** no retry; raise `RuntimeError` so the caller can handle.
- **Server errors (5xx) or timeouts:** compute `wait = min(delay * 2, max_delay) + jitter` (jitter in ±10% of current delay), sleep, then double the delay for the next attempt (capped at `max_delay`).
- After all attempts, raise `RuntimeError` with a clear message.

Example progression: delay 1 → wait ~1.9–2.1s → delay 2; then wait ~3.8–4.2s → delay 4; then ~7.6–8.4s → delay 8; then ~15.2–16.8s → delay 16 (capped at 30 if you use a higher max).

In [11]:
import requests
import time
import random

def get_with_backoff(url, max_retries=3, initial_delay=1, max_delay=30, timeout=10):
    delay = initial_delay
    for attempt in range(1, max_retries + 1):
        print(f"Attempt {attempt}/{max_retries}")
        try:
            response = requests.get(url, timeout=timeout)
            response.raise_for_status()
            print(f"Succeeded with status {response.status_code}")
            return response
        except requests.exceptions.HTTPError as error:
            status_code = error.response.status_code
            if status_code < 500:
                print(f"Failed with client error {status_code}, skipping retry")
                raise RuntimeError("Client error, please review request") from error
            jitter = random.uniform(-0.1 * delay, 0.1 * delay)
            print(f"the jittre is {jitter}")
            wait = min(delay * 2, max_delay) + jitter
            print(f"Failed with server error {status_code}, retrying in {wait:.2f} seconds")
            time.sleep(wait)
            delay = min(delay * 2, max_delay)
        except (requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout) as timeout_error:
            jitter = random.uniform(-0.1 * delay, 0.1 * delay)
            wait = min(delay * 2, max_delay) + jitter
            print(f"Timeout: {timeout_error}, retrying in {wait:.2f} seconds")
            time.sleep(wait)
            delay = min(delay * 2, max_delay)
    raise RuntimeError(f"All {max_retries} retries to query {url} failed")

In [12]:
HTTPBIN_ENDPOINT = "https://httpbin.org"
# Endpoint that always returns 503 so we can see backoff and final failure
url = f"{HTTPBIN_ENDPOINT}/status/503"

try:
    response = get_with_backoff(url, max_retries=4)
    print(response.status_code)
except RuntimeError as e:
    print(e)

Attempt 1/4
the jittre is 0.07895069018484877
Failed with server error 503, retrying in 2.08 seconds
Attempt 2/4
the jittre is 0.16435706905713027
Failed with server error 503, retrying in 4.16 seconds
Attempt 3/4
the jittre is 0.15453549158149715
Failed with server error 503, retrying in 8.15 seconds
Attempt 4/4
the jittre is -0.7108179062691058
Failed with server error 503, retrying in 15.29 seconds
All 4 retries to query https://httpbin.org/status/503 failed


You should see the delay **increase** between attempts (e.g. ~2s, then ~4s, then ~8s, then ~16s). The total time is the sum of **request times** (each attempt can take up to `timeout` seconds) plus the **wait times** between retries.

**Note:** The backoff delay is *between* attempts; it does not include the time the server takes to respond (or time out). So with a 10s timeout and 4 attempts, you could have 4 × 10s of request time plus the backoff waits.

## Common pitfalls and how to avoid them

| Pitfall | Why it matters | What to do |
|--------|----------------|------------|
| **Forgetting timeouts** | Scripts can hang indefinitely if the server or network never responds. | **Always** set a `timeout` on every request. |
| **Retrying client errors (4xx)** | Usually the request is wrong (bad URL, invalid payload); retrying the same request won't help. | Retry only on **server errors (5xx)** and **network/timeout** issues. Retry on 4xx only if you have a good reason (e.g. request is built dynamically from another flaky API and might succeed on retry). |
| **Retrying non-idempotent operations** | Can cause **duplicate actions** (e.g. "complete purchase" runs twice → charged twice). | Retry only **safe, idempotent** operations (e.g. GET). For non-idempotent calls (POST that creates a record), handle the error explicitly and do not retry blindly. |
| **Fixed retry delays** | All clients retry at the same time → **synchronized retry spike** → server gets hammered again. | In production, use **exponential backoff** with **jitter** (randomness) to spread out retries. |