# Lab 02 — Local API (`requests`) + SQL Extraction to pandas

**Focus Areas:** HTTP APIs with `requests`, SQL → pandas (SQLite)

---

## Outcomes

By the end of this lab, you will be able to:

1. Call a **local REST API** with query parameters, parse JSON, and implement **robust error handling** for status codes.
2. Implement **pagination** and an **exponential backoff** retry policy that respects `429 Too Many Requests` and `5xx` errors.
3. Extract data from **SQLite** into pandas using **parameterized queries** and `pd.read_sql_query`, including **chunked reads**.
4. Persist results to **Parquet** for downstream LLM preprocessing.

## Prerequisites & Setup

- Python 3.13 with `requests`, `pandas`, `numpy`, `matplotlib`, `pyarrow` installed.
- JupyterLab or VS Code with Jupyter extension.
- **SQLite** available (via Python's built‑in `sqlite3`).
- **Local API** served by **Datasette** exposing data from a SQLite database.

### Setup Steps

1. Create a project folder and environment
2. Get a sample SQLite DB (Northwind)
3. Run a **local REST API** with Datasette
4. Start this notebook

### 1) Create a project folder and environment

```bash
mkdir lab02 && cd lab02
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install requests pandas numpy matplotlib pyarrow datasette
```

### 2) Get a sample SQLite DB (Northwind)

```bash
curl -L -o northwind.db \
  https://raw.githubusercontent.com/jpwhite3/northwind-SQLite3/main/dist/northwind.db
```

### 3) Run a **local REST API** with Datasette (read‑only JSON over SQLite)

```bash
# Serves a browsable site *and* JSON endpoints
# Terminal will print a local http://127.0.0.1:8001 URL
# macOS/Linux:
datasette northwind.db -h 127.0.0.1 -p 8001
# Windows PowerShell (same command works)
```

Keep this server running. Open a second terminal for the notebook.

---

## Part A — HTTP API with `requests`

> You will call the Datasette JSON API. Endpoint pattern:  
> `http://127.0.0.1:8001/<db>/<table>.json?_size=PAGE_SIZE&_next=...`  
> We'll use `Orders` and `OrderDetails` tables.

### A1. Warm‑up: GET with params & `.json()`

In [None]:
import requests
BASE = "http://127.0.0.1:8001"
DB = "northwind"
TABLE = "Orders"
PAGE_SIZE = 50

params = {"_size": PAGE_SIZE}
r = requests.get(f"{BASE}/{DB}/{TABLE}.json", params=params, timeout=10)
print(r.status_code)
data = r.json()  # raises if not JSON
list(data.keys()), data.get("rows", [])[:2]

**Checkpoint:** Identify where rows live (Datasette returns `rows`).

### A2. Robust request helper with **status handling**

In [None]:
from typing import Dict, Any
import time

class APIError(Exception):
    pass

def get_json(url: str, params: Dict[str, Any] | None = None, *, max_retries: int = 5) -> dict:
    backoff = 0.5
    for attempt in range(1, max_retries+1):
        try:
            resp = requests.get(url, params=params, timeout=15)
            status = resp.status_code
            if status == 200:
                return resp.json()
            elif status in (429, 500, 502, 503, 504):
                # exponential backoff
                time.sleep(backoff)
                backoff *= 2
                continue
            else:
                raise APIError(f"Unexpected status {status}: {resp.text[:200]}")
        except (requests.Timeout, requests.ConnectionError) as e:
            time.sleep(backoff)
            backoff *= 2
    raise APIError(f"Failed after {max_retries} attempts: {url}")

**Simulating 429/5xx:** Stop the server briefly or change the port in `BASE` to provoke errors and observe retries.

### A3. **Pagination** (`_next` cursor)

In [None]:
import pandas as pd

def fetch_all(base: str, db: str, table: str, page_size: int = 100) -> pd.DataFrame:
    url = f"{base}/{db}/{table}.json"
    params = {"_size": page_size}
    out = []
    next_tok = None
    while True:
        if next_tok:
            params["_next"] = next_tok
        payload = get_json(url, params)
        rows = payload.get("rows", [])
        if not rows:
            break
        out.extend(rows)
        next_tok = payload.get("next")  # Datasette provides a cursor token
        if not next_tok:
            break
    return pd.DataFrame(out)

orders = fetch_all(BASE, DB, "Orders", page_size=200)
orders.head(), len(orders)

### A4. Query filters via params

Datasette supports simple filter syntax. Example: find orders with `ShipCountry = 'USA'`.

In [None]:
usa = get_json(f"{BASE}/{DB}/Orders.json", params={"ShipCountry": "USA", "_size": 50})
len(usa["rows"]) , usa["rows"][:2]

**Checkpoint:** Note how query parameters map to column filters.

---

## Part B — SQL → pandas with SQLite

### B1. Parameterized queries

In [None]:
import sqlite3, pandas as pd
# Update file path as needed
conn = sqlite3.connect("lab02/northwind.db")

country = "USA"  # from user input in real apps
q = """
SELECT OrderID, CustomerID, OrderDate, ShipCountry
FROM Orders
WHERE ShipCountry = ? AND OrderDate >= ?
ORDER BY OrderDate DESC
"""
params = (country, "1997-01-01")

safe_df = pd.read_sql_query(q, conn, params=params)
safe_df.head()

> **Why `?` placeholders?** Prevents SQL injection—SQLite driver will safely bind values.

### B2. Chunked reads for large tables

In [None]:
big_q = "SELECT * FROM [Order Details]"  # space requires brackets in SQLite
chunks = pd.read_sql_query(big_q, conn, chunksize=10_000)

import pyarrow.parquet as pq
import pyarrow as pa

# Stream to Parquet in chunks
writer = None
for i, chunk in enumerate(chunks, start=1):
    table = pa.Table.from_pandas(chunk, preserve_index=False)
    if writer is None:
        # Update file path as needed
        writer = pq.ParquetWriter("lab02/order_details.parquet", table.schema)
    writer.write_table(table)

if writer is not None:
    writer.close()

**Checkpoint:** Verify output file size and row count by re‑reading with pandas.

### B3. Quick validation snapshot

In [None]:
import pandas as pd
# Update file path as needed
p = pd.read_parquet("lab02/order_details.parquet")
print(len(p))
print(p.select_dtypes(include='number').describe().T.head())

---

## Part C — Wrap‑Up

Answer the following questions:

### Question 1: How does your retry/backoff behave for 429 vs 500?

*Your answer here:*

Both 429 (Too Many Requests) and 500 (Internal Server Error) trigger the same exponential backoff behavior in our `get_json` function. The backoff starts at 0.5 seconds and doubles with each retry attempt (0.5s → 1s → 2s → 4s → 8s) up to a maximum of 5 attempts. This gives the server time to recover while avoiding overwhelming it with immediate retries.

### Question 2: Why are **parameterized** queries the default choice? Provide a one‑sentence example of a potential injection if not parameterized.

*Your answer here:*

Parameterized queries prevent SQL injection attacks by safely binding user input values instead of concatenating them directly into SQL strings. For example, without parameterization, a malicious user could input `"'; DROP TABLE Orders; --"` as a country name, which would delete the entire Orders table if directly concatenated into the SQL query.

### Question 3: When would you choose chunked reads? What trade‑off do you incur?

*Your answer here:*

Chunked reads are ideal when working with large tables that won't fit into memory all at once, or when processing data incrementally (e.g., streaming to Parquet or performing transformations). The trade-off is increased complexity in code and potentially slower overall processing due to multiple database round-trips, compared to a single bulk read that loads everything into memory at once.

---

## Final Thoughts

- **Datasette tips:** JSON is at `.../table.json` (use `_size`, `_next`, and column filters)
- **Simulating errors:** Stop server to trigger connection errors; add a bogus param to force 4xx; explore retry logs.
- **Common pitfalls:**
  - Missing `timeout` in `requests` → hanging cells.
  - Forgetting `params` vs string concatenation in URLs.
  - Using f‑strings to inject SQL instead of `params=...`.

**Artifacts to retain:** `orders.parquet`, `order_details.parquet`.

---

## Appendix — Solution Snippets Reference

**Backoff conditions you might retry:** `429, 500, 502, 503, 504` (idempotent GETs). Use jitter in production to avoid thundering herds.

**Cursor pagination recap (Datasette):** Examine `payload["next"]` and pass it back as `_next` with the same `_size` to retrieve subsequent pages.

**`read_sql_query` params:** Use `?` for SQLite, `%s` for Postgres, and pass values via `params=...` so the DB‑API binds them safely.