# Chicago Crime Data Loader

This notebook downloads raw crime data from the City of Chicago Open Data Portal using the Socrata Open Data API (SODA).

**Purpose:** Download raw data only - no data cleaning or feature engineering is performed here.

**Data Sources:**
- Training data: 2015–2024
- Test data: 2025

## 1. Environment Setup and Configuration

In [1]:
import os
import requests
import pandas as pd
from typing import Optional, List, Dict

# Configuration
BASE_API_URL = "https://data.cityofchicago.org/resource/ijzp-q8t2.json"
DEFAULT_BATCH_SIZE = 50000  # Max recommended batch size for SODA

## 2. Core API Functions

In [2]:
def fetch_chicago_data(
    limit: int,
    offset: int,
    where: Optional[str] = None,
    select: Optional[str] = None
) -> List[Dict]:
    """
    Fetch a single batch of raw data from the Chicago Open Data API.

    Args:
        limit: Number of records to fetch
        offset: Offset for pagination
        where: SQL-like WHERE clause (optional)
        select: Columns to select (optional)

    Returns:
        List of raw records (each record is a dictionary)
    """
    params = {
        "$limit": limit,
        "$offset": offset
    }

    if where:
        params["$where"] = where
    if select:
        params["$select"] = select

    response = requests.get(BASE_API_URL, params=params)
    response.raise_for_status()

    return response.json()

In [3]:
def fetch_all_chicago_data(
    where: Optional[str] = None,
    batch_size: int = DEFAULT_BATCH_SIZE
) -> pd.DataFrame:
    """
    Fetch all records that satisfy the WHERE clause using pagination.

    Args:
        where: SQL-like WHERE clause
        batch_size: Number of records per API request

    Returns:
        pandas DataFrame containing ALL raw records
    """
    all_records: List[Dict] = []
    offset = 0

    while True:
        print(f"Fetching records {offset} to {offset + batch_size} ...")

        batch = fetch_chicago_data(
            limit=batch_size,
            offset=offset,
            where=where
        )

        if not batch:
            break

        all_records.extend(batch)
        offset += batch_size

    return pd.DataFrame(all_records)

## 3. IO Utilities

In [4]:
def save_to_csv(df: pd.DataFrame, filepath: str) -> None:
    """
    Save DataFrame to CSV.

    Args:
        df: DataFrame to save
        filepath: Output file path
    """
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    df.to_csv(filepath, index=False)
    print(f"Saved raw data to {filepath}")

## 4. Data Download Execution

### 4.1 Download Training Data (2015–2024)

In [None]:
print("Starting raw Chicago crime data download...")

# Training data: 2015–2024
train_where = "year >= 2015 AND year <= 2024"
df_train = fetch_all_chicago_data(where=train_where)

print(f"Training set size: {len(df_train):,}")
print(f"Training columns: {list(df_train.columns)}")

save_to_csv(
    df_train,
    "../data/raw/chicago_crimes_2015_2024_raw.csv"
)

Starting raw Chicago crime data download...
Fetching records 0 to 50000 ...
Fetching records 50000 to 100000 ...
Fetching records 100000 to 150000 ...
Fetching records 150000 to 200000 ...
Fetching records 200000 to 250000 ...
Fetching records 250000 to 300000 ...
Fetching records 300000 to 350000 ...
Fetching records 350000 to 400000 ...
Fetching records 400000 to 450000 ...
Fetching records 450000 to 500000 ...
Fetching records 500000 to 550000 ...
Fetching records 550000 to 600000 ...
Fetching records 600000 to 650000 ...
Fetching records 650000 to 700000 ...
Fetching records 700000 to 750000 ...
Fetching records 750000 to 800000 ...
Fetching records 800000 to 850000 ...
Fetching records 850000 to 900000 ...
Fetching records 900000 to 950000 ...
Fetching records 950000 to 1000000 ...
Fetching records 1000000 to 1050000 ...
Fetching records 1050000 to 1100000 ...
Fetching records 1100000 to 1150000 ...
Fetching records 1150000 to 1200000 ...
Fetching records 1200000 to 1250000 ...
Fe

### 4.2 Download Test Data (2025)

In [None]:
# Test data: 2025
test_where = "year = 2025"
df_test = fetch_all_chicago_data(where=test_where)

print(f"Test set size: {len(df_test):,}")

save_to_csv(
    df_test,
    "../data/raw/chicago_crimes_2025_raw.csv"
)

print("Raw data download completed successfully.")

Fetching records 0 to 50000 ...
Fetching records 50000 to 100000 ...
Fetching records 100000 to 150000 ...
Fetching records 150000 to 200000 ...
Fetching records 200000 to 250000 ...
Fetching records 250000 to 300000 ...
Test set size: 236,292
Saved raw data to data/raw/chicago_crimes_2025_raw.csv
Raw data download completed successfully.
