# Chicago Crime Data Preprocessing

This notebook performs data cleaning and preprocessing on raw Chicago crime data.

**Purpose:** 
- Remove redundant columns
- Convert date formats
- Handle missing values
- Save cleaned data for analysis

**Input Files:**
- Training data: `chicago_crimes_2015_2024_raw.csv`
- Test data: `chicago_crimes_2025_raw.csv`

**Output Files:**
- Training data: `chicago_crimes_2015_2024_cleaned.csv`
- Test data: `chicago_crimes_2025_cleaned.csv`

## 1. Environment Setup and Configuration

In [1]:
import pandas as pd
import os

# Configuration
RAW_DATA_PATH = "../data/raw/"
PROCESSED_DATA_PATH = "../data/processed/"

# Columns to remove from the raw dataset
COLS_TO_DROP = [
    'id', 'case_number', 'block', 'iucr', 'fbi_code', 
    'x_coordinate', 'y_coordinate', 'location', 'updated_on'
]

# data overview
df = pd.read_csv(os.path.join(RAW_DATA_PATH, "chicago_crimes_2015_2024_raw.csv"))
df.head()
df.info()
df.describe()
df.isnull().sum()


<class 'pandas.DataFrame'>
RangeIndex: 2519504 entries, 0 to 2519503
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   id                    int64  
 1   case_number           str    
 2   date                  str    
 3   block                 str    
 4   iucr                  str    
 5   primary_type          str    
 6   description           str    
 7   location_description  str    
 8   arrest                bool   
 9   domestic              bool   
 10  beat                  int64  
 11  district              float64
 12  ward                  float64
 13  community_area        float64
 14  fbi_code              str    
 15  year                  int64  
 16  updated_on            str    
 17  x_coordinate          float64
 18  y_coordinate          float64
 19  latitude              float64
 20  longitude             float64
 21  location              str    
dtypes: bool(2), float64(7), int64(3), str(10)
memory usag

id                          0
case_number                 0
date                        0
block                       0
iucr                        0
primary_type                0
description                 0
location_description    12591
arrest                      0
domestic                    0
beat                        0
district                    1
ward                       55
community_area            176
fbi_code                    0
year                        0
updated_on                  0
x_coordinate            42228
y_coordinate            42228
latitude                42228
longitude               42228
location                42228
dtype: int64

## 2. Preprocessing Function

In [2]:
def preprocess_crime_data(input_filename: str, output_filename: str):
    """
    Load raw CSV, drop redundant columns, and perform basic cleaning.

    Args:
        input_filename: Name of the input file in RAW_DATA_PATH
        output_filename: Name of the output file in PROCESSED_DATA_PATH
    """
    input_path = os.path.join(RAW_DATA_PATH, input_filename)
    output_path = os.path.join(PROCESSED_DATA_PATH, output_filename)

    if not os.path.exists(input_path):
        print(f"Error: {input_path} not found!")
        return

    print(f"Loading data from {input_filename}...")
    df = pd.read_csv(input_path)
    
    # 1. Drop redundant columns
    # errors='ignore' ensures the code does not fail if some columns are already missing
    df.drop(columns=COLS_TO_DROP, inplace=True, errors='ignore')
    print(f"Dropped redundant columns: {COLS_TO_DROP}")

    # 2. Handle date format (recommended to do this in preprocessing)
    # Convert string to real datetime objects for later feature extraction (weekday, hour, etc.)
    if 'date' in df.columns:
        print("Converting 'date' column to datetime objects...")
        df['date'] = pd.to_datetime(df['date'])

    # 3. Handle missing values (optional but recommended)
    # For example, rows with missing latitude/longitude usually cannot be used for spatial analysis
    initial_len = len(df)
    df.dropna(subset=['latitude', 'longitude', 'district'], inplace=True)
    print(f"Removed {initial_len - len(df)} rows with missing critical values.")

    # 4. Save cleaned data
    os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)
    df.to_csv(output_path, index=False)
    print(f"Successfully saved cleaned data to: {output_path}")
    print("-" * 30)

## 3. Data Preprocessing Execution

### 3.1 Preprocess Training Data (2015â€“2024)

In [3]:
print("Starting data preprocessing...")

preprocess_crime_data(
    "chicago_crimes_2015_2024_raw.csv", 
    "chicago_crimes_2015_2024_cleaned.csv"
)

Starting data preprocessing...
Loading data from chicago_crimes_2015_2024_raw.csv...
Dropped redundant columns: ['id', 'case_number', 'block', 'iucr', 'fbi_code', 'x_coordinate', 'y_coordinate', 'location', 'updated_on']
Converting 'date' column to datetime objects...
Removed 42229 rows with missing critical values.
Successfully saved cleaned data to: ../data/processed/chicago_crimes_2015_2024_cleaned.csv
------------------------------


### 3.2 Preprocess Test Data (2025)

In [4]:
preprocess_crime_data(
    "chicago_crimes_2025_raw.csv", 
    "chicago_crimes_2025_cleaned.csv"
)

print("Preprocessing completed!")

Loading data from chicago_crimes_2025_raw.csv...
Dropped redundant columns: ['id', 'case_number', 'block', 'iucr', 'fbi_code', 'x_coordinate', 'y_coordinate', 'location', 'updated_on']
Converting 'date' column to datetime objects...
Removed 64 rows with missing critical values.
Successfully saved cleaned data to: ../data/processed/chicago_crimes_2025_cleaned.csv
------------------------------
Preprocessing completed!
