### Data Cleaning: Handling Missing Values

**Decision:** Remove all rows containing missing values.

**Justification:**
* **Low Impact:** The rows with missing data represent approximately **2.6%** of the total dataset. Removing this small fraction will not statistically bias the results.
* **Critical Spatial Data:** The majority of missing values (6,974 entries) are in the `Latitude`, `Longitude`, and `Location` columns. Since the `Block` addresses are redacted for privacy (e.g., `015XX`), accurate imputation of exact coordinates is impossible. Without these coordinates, the data is unusable for spatial analysis.
* **Data Integrity:** Removing these rows ensures a 100% complete dataset, preventing errors in downstream machine learning models (e.g., Logistic Regression) that cannot handle `NaN` values.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv('../data/raw_data.csv')

# Count of missing values per column
print(df.isnull().sum())
print("==" *40)
# Percentage of missing values per column
print((df.isnull().sum() / len(df)) * 100)
print("==" *40)
# Drop all rows that have at least one missing value
df_clean = df.dropna()

# Verify that the data is now clean
print("Missing values after dropping:")
print(df_clean.isnull().sum().sum())
print("==" *40)
# Check how many rows are left
print(f"New data shape: {df_clean.shape}")



ID                         0
Case Number                0
Date                       0
Block                      0
IUCR                       0
Primary Type               0
Description                0
Location Description     613
Arrest                     0
Domestic                   0
Beat                       0
District                   0
Ward                       2
Community Area            13
FBI Code                   0
X Coordinate            6974
Y Coordinate            6974
Year                       0
Updated On                 0
Latitude                6974
Longitude               6974
Location                6974
dtype: int64
ID                      0.000000
Case Number             0.000000
Date                    0.000000
Block                   0.000000
IUCR                    0.000000
Primary Type            0.000000
Description             0.000000
Location Description    0.231424
Arrest                  0.000000
Domestic                0.000000
Beat               

### Fix Data Types 

In [5]:
# Convert 'Date' to datetime objects
df_clean = df_clean.copy()
df_clean['Date'] = pd.to_datetime(df_clean['Date'], format='%m/%d/%Y %I:%M:%S %p')

# Ensure categorical columns are strings 
categorical_cols = ['Primary Type', 'Description', 'Location Description']
for col in categorical_cols:
    df_clean[col] = df_clean[col].astype(str)

print("Data types fixed.")
df_clean.info()



Data types fixed.
<class 'pandas.core.frame.DataFrame'>
Index: 257752 entries, 1 to 264880
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   ID                    257752 non-null  int64         
 1   Case Number           257752 non-null  object        
 2   Date                  257752 non-null  datetime64[ns]
 3   Block                 257752 non-null  object        
 4   IUCR                  257752 non-null  object        
 5   Primary Type          257752 non-null  object        
 6   Description           257752 non-null  object        
 7   Location Description  257752 non-null  object        
 8   Arrest                257752 non-null  bool          
 9   Domestic              257752 non-null  bool          
 10  Beat                  257752 non-null  int64         
 11  District              257752 non-null  int64         
 12  Ward                  257752 non-null  float6

### Feature Engineering

In [6]:
# Extract Basic Time Components
df_clean['Hour'] = df_clean['Date'].dt.hour
df_clean['Day_of_Week'] = df_clean['Date'].dt.dayofweek  # 0=Monday, 6=Sunday
df_clean['Month'] = df_clean['Date'].dt.month

# Identify Weekends (1 if Sat/Sun, else 0)
# Day 5 is Saturday, Day 6 is Sunday
df_clean['Is_Weekend'] = df_clean['Day_of_Week'].apply(lambda x: 1 if x >= 5 else 0)

# Map Months to Seasons
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df_clean['Season'] = df_clean['Month'].apply(get_season)

print("Feature Engineering Complete. New columns added:")
print(df_clean[['Date', 'Hour', 'Day_of_Week', 'Is_Weekend', 'Season']].head())

Feature Engineering Complete. New columns added:
                 Date  Hour  Day_of_Week  Is_Weekend  Season
1 2015-12-31 23:59:00    23            3           0  Winter
2 2015-12-31 23:55:00    23            3           0  Winter
3 2015-12-31 23:50:00    23            3           0  Winter
4 2015-12-31 23:50:00    23            3           0  Winter
5 2015-12-31 23:45:00    23            3           0  Winter


### Outlier Detection using IQR

In [7]:
def remove_outliers_iqr(df, column):
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define bounds (1.5 times the IQR is the standard rule)
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Filter data to keep only values within bounds
    df_filtered = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df_filtered

# Apply to Latitude and Longitude to remove GPS errors
original_shape = df_clean.shape
df_clean = remove_outliers_iqr(df_clean, 'Latitude')
df_clean = remove_outliers_iqr(df_clean, 'Longitude')

print(f"Outliers removed. Shape changed from {original_shape} to {df_clean.shape}")

Outliers removed. Shape changed from (257752, 27) to (256553, 27)


### Feature Engineering: Aggregating Crime Counts

In [8]:
# Strategy: Calculate "Crime Intensity" for each Community Area at each Hour.
# This answers: "How dangerous is this specific area at this specific time of day usually?"

# 1. Group by Area and Hour to count crimes
crime_intensity = df_clean.groupby(['Community Area', 'Hour']).size().reset_index(name='Area_Hour_Crime_Count')

# 2. Merge this count back into the main dataframe
# 'how=left' ensures we keep all original rows and just add the new info
df_clean = df_clean.merge(crime_intensity, on=['Community Area', 'Hour'], how='left')

# 3. Create a broader "Seasonal Risk" by Area and Season
seasonal_risk = df_clean.groupby(['Community Area', 'Season']).size().reset_index(name='Area_Season_Crime_Count')
df_clean = df_clean.merge(seasonal_risk, on=['Community Area', 'Season'], how='left')

print("Aggregated features created. Preview:")
print(df_clean[['Community Area', 'Hour', 'Area_Hour_Crime_Count', 'Season', 'Area_Season_Crime_Count']].head())

# import os

# # Create the folder if it doesn't exist
# os.makedirs('../data', exist_ok=True)

# # Save the master clean file
# master_output_path = '../data/crime_data_clean.csv'
# df_clean.to_csv(master_output_path, index=False)
# print(f"Master dataset saved to {master_output_path}")

Aggregated features created. Preview:
   Community Area  Hour  Area_Hour_Crime_Count  Season  \
0            68.0    23                    351  Winter   
1            45.0    23                     53  Winter   
2             2.0    23                    131  Winter   
3             6.0    23                    255  Winter   
4            25.0    23                    767  Winter   

   Area_Season_Crime_Count  
0                     1416  
1                      303  
2                      728  
3                     1084  
4                     3708  


In [9]:
df1= pd.read_csv('../data/crime_data_clean.csv')

print(df1.head())

         ID Case Number                 Date                     Block  IUCR  \
0  10365064    HZ100370  2015-12-31 23:59:00       075XX S EMERALD AVE  1320   
1  10364662    HZ100006  2015-12-31 23:55:00  079XX S STONY ISLAND AVE  0430   
2  10364740    HZ100010  2015-12-31 23:50:00         024XX W FARGO AVE  0820   
3  10364683    HZ100002  2015-12-31 23:50:00          037XX N CLARK ST  0460   
4  10366580    HZ102701  2015-12-31 23:45:00        050XX W CONCORD PL  1310   

      Primary Type                    Description Location Description  \
0  CRIMINAL DAMAGE                     TO VEHICLE               STREET   
1          BATTERY  AGGRAVATED: OTHER DANG WEAPON               STREET   
2            THEFT                 $500 AND UNDER            APARTMENT   
3          BATTERY                         SIMPLE             SIDEWALK   
4  CRIMINAL DAMAGE                    TO PROPERTY            APARTMENT   

   Arrest  Domestic  ...   Latitude  Longitude  \
0   False     False  ...

### Step 1: Define Features

In [10]:
# 1. Define the base list of relevant columns
# We drop 'ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Description', 'Location'
# We Keep: Categorical context, Time info, Coordinates, and your new Aggregated counts
selected_features = [
    'Arrest',                  # Target
    'Primary Type',            # Feature (Categorical)
    'Location Description',    # Feature (Categorical)
    'Domestic',                # Feature (Boolean)
    'Hour',                    # Feature (Num)
    'Day_of_Week',             # Feature (Num)
    'Month',                   # Feature (Num)
    'Is_Weekend',              # Feature (Num)
    'Season',                  # Feature (Categorical)
    'Latitude',                # Feature (Num)
    'Longitude',               # Feature (Num)
    'Area_Hour_Crime_Count',   # Feature (Num - Aggregated)
    'Area_Season_Crime_Count'  # Feature (Num - Aggregated)
]

# Create a subset dataframe
df_model_base = df_clean[selected_features].copy()

# Fix Boolean to Int (True/False -> 1/0)
df_model_base['Arrest'] = df_model_base['Arrest'].astype(int)
df_model_base['Domestic'] = df_model_base['Domestic'].astype(int)

### Step 2: Create Dataframe for Tree Models (Random Forest, LGBM, XGB)

In [11]:
from sklearn.preprocessing import LabelEncoder

# Create a copy for Tree models
df_tree = df_model_base.copy()

# Initialize Encoder
le = LabelEncoder()

# Encode the string columns
categorical_cols = ['Primary Type', 'Location Description', 'Season']

for col in categorical_cols:
    df_tree[col] = le.fit_transform(df_tree[col])

# Save for Random Forest / XGB / LGBM
df_tree.to_csv('../data/data_for_tree_models.csv', index=False)
print("Tree-based data saved (Label Encoded).")
print(df_tree.head())

Tree-based data saved (Label Encoded).
   Arrest  Primary Type  Location Description  Domestic  Hour  Day_of_Week  \
0       0             6                   118         0    23            3   
1       0             2                   118         0    23            3   
2       0            30                    17         0    23            3   
3       1             2                   114         0    23            3   
4       0             6                    17         0    23            3   

   Month  Is_Weekend  Season   Latitude  Longitude  Area_Hour_Crime_Count  \
0     12           0       3  41.757367 -87.642993                    351   
1     12           0       3  41.751270 -87.585822                     53   
2     12           0       3  42.016804 -87.690709                    131   
3     12           0       3  41.949837 -87.658635                    255   
4     12           0       3  41.910470 -87.751597                    767   

   Area_Season_Crime_Count  


### Step 3: Create Dataframe for Logistic Regression

In [12]:
# Create a copy for Linear models
df_log = df_model_base.copy()

# Apply One-Hot Encoding (get_dummies)
# drop_first=True prevents multicollinearity (e.g., if it's not Spring, Summer, or Fall, it MUST be Winter)
df_log = pd.get_dummies(df_log, columns=['Primary Type', 'Location Description', 'Season'], drop_first=True)

# Note: Logistic Regression usually requires SCALING (StandardScaler).
# We usually do scaling INSIDE the training loop to avoid data leakage, 
# but the dataframe structure is ready now.

# Save for Logistic Regression
df_log.to_csv('../data/data_for_logistic.csv', index=False)
print("Logistic Regression data saved (One-Hot Encoded).")
print(df_log.head())

Logistic Regression data saved (One-Hot Encoded).
   Arrest  Domestic  Hour  Day_of_Week  Month  Is_Weekend   Latitude  \
0       0         0    23            3     12           0  41.757367   
1       0         0    23            3     12           0  41.751270   
2       0         0    23            3     12           0  42.016804   
3       1         0    23            3     12           0  41.949837   
4       0         0    23            3     12           0  41.910470   

   Longitude  Area_Hour_Crime_Count  Area_Season_Crime_Count  ...  \
0 -87.642993                    351                     1416  ...   
1 -87.585822                     53                      303  ...   
2 -87.690709                    131                      728  ...   
3 -87.658635                    255                     1084  ...   
4 -87.751597                    767                     3708  ...   

   Location Description_VEHICLE - DELIVERY TRUCK  \
0                                          False  