# Project #1: Analyze Chicago Crime Dataset

- Bagdat Rakhimov

- Predictive Analytics Diploma Program, PACE, University of Winnipeg

- Applied Statistics for Data Science PA F24

- Hieu Dang, PhD

- March 3, 2025

## 1. Introduction

### 1.1. Background

This dataset includes reported crime incidents in Chicago from 2001 to 2022. Data comes from the Chicago Police Department's CLEAR system.

### 1.2. Objective

Analyze the Chicago Crime Dataset to generate insights and provide recommendations to the government and citizens of the City of Chicago.

### 1.3. Scope

The dataset represents crime incidents from January 1, 2001, to December 5, 2022, providing over 21 years of historical data.

## 2. Data Understanding & Preparation

### 2.1. Dataset Overview

The dataset has the following columns and datatypes:

- **id** (data type: Integer): Unique identifier for each crime incident.

- **case_number** (String): A unique case reference number assigned by the police.

- **date** (Date/Time): The date and time when the crime occurred.

- **block** (String): Approximate address where the crime happened (block-level precision).

- **iucr** (String): Illinois Uniform Crime Reporting (IUCR) code that categorizes the crime.

- **primary_type** (String): The general category of crime (e.g., THEFT, BURGLARY, ASSAULT).

- **description** (String): More detailed crime description (e.g., "THEFT OVER $500").

- **location_description** (String): The type of place where the crime happened (e.g., "Street", "Gas Station").

- **arrest** (Boolean): Whether an arrest was made (TRUE/FALSE).

- **domestic** (Boolean): Indicates if the incident was domestic violence-related (TRUE/FALSE).

- **beat** (Integer): Smallest police patrol unit area.

- **district** (Float): Larger police district where the crime happened.

- **ward** (Float): City council district where the crime occurred.

- **community_area** (Float): Community-level region in Chicago (1-77).

- **fbi_code** (String): FBI classification code for crime types.

- **x_coordinate** (Float): Geographic coordinate for mapping.

- **y_coordinate** (Float): Geographic coordinate for mapping.

- **year** (Integer): Year the incident occurred.

- **updated_on** (Date/Time): Date and time the record was last updated.

- **latitude** (Float): Latitude coordinate of crime location.

- **longitude** (Float): Longitude coordinate of crime location.

- **location** (String): Combined latitude and longitude values.

- **historical_wards_2003-2015** (Float): The City of Chicago's ward boundaries before the 2015 redistricting.

- **zip_codes** (Float): The postal ZIP code where the crime occurred

- **community_areas** (Float): Community areas (1-77) are predefined geographic regions in Chicago, useful for demographic and economic analysis.

- **census _tract**s (Float): A geographic unit used by the U.S. Census Bureau.

- **wards** (Float): The current (post-2015) ward number where the crime occurred.

- **boundaries_-_zip_codes** (Float): Likely a spatial boundary indicator for ZIP codes, used for geospatial mapping and area-based crime trends.

- **police_districts** (Float): The larger police jurisdiction within which the crime was reported.

- **police_beats** (Float): The smallest geographic police unit, where officers regularly patrol.

### 2.2. Data Source

The link to the City of Chicago dataset:

https://drive.google.com/file/d/13DUbNq-HKRYI2xYSbaEWAt6vwALO-Esq/view

### 2.3. Data Exploration

#### 2.3.1. Tools Preparation

In [None]:
# Importing libraries and packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
import zipfile
import gdown
import os
import geopandas as gpd
import folium
from folium.plugins import HeatMap
from folium.features import GeoJsonTooltip
import branca.colormap as cm
from IPython.display import display

#### 2.3.2. Loading Data

In [None]:
# Define the folder dynamically
current_dir = os.getcwd()
data_folder_path = os.path.join(current_dir, 'data')
os.makedirs(data_folder_path, exist_ok=True)

# Define file paths dynamically
destination = os.path.join(data_folder_path, 'chicago_crime_dataset_v2.csv.zip')
csv_file_path = os.path.join(data_folder_path, 'chicago_crime_dataset_v2.csv')

In [None]:
# Download the dataset
file_id = '13DUbNq-HKRYI2xYSbaEWAt6vwALO-Esq'
gdown.download(f'https://drive.google.com/uc?id={file_id}', destination, quiet=False)
print('Download completed!')

In [None]:
# Unzip the dataset
try:
    with zipfile.ZipFile(destination, 'r') as zip_ref:
        zip_ref.extractall(data_folder_path)
        print(f'Files unzipped successfully to {data_folder_path}')
except zipfile.BadZipFile:
    print('Error: Invalid zip file')
except Exception as e:
    print(f'Failed to unzip: {e}')

In [None]:
# Load dataset
crime_data = pd.read_csv(csv_file_path)

In [None]:
# Verify dataset size and memory usage before cleaning
print(f'Dataset shape before cleaning: {crime_data.shape}')
print(crime_data.memory_usage(deep=True).sum(), 'bytes before cleaning')

#### 2.3.3. Initial Inspection

In [None]:
# Check the first few rows
crime_data.head()

In [None]:
# Check the number of rows and columns in the dataset
print(f"Dataset contains {crime_data.shape[0]} rows and {crime_data.shape[1]} columns.")

In [None]:
# Display column names and their data types
crime_data.info()

Observation:

- The dataset contains 7,691,209 rows and 30 columns, covering various data types, including integers (crime IDs, beats, years), floats (coordinates, districts, wards), booleans (arrest, domestic cases), and strings (case numbers, crime descriptions, locations).

In [None]:
# Check for missing values in each column
print(crime_data.isnull().sum())

Observation:

- Several columns have missing values, with some having minimal gaps (e.g., Case Number with only 4 missing values), while others have significant gaps (e.g., Ward with 614,847 missing values). Geospatial columns (X Coordinate, Y Coordinate, Latitude, Longitude, and Location) have 81,758 missing entries, which may affect any mapping or geospatial analysis. The Ward and Community Area columns also have over 600,000 missing values, which could impact demographic or political analysis. Handling these missing values will be crucial—potential solutions include imputation, removal, or filling with default values where appropriate.

| Column Name                   | Missing Values | Description | Comments |
|--------------------------------|---------------|-------------|----------|
| ID                             | 0             | Unique identifier for each crime record. | Keep – essential for tracking crimes. |
| Case Number                    | 4             | Unique case identifier; a small number of missing cases may indicate data entry issues. | Drop - existance of ID |
| Date                           | 0             | Date and time when the crime occurred. | Keep – essential for analysis. |
| Block                          | 0             | Block where the crime took place. | Keep – useful for location-based analysis. |
| IUCR                           | 0             | Illinois Uniform Crime Reporting (IUCR) code for the crime type. | Keep – necessary for crime classification. |
| Primary Type                   | 0             | General category of the crime. | Keep – important for crime trends. |
| Description                    | 0             | More specific description of the crime. | Keep – useful for detailed analysis. |
| Location Description           | 9,824         | Location type where the crime happened (e.g., street, residence, school). | Input missing values with "Unknown". |
| Arrest                         | 0             | Indicates whether an arrest was made (True/False). | Keep – useful for law enforcement insights. |
| Domestic                       | 0             | Indicates whether the crime was domestic-related (True/False). | Keep – relevant for policy-making. |
| Beat                           | 0             | Police beat (smallest geographic unit for policing). | Drop due to lack of necessity. |
| District                       | 47            | Police district where the crime occurred. | Input missing values with "Unknown". |
| Ward                           | 614,847       | Ward of the city where the crime took place (many missing values). | Drop – too many missing values. |
| Community Area                 | 613,476       | Community area where the crime occurred (similar to Ward, large missing values). | Drop – too many missing values. |
| FBI Code                       | 0             | FBI code representing the crime category. | Drop due to lack of necessity. |
| X Coordinate                   | 81,758        | X-coordinate for mapping crime locations (many missing). | Drop due to existance of Latitude. |
| Y Coordinate                   | 81,758        | Y-coordinate for mapping crime locations (many missing). | Drop due to existance of Longitude. |
| Year                           | 0             | Year of the crime occurrence. | Keep – necessary for trend analysis. |
| Updated On                     | 0             | Date when the record was last updated. | Drop due to lack of necessity. |
| Latitude                       | 81,758        | Latitude of the crime location (many missing). | Keep - useful for mapping. |
| Longitude                      | 81,758        | Longitude of the crime location (many missing). | Keep - useful for mapping. |
| Location                       | 81,758        | Concatenation of Latitude and Longitude (many missing). | Drop – redundant due to Latitude and Longitude are used. |
| Historical Wards 2003-2015      | 104,135       | Historical ward classification from 2003-2015 (many missing). | Drop – outdated data. |
| Zip Codes                      | 81,758        | ZIP code where the crime took place (many missing). | Drop – too many missing values. |
| Community Areas                | 101,079       | Community area codes, similar to Wards (many missing). | Keep - useful for mapping. |
| Boundaries - ZIP Codes         | 101,027       | Boundaries related to ZIP code mapping (many missing). | Drop – too many missing values. |
| Police Districts               | 99,918        | Police district classification (many missing). | Drop – too many missing values. |
| Police Beats                   | 99,895        | Police beats classification (many missing). | Drop – too many missing values. |

In [None]:
# Check for duplicate rows in the dataset
duplicate_count = crime_data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

Observation:

- The dataset contains no duplicate rows, meaning that each crime record is unique.

In [None]:
# Display summary statistics for numerical columns
crime_data.describe()

Observation:

- The dataset includes 7,691,209 crime records spanning from 2001 to 2022, with the median (50%) year being 2009. This suggests a relatively even distribution of data over time.

In [None]:
# Display summary statistics for categorical columns
crime_data.describe(include='object')

Observation:

- The Case Number column has 7,690,678 unique values out of 7,691,205 records, suggesting that most cases have unique identifiers, but a few may be duplicated or missing.

In [None]:
# Check unique crime types in the dataset
print("Unique crime types:", crime_data["Primary Type"].nunique())
print(crime_data["Primary Type"].value_counts().head(10))  # Top 10 crime types

Observation:

- There are 36 unique crime types in the dataset, with THEFT (1,622,756 cases) being the most frequent, followed by BATTERY (1,408,218 cases) and CRIMINAL DAMAGE (876,928 cases).

In [None]:
# Check unique location descriptions
print("Unique crime locations:", crime_data["Location Description"].nunique())
print(crime_data["Location Description"].value_counts().head(10))  # Top 10 crime locations

Observation:

- The dataset contains 215 unique crime locations, with STREET (1,999,814 cases) being the most common crime scene, followed by RESIDENCE (1,293,786 cases) and APARTMENTS (861,299 cases).

In [None]:
# Convert date column to datetime format for better analysis
crime_data["Date"] = pd.to_datetime(crime_data["Date"], format="%m/%d/%Y %I:%M:%S %p")

In [None]:
# Check the earliest and latest dates in the dataset
print(f"Date range: {crime_data['Date'].min()} to {crime_data['Date'].max()}")

Observation:

- The dataset spans from January 1, 2001, to December 5, 2022, covering nearly 22 years of crime data. This long timeframe allows for trend analysis, seasonality detection, and historical comparisons.

### 2.4. Data Cleaning

In [None]:
# Drop unnecessary columns
columns_to_drop = [
    "Case Number", "Ward", "Beat", "Community Area", "FBI Code", "X Coordinate", "Y Coordinate", 
    "Updated On", "Location", "Historical Wards 2003-2015", "Zip Codes", 
    "Boundaries - ZIP Codes", "Police Districts", "Police Beats"
]
crime_data = crime_data.drop(columns=columns_to_drop)

# Fill missing values for 'Location Description' with "Unknown"
crime_data["Location Description"] = crime_data["Location Description"].fillna("Unknown")

# Fill missing values for 'District' with "Unknown"
crime_data["District"] = crime_data["District"].fillna("Unknown")

# Display the cleaned dataset info to confirm changes
crime_data.info()

In [None]:
# Verify dataset size and memory usage after cleaning
print(f'Dataset shape: {crime_data.shape}')
print(crime_data.memory_usage(deep=True).sum(), 'bytes')

### 2.5. Feature Engineering

In [None]:
# Get all unique crime types
unique_crime_types = crime_data["Primary Type"].unique()

# Display the crime types
print("All Unique Crime Types in the Dataset:\n", unique_crime_types)

In [None]:
# Extracting time-based features
crime_data["Year"] = crime_data["Date"].dt.year
crime_data["Month"] = crime_data["Date"].dt.month
crime_data["Day_of_Week"] = crime_data["Date"].dt.day_name()
crime_data["Hour"] = crime_data["Date"].dt.hour

# Creating a weekend flag (1 = weekend, 0 = weekday)
crime_data["Weekend"] = crime_data["Day_of_Week"].isin(["Saturday", "Sunday"]).astype(int)

# Creating a "Time of Day" feature
def get_time_of_day(hour):
    if 5 <= hour < 12:
        return "Morning"
    elif 12 <= hour < 17:
        return "Afternoon"
    elif 17 <= hour < 21:
        return "Evening"
    else:
        return "Night"

crime_data["Time_of_Day"] = crime_data["Hour"].apply(get_time_of_day)

# Categorizing crime severity based on crime type
high_severity = [
    "HOMICIDE", "CRIMINAL SEXUAL ASSAULT", "ROBBERY", "BATTERY", 
    "CRIM SEXUAL ASSAULT", "KIDNAPPING", "HUMAN TRAFFICKING", 
    "STALKING", "INTIMIDATION", "ARSON", "DOMESTIC VIOLENCE"
]

medium_severity = [
    "ASSAULT", "BURGLARY", "MOTOR VEHICLE THEFT", "WEAPONS VIOLATION", 
    "CRIMINAL DAMAGE", "CRIMINAL TRESPASS", "SEX OFFENSE", 
    "OFFENSE INVOLVING CHILDREN", "PROSTITUTION", "PUBLIC PEACE VIOLATION"
]

low_severity = [
    "THEFT", "NARCOTICS", "DECEPTIVE PRACTICE", "GAMBLING", "LIQUOR LAW VIOLATION", 
    "INTERFERENCE WITH PUBLIC OFFICER", "OBSCENITY", "PUBLIC INDECENCY", 
    "OTHER OFFENSE", "OTHER NARCOTIC VIOLATION", "CONCEALED CARRY LICENSE VIOLATION",
    "NON-CRIMINAL", "NON-CRIMINAL (SUBJECT SPECIFIED)", "NON - CRIMINAL", 
    "RITUALISM"
]

def classify_severity(crime):
    if crime in high_severity:
        return "High"
    elif crime in medium_severity:
        return "Medium"
    else:
        return "Low"

crime_data["Crime_Severity"] = crime_data["Primary Type"].apply(classify_severity)

# Calculating arrest rate per district
crime_data["Arrest"] = crime_data["Arrest"].astype(int)
crime_data["Arrest_Rate"] = crime_data.groupby("District")["Arrest"].transform(lambda x: x.rolling(100, min_periods=1).mean())

# Display updated dataset
crime_data.head()

Observation:

- The dataset now contains 23 columns, including additional derived features such as Month, Day_of_Week, Hour, Weekend, Time_of_Day, Crime_Severity, and Arrest_Rate.

## 3. Exploratory Data Analysis (EDA)

### 3.1. Descriptive Statistics

In [None]:
# Display statistics for numerical features
print("Summary Statistics for Numerical Features:\n", crime_data.describe())

Observation:

- The dataset spans from 2001 to 2022, with the median year being 2009, indicating a balanced distribution of crime records over time.

In [None]:
# Display statistics for categorical features
print("\nSummary Statistics for Categorical Features:\n", crime_data.describe(include='object'))

In [None]:
# Display crime count per severity category
print("\nCrime Count by Severity:\n", crime_data["Crime_Severity"].value_counts())

### 3.2. Crime Distribution

#### 3.2.1. By Type

In [None]:
# Count of crimes by type
top_crimes = crime_data["Primary Type"].value_counts().head(10)

# Print top 10 crime types and their counts
print("Top 10 Crime Types in Chicago:")
print(top_crimes)

In [None]:
# Count of crimes by severity category
severity_counts = crime_data["Crime_Severity"].value_counts()

# Print crime severity counts
print("Crime Count by Severity Level:")
print(severity_counts)

#### 3.2.2. By Location 

In [None]:
# Count of crimes by location
top_locations = crime_data["Location Description"].value_counts().head(10)

# Print top 10 crime locations
print("Top 10 Crime Locations in Chicago:")
print(top_locations)

In [None]:
# Group by Community Area and count the number of crimes
crime_by_area = crime_data.groupby("Community Areas").size().reset_index(name="Crime Count")

# Map community area numbers to names
community_names = {i: name for i, name in enumerate([
    "Rogers Park", "West Ridge", "Uptown", "Lincoln Square", "North Center", "Lake View",
    "Lincoln Park", "Near North Side", "Edison Park", "Norwood Park", "Jefferson Park",
    "Forest Glen", "North Park", "Albany Park", "Portage Park", "Irving Park", "Dunning",
    "Montclare", "Belmont Cragin", "Hermosa", "Avondale", "Logan Square", "Humboldt Park",
    "West Town", "Austin", "West Garfield Park", "East Garfield Park", "Near West Side",
    "North Lawndale", "South Lawndale", "Lower West Side", "Loop", "Near South Side",
    "Armour Square", "Douglas", "Oakland", "Fuller Park", "Grand Boulevard", "Kenwood",
    "Washington Park", "Hyde Park", "Woodlawn", "South Shore", "Chatham", "Avalon Park",
    "South Chicago", "Burnside", "Calumet Heights", "Roseland", "Pullman", "South Deering",
    "East Side", "West Pullman", "Riverdale", "Hegewisch", "Garfield Ridge", "Archer Heights",
    "Brighton Park", "McKinley Park", "Bridgeport", "New City", "West Elsdon", "Gage Park",
    "Clearing", "West Lawn", "Chicago Lawn", "West Englewood", "Englewood",
    "Greater Grand Crossing", "Ashburn", "Auburn Gresham", "Beverly", "Washington Heights",
    "Mount Greenwood", "Morgan Park", "O'Hare", "Edgewater"
], start=1)}

# Replace Community Area numbers with their corresponding names from the dictionary
crime_by_area["Community Name"] = crime_by_area["Community Areas"].map(community_names)

# Remove rows where there was no matching name in the dictionary
crime_by_area = crime_by_area.dropna(subset=["Community Name"])

# Sort by Crime Count in descending order
crime_by_area_sorted = crime_by_area.sort_values(by="Crime Count", ascending=False)

print("Crime Count by Community Area in Chicago:")
print(crime_by_area_sorted)

#### 3.2.3. By Time

In [None]:
# Count of crimes by year
yearly_trend = crime_data["Year"].value_counts().sort_index()

print("Crime Trend Over the Years in Chicago:")
print(yearly_trend)

In [None]:
# Count of crimes by month
monthly_trend = crime_data["Month"].value_counts().sort_index()

print("Crime Trend by Month:")
print(monthly_trend)

In [None]:
# Ensure 'Date' is in datetime format
crime_data["Date"] = pd.to_datetime(crime_data["Date"], errors="coerce")

# Extract Day of the Week from 'Date' column
crime_data["Day of Week"] = crime_data["Date"].dt.day_name()

# Count crimes by day of the week (ensure order from Monday to Sunday)
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
day_of_week_trend = crime_data["Day of Week"].value_counts().reindex(day_order)

print("Crime Distribution by Day of the Week:")
print(day_of_week_trend)

In [None]:
# Ensure 'Date' is in datetime format
crime_data["Date"] = pd.to_datetime(crime_data["Date"], errors="coerce")

# Extract Hour from 'Date' column
crime_data["Hour"] = crime_data["Date"].dt.hour

# Count crimes by hour
hourly_trend = crime_data["Hour"].value_counts().sort_index()

print("Crime Distribution by Hour:")
print(hourly_trend)

### 3.3. Data Visualization

#### 3.3.1. Bar charts, histograms, heatmaps (crime density)

In [None]:
# Count of crimes by type
top_crimes = crime_data["Primary Type"].value_counts().head(10)

# Plot crime type distribution
plt.figure(figsize=(12, 6))
sb.barplot(x=top_crimes.index, y=top_crimes.values, palette="coolwarm")
plt.xticks(rotation=45)
plt.xlabel("Crime Type")
plt.ylabel("Number of Crimes")
plt.title("Top 10 Crime Types in Chicago")
plt.show()

Observation:

- The bar chart shows the top 10 most common crime types in Chicago, with THEFT being the most frequent (over 1.6 million cases).

In [None]:
# Count of crimes by severity category
severity_counts = crime_data["Crime_Severity"].value_counts()

# Plot crime severity distribution
plt.figure(figsize=(8, 5))
sb.barplot(x=severity_counts.index, y=severity_counts.values, palette="coolwarm")
plt.xlabel("Crime Severity")
plt.ylabel("Number of Crimes")
plt.title("Crime Count by Severity Level")
plt.show()

Observation:

- The bar chart illustrates the distribution of crimes by severity level.

In [None]:
# Count of crimes by location
top_locations = crime_data["Location Description"].value_counts().head(10)

# Plot crime location distribution
plt.figure(figsize=(12, 6))
sb.barplot(x=top_locations.index, y=top_locations.values, palette="viridis")
plt.xticks(rotation=45, fontsize=6)
plt.xlabel("Crime Location")
plt.ylabel("Number of Crimes")
plt.title("Top 10 Crime Locations in Chicago")
plt.show()

Observation:

- The bar chart displays the top 10 crime locations in Chicago, with STREET being the most common crime scene (over 2 million cases).

In [None]:
# Group by Community Area and count the number of crimes
crime_by_area = crime_data.groupby("Community Areas").size().reset_index(name="Crime Count")

# Replace Community Area numbers with their corresponding names from the dictionary
crime_by_area["Community Name"] = crime_by_area["Community Areas"].map(community_names)

# Remove rows where there was no matching name in the dictionary
crime_by_area = crime_by_area.dropna(subset=["Community Name"])

# Sort by Crime Count in descending order
crime_by_area_sorted = crime_by_area.sort_values(by="Crime Count", ascending=False)

# Plot Crime Count by Community Area with a dark brown to light beige gradient
plt.figure(figsize=(12, 6))
sb.barplot(data=crime_by_area_sorted, 
           x="Community Name",
           y="Crime Count", 
           palette="YlOrBr_r")

plt.xticks(rotation=90)
plt.title("Crime Count by Community Area in Chicago")
plt.xlabel("Community Area")
plt.ylabel("Crime Count")

# Show the plot
plt.show()

Observation:

- The bar chart displays crime distribution by community area in Chicago, highlighting areas with the highest and lowest crime counts.

In [None]:
# Count of crimes by month
monthly_trend = crime_data["Month"].value_counts().sort_index()

plt.figure(figsize=(12, 6))
sb.barplot(x=monthly_trend.index, y=monthly_trend.values, palette="magma")
plt.xlabel("Month")
plt.ylabel("Number of Crimes")
plt.title("Crime Trend by Month")
plt.show()

Observation:

- The bar chart illustrates monthly crime trends in Chicago, showing seasonal variations in criminal activity.

In [None]:
# Count of crimes by time of day
time_of_day_trend = crime_data["Time_of_Day"].value_counts()

plt.figure(figsize=(8, 5))
sb.barplot(x=time_of_day_trend.index, y=time_of_day_trend.values, palette="Set2")
plt.xlabel("Time of Day")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Time of Day")
plt.show()

Observation:

- The bar chart displays crime distribution by time of day, categorizing incidents into Morning, Afternoon, Evening, and Night.

In [None]:
# Ensure 'Date' is in datetime format
crime_data["Date"] = pd.to_datetime(crime_data["Date"], errors="coerce")

# Extract Day of the Week from 'Date' column
crime_data["Day of Week"] = crime_data["Date"].dt.day_name()

# Count crimes by day of the week (ensure order from Monday to Sunday)
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
day_of_week_trend = crime_data["Day of Week"].value_counts().reindex(day_order)

# Plot Day of the Week Crime Trends
plt.figure(figsize=(10, 5))
day_of_week_trend.plot(kind="bar", color="green")
plt.xlabel("Day of the Week")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Day of the Week")
plt.xticks(rotation=45)
plt.grid(axis="y")
plt.show()

Observation:

- The bar chart represents the crime distribution by day of the week

#### 3.3.2. Line graphs (crime trends over time)

In [None]:
# Count of crimes by year
yearly_trend = crime_data["Year"].value_counts().sort_index()

plt.figure(figsize=(12, 6))
sb.lineplot(x=yearly_trend.index, y=yearly_trend.values, marker="o", color="blue")
plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.title("Crime Trend Over the Years in Chicago")
plt.grid(True)
plt.show()

Observation:

- The line chart illustrates the overall crime trend in Chicago from 2001 to 2022.

In [None]:
# Ensure 'Date' is in datetime format
crime_data["Date"] = pd.to_datetime(crime_data["Date"], errors="coerce")

# Extract Hour from 'Date' column
crime_data["Hour"] = crime_data["Date"].dt.hour

# Count crimes by hour
hourly_trend = crime_data["Hour"].value_counts().sort_index()

# Plot Crime Distribution by Hour
plt.figure(figsize=(10, 5))
plt.plot(hourly_trend.index, hourly_trend.values, marker="o", linestyle="-", color="red")
plt.xlabel("Hour of the Day")
plt.ylabel("Number of Crimes")
plt.title("Crime Distribution by Hour")
plt.xticks(range(0, 24))
plt.grid()
plt.show()

Observation:

- The line chart demonstrates crime distribution by hour of the day, showing clear peaks and declines in criminal activity.

#### 3.3.3. Maps (geospatial distribution of crime)

In [None]:
# Remove rows with missing latitude or longitude
crime_data_map = crime_data.dropna(subset=["Latitude", "Longitude"])

# Create a folium map centered at the mean latitude & longitude
chicago_map = folium.Map(location=[crime_data_map['Latitude'].mean(), crime_data_map['Longitude'].mean()], zoom_start=11)

# Convert crime data to list format for HeatMap
heat_data = crime_data_map[['Latitude', 'Longitude']].values.tolist()

# Add heatmap layer
HeatMap(heat_data, radius=7).add_to(chicago_map)

# Display map in the notebook
display(chicago_map)

In [None]:
# Load Chicago Community Area boundaries (GeoJSON)
geojson_url = "https://data.cityofchicago.org/api/geospatial/igwz-8jzy?method=export&format=GeoJSON"
community_areas = gpd.read_file(geojson_url)

# Ensure correct column name exists
if "area_numbe" in community_areas.columns:
    community_areas.rename(columns={"area_numbe": "Community Area"}, inplace=True)
elif "area_number" in community_areas.columns:
    community_areas.rename(columns={"area_number": "Community Area"}, inplace=True)

# Group by Community Area and count the number of crimes
crime_by_area = crime_data.groupby("Community Areas").size().reset_index(name="Crime Count")

# Convert Community Area to integer
community_areas["Community Area"] = community_areas["Community Area"].fillna(-1).astype(int)

# Map community area numbers to names
community_names = {i: name for i, name in enumerate([
    "Rogers Park", "West Ridge", "Uptown", "Lincoln Square", "North Center", "Lake View",
    "Lincoln Park", "Near North Side", "Edison Park", "Norwood Park", "Jefferson Park",
    "Forest Glen", "North Park", "Albany Park", "Portage Park", "Irving Park", "Dunning",
    "Montclare", "Belmont Cragin", "Hermosa", "Avondale", "Logan Square", "Humboldt Park",
    "West Town", "Austin", "West Garfield Park", "East Garfield Park", "Near West Side",
    "North Lawndale", "South Lawndale", "Lower West Side", "Loop", "Near South Side",
    "Armour Square", "Douglas", "Oakland", "Fuller Park", "Grand Boulevard", "Kenwood",
    "Washington Park", "Hyde Park", "Woodlawn", "South Shore", "Chatham", "Avalon Park",
    "South Chicago", "Burnside", "Calumet Heights", "Roseland", "Pullman", "South Deering",
    "East Side", "West Pullman", "Riverdale", "Hegewisch", "Garfield Ridge", "Archer Heights",
    "Brighton Park", "McKinley Park", "Bridgeport", "New City", "West Elsdon", "Gage Park",
    "Clearing", "West Lawn", "Chicago Lawn", "West Englewood", "Englewood",
    "Greater Grand Crossing", "Ashburn", "Auburn Gresham", "Beverly", "Washington Heights",
    "Mount Greenwood", "Morgan Park", "O'Hare", "Edgewater"
], start=1)}

# Replace Community Area numbers with their corresponding names
crime_by_area["Community Name"] = crime_by_area["Community Areas"].map(community_names)

# Convert 'Community Area' and 'Community Areas' to integer type
crime_by_area["Community Areas"] = crime_by_area["Community Areas"].astype(int)

# Merge crime data with community area boundaries
crime_map_data = community_areas.merge(
    crime_by_area, 
    left_on="Community Area", 
    right_on="Community Areas", 
    how="left"
)

# Fill missing crime counts with 0
crime_map_data["Crime Count"] = crime_map_data["Crime Count"].fillna(0)

# Create a color scale from light yellow to red based on crime count
max_crime = crime_map_data["Crime Count"].max()
min_crime = crime_map_data["Crime Count"].min()
colormap = cm.linear.YlOrRd_09.scale(min_crime, max_crime) 

# Create a Folium Map centered around Chicago
chicago_map = folium.Map(location=[41.8781, -87.6298], zoom_start=10)

# Add community areas to the map with gradient coloring and interactive tooltips
folium.GeoJson(
    crime_map_data,
    name="Community Areas",
    style_function=lambda feature: {
        "fillColor": colormap(feature["properties"]["Crime Count"]),
        "color": "black",
        "weight": 1,
        "fillOpacity": 0.7
    },
    tooltip=GeoJsonTooltip(
        fields=["Community Name", "Crime Count"],
        aliases=["Community Area:", "Total Crimes:"],
        localize=True,
        sticky=True,
        labels=True
    )
).add_to(chicago_map)

# Add color legend to the map
colormap.caption = "Crime Count by Community Area"
chicago_map.add_child(colormap)

# Display the interactive map in the Jupyter Notebook
display(chicago_map)

## 4. Statistical Analysis

### 4.1. Correlation Analysis

In [None]:
# Selecting only numeric columns for correlation analysis
numeric_data = crime_data.select_dtypes(include=['number'])

# Creating a heatmap of numeric correlations
plt.figure(figsize=(12, 8))
sb.heatmap(numeric_data.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Crime Data")
plt.show()

Observation:

- The correlation heatmap visualizes the relationships between numerical features in the crime dataset.

### 4.2. Regression Analysis

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Aggregate crimes per year
crime_trend = crime_data.groupby("Year").size().reset_index(name="Crime_Count")

# Define X and y
X = crime_trend[["Year"]]
y = crime_trend["Crime_Count"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict crime for future years (2023-2025)
future_years = pd.DataFrame({"Year": [2023, 2024, 2025]})
future_crime_predictions = model.predict(future_years)
future_crime_df = pd.DataFrame({"Year": [2023, 2024, 2025], "Crime_Count": future_crime_predictions})

print("Predicted crime rates for 2023-2025:", future_crime_predictions)

# Evaluate model performance
y_pred = model.predict(X_test)
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred):.2f}")
print(f"R-squared Score: {r2_score(y_test, y_pred):.2f}")

# Append predicted data to crime trend
total_crime_trend = pd.concat([crime_trend, future_crime_df], ignore_index=True)

print(total_crime_trend)

# Plot crime count per year
plt.figure(figsize=(12, 6))
sns.barplot(x=total_crime_trend["Year"], y=total_crime_trend["Crime_Count"], palette="Blues")
plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.title("Number of Crimes by Year in Chicago (Including Predictions)")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

Observation:

- Using linear regression model we have calculated the predicted number of 150,107 crimes in 2025.

## 5. Key Findings & Insights

In [None]:
# Resample data to count crimes per year
crime_data["Date"] = pd.to_datetime(crime_data["Date"])
crime_data.set_index("Date", inplace=True)

# Resample yearly and compute rolling averages
crime_trend = crime_data.resample("Y").size()
crime_trend_rolling = crime_trend.rolling(window=5).mean()  # 5-year rolling average

# Plot crime trend over the years
plt.figure(figsize=(12, 6))
plt.plot(crime_trend.index, crime_trend.values, label="Annual Crime Count", color="blue", marker="o")
plt.plot(crime_trend_rolling.index, crime_trend_rolling.values, label="5-Year Rolling Average", color="red", linestyle="dashed")

plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.title("Crime Trend Over the Years (with Rolling Average)")
plt.legend()
plt.grid()
plt.show()

Observation:

- The blue line represents the annual crime count, showing a steady decline in crime from 2001 to 2022.

- The red dashed line represents the 5-year trend (rolling average). 

- The trend demonstrates decreasing of the number of crimes in Chicago.

In [None]:
# Group by year and check crime trends
pre_pandemic = crime_data[crime_data["Year"] < 2020]["Year"].value_counts().sort_index()
post_pandemic = crime_data[crime_data["Year"] >= 2020]["Year"].value_counts().sort_index()

# Plot crime trends before and after 2020
plt.figure(figsize=(10,5))
plt.plot(pre_pandemic.index, pre_pandemic.values, marker='o', label="Before COVID-19")
plt.plot(post_pandemic.index, post_pandemic.values, marker='o', label="After COVID-19", linestyle="dashed")
plt.xlabel("Year")
plt.ylabel("Number of Crimes")
plt.title("Crime Trends Before and After COVID-19")
plt.legend()
plt.show()

Observation:

- The chart compares crime trends before and after COVID-19.

- The number of crimes during and after COVID-19 in Chicago has decreased dramatically due to social isolation

In [None]:
# Group by Year and calculate arrest percentage
crime_trends = crime_data.groupby("Year").agg({"Arrest": "mean", "ID": "count"}).rename(columns={"ID": "Total Crimes"})

# Plot Crime Trends vs. Arrest Rate
fig, ax1 = plt.subplots(figsize=(10, 5))

ax1.set_xlabel("Year")
ax1.set_ylabel("Total Crimes", color="red")
ax1.plot(crime_trends.index, crime_trends["Total Crimes"], color="red", marker="o", label="Total Crimes")
ax1.tick_params(axis="y", labelcolor="red")

ax2 = ax1.twinx()
ax2.set_ylabel("Arrest Rate", color="blue")
ax2.plot(crime_trends.index, crime_trends["Arrest"], color="blue", marker="s", linestyle="dashed", label="Arrest Rate")
ax2.tick_params(axis="y", labelcolor="blue")

fig.suptitle("Crime Rate vs. Arrest Rate Over Time")
fig.legend(loc="upper left")
plt.show()

Observation:

- The chart compares total crimes (red line) and arrest rates (blue line) over time.

- This clearly illustrates how the arrest rate impacts the total crime rate.

In [None]:
# Load Socio-Economic Data
income_data = pd.read_csv("https://data.cityofchicago.org/api/views/iqnk-2tcu/rows.csv")

# Merge Socio-Economic Data with Crime Data
crime_demographics = crime_by_area.merge(income_data, left_on="Community Areas", right_on="Community Area", how="left")

# Plot: Crime Count vs. Per Capita Income
plt.figure(figsize=(8, 5))
plt.scatter(crime_demographics["Per Capita Income"], crime_demographics["Crime Count"], alpha=0.6)
plt.xlabel("Per Capita Income")
plt.ylabel("Number of Crimes")
plt.title("Crime vs. Income Level in Chicago")
plt.grid(True)
plt.show()

Observation:

- The scatter plot examines the relationship between per capita income and crime levels across different areas in Chicago.

- Community areas with income under 40,000 per capita demostrate a higher level of crime rate in the city.

## 6. Inferences/Suggestions

### 6.1. Increase Police Presence in High-Crime Areas

- Focus patrols in West Garfield Park, Englewood, and Austin, where crime is highest.

- Boost security in streets, sidewalks, and parking lots, common crime locations.

- Strengthen policing in community areas to improve security.

### 6.2. Target Crime During Peak Times

- Increase patrols from 6 PM - 12 AM, when crime is most frequent.

- Deploy extra security in summer months (June–August).

- Improve surveillance and prevention in business districts.

### 6.3. Use Data-Driven Policing

- Implement real-time crime mapping and predictive analytics.

## 7. Conclusion

A mix of stronger policing, community involvement, and data-driven strategies can help reduce crime and improve public safety in the City of Chicago.