# <center>Chicago Crime Analysis 2022</center>

<img src="https://i.imgur.com/wEZQMIb.jpeg" alt="Crime-image" style="width:100%;">

## Dataset Overview
    ID – Unique identifier for each crime record.
    Case Number – The police case number associated with the crime.
    Date – The date and time when the crime occurred.
    Block – The location block where the crime took place.
    IUCR – Illinois Uniform Crime Reporting (IUCR) code for the crime.
    Primary Type – The category of crime (e.g., THEFT, BATTERY, ASSAULT).
    Description – A more specific description of the crime type.
    Location Description – Where the crime occurred (e.g., STREET, APARTMENT).
    Arrest – Indicates whether an arrest was made (True/False).
    Domestic – Whether the crime was domestic-related (True/False).
    Beat – A specific police patrol area.
    District – The larger police district in which the crime occurred.
    Ward – The political ward where the crime was reported.
    Community Area – The community designation for the crime location.
    FBI Code – FBI classification code for the type of offense.
    X Coordinate, Y Coordinate – Geospatial coordinates in the city system.
    Latitude, Longitude – GPS coordinates for crime mapping.
    Updated On – The date when the crime record was last updated.

## Importing The Requried Libraries

In [None]:
#import the required libraries
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
from tabulate import tabulate
import plotly.express as px
import calendar
import folium
from folium.plugins import HeatMap
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

## Data Loading & Overview

In [None]:
# Loading the original dataset
main_df = pd.read_csv('crime prediction in chicago.csv')

In [None]:
# Making a copy of our original dataset
copy_df = main_df.copy()

In [None]:
# Dataset Overview
copy_df.head().T

In [None]:
# Shape of our dataset
copy_df.shape

Their are 239558 rows and 22 columns in our dataset

In [None]:
# Dataset Info
copy_df.info()

In [None]:
# Statistical Overview
copy_df.describe()

## <center>Data Cleaning & Preprocessing</center>

## Null Values Handling

In [None]:
# Checking null values
copy_df.isnull().sum()

In [None]:
# Visual Representations of Checking Null Values
missing = pd.DataFrame((copy_df.isnull().sum())*100/copy_df.shape[0]).reset_index()
plt.figure(figsize=(16,5))
ax = sns.pointplot(data = missing, x="index", y=0)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

There are almost 0.5% data is missing in Location Description and almost 2% data is missing in X Coordinate, Y Coordinate, Latitude, Longitude, and Location. We also see that there are 10 missing values in Ward which is very very less amount is percentage value near 0%.

### Handling Location Description Data

In [None]:
# Filtered out the null values
null_location_description = copy_df[copy_df['Location Description'].isnull()]

In [None]:
null_location_description.head().T

In [None]:
# Unique Crime Description which has missing location description
null_location_description['Description'].unique()

In [None]:
# Unique primary type crimes which has missing location description
null_location_description['Primary Type'].unique()

In [None]:
crime_types = [
    'DECEPTIVE PRACTICE', 'BATTERY', 'CRIMINAL SEXUAL ASSAULT', 
    'BURGLARY', 'THEFT', 'OTHER OFFENSE', 'ROBBERY'
]

# Filtering the dataset
filtered_df = copy_df.query("`Primary Type` in @crime_types")

# Display the filtered dataset
filtered_df.head().T

In [None]:
# Value Counts
filtered_df['Location Description'].value_counts().head(10)

In [None]:
# Mode of location description
Location_Description_Mode = filtered_df['Location Description'].mode()[0]
Location_Description_Mode

In [None]:
# Filling the null values with the mode
copy_df['Location Description'] = copy_df['Location Description'].fillna(Location_Description_Mode)

In [None]:
# Checking Null values in Location Description
copy_df['Location Description'].isnull().sum()

I first filtered the dataset where the "Location Description" column has missing values. Then, I identified the unique crime types from this filtered data. Next, I retrieved the corresponding "Location Description" values for these crime types from the main dataset. Finally, I determined the most frequently occurring (mode) "Location Description" and used it to fill the missing values.

### Handling Ward

In [None]:
# Percentage of missing values in the 'Ward' column
copy_df['Ward'].isnull().sum()*100/copy_df['Ward'].value_counts().sum()

we can see that there are 0.004% data are missing in Ward Column. So we can easily drop the missing values.

In [None]:
copy_df.dropna(subset=['Ward'], inplace=True)

In [None]:
# Checking Null values in Ward
copy_df['Ward'].isnull().sum()

### Handling Other Columns which contain Null Values

X Coordinate, Y coordinate and Location columns are unnecessary columns so we can easily drop these columns. 

In [None]:
copy_df.drop(['X Coordinate', 'Y Coordinate', 'Location'], axis=1, inplace=True)

In [None]:
# Checking Null Values
copy_df.isnull().sum()

There are only 2% data missing in Latitude and Longitude and the percentage is very low. It is not possible to replace the null values with some value. So we will drop these values from our dataset. 

In [None]:
# Drop the null values from Latitude and Longitude
copy_df.dropna(subset=['Latitude'], inplace=True)

In [None]:
# Checking Null Values for each columns
copy_df.isnull().sum()

There are no null values in any columns. Our Null Values Handling are done.

## Handling Duplicate Values

In our Dataset 'ID' is the primary key. So we will check is there any duplicate occur or not. If occur then keep the first value and remove others. 

In [None]:
copy_df.drop_duplicates(subset=['ID'], keep='first', inplace=True)

## Dropping Unnecessary Columns I

In [None]:
copy_df.info()

In [None]:
copy_df.drop(['ID', 'Case Number', 'IUCR', 'FBI Code'], axis=1, inplace=True)

In [None]:
copy_df.info()

## Feature Extraction

### Handling Datetime

In [None]:
copy_df['Date']

In [None]:
copy_df['Crime_Day']=copy_df['Date'].str.split('/').str[1]
copy_df['Crime_Month']=copy_df['Date'].str.split('/').str[0]
copy_df['Crime_hour'] = copy_df['Date'].str.split('/').str[2].str.split(' ').str[1].str.split(':').str[0]

In [None]:
copy_df.head().T

In [None]:
copy_df['Updated_Day']=copy_df['Updated On'].str.split('/').str[1]
copy_df['Updated_Month']=copy_df['Updated On'].str.split('/').str[0]
copy_df['Updated_Year']=copy_df['Updated On'].str.split('/').str[2].str.split(' ').str[0]

In [None]:
copy_df.head().T

In [None]:
# Drop 'Date' and 'Update On' Column
copy_df.drop(['Date', 'Updated On'], axis=1, inplace=True)

In [None]:
# Data Type Checking
copy_df.info()

### Handling Block Column

In [None]:
# Unique values of Block column
copy_df.Block.unique

In [None]:
copy_df['Block_Code'] = copy_df['Block'].str.split(' ').str[0]
copy_df['Direction'] = copy_df['Block'].str.split(' ').str[1]
copy_df['Street_Name'] = copy_df['Block'].str.split(' ').str[2] + " " + copy_df['Block'].str.split(' ').str[3]

In [None]:
copy_df.head().T

In [None]:
copy_df.drop(['Block'], axis=1, inplace=True)

In [None]:
# Info
copy_df.info()

### Data Type Handling

In [None]:
# Converting object to int
copy_df['Crime_Month'] = pd.to_numeric(copy_df['Crime_Month'])
copy_df['Crime_Day'] = pd.to_numeric(copy_df['Crime_Day'])
copy_df['Crime_hour'] = pd.to_numeric(copy_df['Crime_hour'])
copy_df['Updated_Day'] = pd.to_numeric(copy_df['Updated_Day'])
copy_df['Updated_Month'] = pd.to_numeric(copy_df['Updated_Month'])
copy_df['Updated_Year'] = pd.to_numeric(copy_df['Updated_Year'])

# Converting Float to Int
copy_df['Ward'] = copy_df['Ward'].astype(int)

In [None]:
copy_df.info()

## Dropping Unnecessary Columns II

In [None]:
copy_df.drop(['Year'], axis=1, inplace=True)

## Checking Null Values II

In [None]:
# Checking Null
copy_df.isnull().sum()

In [None]:
# Filling Null Values
copy_df['Street_Name'].fillna('No Street Name', inplace=True)

In [None]:
# Checking Null
copy_df.isnull().sum()

## Clean Data Overview

In [None]:
# Dataset Information
copy_df.info()

In [None]:
# Clean Dataset Shape
copy_df.shape

In [None]:
# Statistical Overview of Clean Data
copy_df.describe().T

In [None]:
# Clean Data Overview
copy_df.head().T

## <center>Deep Dive into Analysis</center>

## Univariate Analysis

In [None]:
# Plot crime type distribution
plt.figure(figsize=(12, 8))
sns.countplot(y=copy_df["Primary Type"], order=copy_df["Primary Type"].value_counts().index, palette="viridis")
plt.title("Crime Type Distribution")
plt.xlabel("Count")
plt.ylabel("Crime Type")
plt.show()

In [None]:
# Plot Top 15 Crime Description
plt.figure(figsize=(14, 6))
top_locations = copy_df["Description"].value_counts().index[:15]  # Select top 15 locations
sns.countplot(y=copy_df[copy_df["Description"].isin(top_locations)]["Description"], 
              order=top_locations, palette="viridis")
plt.title("Top 15 Crime Description")
plt.xlabel("Count")
plt.ylabel("Crime Description")
plt.show()

In [None]:
# Plot Top 15 Crime Location
plt.figure(figsize=(14, 6))
top_locations = copy_df["Location Description"].value_counts().index[:15]  # Select top 15 locations
sns.countplot(y=copy_df[copy_df["Location Description"].isin(top_locations)]["Location Description"], 
              order=top_locations, palette="viridis")
plt.title("Top 15 Crime Locations")
plt.xlabel("Count")
plt.ylabel("Location Description")
plt.show()

In [None]:
# Arrest
plt.figure(figsize=(6, 6))
copy_df["Arrest"].value_counts().plot.pie(autopct="%1.1f%%", colors=["lightblue", "salmon"], startangle=90, explode=[0, 0.1])
plt.title("Proportion of Crimes with Arrests")
plt.ylabel("")  
plt.show()

In [None]:
# Domestic Crime
plt.figure(figsize=(6, 6))
copy_df["Domestic"].value_counts().plot.pie(autopct="%1.1f%%", colors=["#FF6F61", "#EAC435"] , startangle=90, explode=[0, 0.1])
plt.title("Proportion of Domestic Crimes")
plt.ylabel("") 
plt.show()

In [None]:
# Top 15 Crime-Prone Beats
plt.figure(figsize=(14, 6))
top_beats = copy_df["Beat"].value_counts().head(15)  # Top 15 beats with the most crimes
sns.barplot(x=top_beats.index, y=top_beats.values, palette="viridis", order=top_beats.index)
plt.title("Top 15 Crime-Prone Beats")
plt.xlabel("Beat")
plt.ylabel("Crime Count")
plt.xticks(rotation=45)
plt.show()

In [None]:
# District, Ward, Community Area
copy_df["location_group"] = copy_df[["District", "Ward", "Community Area"]].astype(str).agg('-'.join, axis=1)

# Count occurrences of each unique combination
location_counts = copy_df["location_group"].value_counts().head(15)  # Top 15

# Plot the data
plt.figure(figsize=(14, 6))
sns.barplot(x=location_counts.index, y=location_counts.values, palette="coolwarm")
plt.xticks(rotation=45)
plt.xlabel("District-Ward-Community Area")
plt.ylabel("Number of Crimes")
plt.title("Top 15 Crime-Prone Location Combinations")
plt.show()

In [None]:
# Block, Direction, Street
copy_df["location_block_group"] = copy_df[["Block_Code", "Direction", "Street_Name"]].astype(str).agg('-'.join, axis=1)

# Count occurrences of each unique combination
location_counts = copy_df["location_block_group"].value_counts().head(15)  # Top 15

# Plot the data
plt.figure(figsize=(14, 6))
sns.barplot(x=location_counts.index, y=location_counts.values, palette="coolwarm")
plt.xticks(rotation=45)
plt.xlabel("District-Ward-Community Area")
plt.ylabel("Number of Crimes")
plt.title("Top 15 Crime-Prone Location Combinations")
plt.show()

In [None]:
# Get the top 15 crime-prone location combinations
top_locations = copy_df["location_group"].value_counts().head(15).index

# Filter dataset for only these top locations
filtered_df = copy_df[copy_df["location_group"].isin(top_locations)][["location_group", "Block_Code", "Direction", "Street_Name"]]

# Grouping
filtered_location_df = filtered_df.groupby(['location_group', 'Block_Code', 'Direction', 'Street_Name']).size().reset_index(name='Crime Counts')

# Extracting Distric, Ward, Community Area
filtered_location_df['District']=filtered_location_df['location_group'].str.split('-').str[0]
filtered_location_df['Ward']=filtered_location_df['location_group'].str.split('-').str[1]
filtered_location_df['Community Area']=filtered_location_df['location_group'].str.split('-').str[2]

# Drop location_group column
filtered_location_df.drop(['location_group'], axis=1, inplace=True)

# reordering columns
filtered_location_df = filtered_location_df.iloc[:, [4, 5, 6, 0, 1, 2, 3]]

# Sorting and creating new dataframe
top_crime_locations = filtered_location_df.sort_values(by=['Crime Counts'], ascending=False).head(15)

# Add a serial number column
top_crime_locations['Serial No'] = range(1, len(top_crime_locations) + 1)

# Reordering Columns
top_crime_locations = top_crime_locations[['Serial No', 'District', 'Ward', 'Community Area', 'Block_Code',
                                          'Direction', 'Street_Name', 'Crime Counts']]

# Tablular Format Visualization
table = tabulate(top_crime_locations, headers = 'keys', tablefmt = 'pretty', showindex=False)

# Title of our table
title = "Top 15 Crime Areas"
title_centered = title.center(len(table.splitlines()[0]))

# Show Table
print(title_centered)
print(table)

In [None]:
# Crime distribution by Hour
plt.figure(figsize=(12, 6))
sns.countplot(x=copy_df["Crime_hour"], palette="viridis")

plt.xlabel("Hour of the Day", fontsize=12)
plt.ylabel("Number of Crimes", fontsize=12)
plt.title("Crime Distribution by Hour", fontsize=14)

plt.xticks(range(0, 24))
plt.show()

In [None]:
# Crime distribution by day of the week
plt.figure(figsize=(16, 7))
sns.countplot(x=copy_df["Crime_Day"], palette="viridis")

plt.xlabel("Day of the Week", fontsize=12)
plt.ylabel("Number of Crimes", fontsize=12)
plt.title("Crime Distribution by Day of the Week", fontsize=14)
plt.show()

In [None]:
# Crime Distribution by Month
plt.figure(figsize=(10, 5))
sns.countplot(x=copy_df["Crime_Month"], palette="magma")

plt.xlabel("Month", fontsize=12)
plt.ylabel("Number of Crimes", fontsize=12)
plt.title("Crime Distribution by Month", fontsize=14)

month_labels = [calendar.month_abbr[i] for i in range(1, 13)]
plt.xticks(ticks=range(12), labels=month_labels)
plt.show()

## Bivariate Analysis

In [None]:
# Arrest vs Top Crime Locations
plt.figure(figsize=(14, 6))
sns.countplot(y=copy_df[copy_df["location_group"].isin(top_locations)]["location_group"], 
              hue=copy_df["Arrest"], order=top_locations, palette="viridis")
plt.title("Arrests vs. Top 10 Crime Locations")
plt.xlabel("Count")
plt.ylabel("Location Description")
plt.legend(title="Arrest Made")
plt.show()

In [None]:
# Crime Type vs Location
pivot_table = copy_df[copy_df["location_group"].isin(top_locations)].pivot_table(index="location_group", 
                                                                             columns="Primary Type", 
                                                                             aggfunc="size", 
                                                                             fill_value=0)
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_table, cmap="viridis", linewidths=0.5)
plt.title("Crime Heatmap: Type vs. Location")
plt.xlabel("Crime Type")
plt.ylabel("Location Description")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Crime type vs Hour
plt.figure(figsize=(14, 10))
sns.boxplot(x=copy_df["Crime_hour"], y=copy_df["Primary Type"], data=copy_df, showfliers=False, palette="coolwarm")

plt.xlabel("Hour of the Day", fontsize=12)
plt.ylabel("Crime Type", fontsize=12)
plt.title("Crime Types vs. Crime Hour", fontsize=14)
plt.show()

## Map Based Visualization

In [None]:
# Create a base map centered around Chicago
crime_map = folium.Map(location=[copy_df["Latitude"].mean(), copy_df["Longitude"].mean()], zoom_start=11)

# Add a heatmap layer
heat_data = copy_df[["Latitude", "Longitude"]]  # Remove missing values
HeatMap(heat_data, radius=10, blur=8, max_zoom=13).add_to(crime_map)

crime_map

In [None]:
# Crime Cluster over the year
fig = px.scatter_mapbox(copy_df, 
                        lat="Latitude", 
                        lon="Longitude", 
                        color="Primary Type", 
                        size_max=10, 
                        zoom=11, 
                        title="Crime Clusters in Chicago",
                        hover_data=["Primary Type"])

# Update map settings
fig.update_layout(mapbox_style="carto-positron", mapbox_zoom=10)
fig.show()

In [None]:
# Monthly Crime Heatmap

# convert crime months into string
copy_df["crime_month_str"] = copy_df["Crime_Month"].apply(lambda x: calendar.month_abbr[x])

# Month Order
month_order = {calendar.month_abbr[i]: i for i in range(1, 13)}

# Sorting
copy_df["crime_month_order"] = copy_df["crime_month_str"].map(month_order)
copy_df = copy_df.sort_values(by="crime_month_order")  # Ensure correct order

# Aggregate crime data by month and location
df_grouped = copy_df.groupby(["Latitude", "Longitude", "crime_month_str"]).size().reset_index(name="crime_count")

# Create animated heatmap
fig = px.density_mapbox(df_grouped, 
                        lat="Latitude", 
                        lon="Longitude", 
                        z="crime_count", 
                        radius=10, 
                        animation_frame="crime_month_str",  # Animate by month
                        category_orders={"crime_month_str": list(month_order.keys())},
                        title="Crime Heatmap Over Month",
                        hover_data=["crime_count"])

# Update map settings
fig.update_layout(mapbox_style="open-street-map", mapbox_zoom=10)
fig.show()

In [None]:
# Crime Cluster over month

# Covert month into str
copy_df["crime_month_str"] = copy_df["Crime_Month"].apply(lambda x: calendar.month_abbr[x])

# Create animated heatmap
fig = px.scatter_mapbox(copy_df, 
                        lat="Latitude", 
                        lon="Longitude", 
                        color="Primary Type", 
                        size_max=10, 
                        zoom=11, 
                        title="Crime Trends Over Months",
                        animation_frame="crime_month_str",  # Use month names
                        hover_data=["Primary Type"])

# Update map settings
fig.update_layout(mapbox_style="open-street-map", mapbox_zoom=10)
fig.show()