In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import calendar

# Load data

In [None]:
!pip install gdown

import gdown
import pandas as pd

# 1. Specify the file ID and the file name to be saved in the Google Drive share link
file_id = '1UiEn8ssAfzhZCrqYAVA_cmW63nv9RrtY' # The ID extracted from the link
output_filename = 'Crimes_data.csv' # The file name saved locally after downloading

# 2. Build the download link
download_url = f'https://drive.google.com/uc?id={file_id}'

# 3. Download the file using gdown
try:
    gdown.download(download_url, output_filename, quiet=False)
    print(f"The file has been successfully downloaded and saved as {output_filename}")

    # 4. Use pandas to read the downloaded CSV file
    data = pd.read_csv(output_filename)

    # Print the first few lines of the data to confirm the successful reading
    print("\nData preview：")
    print(data.head())

except Exception as e:
    print(f"An error occurred when downloading or reading files：{e}")
    print("Please check whether the file ID is correct and whether the sharing permission of the file is set to 'Anyone with the link'。")

In [None]:
# View basic data information
data.info()
data.head()

### Summary of the Chicago Crime Dataset

The Chicago Crime Dataset contains **1,179,152** records with **22** features, documenting crime incidents in Chicago from **2020-03-20 to 2025-03-20**. This dataset, sourced from the Chicago Police Department's CLEAR system, includes key details about crime types, locations, and timestamps.

**Key Features Overview**

**1.Incident Information**

- **ID:** Unique identifier for each crime record

- **Case Number:** Case reference number

- **Date:** Date and time of the crime (currently in string format, requiring conversion to datetime)

- **Updated On:** Last update timestamp

**2.Crime Classification**

- **IUCR:** Crime classification code

- **Primary Type:** Major crime category (e.g., theft, narcotics, assault)

- **Description:** Specific crime details

- **FBI Code:** Federal classification of the crime

**3.Location Details**

- **Block:** Approximate address where the crime occurred

- **Location Description:** Specific place (e.g., residence, sidewalk, parking lot)

- **Beat / District / Ward / Community Area:** Administrative region identifiers

- **Latitude / Longitude:** Geographical coordinates (some missing values)

**4.Case Attributes**

- **Arrest:** Whether an arrest was made (Boolean)

- **Domestic:** Whether the crime was classified as domestic violence (Boolean)

**5.Geospatial Information**

- **X Coordinate / Y Coordinate:** Projected spatial coordinates

- **ocation:** Combined latitude and longitude in tuple format

**6.Temporal Attributes**

- **Year:** Year in which the crime occurred

- **Date:** Needs to be converted to datetime format for extracting hour, day of the week, and month

**Data Quality Issues**

**1.Missing Values:**

Location Description, Ward, Community Area, X Coordinate, Y Coordinate, Latitude, and Longitude contain missing values.

**2.Data Format Issues:**

Date is in string format and must be converted to datetime for temporal analysis.

**3.otential Data Cleaning:**

Case Number may not be necessary for analysis and could be removed.

# Exploratory Data Analysis (EDA)

## 1. Temporal Pattern Analysis

We will analyze the change in crime cases over time to see if there are certain patterns, such as seasonal or year-to-year trends.

In [None]:
# Converts the 'Date' column to the datetime type
data['Date'] = pd.to_datetime(data['Date'])

# Extract information such as year, month, and day
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['Day'] = data['Date'].dt.day
data['Weekday'] = data['Date'].dt.weekday  # Day of the week

# Set chart style
sns.set(style="whitegrid", palette="pastel")

In [None]:
data.head()

In [None]:
# ----- 1. Number of annual criminal cases -----
crime_per_year = data.groupby('Year').size()

plt.figure(figsize=(10, 6))
plt.plot(crime_per_year.index, crime_per_year.values, marker='o', color='b')
plt.title('Number of Crimes Per Year', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Crimes', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

**Annual Crime Trends**

The line chart shows fluctuations in the total number of crimes per year from 2020 to 2025.

**Key Observations:**

- Crime rates peaked in 2023 (more than 250,000 incidents) before declining in subsequent years.

- The lowest crime count occurred in 2020 (around 150,000), suggesting a possible outlier or data limitation for that year.

- Post-2023, there is a downward trend, indicating potential improvements in law enforcement or societal factors.

In [None]:
# ----- 2. Monthly crime trends -----
crime_per_month = data.groupby(['Year', 'Month']).size().unstack()

plt.figure(figsize=(12, 6))
crime_per_month.plot(kind='line', marker='o', figsize=(12, 6))
plt.title('Monthly Crime Trends (2020-2025)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Crimes', fontsize=12)
plt.legend(title='Month', loc='upper right', labels=[calendar.month_name[i] for i in range(1, 13)])
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

**Monthly Crime Trends**

The multi-line chart depicts monthly crime patterns across each year, with each line representing a month.

**Key Insights:**

- Seasonality: Certain months (e.g., July, August) consistently show higher crime rates, possibly linked to warmer weather or holidays.

- Yearly Patterns: Despite annual fluctuations, monthly trends repeat cyclically, reinforcing seasonal influences.

- Notable Peaks: The highest monthly crime count reached ~22,500 incidents (likely in mid-year months during peak years like 2023).

- One thing to note: the March and February lines look strange because the data starts on March 20, 2020.

**Conclusion:**

Crime rates exhibit both annual variability and strong monthly seasonality, with peaks in mid-year and a notable surge in 2023. Further investigation into external factors (e.g., policy changes, economic conditions) could explain these trends.

## 2. Spatial Distribution Study

In this analysis, we will use latitude and longitude data to map the geographical distribution of crime cases.

In [None]:
# View the coordinate data range
print(data['X Coordinate'].describe())
print(data['Y Coordinate'].describe())

In [None]:
# Select only data with valid coordinates and remove data with zero X and Y coordinates
data_cleaned = data[(data['X Coordinate'] > 0) & (data['Y Coordinate'] > 0)]

# Plot scatter plots
plt.figure(figsize=(10, 8))
plt.scatter(data_cleaned['X Coordinate'], data_cleaned['Y Coordinate'], s=0.5, alpha=0.1, color='blue')
plt.title('Spatial Distribution of Crimes', fontsize=16)
plt.xlabel('X Coordinate', fontsize=12)
plt.ylabel('Y Coordinate', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()



In [None]:
import folium
from folium.plugins import HeatMap

# Create a basic map
map_center = [data['Latitude'].mean(), data['Longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=11)

# Thermal map
# Delete the latitude and longitude data containing NaN
clean_data = data.dropna(subset=['Latitude', 'Longitude'])
heat_data = [[row['Latitude'], row['Longitude']] for index, row in clean_data.iterrows()]
HeatMap(heat_data).add_to(m)

# Show map
m


**Summary of Spatial Distribution of Crimes**

**Key Observations from Coordinate Data:**

1.Data Range:

- X Coordinate: Ranges from 0 to ~1.205 million, with a mean of ~1.165 million and noticeable clustering around 1.15–1.18 million.

- Y Coordinate: Ranges from 0 to ~1.952 million, with a mean of ~1.887 million and concentration between 1.86–1.91 million.

2.Data Cleaning:

- Records with X or Y coordinates = 0 (invalid/missing) were removed, retaining only valid locations for analysis.

3.Spatial Hotspots:

- The scatter plot reveals dense clusters of crimes in specific areas, particularly around:

  -  X: 1.15–1.18 million

  -  Y: 1.86–1.91 million

- Lower-density "halos" suggest sporadic crime occurrences radiating from central hotspots.

4.Potential Implications:

- High-density zones likely correlate with urban centers or high-population areas.

- The presence of outliers (e.g., coordinates near zero) may indicate data entry errors or isolated incidents.

**Conclusion:**

Crimes are not uniformly distributed but concentrated in specific geographic regions, highlighting potential socio-economic or infrastructural factors (e.g., proximity to transit hubs, commercial districts). Further analysis with geographic context (e.g., neighborhood boundaries) could refine hotspot identification for targeted interventions.

(Note: Coordinates are unitless; actual locations require geographic reference points.)

## 3. Crime Correlation Analysis

We will analyze correlations between different crime types to see if some crime types tend to occur together.

In [None]:
# Calculate the correlation between different crime types, using the crime type as the column and the crime type of each case as the value
crime_dummies = pd.get_dummies(data['Primary Type'])

# Computed correlation matrix
crime_corr = crime_dummies.corr()

# Increase the chart size to prevent ICONS from disappearing
plt.figure(figsize=(16, 12))  # Increase chart size
sns.heatmap(crime_corr, cmap='coolwarm', annot=True, fmt=".2f", annot_kws={"size": 8}, linewidths=0.5)

# Headings and tags
plt.title('Correlation Between Different Crime Types', fontsize=16)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

# Display graphics
plt.tight_layout()
plt.show()


**Crime Correlation Heatmap Analysis**

**Key Observations**

- The diagonal line with 1.0 correlation represents self-correlation, which is expected.

- Most values in the matrix are close to 0.0, suggesting that most crime types occur independently of one another.

- Almost all crime types have almost no positive correlation, but more negative correlation, that is, weak negative correlation (fewer crime types occur together).
