# Boston Crime Data - Exploratory Data Analysis (EDA)

## Introduction

This notebook presents a comprehensive Exploratory Data Analysis (EDA) of Boston crime incident data. The dataset contains detailed information about various types of crimes and incidents reported in Boston, including temporal, geographical, and categorical attributes.

### What is EDA?
Exploratory Data Analysis is a critical step in data science that involves examining and understanding data through statistical summaries, visualizations, and pattern identification. It helps us:
- Understand the structure and quality of our data
- Identify patterns, trends, and anomalies
- Generate hypotheses for further analysis
- Prepare data for more advanced modeling

### Dataset Overview
The Boston crime dataset includes the following key information:
- **Temporal data**: Date, time, year, month, day of week, hour
- **Geographical data**: District, street, latitude, longitude
- **Categorical data**: Offense types, descriptions, UCR classification
- **Incident details**: Shooting incidents, reporting areas, incident numbers

### Goals of This Analysis
1. **Data Quality Assessment**: Identify missing values, data inconsistencies, and cleaning requirements
2. **Temporal Patterns**: Understand when crimes occur (time of day, day of week, seasonal trends)
3. **Geographical Distribution**: Analyze crime patterns across different districts and locations
4. **Crime Type Analysis**: Examine the distribution and characteristics of different offense types
5. **Shooting Incidents**: Special focus on violent crimes and their patterns
6. **Correlation Analysis**: Identify relationships between different variables

This analysis will provide valuable insights for law enforcement, urban planning, and public safety initiatives in Boston.


## Data Loading & Setup

In this section, we'll import the necessary libraries and load our crime dataset for analysis.


In [None]:
# Import necessary libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
import folium
from folium import plugins

# Set up plotting parameters for better visualization
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")


In [None]:
# Load the crime data CSV file
df = pd.read_csv('boston_crime_data_20250927_123841.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nFirst 5 rows of the dataset:")
df.head()


## Data Cleaning & Preprocessing

Before diving into analysis, we need to clean and preprocess our data to ensure quality and consistency.


In [None]:
# Check for missing values in the dataset
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"Percentage of missing values: {(df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100:.2f}%")


In [None]:
# Handle missing values in OFFENSE_CODE_GROUP
print("OFFENSE_CODE_GROUP missing values before cleaning:", df['OFFENSE_CODE_GROUP'].isnull().sum())
df['OFFENSE_CODE_GROUP'] = df['OFFENSE_CODE_GROUP'].fillna('Unknown')
print("OFFENSE_CODE_GROUP missing values after cleaning:", df['OFFENSE_CODE_GROUP'].isnull().sum())

# Convert OCCURRED_ON_DATE to datetime
print("\nConverting OCCURRED_ON_DATE to datetime...")
df['OCCURRED_ON_DATE'] = pd.to_datetime(df['OCCURRED_ON_DATE'])
print("Date conversion completed successfully!")

# Strip whitespace from column names
print("\nStripping whitespace from column names...")
df.columns = df.columns.str.strip()
print("Column names after stripping whitespace:")
print(df.columns.tolist())


In [None]:
# Additional data cleaning steps
# Remove any duplicate rows
print("Number of duplicate rows:", df.duplicated().sum())
df = df.drop_duplicates()
print("Number of rows after removing duplicates:", len(df))

# Check data types
print("\nData types:")
print(df.dtypes)

# Clean string columns by stripping whitespace
string_columns = ['DISTRICT', 'OFFENSE_DESCRIPTION', 'DAY_OF_WEEK', 'STREET']
for col in string_columns:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip()

print("\nData cleaning completed successfully!")


## Basic Data Overview

Let's get a comprehensive understanding of our dataset through statistical summaries and data exploration.


In [None]:
# Display comprehensive summary statistics
print("=== COMPREHENSIVE SUMMARY STATISTICS ===")
print(df.describe(include='all'))

print("\n=== DATASET INFORMATION ===")
print(f"Total number of records: {len(df):,}")
print(f"Date range: {df['OCCURRED_ON_DATE'].min()} to {df['OCCURRED_ON_DATE'].max()}")
print(f"Number of unique incidents: {df['INCIDENT_NUMBER'].nunique():,}")
print(f"Number of unique districts: {df['DISTRICT'].nunique()}")
print(f"Number of unique offense types: {df['OFFENSE_DESCRIPTION'].nunique()}")
print(f"Number of shooting incidents: {df['SHOOTING'].sum():,}")
print(f"Percentage of shooting incidents: {(df['SHOOTING'].sum() / len(df)) * 100:.2f}%")


In [None]:
# Count unique values for categorical columns
print("=== CATEGORICAL COLUMNS ANALYSIS ===")
print("\nDistricts:")
district_counts = df['DISTRICT'].value_counts()
print(district_counts)

print("\nDay of Week:")
day_counts = df['DAY_OF_WEEK'].value_counts()
print(day_counts)

print("\nTop 10 Offense Descriptions:")
offense_counts = df['OFFENSE_DESCRIPTION'].value_counts().head(10)
print(offense_counts)

print("\nOffense Code Groups:")
offense_group_counts = df['OFFENSE_CODE_GROUP'].value_counts()
print(offense_group_counts)


## Univariate Analysis

In this section, we'll analyze individual variables to understand their distributions and characteristics.


In [None]:
# Distribution of incidents by district
plt.figure(figsize=(14, 8))
district_counts = df['DISTRICT'].value_counts()
bars = plt.bar(district_counts.index, district_counts.values, color='skyblue', edgecolor='navy', alpha=0.7)
plt.title('Distribution of Incidents by District', fontsize=16, fontweight='bold')
plt.xlabel('District', fontsize=12)
plt.ylabel('Number of Incidents', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
             f'{int(height):,}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

print(f"District with most incidents: {district_counts.index[0]} ({district_counts.iloc[0]:,} incidents)")
print(f"District with least incidents: {district_counts.index[-1]} ({district_counts.iloc[-1]:,} incidents)")


In [None]:
# Distribution of incidents by day of week (ordered from Monday to Sunday)
plt.figure(figsize=(12, 8))

# Define the correct order for days of the week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_counts = df['DAY_OF_WEEK'].value_counts()

# Reorder the data according to the day order
day_counts_ordered = day_counts.reindex(day_order)

bars = plt.bar(day_counts_ordered.index, day_counts_ordered.values, 
               color='lightcoral', edgecolor='darkred', alpha=0.7)
plt.title('Distribution of Incidents by Day of Week', fontsize=16, fontweight='bold')
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Number of Incidents', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
             f'{int(height):,}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

print("Day of week with most incidents:", day_counts_ordered.index[0])
print("Day of week with least incidents:", day_counts_ordered.index[-1])


In [None]:
# Distribution of incidents by hour of day
plt.figure(figsize=(14, 8))
hour_counts = df['HOUR'].value_counts().sort_index()

bars = plt.bar(hour_counts.index, hour_counts.values, 
               color='lightgreen', edgecolor='darkgreen', alpha=0.7)
plt.title('Distribution of Incidents by Hour of Day', fontsize=16, fontweight='bold')
plt.xlabel('Hour of Day', fontsize=12)
plt.ylabel('Number of Incidents', fontsize=12)
plt.xticks(range(0, 24))
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars (only for every 2nd hour to avoid crowding)
for i, bar in enumerate(bars):
    height = bar.get_height()
    if i % 2 == 0:  # Only label every 2nd bar
        plt.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                 f'{int(height):,}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print(f"Hour with most incidents: {hour_counts.idxmax()}:00 ({hour_counts.max():,} incidents)")
print(f"Hour with least incidents: {hour_counts.idxmin()}:00 ({hour_counts.min():,} incidents)")


In [None]:
# Top 10 offense descriptions (horizontal bar chart)
plt.figure(figsize=(12, 10))
top_offenses = df['OFFENSE_DESCRIPTION'].value_counts().head(10)

bars = plt.barh(range(len(top_offenses)), top_offenses.values, 
                color='gold', edgecolor='orange', alpha=0.7)
plt.title('Top 10 Offense Descriptions', fontsize=16, fontweight='bold')
plt.xlabel('Number of Incidents', fontsize=12)
plt.ylabel('Offense Description', fontsize=12)
plt.yticks(range(len(top_offenses)), top_offenses.index)

# Add value labels on bars
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width + width*0.01, bar.get_y() + bar.get_height()/2.,
             f'{int(width):,}', ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

print("Top 5 most common offenses:")
for i, (offense, count) in enumerate(top_offenses.head(5).items(), 1):
    print(f"{i}. {offense}: {count:,} incidents")
