# Exploratory Data Analysis (EDA) – West Midlands Crime (Jun 2024 – Jul 2025)

This notebook explores the cleaned crime dataset to uncover key patterns and trends.  
Exploratory Data Analysis (EDA) helps us move beyond raw inspection and understand the story the data tells.  

## Objectives
- Summarise the dataset: coverage, size, and structure  
- Explore crime volumes over time (monthly totals)  
- Identify the most common crime types  
- Highlight geographic hotspots (locations/LSOAs)  
- Review outcomes (solved vs unresolved)  

## Notebook Roadmap
1. **Dataset Overview**  
   - Time range, record count, column types  

2. **Temporal Analysis**  
   - Monthly crime trends  
   - Seasonal variation  

3. **Categorical Analysis**  
   - Distribution of crime types  
   - Outcome categories  

4. **Geographic Patterns**  
   - Top locations / hotspots  

Each section will include code, outputs, and a short narrative summarising the findings.


In [6]:
# Load the cleaned crime dataset
cleaned_path = '../data/processed/cleaned_crime.csv'
df = pd.read_csv(cleaned_path)
print(f"Loaded cleaned dataset with shape: {df.shape}")
df.head()


Loaded cleaned dataset with shape: (384676, 15)


Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Year,MonthName,MonthNum,Season
0,b345c4eb846684666b656286751722fd142fdc312a8fc7...,2024-06-01,West Midlands Police,West Midlands Police,-1.851067,52.588979,On or near Crown Lane,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
1,7ef5395d6029ed9c8409821a7ef1a2781d5800fb924573...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
2,44ca5b37b556eef23f84d3063e15fe885e008643d035e4...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
3,048f1833119d4bf8fac88cac552d07971731a0e7b37011...,2024-06-01,West Midlands Police,West Midlands Police,-1.851067,52.588979,On or near Crown Lane,E01009417,Birmingham 001A,Violence and sexual offences,Unable to prosecute suspect,2024,June,6,Summer
4,779ba19c7294d2a78ef25196d23303ae51ee5b46a99fc2...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Violence and sexual offences,Unable to prosecute suspect,2024,June,6,Summer


In [None]:
# Data inspection of the new csv
df.info()
df.head()
df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384676 entries, 0 to 384675
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Crime ID               358449 non-null  object 
 1   Month                  384676 non-null  object 
 2   Reported by            384676 non-null  object 
 3   Falls within           384676 non-null  object 
 4   Longitude              384676 non-null  float64
 5   Latitude               384676 non-null  float64
 6   Location               384676 non-null  object 
 7   LSOA code              384676 non-null  object 
 8   LSOA name              384676 non-null  object 
 9   Crime type             384676 non-null  object 
 10  Last outcome category  358449 non-null  object 
 11  Year                   384676 non-null  int64  
 12  MonthName              384676 non-null  object 
 13  MonthNum               384676 non-null  int64  
 14  Season                 384676 non-nu

Crime ID                 26227
Month                        0
Reported by                  0
Falls within                 0
Longitude                    0
Latitude                     0
Location                     0
LSOA code                    0
LSOA name                    0
Crime type                   0
Last outcome category    26227
Year                         0
MonthName                    0
MonthNum                     0
Season                       0
dtype: int64

In [10]:
df["Last outcome category"] = df["Last outcome category"].fillna("Outcome not recorded")
df["Last outcome category"].isna().sum()
df.to_csv("../data/processed/cleaned_crime.csv", index=False)
print("Updated cleaned dataset saved successfully!")




Updated cleaned dataset saved successfully!


In [5]:
# Analyse descriptive statistics for numerical columns
df.describe()


Unnamed: 0,Longitude,Latitude,Year,MonthNum
count,384676.0,384676.0,384676.0,384676.0
mean,-1.898442,52.493743,2024.489217,6.546834
std,0.176816,0.062383,0.499884,3.112873
min,-2.205059,52.348321,2024.0,1.0
25%,-2.015163,52.446838,2024.0,4.0
50%,-1.919471,52.484911,2024.0,7.0
75%,-1.828461,52.536918,2025.0,9.0
max,-1.431931,52.661008,2025.0,12.0


### Dataset Overview and Updates

The cleaned dataset contains **384,676 rows and 15 columns**.  
A descriptive summary of numeric fields shows:  
- **Longitude / Latitude**: Values fall within the expected West Midlands range (approx. -2.2 to -1.4 longitude, 52.3 to 52.7 latitude).  
- **Year**: Only 2024 and 2025 are present.  
- **MonthNum**: Correctly spans 1–12, confirming full annual coverage.  
- **Counts**: Row-level statistics (25%, 50%, 75%) are consistent with expected monthly record volumes.  

### Handling Missing Outcomes

Inspection showed ~26k missing values (~7% of rows) in the **`Last outcome category`** column.  
These missing entries reflect cases where no investigative outcome has yet been published — a known limitation of the open dataset.  

To avoid losing these records, missing values were replaced with the label **"Outcome not recorded"**.  
The processed dataset (`cleaned_crime.csv`) was **overwritten in place**, so this update is now permanent and consistent across all future analyses.  

This ensures:  
- No crimes are excluded due to missing outcomes.  
- All unresolved cases are explicitly flagged as `"Outcome not recorded"`.  


In [16]:
# Explore how dataset is distributed across time
df.index.min(), df.index.max()
df.groupby(df.index.to_period("M")).size()




Month
2024-06    29294
2024-07    30876
2024-08    28420
2024-09    28069
2024-10    28444
2024-11    26107
2024-12    25276
2025-01    25226
2025-02    24165
2025-03    27428
2025-04    26746
2025-05    27736
2025-06    28123
2025-07    28766
Freq: M, dtype: int64

### Time Coverage & Trends
The dataset spans from **June 2024 to July 2025**, covering 13 months.  
Monthly crime volumes range from ~25k to 31k incidents, with a peak in **July 2024 (30,876 crimes)**.  
Crime levels dip during the winter months (Dec 2024 – Feb 2025), suggesting a seasonal pattern where crime decreases in colder months.


In [17]:
# Check which crime types are most common
df["Crime type"].value_counts()


Crime type
Violence and sexual offences    155441
Shoplifting                      38741
Vehicle crime                    32797
Criminal damage and arson        29273
Anti-social behaviour            26227
Other theft                      23717
Public order                     21541
Burglary                         17889
Drugs                            10152
Other crime                       7982
Robbery                           7942
Possession of weapons             7587
Theft from the person             3086
Bicycle theft                     2301
Name: count, dtype: int64

### Crime Type Distribution
The most frequent category is **Violence and sexual offences** (155k incidents), representing a large share of recorded crimes.  
Other major categories include **Shoplifting (38.7k)**, **Vehicle crime (32.8k)**, and **Criminal damage and arson (29.3k)**.  
Less common categories, such as **Bicycle theft (2.3k)** and **Theft from the person (3.1k)**, occur at much lower volumes.


In [18]:
# Identify geographic hotspots
df["Location"].value_counts().head(10)
df["LSOA name"].value_counts().head(10)


LSOA name
Birmingham 138A       6820
Walsall 030A          2607
Coventry 031E         2414
Birmingham 135C       2345
Birmingham 050E       2158
Coventry 031F         2109
Birmingham 050F       2094
Wolverhampton 020H    2031
Solihull 009A         1571
Birmingham 136B       1456
Name: count, dtype: int64

### Geographic Hotspots
Crimes are most concentrated in urban centres:  
- **Birmingham** dominates, with several LSOAs (e.g., 138A, 135C) among the highest counts.  
- **Walsall 030A** and **Coventry 031E/031F** also appear as notable hotspots.  
- Other key areas include Wolverhampton and Solihull.  

This suggests that higher population density and city-centre activity drive concentration of recorded crimes.


In [19]:
# Discover how many crimes are solved vs unsolved
df["Last outcome category"].value_counts()


Last outcome category
Unable to prosecute suspect                            171805
Investigation complete; no suspect identified          111530
Outcome not recorded                                    26227
Under investigation                                     25363
Awaiting court outcome                                  16305
Local resolution                                        10078
Court result unavailable                                 9879
Status update unavailable                                6092
Action to be taken by another organisation               4351
Offender given a caution                                 2650
Further investigation is not in the public interest       157
Formal action is not in the public interest                95
Further action is not in the public interest               75
Suspect charged as part of another case                    69
Name: count, dtype: int64

### Outcome Categories
The most common outcomes are:  
- **Unable to prosecute suspect (171k cases)**  
- **Investigation complete; no suspect identified (111k cases)**  

Combined, these categories show that a majority of crimes do not result in prosecution.  
~26k cases remain as **"Outcome not recorded"**, while a smaller proportion are **under investigation (25k)** or awaiting outcomes.  
This highlights both reporting delays and challenges in securing successful case outcomes.


In [22]:
# Cross analysis of seasonal breakdown 
df.pivot_table(
    index="Season",
    columns="Crime type",
    values="Crime ID",
    aggfunc="count",
    fill_value=0
)


Crime type,Anti-social behaviour,Bicycle theft,Burglary,Criminal damage and arson,Drugs,Other crime,Other theft,Possession of weapons,Public order,Robbery,Shoplifting,Theft from the person,Vehicle crime,Violence and sexual offences
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Autumn,0,561,4286,6067,2205,1707,5354,1556,4402,1742,8912,726,7352,32276
Spring,0,461,3540,6251,2196,1586,4763,1683,4479,1629,8655,663,6983,33152
Summer,0,966,6418,11436,3497,3034,9001,2946,8884,3247,13544,1093,12142,59104
Winter,0,313,3645,5519,2254,1655,4599,1402,3776,1324,7630,604,6320,30909


### Seasonal Breakdown
Seasonal analysis shows crime volumes peaking in **Summer**, with:  
- **59k Violence and sexual offences**  
- **13.5k Shoplifting**  
- **11.4k Criminal damage and arson**  

Winter months show noticeably fewer incidents across all categories, particularly for shoplifting and vehicle crime.  
This reinforces the observation that crime patterns are influenced by seasonal and environmental factors, such as increased outdoor activity in summer.
