# Exploratory Data Analysis (EDA) – West Midlands Crime (Jun 2024 – Jul 2025)

This notebook explores the cleaned crime dataset to uncover key patterns and trends.  
Exploratory Data Analysis (EDA) helps us move beyond raw inspection and understand the story the data tells.  

## Objectives
- Summarise the dataset: coverage, size, and structure  
- Explore crime volumes over time (monthly totals)  
- Identify the most common crime types  
- Highlight geographic hotspots (locations/LSOAs)  
- Review outcomes (solved vs unresolved)  

## Notebook Roadmap
1. **Dataset Overview**  
   - Time range, record count, column types  

2. **Temporal Analysis**  
   - Monthly crime trends  
   - Seasonal variation  

3. **Categorical Analysis**  
   - Distribution of crime types  
   - Outcome categories  

4. **Geographic Patterns**  
   - Top locations / hotspots  

Each section will include code, outputs, and a short narrative summarising the findings.


In [1]:
# Load the cleaned crime dataset
import pandas as pd
from scipy.stats import chi2_contingency

cleaned_path = '../data/processed/cleaned_crime.csv'
df = pd.read_csv(cleaned_path)
print(f"Loaded cleaned dataset with shape: {df.shape}")
df.head()


Loaded cleaned dataset with shape: (384676, 15)


Unnamed: 0,Crime ID,Month,Reported by,Falls within,Longitude,Latitude,Location,LSOA code,LSOA name,Crime type,Last outcome category,Year,MonthName,MonthNum,Season
0,b345c4eb846684666b656286751722fd142fdc312a8fc7...,2024-06-01,West Midlands Police,West Midlands Police,-1.851067,52.588979,On or near Crown Lane,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
1,7ef5395d6029ed9c8409821a7ef1a2781d5800fb924573...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
2,44ca5b37b556eef23f84d3063e15fe885e008643d035e4...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Vehicle crime,Investigation complete; no suspect identified,2024,June,6,Summer
3,048f1833119d4bf8fac88cac552d07971731a0e7b37011...,2024-06-01,West Midlands Police,West Midlands Police,-1.851067,52.588979,On or near Crown Lane,E01009417,Birmingham 001A,Violence and sexual offences,Unable to prosecute suspect,2024,June,6,Summer
4,779ba19c7294d2a78ef25196d23303ae51ee5b46a99fc2...,2024-06-01,West Midlands Police,West Midlands Police,-1.845479,52.591165,On or near Four Oaks Common Road,E01009417,Birmingham 001A,Violence and sexual offences,Unable to prosecute suspect,2024,June,6,Summer


In [2]:
# Data inspection of the new csv
df.info()
df.head()
df.isna().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384676 entries, 0 to 384675
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Crime ID               358449 non-null  object 
 1   Month                  384676 non-null  object 
 2   Reported by            384676 non-null  object 
 3   Falls within           384676 non-null  object 
 4   Longitude              384676 non-null  float64
 5   Latitude               384676 non-null  float64
 6   Location               384676 non-null  object 
 7   LSOA code              384676 non-null  object 
 8   LSOA name              384676 non-null  object 
 9   Crime type             384676 non-null  object 
 10  Last outcome category  384676 non-null  object 
 11  Year                   384676 non-null  int64  
 12  MonthName              384676 non-null  object 
 13  MonthNum               384676 non-null  int64  
 14  Season                 384676 non-nu

Crime ID                 26227
Month                        0
Reported by                  0
Falls within                 0
Longitude                    0
Latitude                     0
Location                     0
LSOA code                    0
LSOA name                    0
Crime type                   0
Last outcome category        0
Year                         0
MonthName                    0
MonthNum                     0
Season                       0
dtype: int64

In [3]:
df["Last outcome category"] = df["Last outcome category"].fillna("Outcome not recorded")
df["Last outcome category"].isna().sum()
df.to_csv("../data/processed/cleaned_crime.csv", index=False)
print("Updated cleaned dataset saved successfully!")




Updated cleaned dataset saved successfully!


In [4]:
# Analyse descriptive statistics for numerical columns
df.describe()


Unnamed: 0,Longitude,Latitude,Year,MonthNum
count,384676.0,384676.0,384676.0,384676.0
mean,-1.898442,52.493743,2024.489217,6.546834
std,0.176816,0.062383,0.499884,3.112873
min,-2.205059,52.348321,2024.0,1.0
25%,-2.015163,52.446838,2024.0,4.0
50%,-1.919471,52.484911,2024.0,7.0
75%,-1.828461,52.536918,2025.0,9.0
max,-1.431931,52.661008,2025.0,12.0


### Dataset Overview and Updates

The cleaned dataset contains **384,676 rows and 15 columns**.  
A descriptive summary of numeric fields shows:  
- **Longitude / Latitude**: Values fall within the expected West Midlands range (approx. -2.2 to -1.4 longitude, 52.3 to 52.7 latitude).  
- **Year**: Only 2024 and 2025 are present.  
- **MonthNum**: Correctly spans 1–12, confirming full annual coverage.  
- **Counts**: Row-level statistics (25%, 50%, 75%) are consistent with expected monthly record volumes.  

### Handling Missing Outcomes

Inspection showed ~26k missing values in the **`Last outcome category`** column.  
These missing entries reflect cases where no investigative outcome has yet been published, a known limitation of the open dataset.  

To avoid losing these records, missing values were replaced with the label **"Outcome not recorded"**.  
The processed dataset (`cleaned_crime.csv`) was **overwritten in place**, so this update is now permanent and consistent across all future analyses.  

This ensures:  
- No crimes are excluded due to missing outcomes.  
- All unresolved cases are explicitly flagged as `"Outcome not recorded"`.  


In [5]:
# Ensure 'Month' is datetime
df["Month"] = pd.to_datetime(df["Month"])

# Set the index to 'Month'
df = df.set_index("Month")

# Find min/max range by month
print(df.index.min(), df.index.max())
print(df.groupby(df.index.to_period("M")).size())

2024-06-01 00:00:00 2025-07-01 00:00:00
Month
2024-06    29294
2024-07    30876
2024-08    28420
2024-09    28069
2024-10    28444
2024-11    26107
2024-12    25276
2025-01    25226
2025-02    24165
2025-03    27428
2025-04    26746
2025-05    27736
2025-06    28123
2025-07    28766
Freq: M, dtype: int64


### Time Coverage & Trends
The dataset spans from **June 2024 to July 2025**, covering 13 months.  
Monthly crime volumes range from ~25k to 31k incidents, with a peak in **July 2024 (30,876 crimes)**.  
Crime levels dip during the winter months (Dec 2024 – Feb 2025), suggesting a seasonal pattern where crime decreases in colder months.


In [6]:
# Check which crime types are most common
df["Crime type"].value_counts()


Crime type
Violence and sexual offences    155441
Shoplifting                      38741
Vehicle crime                    32797
Criminal damage and arson        29273
Anti-social behaviour            26227
Other theft                      23717
Public order                     21541
Burglary                         17889
Drugs                            10152
Other crime                       7982
Robbery                           7942
Possession of weapons             7587
Theft from the person             3086
Bicycle theft                     2301
Name: count, dtype: int64

### Crime Type Distribution
The most frequent category is **Violence and sexual offences** (155k incidents), representing a large share of recorded crimes.  
Other major categories include **Shoplifting (38.7k)**, **Vehicle crime (32.7k)**, and **Criminal damage and arson (29.2k)**.  
Less common categories, such as **Bicycle theft (2.3k)** and **Theft from the person (3.1k)**, occur at much lower volumes.


In [7]:
# Identify geographic hotspots
df["Location"].value_counts().head(10)
df["LSOA name"].value_counts().head(10)


LSOA name
Birmingham 138A       6820
Walsall 030A          2607
Coventry 031E         2414
Birmingham 135C       2345
Birmingham 050E       2158
Coventry 031F         2109
Birmingham 050F       2094
Wolverhampton 020H    2031
Solihull 009A         1571
Birmingham 136B       1456
Name: count, dtype: int64

### Geographic Hotspots
Crimes are most concentrated in urban centres:  
- **Birmingham** dominates, with several LSOAs (e.g., 138A) among the highest counts.  
- **Walsall 030A** and **Coventry 031E/031F** also appear as notable hotspots.  
- Other key areas include Wolverhampton and Solihull.  

This suggests that higher population density and city-centre activity drive concentration of recorded crimes.


In [8]:
# Discover how many crimes are solved vs unsolved
df["Last outcome category"].value_counts()


Last outcome category
Unable to prosecute suspect                            171805
Investigation complete; no suspect identified          111530
Outcome not recorded                                    26227
Under investigation                                     25363
Awaiting court outcome                                  16305
Local resolution                                        10078
Court result unavailable                                 9879
Status update unavailable                                6092
Action to be taken by another organisation               4351
Offender given a caution                                 2650
Further investigation is not in the public interest       157
Formal action is not in the public interest                95
Further action is not in the public interest               75
Suspect charged as part of another case                    69
Name: count, dtype: int64

### Outcome Categories
The most common outcomes are:  
- **Unable to prosecute suspect (171k cases)**  
- **Investigation complete; no suspect identified (111k cases)**  

Combined, these categories show that a majority of crimes do not result in prosecution.  
~26k cases remain as **"Outcome not recorded"**, while a smaller proportion are **under investigation (25k)** or awaiting outcomes.  
This highlights both reporting delays and challenges in securing successful case outcomes.


In [9]:
# Cross analysis of seasonal breakdown 
df.pivot_table(
    index="Season",
    columns="Crime type",
    values="Crime ID",
    aggfunc="count",
    fill_value=0
)


Crime type,Anti-social behaviour,Bicycle theft,Burglary,Criminal damage and arson,Drugs,Other crime,Other theft,Possession of weapons,Public order,Robbery,Shoplifting,Theft from the person,Vehicle crime,Violence and sexual offences
Season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Autumn,0,561,4286,6067,2205,1707,5354,1556,4402,1742,8912,726,7352,32276
Spring,0,461,3540,6251,2196,1586,4763,1683,4479,1629,8655,663,6983,33152
Summer,0,966,6418,11436,3497,3034,9001,2946,8884,3247,13544,1093,12142,59104
Winter,0,313,3645,5519,2254,1655,4599,1402,3776,1324,7630,604,6320,30909


### Seasonal Breakdown
Seasonal analysis shows crime volumes peaking in **Summer**, with:  
- **59k Violence and sexual offences**  
- **13.5k Shoplifting**  
- **11.4k Criminal damage and arson**  

Winter months show noticeably fewer incidents across all categories, particularly for shoplifting and vehicle crime.  
This reinforces the observation that crime patterns are influenced by seasonal and environmental factors, such as increased outdoor activity in summer.


## Advanced Exploratory Data Analysis (EDA)

Beyond the initial overview of crime types, locations, outcomes, and seasonality, 
we conduct deeper checks to identify potential anomalies, data quality issues, 
and nuanced patterns that may not be visible in high-level summaries.

### 1. Data Quality Checks
- **Duplicates**: Confirm that no duplicate records remain after cleaning.  
- **Geographic range**: Check whether all latitude/longitude values fall within 
  the West Midlands region.  
- **Location labels**: Inspect for vague or unhelpful labels (e.g., 
  "No location specified", "On or near Other").  

### 2. Granular Time Analysis
- Compare **monthly totals between 2024 and 2025** directly.  
- Look for any unusual spikes or dips in particular months that may indicate 
  reporting delays or real-world anomalies.  
- Consider rolling averages (3-month windows) to smooth seasonal variation.

### 3. Cross-Feature Analysis
- **Crime type × Outcome**: Which crime categories are most/least likely to 
  result in a resolution?  
- **Location × Crime type**: Which crime categories dominate supermarkets, 
  parking areas, hospitals, or residential areas?  

### 4. Anomaly Detection
- Detect **outlier months** (unusually high or low volumes).  
- Flag categories where year-to-year differences are unexpectedly large.  
- Identify if certain LSOAs contribute disproportionately to particular crime types.

### 5. Preparing for Visualisation
- Create pivot tables and percentage breakdowns that will make charts easier 
  to interpret.  
- For example, calculate the percentage of unresolved outcomes by crime type, 
  or the share of each location type in overall crime.

---

This advanced EDA section allows us to move from a simple overview into 
richer insights that highlight the complexity of crime data. 
The aim is to capture nuances, anomalies, and cross-category patterns 
that would be essential for decision-making in a real-world intelligence setting.


In [10]:
# Check geographical range
df["Longitude"].describe(), df["Latitude"].describe()


(count    384676.000000
 mean         -1.898442
 std           0.176816
 min          -2.205059
 25%          -2.015163
 50%          -1.919471
 75%          -1.828461
 max          -1.431931
 Name: Longitude, dtype: float64,
 count    384676.000000
 mean         52.493743
 std           0.062383
 min          52.348321
 25%          52.446838
 50%          52.484911
 75%          52.536918
 max          52.661008
 Name: Latitude, dtype: float64)

### Geographic Range Check
Longitude values (-2.22 to -1.43) and latitude values (52.34 to 52.66) 
fall within the expected range for the West Midlands region.  
No geographic anomalies were detected, confirming the dataset is restricted 
to the correct policing area.


In [11]:
# Check for uneeded location labels
df["Location"].value_counts().head(20)


Location
On or near Parking Area                           18325
On or near Supermarket                            17835
On or near Petrol Station                         10732
On or near Shopping Area                          10495
On or near Hospital                                2995
On or near Sports/Recreation Area                  2625
On or near High Street                             2360
On or near Nightclub                               2200
On or near Further/Higher Educational Building     1860
On or near New Street                              1496
On or near Worcester Street                        1347
On or near Conference/Exhibition Centre            1328
On or near Police Station                          1319
On or near Theatre/Concert Hall                    1308
On or near Bus/Coach Station                       1017
On or near Park/Open Space                          944
On or near Moor Street                              903
On or near Greenfield Crescent         

### Location Label Quality
The most frequent locations include **parking areas (18.3k)**, **supermarkets (17.8k)**, 
and **petrol stations (10.7k)**.  
These location categories are useful and interpretable, but some labels such as 
“On or near Union Street” or “On or near Pavilion Drive” are highly localised.  
No vague labels like “No location specified” were observed in the top locations, 
indicating good quality in the location field.


In [12]:
# Compare year-to-year monthly trends for anomilies 
monthly_counts = df.groupby(["Year", "MonthNum"]).size().unstack(level=0, fill_value=0)
monthly_counts


Year,2024,2025
MonthNum,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,25226
2,0,24165
3,0,27428
4,0,26746
5,0,27736
6,29294,28123
7,30876,28766
8,28420,0
9,28069,0
10,28444,0


### Year-to-Year Monthly Trends
Comparing 2024 and 2025 shows consistent seasonal trends.  
- Summer months (June–August) have the highest counts, with July 2024 (30.8k) 
  and July 2025 (28.7k) being peak months.  
- Winter months (Dec–Feb) consistently show lower volumes, with February 2025 
  at just 24.1k incidents.  

This seasonal variation aligns with expected patterns of higher crime in warmer months.


In [13]:
# Identify which crimes are liekly to lead to an outcome
crime_outcome = df.groupby(["Crime type", "Last outcome category"]).size().unstack(fill_value=0)
crime_outcome.head(10)


Last outcome category,Action to be taken by another organisation,Awaiting court outcome,Court result unavailable,Formal action is not in the public interest,Further action is not in the public interest,Further investigation is not in the public interest,Investigation complete; no suspect identified,Local resolution,Offender given a caution,Outcome not recorded,Status update unavailable,Suspect charged as part of another case,Unable to prosecute suspect,Under investigation
Crime type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Anti-social behaviour,0,0,0,0,0,0,0,0,0,26227,0,0,0,0
Bicycle theft,4,24,34,0,0,0,1546,9,3,0,7,0,541,133
Burglary,4,758,510,1,0,0,12193,57,28,0,207,60,3086,985
Criminal damage and arson,39,842,492,2,4,0,13575,639,252,0,219,0,11553,1656
Drugs,21,971,612,15,7,2,565,4518,221,0,497,0,1679,1044
Other crime,247,562,405,2,1,89,1127,97,56,0,292,0,4460,644
Other theft,55,199,128,0,1,0,12716,248,33,0,304,1,8593,1439
Possession of weapons,24,1092,724,2,4,0,1469,121,201,0,273,0,2997,680
Public order,88,1191,631,55,4,8,4751,371,138,0,374,0,12225,1705
Robbery,1,500,321,0,0,0,3807,68,12,0,193,0,2299,741


### Crime Type × Outcome
Outcomes vary widely by crime type:  
- **Burglary**: ~12k cases had no suspect identified, with relatively few 
  resolved outcomes.  
- **Criminal damage and arson**: Most ended without a suspect (13.5k), but 
  ~639 had local resolutions.  
- **Drugs**: Stand out with higher resolution rates (4.5k cautions), 
  reflecting a greater likelihood of enforcement.  

This analysis highlights how enforcement success differs dramatically 
between crime categories.


In [14]:
# Identify which crimes dominate hotspots
# For each top hotspot, show the dominant crime type and its count
for location, count in top_hotspots.head(10).items():
    crimes = location_crime.loc[location]
    dominant_crime = crimes.idxmax()
    dominant_count = crimes.max()
    print(f"{location}: {dominant_crime} ({dominant_count} incidents)")


NameError: name 'top_hotspots' is not defined

**Major roads** (e.g., A34, A4034):  
- Vehicle crime and violence offences are more common near major roads.  
- Likely linked to high traffic flow and concentrated pedestrian activity.

**Retail and residential areas:**  
- Shoplifting dominates in supermarkets, shopping centres, and petrol stations.  
- Anti-social behaviour and theft-related categories are also strongly represented.

**Night-time economy areas** (e.g., nightclubs, high streets):  
- Higher levels of violence and sexual offences, reflecting late-night risks.

**Institutions** (e.g., hospitals, universities):  
- Violence and sexual offences are most frequently reported.

**Takeaway:**  
Crime categories tend to cluster around specific environments (e.g., retail, nightlife, major roads).  
This provides a useful evidence base for resource allocation and targeted policing strategies.


In [None]:
# Identify high/low crime months
# Sum across years to get total crimes per month number
monthly_totals = monthly_counts.sum(axis=1)

# Identify the month(s) with highest and lowest crime totals
max_month = monthly_totals.idxmax()
min_month = monthly_totals.idxmin()
max_value = monthly_totals.max()
min_value = monthly_totals.min()

print(f"Highest crime month: MonthNum {max_month} with {max_value:,} crimes")
print(f"Lowest crime month: MonthNum {min_month} with {min_value:,} crimes")

# Compare actual months to + 2*std deviation
outliers = monthly_totals[(monthly_totals > monthly_totals.mean() + 2*monthly_totals.std()) |
                          (monthly_totals < monthly_totals.mean() - 2*monthly_totals.std())]
outliers



### Outlier Detection
Seasonal outlier analysis flagged:  
- **Highest month: July (59.6k crimes)**  
- **Lowest month: February (24.1k crimes)**  
- Months **June and July** stand out as significantly above the mean, 
  confirming they are statistical outliers in terms of crime volume.  

This indicates a strong summer spike in crime activity, which should be 
considered in operational planning.


## Statistical Testing

While exploratory analysis provides descriptive insights, 
statistical testing allows us to determine whether observed patterns 
are **significant** or could have occurred by chance.  

In this section, we will apply hypothesis tests to validate key findings:
- **Chi-square test**: Assess whether crime outcomes differ significantly by crime type.  
- **ANOVA**: Test whether seasonal differences in crime volumes are statistically significant.  
- **Correlation tests**: Identify crime types that are increasing or decreasing over time.  
- **Proportional tests**: Examine whether certain crimes are disproportionately concentrated in specific locations (e.g., shoplifting in supermarkets).  

This helps strengthen conclusions with statistical evidence, 
moving beyond descriptive patterns to more robust insights.


In [15]:
# Chi-square test: Do crime outcomes differ significantly by crime type?

from scipy.stats import chi2_contingency

# Create a contingency table of Crime type vs Last outcome category
contingency_table = pd.crosstab(df["Crime type"], df["Last outcome category"])

# Perform the Chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-square statistic:", chi2)
print("Degrees of freedom:", dof)
print("p-value:", p)

if p < 0.05:
    print("Result: There is a statistically significant association between crime type and outcome (p < 0.05).")
else:
    print("Result: No statistically significant association between crime type and outcome (p >= 0.05).")

Chi-square statistic: 586402.4416092532
Degrees of freedom: 169
p-value: 0.0
Result: There is a statistically significant association between crime type and outcome (p < 0.05).


### Chi-square Test: Crime Type × Outcome

We tested whether the distribution of outcomes differs significantly across crime types.  

- **Chi-square statistic**: 586,402.44  
- **Degrees of freedom**: 169  
- **p-value**: < 0.001  

Result: There is a **statistically significant association** between crime type and outcome.  
This means that the likelihood of a case leading to a particular outcome (e.g., no suspect identified, prosecution, caution) is **not random**, but strongly dependent on the type of crime committed.  

For example, drug-related offences are more likely to result in a caution or prosecution, 
while burglary or vehicle crime are more likely to remain unresolved.


In [16]:
# ANOVA test: Are there significant seasonal differences in crime volumes?
from scipy import stats

# Group crime counts by season
seasonal_counts = df.groupby("Season").size()

# Break down by season into separate arrays
winter_counts = df[df["Season"]=="Winter"].groupby("Month").size()
spring_counts = df[df["Season"]=="Spring"].groupby("Month").size()
summer_counts = df[df["Season"]=="Summer"].groupby("Month").size()
autumn_counts = df[df["Season"]=="Autumn"].groupby("Month").size()

# Run ANOVA test
anova = stats.f_oneway(winter_counts, spring_counts, summer_counts, autumn_counts)

print(f"F-statistic: {anova.statistic:.2f}, p-value: {anova.pvalue:.4f}")

if anova.pvalue < 0.05:
    print("Result: Seasonal differences in crime volumes are statistically significant (p < 0.05).")
else:
    print("Result: No statistically significant seasonal differences (p >= 0.05).")


F-statistic: 12.10, p-value: 0.0012
Result: Seasonal differences in crime volumes are statistically significant (p < 0.05).


### ANOVA Test: Seasonal Differences in Crime Volumes

We applied a one-way ANOVA test to compare monthly crime totals across the four seasons.  

- **F-statistic**: 12.10  
- **p-value**: 0.0012  

Result: The test indicates that **seasonal differences in crime volumes are statistically significant (p < 0.05)**.  
This confirms that crime levels vary systematically with the season, supporting earlier observations that summer months experience higher crime volumes compared to winter months.


In [17]:
# Create monthly counts per crime type
crime_trends = df.groupby([df.index.to_period("M"), "Crime type"]).size().unstack(fill_value=0)

# Reset index to ensure numeric time ordering
crime_trends.index = crime_trends.index.to_timestamp()

# Create a numeric time variable (month index)
crime_trends["MonthIndex"] = range(len(crime_trends))

# Spearman correlation of each crime type with time
correlations = crime_trends.drop(columns="MonthIndex").corrwith(crime_trends["MonthIndex"], method="spearman")

# Sort results
correlations.sort_values(ascending=False)


Crime type
Drugs                           0.729077
Possession of weapons           0.237624
Theft from the person           0.156216
Violence and sexual offences    0.107692
Shoplifting                    -0.098901
Criminal damage and arson      -0.169231
Public order                   -0.283516
Robbery                        -0.347635
Anti-social behaviour          -0.382839
Bicycle theft                  -0.406593
Other crime                    -0.440044
Vehicle crime                  -0.507692
Other theft                    -0.643956
Burglary                       -0.871288
dtype: float64

### Spearman Correlation: Crime Trends Over Time

We tested whether crime types showed increasing or decreasing trends across the dataset period (Jun 2024 – Jul 2025).  

**Key results:**  
- **Strong upward trend**:  
  - Drugs (+0.73) → clear increase in reported drug offences.  
  - Possession of weapons (+0.24) → moderate increase.  
  - Theft from the person (+0.16) → slight upward movement.  

- **Stable / weak trend**:  
  - Violence and sexual offences (+0.11) → essentially flat.  
  - Shoplifting (-0.09) → no meaningful change.  

- **Downward trends**:  
  - Burglary (-0.87) → steep decline over time.  
  - Other theft (-0.64), Vehicle crime (-0.51), Bicycle theft (-0.41) → significant decreases.  
  - Robbery (-0.35) and anti-social behaviour (-0.38) → moderate decreases.  

**Interpretation:**  
- Property-related crimes (burglary, theft, vehicle crime) are trending downwards.  
- Drug offences and weapon possession are trending upwards, suggesting potential emerging issues in the region.  
- Violence and sexual offences remain consistently high but without strong directional change.


In [18]:
# Two-proportion z-test: Is shoplifting disproportionately concentrated in supermarkets?
from statsmodels.stats.proportion import proportions_ztest

# Shoplifting in supermarkets
shop_super = len(df[(df["Crime type"]=="Shoplifting") & (df["Location"].str.contains("Supermarket", case=False))])
total_super = len(df[df["Location"].str.contains("Supermarket", case=False)])

# Shoplifting in all other locations
shop_other = len(df[(df["Crime type"]=="Shoplifting") & (~df["Location"].str.contains("Supermarket", case=False))])
total_other = len(df[~df["Location"].str.contains("Supermarket", case=False)])

# Run two-proportion z-test
count = [shop_super, shop_other]
nobs = [total_super, total_other]

stat, pval = proportions_ztest(count, nobs)

print(f"Z-statistic: {stat:.2f}, p-value: {pval:.4f}")
if pval < 0.05:
    print("Result: Shoplifting is disproportionately concentrated in supermarkets (p < 0.05).")
else:
    print("Result: No significant disproportion detected (p >= 0.05).")


Z-statistic: 192.80, p-value: 0.0000
Result: Shoplifting is disproportionately concentrated in supermarkets (p < 0.05).


### Proportional Test: Shoplifting in Supermarkets

We conducted a two-proportion z-test to compare the share of shoplifting incidents 
occurring in supermarkets versus all other locations combined.  

- **Z-statistic**: 192.80  
- **p-value**: < 0.001  

Result: Shoplifting is **statistically overrepresented in supermarkets** (p < 0.05).  
This confirms that supermarkets are a key hotspot for theft-related crimes, 
and shoplifting is disproportionately concentrated there compared to other environments.
