# London Crime Prediction Notebook


This notebook explores and predicts crime trends in London using historical ward-level data. It is designed as a template for junior data scientists and analysts to understand, forecast, and communicate crime patterns using simple, interpretable models.

## Objectives

- **Explore** the structure and trends in London crime data at the ward and category level.
- **Build** two predictive models:
  1. Forecast the percentage change in total crime for each ward (next month).
  2. Forecast the frequency of each crime category one year into the future.
- **Interpret** and communicate the results for practical use by local authorities and the public.
- **Provide** a clear, reproducible workflow for junior analysts to follow or extend.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\London-Crime\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\London-Crime'

---

# Predictive Modeling: Crime in London Wards

In this section, we will build and explain two predictive models using the London crime dataset. The goal is to provide actionable insights for local authorities and the public by forecasting crime trends at the ward and category level.

**Model 1:** Predicts the percentage increase (or decrease) in total crime for each ward using linear regression. This helps identify which wards may see rising or falling crime rates in the near future.

**Model 2:** Predicts the frequency of each crime category for the next month, also using linear regression. This can help prioritize resources for specific types of crime.

Both models are designed to be simple and interpretable, suitable for junior data scientists or analysts. The predictions are saved as CSV files for further analysis or visualization.

---

**Why Predict Crime?**
- Anticipating changes in crime rates helps with resource allocation, community safety planning, and public awareness.
- Simple models can provide a baseline for more advanced analytics in the future.

## 1. Load Data and Libraries

To begin, we import the necessary Python libraries for data analysis and modeling. We then load the processed crime data, which contains monthly crime counts for each ward and crime category in London. This data will be used as the basis for our predictive models.

- **pandas**: For data manipulation and analysis
- **numpy**: For numerical operations
- **scikit-learn**: For building linear regression models

We also preview the first few rows of the dataset to understand its structure.

In [6]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import numpy as np

# Load the processed crime data
file_path = 'data/clean/processed_crime_data.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Crime Category,Specific Crime Type,WardName,WardCode,LookUp_BoroughName,2023-11-01 00:00:00,2023-12-01 00:00:00,2024-01-01 00:00:00,2024-02-01 00:00:00,2024-03-01 00:00:00,...,2025-01-01 00:00:00,2025-02-01 00:00:00,2025-03-01 00:00:00,2025-04-01 00:00:00,2025-05-01 00:00:00,2025-06-01 00:00:00,2025-07-01 00:00:00,2025-08-01 00:00:00,2025-09-01 00:00:00,2025-10-01 00:00:00
0,ARSON AND CRIMINAL DAMAGE,ARSON,Heathrow Villages,E05013570,Aviation Security (SO18),0,1,3,0,2,...,2,1,1,0,0,5,3,2,0,0
1,ARSON AND CRIMINAL DAMAGE,CRIMINAL DAMAGE,Heathrow Villages,E05013570,Aviation Security (SO18),17,36,25,28,21,...,28,23,27,25,15,18,25,24,30,27
2,BURGLARY,BURGLARY BUSINESS AND COMMUNITY,Heathrow Villages,E05013570,Aviation Security (SO18),1,4,2,9,6,...,3,5,2,5,6,4,1,4,2,5
3,BURGLARY,RES BURGLARY OF A HOME,Heathrow Villages,E05013570,Aviation Security (SO18),7,8,11,7,3,...,5,4,10,2,1,8,1,4,2,3
4,BURGLARY,RES BURGLARY OF UNCONNECTED BUILDING,Heathrow Villages,E05013570,Aviation Security (SO18),3,3,1,0,1,...,1,0,0,1,0,1,0,2,1,0


## 2. Model 1: Predict Percentage Increase in Crime by Ward (with Train/Test Split)

### Approach
We use a simple linear regression for each ward to model the trend in total crime over time. To properly evaluate model performance, we implement a train/test split:
- **Training data**: The first ~80% of months (19 out of 24 months)
- **Test data**: The last ~20% of months (5 months)

The model is trained on the training period, then evaluated on the test period using Mean Absolute Error (MAE). After validation, we retrain on all available data to forecast the next month's value.

### Why Linear Regression?
Linear regression is a straightforward and interpretable method for time series forecasting when trends are roughly linear. It provides a baseline for more complex models and is easy to explain to non-technical stakeholders.

### Why Train/Test Split?
A train/test split allows us to evaluate how well the model generalizes to unseen data. The MAE on the test set gives us a realistic measure of prediction accuracy.

### What Does a Negative Percentage Mean?
A negative percentage increase means the model predicts a decrease in crime for the next month compared to the last month. This can help identify wards where crime is expected to fall, not just rise.

### Output
The predictions are saved to `ward_crime_pct_increase_predictions.csv` and include:
- `WardName`: The name of the ward
- `LastMonth`: The total crime count in the most recent month
- `PredictedNextMonth`: The model's forecast for the next month
- `PctIncrease`: The percentage change (can be negative if a decrease is predicted)
- `Test_MAE`: Mean absolute error on the test set for that ward

In [7]:
# Prepare data: sum all crimes per ward per month
ward_cols = ['WardName']
date_cols = df.columns[5:]

ward_monthly = df.groupby('WardName')[date_cols].sum()

# Define train/test split (80/20 split on time series)
n_months = len(date_cols)
train_size = int(n_months * 0.8)  # ~19 months for training
print(f"Total months: {n_months}, Training months: {train_size}, Test months: {n_months - train_size}")

# Predict next month's total crime for each ward using linear regression with train/test split
results = []
for ward, row in ward_monthly.iterrows():
    y = row.values
    X = np.arange(len(y)).reshape(-1, 1)
    
    # Split into train and test sets
    X_train, y_train = X[:train_size], y[:train_size]
    X_test, y_test = X[train_size:], y[train_size:]
    
    # Train model on training data
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate on test data
    y_test_pred = model.predict(X_test)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Retrain on all data for final prediction
    model.fit(X, y)
    next_month = len(y)
    pred = model.predict([[next_month]])[0]
    last_month = y[-1]
    pct_increase = ((pred - last_month) / last_month) * 100 if last_month != 0 else 0
    
    results.append({
        'WardName': ward, 
        'LastMonth': last_month, 
        'PredictedNextMonth': pred, 
        'PctIncrease': pct_increase,
        'Test_MAE': test_mae
    })

ward_pred_df = pd.DataFrame(results)
ward_pred_df.to_csv('data/clean/ward_crime_pct_increase_predictions.csv', index=False)
ward_pred_df

Total months: 24, Training months: 19, Test months: 5


Unnamed: 0,WardName,LastMonth,PredictedNextMonth,PctIncrease,Test_MAE
0,Abbey,192,180.927536,-5.766908,22.568421
1,Abbey Road,75,60.811594,-18.917874,16.536842
2,Abbey Wood,147,127.420290,-13.319531,13.663158
3,Abingdon,93,78.847826,-15.217391,7.552281
4,Addiscombe East,68,66.532609,-2.157928,10.848421
...,...,...,...,...,...
664,Worcester Park North,24,32.521739,35.507246,4.162105
665,Worcester Park South,15,19.510870,30.072464,5.985263
666,Wormholt,35,37.528986,7.225673,5.272982
667,Yeading,111,98.391304,-11.359185,23.768421


In [12]:
# Summary statistics of Model 1 test performance
print("Model 1 Test Performance Summary:")
print(f"Average Test MAE across all wards: {ward_pred_df['Test_MAE'].mean():.2f}")
print(f"Median Test MAE: {ward_pred_df['Test_MAE'].median():.2f}")
print(f"Min Test MAE: {ward_pred_df['Test_MAE'].min():.2f}")
print(f"Max Test MAE: {ward_pred_df['Test_MAE'].max():.2f}")
print("\nWards with highest test error (top 5):")
print(ward_pred_df.nlargest(5, 'Test_MAE')[['WardName', 'Test_MAE', 'LastMonth']])
print("\nWards with lowest test error (top 5):")
print(ward_pred_df.nsmallest(5, 'Test_MAE')[['WardName', 'Test_MAE', 'LastMonth']])

Model 1 Test Performance Summary:
Average Test MAE across all wards: 17.34
Median Test MAE: 13.78
Min Test MAE: 2.28
Max Test MAE: 216.80

Wards with highest test error (top 5):
             WardName    Test_MAE  LastMonth
628          West End  216.800000       2387
532        St James's   78.564912       1783
135          Colville   77.957193        151
82        Camden Town   77.852632        260
300  Hounslow Central   72.231579        201

Wards with lowest test error (top 5):
                      WardName  Test_MAE  LastMonth
485  Selsdon Vale & Forestdale  2.282105         41
572                 Teddington  2.477895         60
73                Bruce Castle  2.707368        146
331              Kingston Gate  2.755789         36
567               Sutton North  2.877895         51


### Interpreting the Test MAE Score for Model 1

The Mean Absolute Error (MAE) on the test set tells us, on average, how many crimes the model's predictions are off by when forecasting **unseen future months** for each ward. A **lower Test MAE** means the model generalizes well and can accurately predict crime trends it hasn't seen during training.

**What is a good MAE?**
- Compare the Test MAE to the ward's average monthly crime count (`LastMonth` column)
- If Test MAE is 5-10% of the typical monthly crime count, the model is performing well
- If Test MAE is 20%+ of monthly crimes, the model may struggle with that ward's variability

**Why do some wards have higher errors?**
- High-crime wards tend to have more variability and thus higher absolute errors
- Wards with irregular patterns (sudden spikes or drops) are harder to predict with linear models
- Lower-crime wards typically have lower MAE but may show higher percentage errors

The test set evaluation gives us confidence that the model can make reasonable predictions for the next month, as it has been validated on data it wasn't trained on.

## 3. Model 2: Predict Future Crime Category Frequency (One Year Ahead with Train/Test Split)

### Approach
For the second model, we use linear regression to predict the frequency of each crime category **one year into the future**. To properly evaluate model performance, we implement a train/test split:
- **Training data**: The first ~80% of months (19 out of 24 months)
- **Test data**: The last ~20% of months (5 months)

The model is trained on the training period, then evaluated on the test period using Mean Absolute Error (MAE). After validation, we retrain on all available data to forecast 12 months ahead from the last available month.

### Why Linear Regression for Categories?
This approach is simple and interpretable, making it easy to communicate results to stakeholders. It also provides a quick way to spot categories that are trending up or down, which can inform resource allocation and policy decisions.

### Why Train/Test Split?
A train/test split allows us to evaluate how well the model generalizes to unseen data, ensuring our one-year forecasts are based on validated performance.

### Output
The predictions are saved to `crime_category_frequency_1year_predictions.csv` and include:
- `Crime Category`: The name of the crime category
- `LastMonth`: The total count in the most recent month
- `PredictedOneYearAhead`: The model's forecast for 12 months after the last available month
- `Test_MAE`: Mean absolute error on the test set for that category

In [10]:
# Prepare data: sum all crimes per category per month
cat_monthly = df.groupby('Crime Category')[date_cols].sum()

# Use the same train/test split as Model 1
print(f"Total months: {n_months}, Training months: {train_size}, Test months: {n_months - train_size}")

cat_results = []
for cat, row in cat_monthly.iterrows():
    y = row.values
    X = np.arange(len(y)).reshape(-1, 1)
    
    # Split into train and test sets
    X_train, y_train = X[:train_size], y[:train_size]
    X_test, y_test = X[train_size:], y[train_size:]
    
    # Train model on training data
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate on test data
    y_test_pred = model.predict(X_test)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    
    # Retrain on all data for final prediction
    model.fit(X, y)
    one_year_ahead = len(y) + 12 - 1  # 12 months ahead from last index
    pred = model.predict([[one_year_ahead]])[0]
    
    cat_results.append({
        'Crime Category': cat, 
        'LastMonth': y[-1], 
        'PredictedOneYearAhead': pred,
        'Test_MAE': test_mae
    })

cat_pred_df = pd.DataFrame(cat_results)
cat_pred_df.to_csv('data/clean/crime_category_frequency_1year_predictions.csv', index=False)
cat_pred_df

Total months: 24, Training months: 19, Test months: 5


Unnamed: 0,Crime Category,LastMonth,PredictedOneYearAhead,Test_MAE
0,ARSON AND CRIMINAL DAMAGE,4648,4643.404058,433.263158
1,BURGLARY,4237,3341.307391,272.564912
2,DRUG OFFENCES,4519,6014.827101,376.252632
3,FRAUD AND FORGERY,3,-3.148406,2.715789
4,MISCELLANEOUS CRIMES AGAINST SOCIETY,1082,1260.050145,57.972632
5,NFIB FRAUD,0,-1.254203,0.294737
6,POSSESSION OF WEAPONS,474,513.046957,135.115789
7,PUBLIC ORDER OFFENCES,4861,4976.547971,736.557895
8,ROBBERY,2564,2441.214058,162.8
9,SEXUAL OFFENCES,2151,2579.684348,185.528421


In [11]:
# Summary statistics of Model 2 test performance
print("Model 2 Test Performance Summary:")
print(f"Average Test MAE across all categories: {cat_pred_df['Test_MAE'].mean():.2f}")
print(f"Median Test MAE: {cat_pred_df['Test_MAE'].median():.2f}")
print(f"Min Test MAE: {cat_pred_df['Test_MAE'].min():.2f}")
print(f"Max Test MAE: {cat_pred_df['Test_MAE'].max():.2f}")
print("\nCategories with highest test error (top 5):")
print(cat_pred_df.nlargest(5, 'Test_MAE')[['Crime Category', 'Test_MAE', 'LastMonth']])
print("\nCategories with lowest test error (top 5):")
print(cat_pred_df.nsmallest(5, 'Test_MAE')[['Crime Category', 'Test_MAE', 'LastMonth']])


Model 2 Test Performance Summary:
Average Test MAE across all categories: 479.27
Median Test MAE: 272.56
Min Test MAE: 0.29
Max Test MAE: 2367.01

Categories with highest test error (top 5):
                 Crime Category     Test_MAE  LastMonth
12  VIOLENCE AGAINST THE PERSON  2367.010526      21132
10                        THEFT   837.854737      23503
7         PUBLIC ORDER OFFENCES   736.557895       4861
11             VEHICLE OFFENCES   662.589474       7377
0     ARSON AND CRIMINAL DAMAGE   433.263158       4648

Categories with lowest test error (top 5):
                         Crime Category    Test_MAE  LastMonth
5                            NFIB FRAUD    0.294737          0
3                     FRAUD AND FORGERY    2.715789          3
4  MISCELLANEOUS CRIMES AGAINST SOCIETY   57.972632       1082
6                 POSSESSION OF WEAPONS  135.115789        474
8                               ROBBERY  162.800000       2564


### Interpreting the Test MAE Score for Model 2

The Mean Absolute Error (MAE) on the test set tells us how accurately the model can predict crime category counts for **unseen future months**. A **lower Test MAE** means the model is reliable for forecasting trends in that crime category.

**What is a good MAE?**
- Compare the Test MAE to the category's average monthly count (`LastMonth` column)
- If Test MAE is 5-10% of the typical monthly count, the model performs well
- If Test MAE is 20%+ of monthly crimes, the category may have high variability or non-linear trends

**Why do some categories have higher errors?**
- High-volume crime categories (e.g., THEFT, VIOLENCE) have more variability and higher absolute errors
- Categories with seasonal patterns or sudden changes are harder to predict with simple linear models
- Lower-volume categories may have lower absolute MAE but can still show high percentage errors

**Implications for One-Year Forecasts:**
Since Model 2 predicts 12 months ahead (not just one month like Model 1), the test MAE helps us understand the model's baseline accuracy. Categories with low test errors are more reliable for long-term planning, while high-error categories may benefit from more sophisticated forecasting methods or additional features (e.g., seasonality adjustments).

## 4. Summary and Interpretation

### What Did We Do?
- **Loaded and explored the London crime dataset** to understand its structure and content.
- **Built two simple predictive models** using linear regression:
  - The first model forecasts the percentage increase or decrease in total crime for each ward, helping to identify areas with rising or falling crime trends.
  - The second model predicts the frequency of each crime category **one year into the future**, highlighting which types of crime may require more attention over a longer period.
- **Saved the predictions to CSV files** for further analysis, reporting, or visualization.

### How to Interpret the Results
- **PctIncrease (Model 1):**
  - A positive value means the model predicts an increase in crime for the ward next month.
  - A negative value means a decrease is predicted. This is possible if the trend in the data is downward or if there was a recent spike that the model expects to drop.
  - Wards with the highest positive values may need more resources or attention.
- **PredictedOneYearAhead (Model 2):**
  - Shows the expected number of crimes for each category 12 months after the last available month.
  - Categories with high or rising predictions may indicate emerging issues or persistent problems.

### Limitations and Next Steps
- **Model Simplicity:** These models use only linear trends and do not account for seasonality, external events, or other factors. They are intended as a baseline.
- **Data Quality:** Predictions are only as good as the data. Outliers or missing data can affect results.
- **Model Performance:** Model 1 performs adequately for short-term ward-level predictions (median MAE of 13.78 crimes). However, Model 2 shows higher error rates for one-year forecasts, particularly for categories like PUBLIC ORDER OFFENCES (15.2% error), suggesting that more sophisticated methods like ARIMA or seasonal decomposition would be valuable future enhancements for long-term predictions.
- **Further Analysis:** Consider visualizing the predictions, comparing with actual future data, or using more advanced models (e.g., ARIMA, Random Forest) for improved accuracy.

### Why This Matters
- **Proactive Policing:** Early identification of rising crime trends can help authorities allocate resources more effectively.
- **Community Awareness:** Public access to predictions can inform community safety initiatives.
- **Baseline for Improvement:** Simple models provide a starting point for more sophisticated analytics in the future.