# London Crime Prediction Notebook


This notebook explores and predicts crime trends in London using historical ward-level data. It is designed as a template for junior data scientists and analysts to understand, forecast, and communicate crime patterns using simple, interpretable models.

## Objectives

- **Explore** the structure and trends in London crime data at the ward and category level.
- **Build** two predictive models:
  1. Forecast the percentage change in total crime for each ward (next month).
  2. Forecast the frequency of each crime category one year into the future.
- **Interpret** and communicate the results for practical use by local authorities and the public.
- **Provide** a clear, reproducible workflow for junior analysts to follow or extend.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\London-Crime\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\jackr\\OneDrive\\Desktop\\my_projects\\London-Crime'

---

# Predictive Modeling: Crime in London Wards

In this section, we will build and explain two predictive models using the London crime dataset. The goal is to provide actionable insights for local authorities and the public by forecasting crime trends at the ward and category level.

**Model 1:** Predicts the percentage increase (or decrease) in total crime for each ward using linear regression. This helps identify which wards may see rising or falling crime rates in the near future.

**Model 2:** Predicts the frequency of each crime category for the next month, also using linear regression. This can help prioritize resources for specific types of crime.

Both models are designed to be simple and interpretable, suitable for junior data scientists or analysts. The predictions are saved as CSV files for further analysis or visualization.

---

**Why Predict Crime?**
- Anticipating changes in crime rates helps with resource allocation, community safety planning, and public awareness.
- Simple models can provide a baseline for more advanced analytics in the future.

## 1. Load Data and Libraries

To begin, we import the necessary Python libraries for data analysis and modeling. We then load the processed crime data, which contains monthly crime counts for each ward and crime category in London. This data will be used as the basis for our predictive models.

- **pandas**: For data manipulation and analysis
- **numpy**: For numerical operations
- **scikit-learn**: For building linear regression models

We also preview the first few rows of the dataset to understand its structure.

In [6]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np

# Load the processed crime data
file_path = 'data/clean/processed_crime_data.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Crime Category,Specific Crime Type,WardName,WardCode,LookUp_BoroughName,2023-11-01 00:00:00,2023-12-01 00:00:00,2024-01-01 00:00:00,2024-02-01 00:00:00,2024-03-01 00:00:00,...,2025-01-01 00:00:00,2025-02-01 00:00:00,2025-03-01 00:00:00,2025-04-01 00:00:00,2025-05-01 00:00:00,2025-06-01 00:00:00,2025-07-01 00:00:00,2025-08-01 00:00:00,2025-09-01 00:00:00,2025-10-01 00:00:00
0,ARSON AND CRIMINAL DAMAGE,ARSON,Heathrow Villages,E05013570,Aviation Security (SO18),0,1,3,0,2,...,2,1,1,0,0,5,3,2,0,0
1,ARSON AND CRIMINAL DAMAGE,CRIMINAL DAMAGE,Heathrow Villages,E05013570,Aviation Security (SO18),17,36,25,28,21,...,28,23,27,25,15,18,25,24,30,27
2,BURGLARY,BURGLARY BUSINESS AND COMMUNITY,Heathrow Villages,E05013570,Aviation Security (SO18),1,4,2,9,6,...,3,5,2,5,6,4,1,4,2,5
3,BURGLARY,RES BURGLARY OF A HOME,Heathrow Villages,E05013570,Aviation Security (SO18),7,8,11,7,3,...,5,4,10,2,1,8,1,4,2,3
4,BURGLARY,RES BURGLARY OF UNCONNECTED BUILDING,Heathrow Villages,E05013570,Aviation Security (SO18),3,3,1,0,1,...,1,0,0,1,0,1,0,2,1,0


## 2. Model 1: Predict Percentage Increase in Crime by Ward

### Approach
We use a simple linear regression for each ward to model the trend in total crime over time. The model is trained on the monthly total crime counts for each ward. We then forecast the next month's value and compare it to the most recent month to calculate the percentage increase or decrease.

### Why Linear Regression?
Linear regression is a straightforward and interpretable method for time series forecasting when trends are roughly linear. It provides a baseline for more complex models and is easy to explain to non-technical stakeholders.

### What Does a Negative Percentage Mean?
A negative percentage increase means the model predicts a decrease in crime for the next month compared to the last month. This can help identify wards where crime is expected to fall, not just rise.

### Output
The predictions are saved to `ward_crime_pct_increase_predictions.csv` and include:
- `WardName`: The name of the ward
- `LastMonth`: The total crime count in the most recent month
- `PredictedNextMonth`: The model's forecast for the next month
- `PctIncrease`: The percentage change (can be negative if a decrease is predicted)

In [9]:
# Prepare data: sum all crimes per ward per month
ward_cols = ['WardName']
date_cols = df.columns[5:]

ward_monthly = df.groupby('WardName')[date_cols].sum()

# Predict next month's total crime for each ward using linear regression
results = []
for ward, row in ward_monthly.iterrows():
    y = row.values
    X = np.arange(len(y)).reshape(-1, 1)
    model = LinearRegression()
    model.fit(X, y)
    next_month = len(y)
    pred = model.predict([[next_month]])[0]
    last_month = y[-1]
    pct_increase = ((pred - last_month) / last_month) * 100 if last_month != 0 else 0
    results.append({'WardName': ward, 'LastMonth': last_month, 'PredictedNextMonth': pred, 'PctIncrease': pct_increase})

ward_pred_df = pd.DataFrame(results)
ward_pred_df.to_csv('data/clean/ward_crime_pct_increase_predictions.csv', index=False)
ward_pred_df

Unnamed: 0,WardName,LastMonth,PredictedNextMonth,PctIncrease
0,Abbey,192,180.927536,-5.766908
1,Abbey Road,75,60.811594,-18.917874
2,Abbey Wood,147,127.420290,-13.319531
3,Abingdon,93,78.847826,-15.217391
4,Addiscombe East,68,66.532609,-2.157928
...,...,...,...,...
664,Worcester Park North,24,32.521739,35.507246
665,Worcester Park South,15,19.510870,30.072464
666,Wormholt,35,37.528986,7.225673
667,Yeading,111,98.391304,-11.359185


## 3. Model 2: Predict Future Crime Category Frequency (One Year Ahead)

### Approach
For the second model, we use linear regression to predict the frequency of each crime category **one year into the future**. This is done by aggregating the monthly counts for each category across all wards, then fitting a linear model to the time series for each category. The model then forecasts the value 12 months ahead from the last available month.

### Why Linear Regression for Categories?
This approach is simple and interpretable, making it easy to communicate results to stakeholders. It also provides a quick way to spot categories that are trending up or down, which can inform resource allocation and policy decisions.

### Output
The predictions are saved to `crime_category_frequency_1year_predictions.csv` and include:
- `Crime Category`: The name of the crime category
- `LastMonth`: The total count in the most recent month
- `PredictedOneYearAhead`: The model's forecast for 12 months after the last available month

In [11]:
# Prepare data: sum all crimes per category per month
cat_monthly = df.groupby('Crime Category')[date_cols].sum()

cat_results = []
for cat, row in cat_monthly.iterrows():
    y = row.values
    X = np.arange(len(y)).reshape(-1, 1)
    model = LinearRegression()
    model.fit(X, y)
    one_year_ahead = len(y) + 12 - 1  # 12 months ahead from last index
    pred = model.predict([[one_year_ahead]])[0]
    cat_results.append({'Crime Category': cat, 'LastMonth': y[-1], 'PredictedOneYearAhead': pred})

cat_pred_df = pd.DataFrame(cat_results)
cat_pred_df.to_csv('data/clean/crime_category_frequency_1year_predictions.csv', index=False)
cat_pred_df.head()

Unnamed: 0,Crime Category,LastMonth,PredictedOneYearAhead
0,ARSON AND CRIMINAL DAMAGE,4648,4643.404058
1,BURGLARY,4237,3341.307391
2,DRUG OFFENCES,4519,6014.827101
3,FRAUD AND FORGERY,3,-3.148406
4,MISCELLANEOUS CRIMES AGAINST SOCIETY,1082,1260.050145


## 4. Summary and Interpretation

### What Did We Do?
- **Loaded and explored the London crime dataset** to understand its structure and content.
- **Built two simple predictive models** using linear regression:
  - The first model forecasts the percentage increase or decrease in total crime for each ward, helping to identify areas with rising or falling crime trends.
  - The second model predicts the frequency of each crime category **one year into the future**, highlighting which types of crime may require more attention over a longer period.
- **Saved the predictions to CSV files** for further analysis, reporting, or visualization.

### How to Interpret the Results
- **PctIncrease (Model 1):**
  - A positive value means the model predicts an increase in crime for the ward next month.
  - A negative value means a decrease is predicted. This is possible if the trend in the data is downward or if there was a recent spike that the model expects to drop.
  - Wards with the highest positive values may need more resources or attention.
- **PredictedOneYearAhead (Model 2):**
  - Shows the expected number of crimes for each category 12 months after the last available month.
  - Categories with high or rising predictions may indicate emerging issues or persistent problems.

### Limitations and Next Steps
- **Model Simplicity:** These models use only linear trends and do not account for seasonality, external events, or other factors. They are intended as a baseline.
- **Data Quality:** Predictions are only as good as the data. Outliers or missing data can affect results.
- **Further Analysis:** Consider visualizing the predictions, comparing with actual future data, or using more advanced models (e.g., ARIMA, Random Forest) for improved accuracy.

### Why This Matters
- **Proactive Policing:** Early identification of rising crime trends can help authorities allocate resources more effectively.
- **Community Awareness:** Public access to predictions can inform community safety initiatives.
- **Baseline for Improvement:** Simple models provide a starting point for more sophisticated analytics in the future.