
# Crime Prediction Project
**Objective:** Predict whether a crime will occur based on location and shift using Machine Learning models.  
**Dataset:** Crime Incidents Dataset (2024)


## Dataset Overview and Loading

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('A_train_balanced.csv')
df.head()


## Exploratory Data Analysis (EDA)

In [None]:

# Check for nulls
df.isnull().sum()


In [None]:

# Visualizing Crime vs No Crime Count
sns.countplot(x='crime_label', data=df, palette='magma')
plt.title("Crime vs No Crime Distribution")
plt.xlabel("Label (0 = No Crime, 1 = Crime)")
plt.ylabel("Count")
plt.show()


In [None]:

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()


## Random Forest Model - Feature Importance

In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score

features = ['SHIFT', 'DISTRICT', 'WARD', 'PSA', 'BLOCK']
target = 'crime_label'

# Label Encoding
for col in features:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))

X = df[features]
y = df[target]

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Feature importance plot
importances = rf.feature_importances_
feature_names = X.columns
sns.barplot(x=importances, y=feature_names, palette="viridis")
plt.title("Feature Importance - Random Forest")
plt.show()


## Random Forest Model - Evaluation

In [None]:

# Prediction and Evaluation
y_pred = rf.predict(X_val)
print("Random Forest Classification Report:")
print(classification_report(y_val, y_pred))

cm = confusion_matrix(y_val, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No Crime', 'Crime'])
disp.plot(cmap='Blues')
plt.title("Random Forest - Confusion Matrix")
plt.show()

accuracy = accuracy_score(y_val, y_pred)
print("Random Forest Accuracy:", accuracy)
plt.bar(["Random Forest"], [accuracy], color='green')
plt.ylim(0, 1)
plt.title("Random Forest - Accuracy Score")
plt.ylabel("Accuracy")
plt.show()


## Conclusion / Observations


- The Random Forest model performed well in predicting crime based on location and shift.
- Accuracy achieved is shown above.
- Feature importance shows that 'SHIFT' and 'DISTRICT' significantly contribute to predictions.

**Next Steps:** Repeat similar analysis with Logistic Regression and XGBoost Models.
