<a href="https://colab.research.google.com/github/Srujgit/CrimeAnalysis/blob/main/CrimesInBoston.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Crimes in Boston

This project aims to analyse crimes in Boston.

Source: https://www.kaggle.com/datasets/AnalyzeBoston/crimes-in-boston

In [1]:
### Importing basic libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [28]:
filepath = "/content/drive/MyDrive/CrimeAnalysis/CrimesInBoston/crime.csv"

df = pd.read_csv(filepath, encoding='ISO-8859-1')
df.head(10)

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I182070945,619,Larceny,LARCENY ALL OTHERS,D14,808,,2018-09-02 13:00:00,2018,9,Sunday,13,Part One,LINCOLN ST,42.357791,-71.139371,"(42.35779134, -71.13937053)"
1,I182070943,1402,Vandalism,VANDALISM,C11,347,,2018-08-21 00:00:00,2018,8,Tuesday,0,Part Two,HECLA ST,42.306821,-71.0603,"(42.30682138, -71.06030035)"
2,I182070941,3410,Towed,TOWED MOTOR VEHICLE,D4,151,,2018-09-03 19:27:00,2018,9,Monday,19,Part Three,CAZENOVE ST,42.346589,-71.072429,"(42.34658879, -71.07242943)"
3,I182070940,3114,Investigate Property,INVESTIGATE PROPERTY,D4,272,,2018-09-03 21:16:00,2018,9,Monday,21,Part Three,NEWCOMB ST,42.334182,-71.078664,"(42.33418175, -71.07866441)"
4,I182070938,3114,Investigate Property,INVESTIGATE PROPERTY,B3,421,,2018-09-03 21:05:00,2018,9,Monday,21,Part Three,DELHI ST,42.275365,-71.090361,"(42.27536542, -71.09036101)"
5,I182070936,3820,Motor Vehicle Accident Response,M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY,C11,398,,2018-09-03 21:09:00,2018,9,Monday,21,Part Three,TALBOT AVE,42.290196,-71.07159,"(42.29019621, -71.07159012)"
6,I182070933,724,Auto Theft,AUTO THEFT,B2,330,,2018-09-03 21:25:00,2018,9,Monday,21,Part One,NORMANDY ST,42.306072,-71.082733,"(42.30607218, -71.08273260)"
7,I182070932,3301,Verbal Disputes,VERBAL DISPUTE,B2,584,,2018-09-03 20:39:37,2018,9,Monday,20,Part Three,LAWN ST,42.327016,-71.105551,"(42.32701648, -71.10555088)"
8,I182070931,301,Robbery,ROBBERY - STREET,C6,177,,2018-09-03 20:48:00,2018,9,Monday,20,Part One,MASSACHUSETTS AVE,42.331521,-71.070853,"(42.33152148, -71.07085307)"
9,I182070929,3301,Verbal Disputes,VERBAL DISPUTE,C11,364,,2018-09-03 20:38:00,2018,9,Monday,20,Part Three,LESLIE ST,42.295147,-71.058608,"(42.29514664, -71.05860832)"


## Data preprocessing stage

In this stage, unnecessary columns such as INCIDENT_NUMBER are dropped, while alphanumeric values such as district are one-hot encoded.

In [29]:
df = df.drop(['INCIDENT_NUMBER','OFFENSE_CODE', 'OFFENSE_CODE_GROUP', 'OFFENSE_DESCRIPTION','SHOOTING', 'OCCURRED_ON_DATE','UCR_PART', 'STREET', 'Location'], axis = 1)
df.head()

Unnamed: 0,DISTRICT,REPORTING_AREA,YEAR,MONTH,DAY_OF_WEEK,HOUR,Lat,Long
0,D14,808,2018,9,Sunday,13,42.357791,-71.139371
1,C11,347,2018,8,Tuesday,0,42.306821,-71.0603
2,D4,151,2018,9,Monday,19,42.346589,-71.072429
3,D4,272,2018,9,Monday,21,42.334182,-71.078664
4,B3,421,2018,9,Monday,21,42.275365,-71.090361


District and days of the week are one-hot encoded.

In [30]:
dayEncoder = {
    'Monday': 1,
    'Tuesday': 2,
    'Wednesday': 3,
    'Thursday': 4,
    'Friday': 5,
    'Saturday': 6,
    'Sunday': 7
}

df['DAY_OF_WEEK'] = df['DAY_OF_WEEK'].map(dayEncoder)


In [5]:
df.head()

Unnamed: 0,DISTRICT,REPORTING_AREA,YEAR,MONTH,DAY_OF_WEEK,HOUR,Lat,Long
0,D14,808,2018,9,7,13,42.357791,-71.139371
1,C11,347,2018,8,2,0,42.306821,-71.0603
2,D4,151,2018,9,1,19,42.346589,-71.072429
3,D4,272,2018,9,1,21,42.334182,-71.078664
4,B3,421,2018,9,1,21,42.275365,-71.090361


To begin converting the district values to a numeric value, we first need to see how many districts are there.

In [31]:
uniqueDistricts = set()
distLetter = df.loc[:,'DISTRICT']
for i in range(len(distLetter)):
  letter = str(distLetter[i])
  uniqueDistricts.add(letter[0])
print("Number of unique districts in Boston: ", len(uniqueDistricts))
print("Names of the unique districts in Boston: ", uniqueDistricts)

Number of unique districts in Boston:  6
Names of the unique districts in Boston:  {'C', 'A', 'E', 'D', 'n', 'B'}


It can be seen that there are 5 district names, excluding 'NaN'.
For now, the "NaN" values can be ignored, while the rest are converted to numeric values. This will aid in converting the 'NaN' values as seen in the next section.

In [32]:
def districtEncoder(district_str):
    d_str = str(district_str)
    if d_str != 'nan':
      district_letter = d_str[0]
      district_number = int(d_str[1:])

      if district_letter == 'A':
          return district_number
      elif district_letter == 'B':
          return 100 + district_number
      elif district_letter == 'C':
          return 200 + district_number
      elif district_letter == 'D':
          return 300 + district_number
      else :
        return 400+ district_number

    else:
      district_str = float(district_str)
      return None

df['DISTRICT'] = df['DISTRICT'].apply(districtEncoder)

In [8]:
df.head(10)

Unnamed: 0,DISTRICT,REPORTING_AREA,YEAR,MONTH,DAY_OF_WEEK,HOUR,Lat,Long
0,314.0,808,2018,9,7,13,42.357791,-71.139371
1,211.0,347,2018,8,2,0,42.306821,-71.0603
2,304.0,151,2018,9,1,19,42.346589,-71.072429
3,304.0,272,2018,9,1,21,42.334182,-71.078664
4,103.0,421,2018,9,1,21,42.275365,-71.090361
5,211.0,398,2018,9,1,21,42.290196,-71.07159
6,102.0,330,2018,9,1,21,42.306072,-71.082733
7,102.0,584,2018,9,1,20,42.327016,-71.105551
8,206.0,177,2018,9,1,20,42.331521,-71.070853
9,211.0,364,2018,9,1,20,42.295147,-71.058608


Observing the file it can be seen that DISTRICT, REPORTING_AREA, Lat and Long have multiple missing values which need to be dealt with before making a model.

In [33]:
missing_vals = df['Lat'].isna().sum()
print("Number of missing values in Lat: ", missing_vals)

missing_vals = df['Long'].isna().sum()
print("Number of missing values in Long: ", missing_vals)

missing_vals = df['DISTRICT'].isna().sum()
print("Number of missing values in DISTRICT: ", missing_vals)


Number of missing values in Lat:  19999
Number of missing values in Long:  19999
Number of missing values in DISTRICT:  1765


It can be seen that 19999 values of Lat and Long are missing, while 1765 values of DISTRICT are missing.

In [34]:
df['Lat'] = df['Lat'].fillna(df['Lat'].median())
df['Long'] = df['Long'].fillna(df['Long'].median())

df['Lat'] = df['Lat'].replace(-1, df['Lat'].median())
df['Long'] = df['Long'].replace(-1, df['Long'].median())

missing_vals = df['Lat'].isna().sum()
print("Number of missing values in Lat: ", missing_vals)
missing_vals = df['Long'].isna().sum()
print("Number of missing values in Long: ", missing_vals)


Number of missing values in Lat:  0
Number of missing values in Long:  0


The missing values of REPORTING_AREA are going undected as it may have been replaced with an empty string instead of NaN/ None.

In [35]:
missing_vals = df['REPORTING_AREA'].isna().sum()
print("Number of missing values in REPORTING_AREA: ", missing_vals)
df['REPORTING_AREA'] = pd.to_numeric(df['REPORTING_AREA'], errors='coerce')
df['REPORTING_AREA'] = df['REPORTING_AREA'].fillna(df['REPORTING_AREA'].median())
missing_vals = df['REPORTING_AREA'].isna().sum()
print("Number of missing values in REPORTING_AREA: ", missing_vals)

Number of missing values in REPORTING_AREA:  0
Number of missing values in REPORTING_AREA:  0


Close to 1800 values in the dataset for district are missing. Data imputation is used to tackle such missing values. Imputation of the missing district values is done using regression imputation to help achieve more accurate results rather than filling values in based on mean or median (Simple imputation).

In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

X = df[['REPORTING_AREA', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'Lat', 'Long']]
y = df['DISTRICT']

df['DISTRICT'] = pd.to_numeric(df['DISTRICT'], errors='coerce')
df['DISTRICT'] = df['DISTRICT'].replace('nan', pd.NA)

imputer = IterativeImputer(max_iter=10, random_state=0,estimator=LinearRegression())
X_imputed = imputer.fit_transform(df)

# Assign the imputed values back to the original DataFrame
df[['REPORTING_AREA', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'Lat', 'Long', 'DISTRICT']] = X_imputed

In [40]:
missing_vals = df['DISTRICT'].isna().sum()
print("Number of missing values in DISTRICT: ", missing_vals)

Number of missing values in DISTRICT:  0


In [41]:
missing_vals = df['HOUR'].isna().sum()
print("Number of missing values in HOUR: ", missing_vals)

missing_vals = df['YEAR'].isna().sum()
print("Number of missing values in YEAR: ", missing_vals)

missing_vals = df['MONTH'].isna().sum()
print("Number of missing values in MONTH: ", missing_vals)


missing_vals = df['DAY_OF_WEEK'].isna().sum()
print("Number of missing values in DAY_OF_WEEK: ", missing_vals)

df['HOUR'] = pd.to_numeric(df['HOUR'], errors='coerce')
df['MONTH'] = pd.to_numeric(df['MONTH'], errors='coerce')
df['YEAR'] = pd.to_numeric(df['YEAR'], errors='coerce')
df['DAY_OF_WEEK'] = pd.to_numeric(df['DAY_OF_WEEK'], errors='coerce')

df['HOUR'] = df['HOUR'].fillna(df['HOUR'].median())
df['MONTH'] = df['MONTH'].fillna(df['MONTH'].median())
df['YEAR'] = df['YEAR'].fillna(df['YEAR'].median())
df['DAY_OF_WEEK'] = df['DAY_OF_WEEK'].fillna(df['DAY_OF_WEEK'].median())

Number of missing values in HOUR:  0
Number of missing values in YEAR:  0
Number of missing values in MONTH:  0
Number of missing values in DAY_OF_WEEK:  0


With this, all the necessary values have been imputated.

##Training the model
The aim of this project is to predict the crime rates, hence we need to build regression models.
We will be comparing Linear Regression, Support Vector Regression and Decision Tree Regression.

In [51]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import train_test_split

###Data split
There are 319074 samples.
Out of this, 20% samples, i.e 63815 samples are used for testing, while rest (255259) are used for training.
The training set is further broken down into a training and validation set, in a ratio of 90:10.
The final breakdown of samples is as follows-
Training samples =  229733
Validation samples = 25525
Testing samples = 63815  

In [71]:
X = df[['REPORTING_AREA', 'YEAR', 'MONTH', 'DAY_OF_WEEK', 'HOUR', 'Lat', 'Long']]
y = df['DISTRICT']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.1, shuffle = True)

In [72]:
dtr = DecisionTreeRegressor(max_depth = 40, min_samples_split= 4, min_samples_leaf = 5, max_features = 6)
svr = SVR(kernel = 'poly', C = 2, degree = 3)
lin = LinearRegression(n_jobs=200, fit_intercept = True)

dtr_model = dtr.fit(X_train, y_train)
svr_model = svr.fit(X_train, y_train)
lin_model = lin.fit(X_train, y_train)

###Using the validation set to help tune the hyperparameters

In [73]:
dtr_val_pred = dtr.predict(X_val)
svr_val_pred = svr.predict(X_val)
lin_val_pred = lin.predict(X_val)

print("Decision Tree Regression mean squared error: ", mse(y_val, dtr_val_pred))
print("Support Vector Regression mean squared error: ", mse(y_val, svr_val_pred))
print("Linear Regression mean squared error: ", mse(y_val, lin_val_pred))

Decision Tree Regression mean squared error:  2.472434012927376e-06
Support Vector Regression mean squared error:  0.0009135991428229871
Linear Regression mean squared error:  0.0003751802404835075


In [74]:
dtr_val_pred = dtr.predict(X_test)
svr_val_pred = svr.predict(X_test)
lin_val_pred = lin.predict(X_test)

print("Decision Tree Regression mean squared error: ", mse(y_test, dtr_val_pred))
print("Support Vector Regression mean squared error: ", mse(y_test, svr_val_pred))
print("Linear Regression mean squared error: ", mse(y_test, lin_val_pred))

Decision Tree Regression mean squared error:  2.833764256050924e-06
Support Vector Regression mean squared error:  0.0009146558738485546
Linear Regression mean squared error:  0.0003748591819373055
