# Predicting NY Crime
## Phase 2 : Machine Learning Modelling 

### RMIT University
### Master in Analytics
#### Machine Learning (MATH2319)
#### Group name : 56 
#### Name (ID) : Fabian Caballero (s3912233)


## Table of contents
* [Introduction](#i)
  + [Phase 1 Summary](#p1s)
  + [Report Overview](#ro)
  + [Methodology](#ro)
* [Data](#dp)
* [Predictive Modelling](#pm)
  + [Feature Selection](#fs)
  + [Model Fitting and Tunning](#mft)
      +[Model 1](#m1)
      +[Model 2](#m2)
      +[Model 3](#m3)
  + [Model Fitting and Tunning for Neural Network](#mftnn)  
  + [Model Comparison](#mc)  
* [Limitations](#l)
* [Conclusions](#c)
* [References](#r)

## Introduction <a id='i'></a>

This study aims to propose a machine learning model to identify risk areas for crime in Los Angeles based on historical features of the LA police crime dataset from 2020. The dataset contains more than 600,000 records of criminal incidents that were preprocessed and sampled to obtain 28,850 records for analysis. The descriptive analysis revealed some patterns and insights about the distribution of crime across different areas, types, times and victim characteristics. The study concludes that a machine learning model that can forecast the level of risk for each area based on these features is needed to optimize the allocation of police resources and prevent crime. This is a brief summary of the descriptive analysis studied in phase 1 :

* LAPD has divided Los Angeles in 21 different areas. There are areas with major risk of crime than others. However, the area 1 of LA seems to be the most affect by crime incidents
* The average age of the victims oscillated from 30 to 40 years old and in average the female victims are younger that the male victims.


### Phase 1 Summary <a id='p1s'></a>



### Report Overview <a id='ro'></a>
The dataset contains observations of reported crimes in Los Angeles from 2020 until the last update the 25th of March of 2023. These records contain unique ID  of the incident, date of the occurrence, Area, geographic location. Additionally, each record has information about the demography of the victim and the type of crime. This set of variables can be helpful to model predictions of areas sensible to crime.

The LA Crime Dataset has 28 features and 690,454 observations. However, the following variable will be excluded because most of them are equal to the main variable but with a different structure. Also, a few variables could generate noise in the model because the uniqueness of their values. 


### Methodology <a id='ro'></a>

## Data  <a id='dp'></a>

The LA crime dataset contains crime records of the LAPD (Los Angeles Police Department) from 2020. The data was sourced from Data Catalog(LAPD,2023)

Data preparation is essential  to guarantee the quality of the model. Phase 1 main objective was to prepare the data of NYPD dataset for modelling. for this reason the following actions were applied:
* Drop unsuitable variables for the model
* Rename columns
* Change data types
* Remove missing values
* Replace NaN values depending of the variable
    * For weapon used, all the NaN were replaced by 500.0
* Remove observations with incongruent values
    * Records with victim age < 0
    * Records with victim sex different from "M" or "F"
* Sample 25,850 observations from 516,998 rows (After pre-processing)

In [26]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate # Create tables 
from datetime import datetime # Extracting specific date values
from sklearn import preprocessing # Scaling Features
from sklearn import feature_selection as fs #feature selection methods
from tabulate import tabulate

# Modelling
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.metrics import RocCurveDisplay # Accuracy curve
pd.set_option('display.max_columns', None) 


%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")

In [4]:
df_sample = pd.read_csv("NYPD_sample_25850_rows.csv")
df_sample.head()

Unnamed: 0,occ_day,occ_militar_time,area,crm_code,vict_age,vict_sex,vict_descent,premis_code,weapon_used_code,lat,lon
0,1,1210,4,354,38,F,H,503.0,500.0,34.0234,-118.2102
1,3,1600,3,624,17,F,B,502.0,400.0,34.0055,-118.2937
2,5,1530,1,740,41,M,H,122.0,500.0,34.0459,-118.2526
3,0,1606,1,341,19,M,C,124.0,500.0,34.0566,-118.2318
4,0,1207,7,354,35,F,H,501.0,500.0,34.0623,-118.3462


In [5]:
df_sample.dtypes

occ_day               int64
occ_militar_time      int64
area                  int64
crm_code              int64
vict_age              int64
vict_sex             object
vict_descent         object
premis_code         float64
weapon_used_code    float64
lat                 float64
lon                 float64
dtype: object

## Predictive Modelling  <a id='pm'></a>

### Data Preparation for Modelling

#### Encoding categorical variables
Although the dataset was previously treated in phase 1, data required a few extra manipulation tasks before being processed for model prediction.Initially, the target feature and the descriptive variables are split into two data frames. the following code is the frequency of each one of the 21 area of Los Angeles.

Secondly, categorical data must be encoding to numerical values to feed the model with right type of values.This is because, machine learning models utilize numerical values to generate the calculation or prediction.  


In [7]:
data = df_sample.drop(columns='area')
target = df_sample['area']

In [8]:
target.value_counts()

1     1819
12    1717
14    1498
3     1424
6     1404
18    1327
15    1261
7     1227
8     1214
20    1202
21    1185
13    1172
9     1158
2     1127
11    1091
19    1059
10    1057
17    1042
5     1017
4      933
16     916
Name: area, dtype: int64

Then, there are just two categorical variable to be encoding into numerical values: vict_age and vict_descent. Fortunately for the model, LAPD is already using encoded values for the majority of the dataset variable. For example, the uses float or integer code to determine the weapon of crime.

The process of encoding for the vict_sex transforms the categorical values into a binary  variable by encoding F = 0 and M = 1. However, in the case of vict_descent that has multiple values, the function $get_dummies()$ provided a solution by creating dummy features for each value.

In [9]:
data.columns[data.dtypes==object].tolist()

['vict_sex', 'vict_descent']

In [10]:
data['vict_sex'] = data['vict_sex'].replace({'F':0,'M':1})
data['vict_sex'].nunique

<bound method IndexOpsMixin.nunique of 0        0
1        0
2        1
3        1
4        0
        ..
25845    0
25846    1
25847    0
25848    1
25849    1
Name: vict_sex, Length: 25850, dtype: int64>

In [11]:
data = pd.get_dummies(data)

In [12]:
data.columns

Index(['occ_day', 'occ_militar_time', 'crm_code', 'vict_age', 'vict_sex',
       'premis_code', 'weapon_used_code', 'lat', 'lon', 'vict_descent_A',
       'vict_descent_B', 'vict_descent_C', 'vict_descent_D', 'vict_descent_F',
       'vict_descent_G', 'vict_descent_H', 'vict_descent_I', 'vict_descent_J',
       'vict_descent_K', 'vict_descent_O', 'vict_descent_P', 'vict_descent_S',
       'vict_descent_U', 'vict_descent_V', 'vict_descent_W', 'vict_descent_X',
       'vict_descent_Z'],
      dtype='object')

In [13]:
data.columns.nunique()

27

In [14]:
data.sample(5, random_state = 220)

Unnamed: 0,occ_day,occ_militar_time,crm_code,vict_age,vict_sex,premis_code,weapon_used_code,lat,lon,vict_descent_A,vict_descent_B,vict_descent_C,vict_descent_D,vict_descent_F,vict_descent_G,vict_descent_H,vict_descent_I,vict_descent_J,vict_descent_K,vict_descent_O,vict_descent_P,vict_descent_S,vict_descent_U,vict_descent_V,vict_descent_W,vict_descent_X,vict_descent_Z
7519,5,145,341,21,0,501.0,500.0,34.1172,-118.1799,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
10499,0,714,440,53,0,501.0,500.0,34.1961,-118.4563,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
16305,5,1800,230,41,1,101.0,106.0,34.0473,-118.2139,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
6856,5,430,330,31,1,102.0,500.0,34.2131,-118.3954,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
11430,2,1351,745,53,0,101.0,500.0,33.9994,-118.2499,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


#### Scaling Features
Once all the descriptive variables are tidied and numerically encoded, the next step is to normalized all the values of the dataset by re scaling the descriptive features. This process is required because  there variable with a high range and other variables a small range and this cause problems in to the model. 

Applying the feature scaling the performance of the classifiers may improve. For this task, The function $MinMaxScaler()$ re scale the minimum and the maximum value of each variable in 0 and 1 respectively. As a result, the values of oscillated in a common range from 0 to 1.

In [16]:
data_original = data.copy()

In [17]:
scaler = preprocessing.MinMaxScaler()
scaler.fit(data)
data = scaler.fit_transform(data)
pd.DataFrame(data, columns=data_original.columns).sample(5, random_state=999)

Unnamed: 0,occ_day,occ_militar_time,crm_code,vict_age,vict_sex,premis_code,weapon_used_code,lat,lon,vict_descent_A,vict_descent_B,vict_descent_C,vict_descent_D,vict_descent_F,vict_descent_G,vict_descent_H,vict_descent_I,vict_descent_J,vict_descent_K,vict_descent_O,vict_descent_P,vict_descent_S,vict_descent_U,vict_descent_V,vict_descent_W,vict_descent_X,vict_descent_Z
24160,0.833333,0.775657,0.390071,0.185567,0.0,0.125287,0.963768,0.990099,0.003501,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20159,0.666667,0.381255,0.609929,0.43299,0.0,0.45977,0.722222,0.993131,0.003142,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12596,0.666667,0.69084,0.874704,0.237113,0.0,0.008046,0.963768,0.993204,0.003879,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20275,0.166667,0.557252,0.288416,0.463918,1.0,0.0,0.963768,0.990125,0.003533,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7421,0.0,0.381255,0.744681,0.237113,1.0,0.008046,0.963768,0.996271,0.001992,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Feature Selection
The feature selection is a step required to evaluate the importance of the descriptive features for the ML model inference. For the purpose of this project, two methods of feature selection were used. This is a good practice in order to explore common features among the the two methods.

The two feature selection are F-score for its statistical component and Mutual information for its analysis in the relationship among the variables. 

For the case of this dataset with a total of 27 descriptive features, both methods evaluated the best selection of features when just 10 and 20 features are required. This evaluates the coherence of the method when the number of p increases and how much influence they have on the model.

###### F-score method p=10

In [18]:
num_features = 10
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=num_features)
fs_fit_fscore.fit_transform(data, target)
fs_indices_fit_fscore = np.argsort(np.nan_to_num(fs_fit_fscore.scores_))[::-1][0:num_features]
best_f = list(zip(data_original.columns[fs_indices_fit_fscore].values,fs_fit_fscore.scores_))
best_f_sorted = sorted(best_f, key=lambda x: x[1],reverse=True)

print(tabulate(best_f_sorted, floatfmt=".4f"))


----------------  -------
vict_descent_K    48.5965
vict_descent_X    18.4528
vict_sex          17.2189
weapon_used_code  15.3553
vict_descent_O     8.6614
vict_descent_H     7.0587
vict_age           6.7561
vict_descent_A     2.5383
vict_descent_W     1.7610
vict_descent_B     1.0128
----------------  -------


###### F-score method p=20

In [19]:
num_features = 20
fs_fit_fscore = fs.SelectKBest(fs.f_classif, k=num_features)
fs_fit_fscore.fit_transform(data, target)
fs_indices_fit_fscore = np.argsort(np.nan_to_num(fs_fit_fscore.scores_))[::-1][0:num_features]
best_f = list(zip(data_original.columns[fs_indices_fit_fscore].values,fs_fit_fscore.scores_))
best_f_sorted = sorted(best_f, key=lambda x: x[1],reverse=True)

print(tabulate(best_f_sorted, floatfmt=".4f"))

----------------  --------
premis_code       178.0760
vict_descent_J    144.8283
vict_descent_K     48.5965
vict_descent_Z     36.2879
occ_militar_time   27.6077
vict_descent_X     18.4528
vict_sex           17.2189
weapon_used_code   15.3553
vict_descent_O      8.6614
vict_descent_H      7.0587
vict_age            6.7561
crm_code            6.0386
vict_descent_C      5.2285
vict_descent_V      3.3165
vict_descent_A      2.5383
vict_descent_W      1.7610
lon                 1.2429
vict_descent_B      1.0128
vict_descent_F      0.9239
lat                 0.7925
----------------  --------


###### Mutual Information, p=10

In [20]:
num_features = 10
fs_fit_mutual_info = fs.SelectKBest(fs.mutual_info_classif, k=num_features)
fs_fit_mutual_info.fit_transform(data, target)
fs_indices_mutual_info = np.argsort(fs_fit_mutual_info.scores_)[::-1][0:num_features]
best_features_mutual_info = data_original.columns[fs_indices_mutual_info].values
best_features_mutual_info

best_f = list(zip(data_original.columns[fs_indices_mutual_info].values,fs_fit_mutual_info.scores_))
best_f_sorted = sorted(best_f, key=lambda x: x[1],reverse=True)

print(tabulate(best_f_sorted, floatfmt=".4f"))

----------------  ------
weapon_used_code  2.1659
vict_age          2.0491
vict_descent_H    0.1768
premis_code       0.0872
vict_descent_B    0.0300
crm_code          0.0222
vict_descent_O    0.0134
vict_descent_W    0.0073
lat               0.0047
lon               0.0000
----------------  ------


###### Mutual Information, p=20

In [None]:
num_features = 20
fs_fit_mutual_info = fs.SelectKBest(fs.mutual_info_classif, k=num_features)
fs_fit_mutual_info.fit_transform(data, target)
fs_indices_mutual_info = np.argsort(fs_fit_mutual_info.scores_)[::-1][0:num_features]
best_features_mutual_info = data_original.columns[fs_indices_mutual_info].values
best_features_mutual_info

best_f = list(zip(data_original.columns[fs_indices_mutual_info].values,fs_fit_mutual_info.scores_))
best_f_sorted = sorted(best_f, key=lambda x: x[1],reverse=True)

print(tabulate(best_f_sorted, floatfmt=".4f"))

As a result, there are two important findings to consider. Firstly, in the case of F-score selection method, seems that a larger number of P will provide a better performance compare to low number of features. As evidence, it is possible to observe how the method ignores these two variables $premis code$ and $vict descent J$, with their respectively F-score 178.0760 and 144.8283. These two variable have the highest score. Therefore, for these model it is better to have a large number p. Secondly, in the mutual information method, the results appeared to be coherent from one change of p to the other.  However, it does not share the same ranking of features. For instance, their best descriptive variables are $weapon used code$ and $vict age$. 

# Cross-validation method
Finally, before to proceed to the modelling part, it is important to define the method to train and test the data. Cross-validation is effective method to split the data into training and testing. For the size and number of features of these dataset, the parameters to consider are : split of 10 folds, 1 replication and a test size of the 30% of the data. this parameters for the cross-validation is efficient for this project size.

In [133]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold

d_train, d_test, t_train, t_test = train_test_split(data, 
                                                    target, 
                                                    test_size = 0.3, 
                                                    random_state=999)
cv_method = RepeatedStratifiedKFold(n_splits=10,
                          n_repeats=1, 
                          random_state=999)
cv_method

RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999)

As a result of the split, the training set have 18,095 observations and the dataset for testing have  7,755 records

In [138]:
print(f'Shape of the data to train : {d_train.shape}')
print(f'Shape of the data to test : {d_test.shape}')
print(f'Shape of the target to train : {t_train.shape}')
print(f'Shape of the target to test : {t_test.shape}')

Shape of the data to train : (18095, 27)
Shape of the data to test : (7755, 27)
Shape of the target to train : (18095,)
Shape of the target to test : (7755,)


### Model Fitting and Tunning  <a id='mft'></a>

The best predictive model for these descriptive features and target variable is decided may the experimentation of different machine learning algorithms, all the possible model applied in this dataset are from the $SKLEARN$ python module. The majority of models used are from the following type of algorithms : Information-based, probability-based and similarity-based. The procedures for the model experimentation started with selection of basic algorithm. Before running the model with training set, the parameter are tunned in order to identify which is the best combination of values for each  model. Finally, each model is evaluated by metrics such as precision, accuracy and confusion matrix.

### Model 1 : Decision Tree <a id='m1'></a>

Firstly, information-based models may help to propose a the fittest model to predict the possible location of a crime based on the 21 areas of Los Angeles city. one fundamental algorithm for this type of modelling is the decision tree algorithms.  This algorithm is a supervised method of classification by creating the descriptive features as conditional nodes that try to predict the target value place the descriptions of the observation among the nodes.

#### Pipeline and parameters setting
Setting the parameters of a decision tree requires the following information : Number of descriptive features, the criteria of the evaluation for the decision tree and the maximum depth of three. using Sklearn module is possible to set a range of values for each of this parameters. For instance, the tunning for this dataset will be the following :

Number of features : 10, 20 and all the features
Criterion : gini level and entropy measure
Maximum depth of the three : 8 to 20

These range of parameters were selected  to provide enough options to the model to find the most optimal combination in term of the accuracy of predictions. Additionally, SKlearn module can be used alongside the Pipeline module, which is a useful module to normalize different process in one chunk of code. For the case of this project, Pipeline will join the feature selection method and the model process.

###### $Decision  Tree$

In [139]:
df_classifier = DecisionTreeClassifier(random_state=999)

params_pipe_DT_fs = {'fselector__score_func': [f_classif, mutual_info_classif],
                     'fselector__k': [10,20,data.shape[1]],
                     'dt__criterion': ['gini', 'entropy'],
                     'dt__max_depth': [8,9,10,11,12,13,14,15,16,17,18,19,20]}
 
pipe_DT_fs = Pipeline([('fselector', SelectKBest()), 
                    ('dt', df_classifier)])

gs_pipe_DT_fs  = GridSearchCV(estimator=pipe_DT_fs, 
                           param_grid=params_pipe_DT_fs, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='accuracy', 
                           verbose=1)



#### Model Fitting

The model run for about 5 min fitting 10 folds for each 156 possible combinations.

Considering the multiple parameter combination for this pipeline, the model for 
a decision tree algorithm has chosen that the best feature method is the mutual information method with a k value equals to  10 and the best parameters are the entropy measure with a maximum depth of 20.

In [140]:
%%time
gs_pipe_DT_fs.fit(d_train,t_train);

Fitting 10 folds for each of 156 candidates, totalling 1560 fits
CPU times: total: 5.11 s
Wall time: 5min 2s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=Pipeline(steps=[('fselector', SelectKBest()),
                                       ('dt',
                                        DecisionTreeClassifier(random_state=999))]),
             n_jobs=-2,
             param_grid={'dt__criterion': ['gini', 'entropy'],
                         'dt__max_depth': [8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
                                           18, 19, 20],
                         'fselector__k': [10, 20, 27],
                         'fselector__score_func': [<function f_classif at 0x0000024395A045E0>,
                                                   <function mutual_info_classif at 0x0000024395D62700>]},
             scoring='accuracy', verbose=1)

In [141]:
gs_pipe_DT_fs.best_estimator_[0].k

10

In [142]:
gs_pipe_DT_fs.best_estimator_

Pipeline(steps=[('fselector',
                 SelectKBest(score_func=<function mutual_info_classif at 0x0000024395D62700>)),
                ('dt',
                 DecisionTreeClassifier(criterion='entropy', max_depth=20,
                                        random_state=999))])

#### Metrics
The metrics to evaluate this algorithm are the best f-score, accuracy score and the confusion matrix. 

F-SCORE

In [143]:
gs_pipe_DT_fs.best_score_

0.984415277815954

ACCURACY SCORE

In [144]:
t_pred = gs_pipe_DT_fs.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.9865892972275951

CONFUSION MATRIX

In [145]:
metrics.roc_auc_score(t_test, t_prob, multi_class='ovo')

0.9939905943794494

In [146]:
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
538,6,0,0,0,0,0,1,0,0,0,0,6,0,0,0,0,0,0,0,0
1,336,0,0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,3,0
0,0,442,0,0,1,2,0,0,0,1,1,1,0,0,0,0,0,0,0,0
1,0,0,264,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
0,0,0,0,302,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
0,1,0,0,2,421,1,0,0,0,1,0,0,0,2,0,0,0,1,0,0
0,0,0,0,0,1,361,0,0,0,0,0,0,0,0,0,0,0,1,3,0
0,0,0,0,0,0,3,379,0,1,0,1,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,328,0,0,0,0,0,1,0,0,0,1,0,0
0,0,0,0,0,0,0,0,0,335,0,0,0,0,0,0,3,0,0,0,2


In general, the metrics showed a positive result using a decision tree algorithm to create a predictive model for the LAPD. Accuracy and f-score are high, which means that has high chance to capture the data and predict a target in a proper way. Additionally, the confusion matrix shows similar results when the accuracy is the objective. This is because, the model has properly capture the majority of true positives. 

In order to explore options of informatiom-based models, this project will test the same data in a Random Forest algorithm that subsets certain features of the dataset.

###### $Random Forest$
Randome Forest is an alternative of a information-based algorithm. It search till the maximum depth to identify possible pattern between features that provide the model to predict the response variable

In [148]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(max_depth=None,
                                       min_samples_split=2,
                                       random_state=999)

params_pipe_RF_fs = {'fselector__score_func': [f_classif, mutual_info_classif],
                     'fselector__k': [10,20,data.shape[1]]}
 
pipe_RF_fs = Pipeline([('fselector', SelectKBest()), 
                    ('df', rf_classifier)])

gs_pipe_RF_fs  = GridSearchCV(estimator=pipe_RF_fs, 
                           param_grid=params_pipe_RF_fs, 
                           cv=cv_method,
                           n_jobs=-2,
                           scoring='accuracy', 
                           verbose=1)


#### Model Fitting
The model run for less than a 1 min fitting 10 folds for each 6 subsets and just 10 descriptive feature.

In [151]:
%%time
gs_pipe_RF_fs.fit(d_train, t_train);

Fitting 10 folds for each of 6 candidates, totalling 60 fits
CPU times: total: 4.34 s
Wall time: 32.9 s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=Pipeline(steps=[('fselector', SelectKBest()),
                                       ('df',
                                        RandomForestClassifier(random_state=999))]),
             n_jobs=-2,
             param_grid={'fselector__k': [10, 20, 27],
                         'fselector__score_func': [<function f_classif at 0x0000024395A045E0>,
                                                   <function mutual_info_classif at 0x0000024395D62700>]},
             scoring='accuracy', verbose=1)

In [152]:
gs_pipe_RF_fs.best_estimator_[0].k

10

#### Metrics
The metrics to evaluate this algorithm are the best f-score, accuracy score, precision score and the confusion matrix.

F-SCORE

In [153]:
gs_pipe_RF_fs.best_score_

0.9690518249757962

ACCURACY SCORE

In [154]:
t_pred = gs_pipe_RF_fs.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.970341715022566

CONFUSION MATRIX

In [155]:
metrics.confusion_matrix(t_test, t_pred)
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
533,2,0,0,0,0,1,1,0,0,0,0,14,0,0,0,0,0,0,0,0
15,305,0,0,0,0,0,0,0,0,13,0,0,0,0,0,0,0,0,14,0
0,2,426,0,0,2,1,0,0,0,0,8,6,1,0,0,0,0,0,2,0
4,0,0,258,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0
0,1,0,0,302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,0,2,414,6,0,0,0,1,0,0,0,1,0,0,0,0,2,0
0,0,1,0,0,4,358,0,0,0,0,0,0,0,0,0,0,0,1,2,0
0,0,0,0,0,0,5,377,0,0,0,0,0,3,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,317,1,0,0,0,0,11,0,0,0,1,0,0
0,0,0,0,0,0,0,0,4,335,0,0,0,0,0,0,1,0,0,0,0


Similarly to the decision tree algorithm, Random forest can capture the data with a considerable precision and accuracy. It is possible to be reflected in the confusion matrix where the majority of observations are considered as True Positives.

### Model 2 : Naive Bayes <a id='m2'></a>
Secondly, models with probabilistic methods are an alternative to explore. This type of models consider the probability of the descriptive features as an input and it use them to predict the target feature based on the conditions. For this model, the selection of features is not an input to determine the best model. However, it requires parameters for smoothing the probability across the descriptive features. 

###### $Gaussian Naive$

In [165]:
# Full data
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer

nb_classifier = GaussianNB()

params_NB = {'var_smoothing': np.logspace(10,-2, num=100)}

gs_NB_1000 = GridSearchCV(estimator=nb_classifier, 
                     param_grid=params_NB, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy',
                     return_train_score=True)

d_train_transformed = PowerTransformer().fit_transform(d_train)

d_train_transformed.

#### Model Fitting
The model run for about 2 min fitting 10 folds for each 1000 possible combinations and it selects that a variance spread of 0.1629 for this dataset.

In [166]:
%%time
gs_NB_1000.fit(d_train_transformed,t_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
CPU times: total: 1min 23s
Wall time: 1min 24s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=GaussianNB(),
             param_grid={'var_smoothing': array([1.00000000e+10, 7.56463328e+09, 5.72236766e+09, 4.32876128e+09,
       3.27454916e+09, 2.47707636e+09, 1.87381742e+09, 1.41747416e+09,
       1.07226722e+09, 8.11130831e+08, 6.13590727e+08, 4.64158883e+08,
       3.51119173e+08, 2.65608778e+08,...
       2.00923300e+00, 1.51991108e+00, 1.14975700e+00, 8.69749003e-01,
       6.57933225e-01, 4.97702356e-01, 3.76493581e-01, 2.84803587e-01,
       2.15443469e-01, 1.62975083e-01, 1.23284674e-01, 9.32603347e-02,
       7.05480231e-02, 5.33669923e-02, 4.03701726e-02, 3.05385551e-02,
       2.31012970e-02, 1.74752840e-02, 1.32194115e-02, 1.00000000e-02])},
             return_train_score=True, scoring='accuracy', verbose=1)

In [167]:
gs_NB_1000.best_estimator_

GaussianNB(var_smoothing=0.16297508346206402)

#### Metrics

F-SCORE

In [168]:
gs_NB_1000.best_score_

0.225643574637555

ACCURACY SCORE

In [169]:
t_pred = gs_NB_1000.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.04332688588007737

CONFUSION MATRIX

In [170]:
metrics.confusion_matrix(t_test, t_pred)
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix)) #18 & 21

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,0,0,0,0,0,0,0,6,0,0,2,0,0,7,148,0,0,388,0,0
0,0,0,0,0,0,0,0,4,0,0,0,0,0,1,68,0,0,274,0,0
0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,43,0,0,403,0,0
0,0,0,0,0,0,0,0,2,0,0,1,0,0,2,21,0,0,240,0,0
0,0,0,0,0,0,0,0,1,0,0,1,0,0,2,65,0,0,234,0,0
0,0,0,0,0,0,0,0,1,0,0,4,0,0,2,168,0,0,254,0,0
0,0,0,0,0,0,0,0,4,0,0,1,0,0,2,125,0,0,234,0,0
0,0,0,0,0,0,0,0,4,0,0,1,0,0,5,225,0,0,150,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,125,0,0,205,0,0
0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,123,0,0,213,0,0


In overall, the result for this model are not promising, the F-score, accuracy score and the mAP have a insignificant value. Therefore, this model it is not viable to use for predictions. Additionally, the confusion matrix shows how the model captures the majority of records as False Positives. Just the descriptive variables 18 and 21 has correctly find a percentage of True Positives. Showing that the model is very confused of how to classify the target

### Model 3 : KNN <a id='m3'></a>
In another hand, there similarity-based methods that may fit better for the type of prediction that this project is looking for. The most common model is the KNN which is known to consider  the neighbors of the observation and based on that information the algorithm infer a classification for the target value, in this case one of the 21 areas of Los Angeles.

The initial parameters are to select among three possible number of features [10,20,27], with a range from 1 to 5 number of neighbors and measuring the distance with Euclidean and Manhattan methods. 

In [172]:
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

params_pipe_KNN = {'fselector__score_func': [f_classif, mutual_info_classif],
                   'fselector__k': [10, 20, data.shape[1]],
                   'knn__n_neighbors': [1, 2, 3, 4, 5],
                   'knn__p': [1, 2]}
 
pipe_KNN = Pipeline([('fselector', SelectKBest()), 
                     ('knn', KNeighborsClassifier())])

gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN, 
                           param_grid=params_pipe_KNN, 
                           cv=cv_method,
                           scoring='accuracy',
                           verbose=1) 


#### Model Fitting
This algorithm  test 60 possible combination. It took around 20 to complete the process

In [182]:
%%time
gs_pipe_KNN.fit(d_train, t_train)

Fitting 10 folds for each of 60 candidates, totalling 600 fits
CPU times: total: 16min 42s
Wall time: 15min 34s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=Pipeline(steps=[('fselector', SelectKBest()),
                                       ('knn', KNeighborsClassifier())]),
             param_grid={'fselector__k': [10, 20, 27],
                         'fselector__score_func': [<function f_classif at 0x0000024395A045E0>,
                                                   <function mutual_info_classif at 0x0000024395D62700>],
                         'knn__n_neighbors': [1, 2, 3, 4, 5],
                         'knn__p': [1, 2]},
             scoring='accuracy', verbose=1)

In [184]:
gs_pipe_KNN.best_estimator_[0].k

10

In [185]:
gs_pipe_KNN.best_estimator_

Pipeline(steps=[('fselector',
                 SelectKBest(score_func=<function mutual_info_classif at 0x0000024395D62700>)),
                ('knn', KNeighborsClassifier(n_neighbors=3, p=1))])

#### Metrics

F-SCORE

In [186]:
gs_pipe_KNN.best_score_

0.18905698640010504

ACCURACY SCORE

In [187]:
t_pred = gs_pipe_KNN.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.13191489361702127

CONFUSION MATRIX

In [188]:
metrics.confusion_matrix(t_test, t_pred)
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
204,53,43,30,13,36,33,21,24,7,12,21,12,9,3,2,2,10,2,8,6
81,42,33,26,10,39,15,16,6,6,12,20,8,8,4,3,2,1,0,14,1
93,41,92,33,15,23,24,21,9,7,9,42,19,7,3,2,1,2,1,4,0
55,38,22,52,6,9,14,9,7,3,10,17,6,3,5,1,2,3,1,1,2
36,31,34,27,58,17,6,17,3,5,5,26,4,6,3,1,1,18,2,2,1
83,44,37,25,18,57,32,29,16,12,22,8,5,10,9,8,2,4,0,5,3
83,46,37,17,9,41,36,25,6,7,12,8,4,9,6,0,2,1,4,8,5
57,36,30,15,9,58,35,44,18,19,8,9,6,15,6,2,6,3,1,5,3
51,22,22,11,16,40,27,20,37,17,10,8,6,5,15,3,7,1,7,1,4
26,27,34,20,9,46,21,30,24,34,10,10,0,7,9,14,4,2,6,4,3


Although the F-score is lower compare to the previous model. The average accuracy of the model is close to the 20% which is better than the models based on information algorithms. The confusion matrix shows a significant count of True Positive against the False Negative and True Negative. This model may need more parameters to improve its accuracy. Considering the catalog of useful model from Sklearn, it is possible to apply the similar algorithms as the Nearest Centroid which change the paradigm of the algorithm to center the neighbors and calculate the distance from the center to the observation as a classification method.

###### Nearest Centroid

In [191]:
from sklearn.neighbors import NearestCentroid
KNN_centroid = NearestCentroid()

params_KNN_centroid = {'metric': ['euclidean','manhattan','haversine','cosine'],
                      'shrink_threshold':[0.1,0.2,0.3,0.4,0.5,0.7,0.9]}

gs_KNN_centroid = GridSearchCV(estimator=KNN_centroid, 
                     param_grid=params_KNN_centroid, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy',
                     return_train_score=True)

#### Model Fitting

In [192]:
%%time
gs_KNN_centroid.fit(d_train, t_train)

Fitting 10 folds for each of 28 candidates, totalling 280 fits
CPU times: total: 3.45 s
Wall time: 5.39 s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=NearestCentroid(),
             param_grid={'metric': ['euclidean', 'manhattan', 'haversine',
                                    'cosine'],
                         'shrink_threshold': [0.1, 0.2, 0.3, 0.4, 0.5, 0.7,
                                              0.9]},
             return_train_score=True, scoring='accuracy', verbose=1)

In [193]:
gs_KNN_centroid.best_estimator_

NearestCentroid(shrink_threshold=0.4)

#### Metrics

F-SCORE

In [194]:
gs_KNN_centroid.best_score_

0.11069337169279445

ACCURACY SCORE

In [195]:
t_pred = gs_KNN_centroid.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.1102514506769826

CONFUSION MATRIX

In [196]:
metrics.confusion_matrix(t_test, t_pred)
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
26,0,3,186,0,0,4,137,0,40,0,126,3,0,0,0,3,0,0,20,3
4,0,3,177,0,0,2,63,0,29,0,42,10,0,0,0,0,0,0,14,3
5,0,4,179,0,0,1,41,0,26,0,168,8,1,0,0,0,0,0,12,3
7,0,0,202,0,0,1,19,0,7,0,14,10,0,0,0,0,0,0,4,2
1,0,1,161,0,0,2,65,0,15,0,46,7,0,0,0,0,0,0,5,0
5,0,2,123,0,0,2,163,0,43,0,74,1,0,0,0,1,0,0,10,5
8,0,2,91,0,0,2,116,0,42,0,75,3,0,0,0,2,0,0,25,0
7,0,3,45,0,0,3,212,0,67,0,27,0,0,0,0,0,0,0,17,4
1,0,2,105,0,0,1,114,0,63,0,29,4,0,0,0,0,0,0,9,2
2,0,1,118,0,0,2,113,0,59,0,32,3,0,0,0,0,0,0,8,2


The results of this algorithm does not show a strong model for this type of dataset. This is because, the precision score is lower compare to the original KNN algorithm and the F-score seems to significantly slow as well. 

The previous model experiments proposes that decision tree, random forest and the K-near neighbor algorithms are able to predict target features with certain level of confidence and precision. However, before the project can do a pair comparison among the three models, there is a final algorithm that could show certain level of accuracy, the Neural Network Algorithm.b

### Model Fitting and Tunning for Neural Network  <a id='mftnn'></a>
Finally, NN or Neural Network is an algorithm that tries to simulate neural work from human to be applied to predictive models. This algorithm is able to identify relationships and patterns across the feature variables. Once, NN identifies this, it used as an input to learn trough certain amount of layers that at the same time has certain amount of neurons in it, which are the possible relationships. Layer-to-layer, the algorithm does calculation to try to predict the target value.

Initially, the Neural Network algorithm can be settled with initial variations to identify which solver method, activation method and learning rate method is better for the type of the data from the LAPD. Once, this run provides results, it is possible to observe which parameter are more convenient for the model.

The parameters to tune are the following :
    * The activation of the model by applying relu or logistic method
    * the type of solver between stochastic gradient decent (SGD) and Adaptive Moment Estimation (ADAM)
    * Select between a constant or an adaptive learning method

In [197]:
from sklearn.neural_network import MLPClassifier

MLP = MLPClassifier(random_state= 999)

params_MLP = {'activation': ['relu','logistic'],
              'early_stopping': [True],
              'solver': ['sgd', 'adam'],
              'learning_rate': ['constant','adaptive'],
              'random_state' : [999]}

gs_MLP = GridSearchCV(estimator=MLP, 
                     param_grid=params_MLP, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy',
                     return_train_score=True)

#### Model Fitting
This algorithm  test 80 possible combination and 5 minutes.

In [199]:
%%time
gs_MLP.fit(d_train, t_train)

Fitting 10 folds for each of 8 candidates, totalling 80 fits
CPU times: total: 1min 10s
Wall time: 4min 30s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=MLPClassifier(random_state=999),
             param_grid={'activation': ['relu', 'logistic'],
                         'early_stopping': [True],
                         'learning_rate': ['constant', 'adaptive'],
                         'random_state': [999], 'solver': ['sgd', 'adam']},
             return_train_score=True, scoring='accuracy', verbose=1)

From the initial result run of neural network algorithm, it is possible to identify best parameter for a predictive model with the current data.

In [200]:
gs_MLP.best_params_

{'activation': 'relu',
 'early_stopping': True,
 'learning_rate': 'constant',
 'random_state': 999,
 'solver': 'adam'}

In [201]:
gs_MLP.best_score_

0.1202533679057139

Setting this parameters, it is possible to modify other parameters that definitely have an effect on the module performance. For instance the batch size and the learning rate are flexible parameter that can fluctuated across a large range of number in order to identify the best combination of them.

BATCH SIZE VARIATION

Initially, the batch size with change between 250, 500 and 700 to see how much effect has on the model and which size of batches is more convenient.

In [202]:
MLP_BATCH = MLPClassifier(random_state= 999)

params_MLP_BATCH = {'activation': ['relu'],
              'early_stopping': [False],
              'solver': ['adam'],
                 'batch_size': [250, 500, 750],
                 'learning_rate_init': [0.001],
              'learning_rate': ['constant'],
              'random_state' : [999]}

gs_MLP_BATCH = GridSearchCV(estimator=MLP_BATCH, 
                     param_grid=params_MLP_BATCH,
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy',
                     return_train_score=True)

#### Model Fitting
This algorithm  test 30 fits in less than  6 minutes.

In [203]:
%%time
gs_MLP_BATCH.fit(d_train, t_train)

Fitting 10 folds for each of 3 candidates, totalling 30 fits
CPU times: total: 1min 32s
Wall time: 5min 52s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=MLPClassifier(random_state=999),
             param_grid={'activation': ['relu'], 'batch_size': [250, 500, 750],
                         'early_stopping': [False],
                         'learning_rate': ['constant'],
                         'learning_rate_init': [0.001], 'random_state': [999],
                         'solver': ['adam']},
             return_train_score=True, scoring='accuracy', verbose=1)

In [205]:
gs_MLP_BATCH.best_estimator_

MLPClassifier(batch_size=250, random_state=999)

In [206]:
t_pred = gs_MLP_BATCH.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.1400386847195358

LEARNING RATE VARIATION

Then, the constant learning rate of the NN varied from 0.005 to 0.05 to determine which is the best speed of learning for this model.

In [207]:
MLP_LR= MLPClassifier(random_state= 999)
params_MLP_LR = {'activation': ['relu'],
              'early_stopping': [False],
              'solver': ['adam'],
                 'batch_size': [250],
                 'learning_rate_init': [0.005,0.01,0.05],
              'learning_rate': ['constant'],
              'random_state' : [999]}


gs_MLP_LR = GridSearchCV(estimator=MLP_LR,
                         param_grid=params_MLP_LR,
                     cv=cv_method,
                     verbose=1, 
                     scoring='accuracy',
                     return_train_score=True)

#### Model Fitting
This algorithm  test 30 fits in less than  10 minutes.

In [208]:
%%time
gs_MLP_LR.fit(d_train, t_train)

Fitting 10 folds for each of 3 candidates, totalling 30 fits
CPU times: total: 1min 16s
Wall time: 6min 31s


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=1, n_splits=10, random_state=999),
             estimator=MLPClassifier(random_state=999),
             param_grid={'activation': ['relu'], 'batch_size': [250],
                         'early_stopping': [False],
                         'learning_rate': ['constant'],
                         'learning_rate_init': [0.005, 0.01, 0.05],
                         'random_state': [999], 'solver': ['adam']},
             return_train_score=True, scoring='accuracy', verbose=1)

In [211]:
gs_MLP_LR.best_estimator_

MLPClassifier(batch_size=250, learning_rate_init=0.01, random_state=999)

In [212]:
t_pred = gs_MLP_LR.predict(d_test)

0.36002578981302386

#### Final NN Model 
The previous experiments has shown that the fittest Neural Network model is the one with following parameters:
* Activation relu
* Batch size of 250 observations
* Constant Learning rate at 0.01
* Adaptive solver method (ADAM)

#### Metrics

F-SCORE

In [214]:
gs_MLP_LR.best_score_

0.3437415745092829

ACCURACY SCORE

In [216]:
t_pred = gs_MLP_LR.predict(d_test)
metrics.accuracy_score(t_test, t_pred)

0.36002578981302386

CONFUSION MATRIX

In [217]:
metrics.confusion_matrix(t_test, t_pred)
matrix = tabulate(metrics.confusion_matrix(t_test, t_pred),
                           tablefmt = 'html')
from IPython.display import HTML, display
display(HTML(matrix))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
248,20,1,2,7,0,1,0,0,0,102,4,79,0,0,0,0,83,0,4,0
34,48,7,0,34,12,1,0,0,0,34,77,30,1,0,1,0,42,0,26,0
9,16,132,0,8,40,16,0,0,0,3,159,8,2,8,19,1,2,1,24,0
29,0,0,205,0,0,0,0,0,0,19,1,7,0,0,0,0,5,0,0,0
25,32,11,0,33,20,1,0,0,0,31,81,30,0,0,0,1,29,0,9,0
0,3,97,0,0,152,46,2,0,0,1,58,0,8,33,14,0,1,0,14,0
0,0,83,0,1,70,50,0,0,0,0,18,0,21,79,30,0,0,0,14,0
0,0,2,1,0,0,3,97,47,21,0,0,0,114,36,14,18,0,28,0,4
0,0,0,0,0,0,1,64,86,1,0,0,0,84,1,2,2,0,89,0,0
0,0,0,0,0,0,0,14,14,110,0,0,0,3,0,0,133,0,39,0,27


The NN algorithm seems to have a accuracy score of 0.36 which is lower than the decision tree but higher than the other models. The parameters selected shows how the model improved from the initial setting to this final sets. This means, that it could improve more if the range of parameters change. However, it does not show enough evidence to be better than the decision tree model.

### Model Comparison  <a id='mc'></a>

The model comparison is part of the model evaluation. This stage of the data science projects cycle presents the performance of different models and it evaluates this values against each to compare which model is the fittest.  There are 4 candidate model to be compared : Decision Tree (DT), Random Forest(RF), K- Near Neighbors (KNN) the final Neural Network model (NN). The other models applied were discarded because their performance is lower than the mentioned candidates.

The model comparison is comparing cross-validation results between these 4 models having the accuracy score as point of reference for the comparison.

#### Cross-Validation

In [224]:
%%time
from sklearn.model_selection import cross_val_score

cv_method_ttest = RepeatedStratifiedKFold(n_splits=10, 
                                          n_repeats=1, 
                                          random_state=999)

# Decision Tree
cv_results_DT = cross_val_score(estimator=gs_pipe_DT_fs.best_estimator_,
                                 X=data,
                                 y=target, 
                                 cv=cv_method_ttest, 
                                 n_jobs=-2,
                                 scoring='accuracy')
print('f DT : {cv_results_DT.mean().round(3)}')

# Random Forest
cv_results_RF = cross_val_score(estimator=gs_pipe_RF_fs.best_estimator_,
                                X=data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='accuracy')
print('f DT : {cv_results_RF.mean().round(3)}')

# KNN
cv_results_KNN = cross_val_score(estimator=gs_pipe_KNN.best_estimator_,
                                X=data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='accuracy')
print('f DT : {cv_results_KNN.mean().round(3)}')

# MLP_LR
cv_results_MLP_LR = cross_val_score(estimator=gs_MLP_LR.best_estimator_,
                                X=data,
                                y=target, 
                                cv=cv_method_ttest, 
                                n_jobs=-2,
                                scoring='accuracy')

print('f DT : {cv_results_MLP_LR.mean().round(3)}')



f DT : {cv_results_DT.mean().round(3)}
f DT : {cv_results_RF.mean().round(3)}
f DT : {cv_results_KNN.mean().round(3)}
f DT : {cv_results_MLP_LR.mean().round(3)}
CPU times: total: 31.2 ms
Wall time: 49.4 s


In [228]:
print('DT : ',cv_results_DT.mean().round(3))
print('RF : ',cv_results_RF.mean().round(3))
print('KNN : ',cv_results_KNN.mean().round(3))
print('MLP_LR : ',cv_results_MLP_LR.mean().round(3))

DT :  0.986
RF :  0.974
KNN :  0.148
MLP_LR :  0.368


In coherence with previous analysis of metrics for each model, it is observable  how in average the KNN and NN algorithms do not show a significant performance. However, the information-based algorithms have perform better with similar mean accuracy score, which shows a that possible the decision tree algorithm is doing better than the random forest algorithm. As an evidence of these, the next step is to execute a Pair T-test comparison between the mean accuracy score.

#### Pair T-test

The Pair  T-test states the null hypothesis that the mean of both models are statistically equal (H0: μ1 = μ2) and the alternative hypothesis is the contrary (H1: μ1 ≠ μ2).

In [236]:
from scipy import stats

# DT and KNN
print(stats.ttest_rel(cv_results_DT, cv_results_KNN))
# DT and NN
print(stats.ttest_rel(cv_results_DT,cv_results_MLP_LR))

# TO check equality between DT and RF
print(stats.ttest_rel(cv_results_DT, cv_results_RF))

# PLOTS 

Ttest_relResult(statistic=409.02794328819067, pvalue=1.588541092969237e-20)
Ttest_relResult(statistic=93.08419476804596, pvalue=9.663686706204715e-15)
Ttest_relResult(statistic=12.779141270558965, pvalue=4.497891631877257e-07)


The t-test shows that any p-value is greater than the level of significance (0.05). Therefore, it rejects the null hypothesis that means accuracy of theses models are statistically equal. In particular, this rejects that average accuracy score between decision tree and random forest models are different. 

The result of this test shows enough evidence that indicate that the decision tree model for this special case is the fittest model.

## Limitations  <a id='l'></a>

In overall the limitations of this project are related to the type of descriptive variables involved andthe size of the systems for the modelling. However, these limitations do not stop the project of experimenting with the current dataset and modelling tools available.

Firstly, The current descriptive variables may are not able to describe the area of crime for classification. It seems that the features are correlated with the area of crime in LA but not all them are significantly for a classification task. For example the hours during the crime was issued is time variable that has an extra level of complexity and this can have effect on the modelling process. Additionally, LAPD should collect more informative features about the 21 areas.

Despite the size of the data normally provides extra information for a better model, it has also disadvantages. One of them, is that processing large dataset requires time and  operation resources from the system to run a variety of possible tunning combinations. For the scope of this project, this is a limitation using a local computer for modelling.

Finally, it is not clear limitation but multinomial target have an extra level of complexity in the model evaluation section. This is because, they do not offer many visualizations to understand the results of the metric and possible plots required a high level of programming proficiency.

## Summary & Conclusions  <a id='sic'></a>
### Project Summary  <a id='ps'></a>

In summary, Los Angeles Police Department criminal records is a dataset with more than 600,000 observations from 2020. This is a vast dataset that contains over 21 descriptive features describing each of the  crime incidents in the city from 2020. One of the goals of the Police Department is to guarantee citizen safety. Hence, it important to work on technology tools able to predict where a crime incidents across the city. For this reason, the scope of this project was focused on the research and experiment of machine learning models. This is because, a ML model is able to classify or predict target variable, in this case, an area that is in risk of a crime. 

This project was concluded in two phases, the first phase contributed to understand the characteristics of the data collected.  The second phase experimented with a range of machine learning algorithm to propose a the fittest model with the current data. This summary will describe the general points of both phases.

First, Phase 1 prepared the data to be processed in ML models and it also described variables and values of the LAPD database. Initially, the data was cleaned of noisy values and tidied by removing missing values and renaming incorrect values. Then, descriptive tools were applied to identify important findings related with area descriptions and  victims demographic in the 21 areas of Los Angeles. For instance, There are areas where female population is more affected than male population. Another evidence exposed by the data explains that the area 1 of Los Angeles has a high frequency of crimes compared to the other 20 areas. From this type of analysis, the project were able to continue to next phase where all this data was processed into different machine learning models.

Secondly, Phase 2 experimented with different algorithms including Neural Network  to determine which one fits better with the current variables. The sample data was split into training and testing following the ML procedures. After that, 4 different algorithms were tunned with their possible characteristics. This is because, tunning the algorithm improves the model to try to get the fittest model with the current knowledge and data. Finally, the 4 models were compared to identify which one performs better. For instance, the information-based algorithms worked better than other algorithm paradigms for this specific data. As a conclusion of this phase, one model performed better but it requires further development in order to be able to determine the area of crime for LAPD.


### Findings  <a id='fffff'></a>

This project generates a variety of results regarding the type of data used and the possible machine learning methods that can be applied. This a summary of this findings :

* The data collected form the LAPD contains information about the type of crime and victim demography. This is include, the areas where the crime was issued. However, this project present that there are areas more affected than others among the 21 divided areas of LA.

* Continuing with the data description, the mean of age fluctuates between 35 and 45 and there are areas where male population is less affected than female population.

* As a result of selecting the best features, there are two important findings to consider. Firstly, in the case of F-score selection method, seems that a larger number of P will provide a better performance compare to low number of features. Secondly, in the mutual information method, the results appeared to be coherent from one change of p to the other. However, it does not share the same ranking of features.

* In terms of the machine learning research, the three models were tested to create a predictive model for the LAPD. The decision tree algorithm has high accuracy and f-score which means it can capture data and predict the area of crime properly. The Naive Bayes model is not viable to use for predictions due to its insignificant F-score, accuracy score and mAP. Finally,  Although F-score of the KNN model is lower compared to previous models, the average accuracy of the model is close to 20% which is better than models based on information algorithms.

* In addition to the previous findings, the Neural Network algorithm is another algorithm that was tested to create a predictive model for the LAPD. The Neural Network algorithm tries to simulate neural work from humans and applies it to predictive models. However, the NN algorithm seems to have an accuracy score of 0.36 which is lower than the decision tree but higher than the other models.

* It is possible that the perfomance of the models woudl be better, if there more variable that describe the area of crim. For example, LAPD can collect information about industries in the area, type of communities, population size in between others. This type of variable can feed the models with more information about the area where the crime happened to predict future incidents.

### Conslusion  <a id='c'></a>
LAPD requires a Machine Learning model that can use feature variables of time, space and descriptive variables to forecast areas that are more vulnerable to crime occurrences. The decision tree algorithm has high accuracy and f-score which means it can capture data and predict the area of crime properly. The Naive Bayes, KNN and NN are model with a low performance but they could improve if the quality of the variable improve by adding descriptive features related to the area of crime. However, approach is beneficial for the LAPD to reduce the level of crimes and guarantee the safety within the community of Los Angeles by knowing where to assign police force in the different areas of the city.

## References  <a id='r'></a>

• Z. M. Wawrzyniak et al., "Data-driven models in machine learning for crime prediction," 2018 26th International Conference on Systems Engineering (ICSEng), Sydney, NSW, Australia, 2018, pp. 1-8, doi: 10.1109/ICSENG.2018.8638230.

• M. Raja Suguna, R. Beaulah Jeyavathana, and K. V. Kanimozhi (2022). Comparative analysis of crime predictions using machine learning algorithms with geospatial features", AIP Conference Proceedings 2393, 020004

• O. Llaha, "Crime Analysis and Prediction using Machine Learning," 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 2020, pp. 496-501

• Bshayer S. Aldossari, Futun M. Alqahtani, Noura S. Alshahrani, Manar M. Alhammam, Razan M. Alzamanan, Nida Aslam, and Irfanullah. 2020. A Comparative Study of Decision Tree and Naive Bayes Machine Learning Model for Crime Category Prediction in Chicago. In Proceedings of 2020 the 6th International Conference on Computing and Data Engineering (ICCDE 2020). Association for Computing Machinery, New York, NY, USA, 34–38.

• Xu Zhang, Lin Liu, Minxuan Lan, Guangwen Song, Luzi Xiao, Jianguo Chen (2022). Interpretable machine learning models for crime prediction, Computers, Environment and Urban Systems, Volume 94,2022,101789, ISSN 0198-9715

• Jenga, K., Catal, C. & Kar, G(2023). Machine learning in crime prediction. J Ambient Intell Human Comput 14, 2887–2913 (2023).

• Shah, N., Nandish, B., & Manan, S. (2021). Crime forecasting: a machine learning and computer vision approach to crime prediction and prevention. Visual Computing for Industry Biomedicine, and Art, 4(1)

• S. S. Kshatri, D. Singh, B. Narain, S. Bhatia, M. T. Quasim and G. R. Sinha. (2021), "An Empirical Analysis of Machine Learning Algorithms for Crime Prediction Using Stacked Generalization: An Ensemble Approach," in IEEE Access, vol. 9, pp. 67488-67500, 2021

• Ahishakiye, Emmanuel & Opiyo, Elisha & Wario, Ruth & Niyonzima, Ivan. (2017). A Performance Analysis of Business Intelligence Techniques on Crime Prediction. International Journal of Computer and Information Technology. 06. 84 - 90.

• Kumari, Pratibha & Gahalot, Akanksha & Uprant, & Dhiman, Suraina & Chouhan, Lokesh. (2020). Crime Prediction and Analysis. 1-6. 10.1109/IDEA49133.2020.9170731.

• Kiani, R., Mahdavi, S., & Keshavarzi, A. (2015). Analysis and Prediction of Crimes by Clustering and Classification. International Journal of Advanced Research in Artificial Intelligence, 4(8).

• Khan, M., Ali, A., & Alharbi, Y. (2022). Predicting and Preventing Crime: A Crime Prediction Model Using San Francisco Crime Data by Classification Techniques. Complexity (New York, N.Y.), 2022, 1–13.

• Jangra M, Kalsi S (2019) Crime analysis for multistate network using naïve Bayes classifier. Int J Comput Sci Mob Comput 8(6):134–143