# Making predictions on the NYC criminality dataset

Antonio Ríos-Vila

In the previous notebook that I shared, we performed an exploratory analysis of the data contained mainly in the New York City Department historic of arrests and complaints. As I said in that notebook before, we have a great amount of data to perform experiments on making some predictions with simple Machine Learning techniques, those who are not based in deep learning approaches. So, in this notebook, it is time to explore how easy and/or hard could be to predict some data in this dataset.
The focus of this work will be to try to predict where the crimes occur (given some data) with both a classification approach (where we try to gess the borough where the crime has taken place) and a regression one, where we will try to predict the exact location of the complaint.

First, we import all the libraries and tools that we need to do this

In [None]:
import numpy as np 
import pandas as pd

from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC, SVC, SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier


from sklearn.model_selection import StratifiedKFold, KFold 
from sklearn.preprocessing import MinMaxScaler, StandardScaler, label_binarize
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import f1_score, r2_score, roc_curve, auc
from sklearn.utils.random import sample_without_replacement
from sklearn.utils import shuffle
from sklearn import preprocessing

from itertools import cycle

import seaborn as sns
import matplotlib.pyplot as plt
import os

## Utility Functions

Before diving into this approach, I coded some utility functions that will be called during this notebook and I preferred to wrap up in order to avoid code repetition.

In [None]:
# Function to train the classification problem with 10 - Fold CV
def trainClassificationCV(X,Y,numclasses, ylabel):
    #We define the KFOLD separator
    cv = StratifiedKFold(n_splits=10)
    
    #Some utility variables that will be useful during our train/test process
    mean_score_logistic = 0
    mean_score_NB = 0
    mean_score_SVM = 0
    mean_score_SVMK = 0
    iteration = 0
    
    #I would like to know the F1 scores of the models, as precission and recall are important when predicting
    #the targetted label we want to describe
    f1_scores_logistic = []
    f1_scores_NB = []
    f1_scores_SVM = []
    f1_scores_SVMK = []
    
    # I will also store all the historic variables of the models to obtain a ROC curve for each fold and method
    roc_curve_logistic = []
    roc_curve_NB = []
    roc_curve_SVM = []
    roc_curve_SVMK = []
    
    roc_curve_data = []
    
    for train_idx, test_idx in cv.split(X,Y):
        #Split into train_test datasets
        x_train, y_train = X.iloc[train_idx], Y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], Y.iloc[test_idx]
        
        #We define the models to work with: Logistic regression, Naive Bayes, Lineal SVM and Kernel SVM
        model_logistic = LogisticRegression(C=100, max_iter=10000)
        model_NB = GaussianNB()
        model_SVM = LinearSVC(random_state=1, tol=1e-5)
        model_SVMK = SVC(gamma="auto", C=10000)
        
        #Train step
        model_logistic.fit(x_train, y_train[ylabel].ravel())
        model_NB.fit(x_train, y_train[ylabel].ravel())
        model_SVM.fit(x_train, y_train[ylabel].ravel())
        model_SVMK.fit(x_train, y_train[ylabel].ravel())
        
        ## VALIDATION AND METRIC OBTAINMENT METHODS ##
        mean_score_logistic = mean_score_logistic + model_logistic.score(x_test,y_test[ylabel].ravel())
        mean_score_NB = mean_score_NB + model_NB.score(x_test,y_test[ylabel].ravel())
        mean_score_SVM = mean_score_SVM + model_SVM.score(x_test,y_test[ylabel].ravel())
        mean_score_SVMK = mean_score_SVMK + model_SVMK.score(x_test,y_test[ylabel].ravel())
        
        score_roc_Logistic = model_logistic.decision_function(x_test)
        score_roc_NB = model_logistic.decision_function(x_test)
        score_roc_SVM = model_SVM.decision_function(x_test)
        score_roc_SVMK = model_SVMK.decision_function(x_test)
        
        fprlog = dict()
        tprlog = dict()
        roc_auclog = dict()
        
        fprNB = dict()
        tprNB = dict()
        roc_aucNB = dict()
        
        fprSVM = dict()
        tprSVM = dict()
        roc_aucSVM = dict()
        
        fprSVMK = dict()
        tprSVMK = dict()
        roc_aucSVMK = dict()
        
        for i in range(numclasses):
            fprlog[i], tprlog[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_Logistic[:,i])
            roc_auclog[i] = auc(fprlog[i], tprlog[i])
            
            fprNB[i], tprNB[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_NB[:,i])
            roc_aucNB[i] = auc(fprNB[i], tprNB[i])
            
            fprSVM[i], tprSVM[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_SVM[:,i])
            roc_aucSVM[i] = auc(fprSVM[i], tprSVM[i])
            
            fprSVMK[i], tprSVMK[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_SVMK[:,i])
            roc_aucSVMK[i] = auc(fprSVMK[i], tprSVMK[i])
            
        
        roc_curve_data.append([[fprlog, tprlog, roc_auclog], [fprNB, tprNB, roc_aucNB], [fprSVM, tprSVM, roc_aucSVM], [fprSVMK, tprSVMK, roc_aucSVMK]])
    
        print("Iteration ", iteration)
        iteration+=1
        print("Accuracy Naive Bayes: ", model_NB.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy Logistic Regression: ", model_logistic.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy Linear SVM: ", model_SVM.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy Kernel SVM: ", model_SVMK.score(x_test,y_test[ylabel].ravel()))

        y_predicted_lr = model_logistic.predict(x_test)
        y_predicted_NB = model_NB.predict(x_test)
        y_predicted_SVM = model_SVM.predict(x_test)
        y_predicted_SVMK = model_SVMK.predict(x_test)
    
        precision_lr = f1_score(y_test.Borough.ravel(), y_predicted_lr, average=None)
        precision_nb = f1_score(y_test.Borough.ravel(), y_predicted_NB, average=None)
        precision_svm = f1_score(y_test.Borough.ravel(), y_predicted_SVM, average=None)
        precision_svmk = f1_score(y_test.Borough.ravel(), y_predicted_SVMK, average=None)
        
        f1_scores_logistic.append(precision_lr)
        f1_scores_NB.append(precision_nb)
        f1_scores_SVM.append(precision_svm)
        f1_scores_SVMK.append(precision_svmk)

        print("F1-Score Naive Bayes: ", precision_nb)
        print("F1-Score Logistic Regression: ", precision_lr)
        print("F1-Score Linear SVM: ", precision_svm)
        print("F1-Score Kernel SVM: ", precision_svmk)
    

        print("-----------------")

    result_lr = np.round(mean_score_logistic/10,3)
    result_nb = np.round(mean_score_NB/10,3)
    result_svm = np.round(mean_score_SVM/10,3)
    result_svmk = np.round(mean_score_SVMK/10,3)
    print("Accuracy in Cross validation Naive Bayes: ", result_nb) 
    print("Accuracy in Cross validation Logistic Regresion: ", result_lr)    
    print("Accuracy in Cross validation Linear SVM: ", result_svm)   
    print("Accuracy in Cross validation Kernel SVM: ", result_svmk)
    
    #We return the ROC data
    return roc_curve_data

In [None]:
# Function to train the regression problem with 10 - Fold CV
# The procedure is similar to the above function, even there are not that many parameters to obtain during
# the training process (we only want the R2 score)
def trainRegressionCV(X,Y, ylabel):
    cv = KFold(n_splits=10) 
    r2_linear = 0
    r2_svr = 0
    r2_rfo = 0
    r2_ridge = 0
    r2_lasso = 0
    r2_knn = 0
    iteration = 0
    for train_idx, test_idx in cv.split(X,Y):
        x_train, y_train = X.iloc[train_idx], Y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], Y.iloc[test_idx]

        model_linear = LinearRegression()
        model_SVR = SVR(max_iter=1000, C=50, kernel="poly", coef0=2)
        model_RandomForest = RandomForestRegressor(criterion="mse", bootstrap=True)
        model_Ridge = Ridge(alpha=1.0)
        model_Lasso = Lasso(alpha=0.1)
        model_KNN = KNeighborsRegressor(n_neighbors=100)
    
        model_linear.fit(x_train, y_train[ylabel].ravel())
        model_SVR.fit(x_train, y_train[ylabel].ravel())
        model_RandomForest.fit(x_train, y_train[ylabel].ravel())
        model_Ridge.fit(x_train, y_train[ylabel].ravel())
        model_Lasso.fit(x_train, y_train[ylabel].ravel())
        model_KNN.fit(x_train, y_train[ylabel].ravel())
    
        print("Iteration ", iteration)
        iteration+=1
    
        r2_linear = r2_linear + model_linear.score(x_test, y_test[ylabel].ravel())
        r2_svr = r2_svr + model_SVR.score(x_test, y_test[ylabel].ravel())
        r2_rfo = r2_rfo + model_RandomForest.score(x_test, y_test[ylabel].ravel())
        r2_ridge = r2_ridge + model_Ridge.score(x_test, y_test[ylabel].ravel())
        r2_lasso = r2_lasso + model_Lasso.score(x_test, y_test[ylabel].ravel())
        r2_knn = r2_knn + model_KNN.score(x_test, y_test[ylabel].ravel())
    
        print(f"R2 Linear Regression {model_linear.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 SVR {model_SVR.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Random Forest {model_RandomForest.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 KNN100 {model_KNN.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Ridge {model_Ridge.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Lasso {model_Lasso.score(x_test, y_test[ylabel].ravel())}")
    
        print("-----------------")

    result_lm = np.round(r2_linear/10,3)
    result_svr = np.round(r2_svr/10,3)
    result_rfo = np.round(r2_rfo/10,3)
    result_rid = np.round(r2_ridge/10,3)
    result_las = np.round(r2_lasso/10,3)
    result_knn = np.round(r2_knn/10,3)
    print("Mean R2 Linear Regression: ", result_lm) 
    print("Mean R2 SVR: ", result_svr)
    print("Mean R2 RandomForest: ", result_rfo) 
    print("Mean R2 Ridge: ", result_rid)
    print("Mean R2 Lasso: ", result_las)     

In [None]:
#Boxplot function for our dataset
def boxes(xdata, ydata, labels, ylabel):
    idx = 0
    idy = 0
    fig, axes = plt.subplots(2,4, figsize=(30, 20))
    for label in labels:
        sns.boxplot(ax=axes[idx, idy], x=xdata[label], y = ydata[ylabel])
        idy+=1
        if idy>3:
            idy = 0
            idx += 1

In [None]:
#Somewat analogous function for R software pairplot
#However, we only make pairplots with our interest variable (represented by ylabel) as we want
#to see if there is any kind of linear relationship between regressors and variables
def pairplots(xdata, ydata, labels, ylabel, subpl):
    idx = 0
    idy = 0
    maxY = subpl[1]
    fig, axes = plt.subplots(subpl[0],subpl[1], figsize=(15, 5))
    for label in labels:
        sns.scatterplot(ax=axes[idx, idy], x=xdata[label], y = ydata[ylabel])
        idy+=1
        if idy>maxY-1:
            idy = 0
            idx += 1

In [None]:
#Monstruous function to plot the ROC curves
def plot_roc_curves_CV(dataroc, foldsToSee, yencoder, methods):
    plt.figure()
    lw = 2
    foldSee = foldsToSee if foldsToSee is not None else range(10)
    for fold in foldSee:
        fig, axs = plt.subplots(2, 2, figsize=(20,20))
        fig.suptitle(f"Fold {fold}", fontsize=16)
        fpr_complete = dataroc[fold]
        tpr_complete = dataroc[fold]
        rocauc_complete = dataroc[fold]
        x=0
        y=0
        for method in range(4):
            axis = axs[x,y]
    
            fpr = fpr_complete[method][0]
            tpr = tpr_complete[method][1]
            rocauc = rocauc_complete[method][2]

            colors = cycle(['aqua', 'darkorange', 'cornflowerblue', 'red', 'yellow'])
            for i, color in zip(range(5), colors):
                axis.plot(fpr[i], tpr[i], color=color, lw=lw,
                     label='ROC curve of class {0} (area = {1:0.2f})'
                     ''.format(yencoder.inverse_transform([i])[0], rocauc[i]))

            axis.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
            axis.axis(xmin=0.0, xmax=1.0, ymin=0.0, ymax=1.05)
            axis.set_xlabel('False Positive Rate')
            axis.set_ylabel('True Positive Rate')
            axis.set_title(f"ROC - {methods[method]}")
            axis.legend(loc="lower right")
            
            y+=1
            if y > 1:
                y = 0
                x+=1

## Classification

In this section, we will try to classify in which borough belongs an input which, in our case, is a biased crime complaint. I said biased as we are avoiding some explicit or easily deductible location variables (such as the cartesian coordinates and the latitude/longitude data). This will make the problem more interesting as we will not be easing the model to provide direct solutions with coordinate variables.

Before starting the training of the models, let's prepare the data

In [None]:
#Read dataframe and drop null data, in my previous work i saw that it was somewhat residual (25.000 dropped data over 5.000.000 entries), but it helps to avoid future errors
dataframe = pd.read_csv('../input/arrestsnypd/NYPDArrests.csv')
dataframe = dataframe.dropna()
# We get the descriptors that I do not conseader "cheat" too much the prediction of the borough. ARREST_PRECINCT we will see that traverses that thin line between cheat and non
# trivial description, however, that is to be discussed for further analysis
data_criminality = dataframe[["ARREST_DATE","OFNS_DESC","LAW_CAT_CD","ARREST_BORO","ARREST_PRECINCT", "AGE_GROUP", "PERP_SEX", "PERP_RACE"]]
# There are some ages like 12005 and so that are not good to label classify, as introduce unwanted variability, so we directly discard them as they are also residual
# We stick to the main five groups of ages in this dataset
data_criminality = data_criminality[data_criminality.AGE_GROUP.isin(['45-64', '25-44', '18-24', '<18', '65+'])]

#This is just for the sake of my comfortability treating the data, it is not necessary to change names, but I think it is more descriptive
data_criminality.loc[data_criminality.ARREST_BORO == "M", "ARREST_BORO"] = "Manhattan"
data_criminality.loc[data_criminality.ARREST_BORO == "B", "ARREST_BORO"] = "Bronx"
data_criminality.loc[data_criminality.ARREST_BORO == "K", "ARREST_BORO"] = "Brooklyn"
data_criminality.loc[data_criminality.ARREST_BORO == "Q", "ARREST_BORO"] = "Queens"
data_criminality.loc[data_criminality.ARREST_BORO == "S", "ARREST_BORO"] = "Staten Island"

data_criminality.head()

Before stepping further, we must take into account a tiny detail about this dataset: it has 5.000.000 entries. As much as I'd love to process all data, i believe it is unfeasable (as Kaggle gives limited resources to freebies like me) to do so. Now then, we have to get a smaller dataset which we can work from. In my opinion, I think if we can get a 100.000 samples dataset can be enough to accomplish our goals (and would not turn my Kaggle space into a fire hazard).

However, we have to be careful when reducing the dataset, as we have to obtain a balanced dataframe to not have unbalanced problems. The best way, in my opinion, to do so is to make a random sample between the subsets conformed by the target labels. In our case, as the model has to predict among five boroughs, we should make its correspondent subsets and analyse if we can do so.

In [None]:
manhattan_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Manhattan"]
print(len(manhattan_crimes))
bronx_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Bronx"]
print(len(bronx_crimes))
brooklyn_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Brooklyn"]
print(len(brooklyn_crimes))
queens_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Queens"]
print(len(queens_crimes))
statenisland_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Staten Island"]
print(len(statenisland_crimes))

We can observe by this projection that the dataset is already quite unbalanced (Staten Island has way less reports than the rest). However, we can observe that we can get a 100.000 reports dataset by randomly sampling 20.000 events from each subset, they have enough data to carry out this way. Even though, despite having enough data for the models we are using in this work to converge, we should take into account that we are deliberately increasing the variance from our solution to the optimal one by reducing the dataset. However, we have to do it for computational reasons, and I believe the results we obtain with this models will not be far from the ones obtained if we used all the dataset samples.

So, without further ado, we proceed to create our reduced dataset, which I call the "Intermediate dataset"

In [None]:
mhc = manhattan_crimes.iloc[sample_without_replacement(n_population=len(manhattan_crimes), n_samples = 20000, random_state=1)]
brc = bronx_crimes.iloc[sample_without_replacement(n_population=len(bronx_crimes), n_samples = 20000, random_state=1)]
bxc = brooklyn_crimes.iloc[sample_without_replacement(n_population=len(brooklyn_crimes), n_samples = 20000, random_state=1)]
quc = queens_crimes.iloc[sample_without_replacement(n_population=len(queens_crimes), n_samples = 20000, random_state=1)]
sic = statenisland_crimes.iloc[sample_without_replacement(n_population=len(statenisland_crimes), n_samples = 20000, random_state=1)]
intermediate_dataset = pd.concat([mhc, brc, bxc, quc, sic], ignore_index=True)
intermediate_dataset = shuffle(intermediate_dataset, random_state=1)
print(len(intermediate_dataset))
intermediate_dataset.head()

Once done that, we proceed to prepare, finally, our data to do the training process

In [None]:
X = intermediate_dataset[["ARREST_DATE","OFNS_DESC","LAW_CAT_CD", "ARREST_PRECINCT", "AGE_GROUP","PERP_SEX", "PERP_RACE"]]
Y = intermediate_dataset[["ARREST_BORO"]]

As most of our data is categorical, we need to label encode it to make it numerical

In [None]:
le_OFNS = preprocessing.LabelEncoder()
le_LAW = preprocessing.LabelEncoder()
le_AGE = preprocessing.LabelEncoder()
le_SEX = preprocessing.LabelEncoder()
le_RACE = preprocessing.LabelEncoder()
le_BOROUGH = preprocessing.LabelEncoder()

le_BOROUGH.fit(Y.ARREST_BORO)
le_OFNS.fit(X.OFNS_DESC)
le_LAW.fit(X.LAW_CAT_CD)
le_AGE.fit(X.AGE_GROUP)
le_SEX.fit(X.PERP_SEX)
le_RACE.fit(X.PERP_RACE)

In [None]:
scaler = MinMaxScaler()
XDataframe = {'Month': pd.DatetimeIndex(X['ARREST_DATE']).month, 
             'Offense': le_OFNS.transform(X.OFNS_DESC),
             'Law_Code': le_LAW.transform(X.LAW_CAT_CD),
             'Precint': X.ARREST_PRECINCT,
             'Age': le_AGE.transform(X.AGE_GROUP),
             'Sex': le_SEX.transform(X.PERP_SEX),
             'Race': le_RACE.transform(X.PERP_RACE)}
X = pd.DataFrame(data=XDataframe)
X_Scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
Y_Encoded = pd.DataFrame(data={'Borough':le_BOROUGH.transform(Y.ARREST_BORO)})

In [None]:
dataroc = trainClassificationCV(X_Scaled,Y_Encoded, 5, "Borough")

So, after the trainin process, we see that all the models are very accurate on predicting the Borough by the data we provide (both with the Accuracy and the F1 scores). However, it is hard to tell which one is the best, as they seem to be pretty equal. However, we can have a bigger picture by plotting the ROC curve (which takes) they describe in each fold. As they behave pretty consistently, I will only plot one fold to make this analysis, but you can see it all by adding a None to the folds parameter that it is given to the function.

In [None]:
plot_roc_curves_CV(dataroc, [0], le_BOROUGH, ['Logistic Regression', 'Naive Bayes', 'Linear SVM', 'Kernel SVM'])

So, as we can see, even of having that good results during the training process, there exists a model that outperforms the rest in all classes, which is the Kernel SVM. As we can observe from the ROC traced by the Lineal SVM, the data that we are handling is not linearly sepparable (as it would have a 100% accuracy on test to be so). The Logistic Regression and the Naive Bayes method do confirm that. We can also observe that the most hard variable to predict is the Brooklyn borough. However, if we apply a non-linearity function to the distribution (the kernel that uses the SVM-Kernel method), we observe that the ROC has an area of 1 in all classes, which is the most optimal result we can obtain. In summary, the data is not linear and, with a simple non-linearity distribution, we can obtain the best performance of a model that has a great accuracy in test.

However, having that good results somehow triggers one question on me: Does it need all the regressors we have inserted?
To inspect that, we should check the distribution of the data depending on each applied regression and analyze it.

In [None]:
boxes(X, Y, ["Month", "Offense", "Law_Code", "Precint", "Age", "Sex", "Race"], "ARREST_BORO")

As we can see, the regressor that better sepparates the classes is the Precinct one, as we can observe that we could tell the difference between the data. So, I believe, that using only the Precinct descriptor we could get the same best model (Kernel SVM) in less time (and with a simpler formula, which is always something to pursue in this problems). However, we can observe that Brooklyn (the less perfomrant class) can be somehow hard to distinguish between Bronx and Queens, so I believe we should add a supporting regressor that helps to make a difference between them. In this dataset, I believe the race of the offender can somehow draw a difference between these boroughs (as we can observe in their respective boxplot). So, I'm going to train the model but only considering these two descriptors.

In [None]:
dataroc_2 = trainClassificationCV(X_Scaled[["Precint", "Race"]],Y_Encoded, 5, "Borough")

In [None]:
plot_roc_curves_CV(dataroc_2,[0], le_BOROUGH, ['Logistic Regression', 'Naive Bayes', 'Linear SVM', 'Kernel SVM'])

As we can see, the Kernel SVM maintains its performance results.

## Regression

In this section, we will try to predict where exactly (by regression means) has occoured a complaint. In other words, we will try to predict the Latitude and the Longitude of the reported event by the other regressors the dataset contains, and we will analyze how this works and if it is feasable to do so. In order to achieve this goal, we will train the regression models both on the Latitude and the Longitude of the events.


The following code is just like in the classification section (including the Latitude and Longitude variables that we will try to predict).

In [None]:
data_criminality = dataframe[["ARREST_DATE","OFNS_DESC","LAW_CAT_CD","ARREST_BORO","ARREST_PRECINCT", "AGE_GROUP", "PERP_SEX", "PERP_RACE", "Latitude", "Longitude"]]
data_criminality = data_criminality[data_criminality.AGE_GROUP.isin(['45-64', '25-44', '18-24', '<18', '65+'])]
data_criminality.loc[data_criminality.ARREST_BORO == "M", "ARREST_BORO"] = "Manhattan"
data_criminality.loc[data_criminality.ARREST_BORO == "B", "ARREST_BORO"] = "Bronx"
data_criminality.loc[data_criminality.ARREST_BORO == "K", "ARREST_BORO"] = "Brooklyn"
data_criminality.loc[data_criminality.ARREST_BORO == "Q", "ARREST_BORO"] = "Queens"
data_criminality.loc[data_criminality.ARREST_BORO == "S", "ARREST_BORO"] = "Staten Island"
manhattan_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Manhattan"]
bronx_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Bronx"]
brooklyn_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Brooklyn"]
queens_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Queens"]
statenisland_crimes = data_criminality.loc[data_criminality.ARREST_BORO == "Staten Island"]
mhc = manhattan_crimes.iloc[sample_without_replacement(n_population=len(manhattan_crimes), n_samples = 20000, random_state=1)]
brc = bronx_crimes.iloc[sample_without_replacement(n_population=len(bronx_crimes), n_samples = 20000, random_state=1)]
bxc = brooklyn_crimes.iloc[sample_without_replacement(n_population=len(brooklyn_crimes), n_samples = 20000, random_state=1)]
quc = queens_crimes.iloc[sample_without_replacement(n_population=len(queens_crimes), n_samples = 20000, random_state=1)]
sic = statenisland_crimes.iloc[sample_without_replacement(n_population=len(statenisland_crimes), n_samples = 20000, random_state=1)]

In [None]:
intermediate_dataset = pd.concat([mhc, brc, bxc, quc, sic], ignore_index=True)
intermediate_dataset = shuffle(intermediate_dataset, random_state=1)
intermediate_dataset = intermediate_dataset.reset_index()[["ARREST_DATE","OFNS_DESC","LAW_CAT_CD","ARREST_BORO","ARREST_PRECINCT", "AGE_GROUP", "PERP_SEX", "PERP_RACE", "Latitude", "Longitude"]]

In [None]:
X = intermediate_dataset[["ARREST_DATE","OFNS_DESC","LAW_CAT_CD", "ARREST_BORO","ARREST_PRECINCT", "AGE_GROUP", "PERP_SEX", "PERP_RACE"]]
Y_Lat = intermediate_dataset[["Latitude"]]
Y_Long = intermediate_dataset[["Longitude"]]
XDataframe = {'Month': pd.DatetimeIndex(X['ARREST_DATE']).month, 
             'Offense': le_OFNS.transform(X.OFNS_DESC),
             'Borough': le_BOROUGH.transform(X.ARREST_BORO),
             'Law_Code': le_LAW.transform(X.LAW_CAT_CD),
             'Precint': X.ARREST_PRECINCT,
             'Age': le_AGE.transform(X.AGE_GROUP),
             'Sex': le_SEX.transform(X.PERP_SEX),
             'Race': le_RACE.transform(X.PERP_RACE)}
X = pd.DataFrame(data=XDataframe)
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

Once prepared all of our data, we train the regression models

In [None]:
trainRegressionCV(X_scaled, Y_Lat, "Latitude")

In [None]:
trainRegressionCV(X_scaled, Y_Long, "Longitude")

So, as we can observe, the models show poor performance on the prediction task, except the Random Forest model in the Longitude prediction which unexpectably is able to explain the 98% of the variance present in the test set. 
This was something expected, as there is a lot of categorical data which, sadly, does not draw any linear relationship between the predicted label and the regressors. We can, indeed, see this with some graphics: 

In [None]:
pairplots(X, Y_Lat, ["Month", "Offense", "Borough", "Law_Code", "Precint", "Age", "Sex", "Race"], "Latitude", (2,4))

As we can see in the Latitude case, the data has poor variance and it cannot be seen any linear relationships between the latitude and the different regressors. It is logical to think that is impossible to perform a good regression task with this kind of data. Indeed, it is normal that this could happen as the majority of the regressors used in this problem are, except the Precint one, categorical data.

In [None]:
pairplots(X, Y_Long, ["Month", "Offense", "Borough", "Law_Code", "Precint", "Age", "Sex", "Race"], "Longitude", (2,4))

Suprprisingly in the case of the Longitude, we see that the Precint regressor somehow draws a tendency with multiple clusters within the target variable, which explains why the models perform better than in the Latitude. Indeed, this also explains somehow why the Random Forest Regressor does indeed perform very well on this case, as it shows good performance when the data can have multidimensional relationships with the targetted variable to predict. 

The first thing before using clustering or ensemble methods, is to try to reduce the number of regressors in these models, which usually benefit in performance from these simplifications.
First of all, let's see if there is a multicolinearity problem, where a regressor already explains the variability of other and makes it useless for the model, by plotting any correlations between the regressors that we have in the X set.

In [None]:
correlation = X.corr()
correlation
mask = np.triu(np.ones_like(correlation, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlation, mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

As we can observe, the Borough and the Precint are highly correlated (as expected from the classification problem), so we can say that we can discard one of them without losing relevant information. We can also say there is some close relation between Law Code and Offense, so we will also pick the latter as it has more categories than the other one.

In the case of Latitude is really hard, as there are no linear relationships. However, I believe that if we take those that seem to have more variability (Month, Precint, Offense, and Race), we could get some performance out of them. In the case of the Longitude, I think, from the scatterplots, that using the Precint and the Offense would be enough to get the best performance of the Random Forest method.

In [None]:
trainRegressionCV(X_scaled[["Offense", "Precint", "Month", "Race"]], Y_Lat, "Latitude")

In [None]:
trainRegressionCV(X_scaled[["Offense", "Precint"]], Y_Long, "Longitude")

## Bringing in ensembles

So, we have performed several classifications and regressions tasks in this document. However, we have seen that, for example, in the regression task we had problems to predict the location of a certain crime, as there was no tendency (correlation) between the regressors and the explained variable.
However, could we try to solve that?
We can try different approaches. However, one of the most interesting ones is to bring in ensembles (a set of prediction models which combine their predictions, depending the method, to obtain better results). In this section, we are going to try to combine some ensemble options to solve our problem in both the classification and the regression contexts and observe if we obtain better results than in the previous methods. 

First, we should import the desired libraries to work with and redesign our utility functions to support ensembles.

In [None]:
from sklearn.ensemble import StackingClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.ensemble import StackingRegressor, AdaBoostRegressor, GradientBoostingRegressor, BaggingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

Then, we redefine our utility functions for classification to train the proposed ensemble models with a 10-fold Cross Validation method. 

In [None]:
def trainClassification_EnsemblesCV(X,Y,numclasses, ylabel):
    #We define the KFOLD separator
    cv = StratifiedKFold(n_splits=10)
    
    #Some utility variables that will be useful during our train/test process
    mean_score_adaboost = 0
    mean_score_gradientboost = 0
    mean_score_stacking = 0
    mean_score_bagging = 0
    iteration = 0
    
    #I would like to know the F1 scores of the models, as precission and recall are important when predicting
    #the targetted label we want to describe
    f1_scores_adaboost = []
    f1_scores_gradientboost = []
    f1_scores_stacking = []
    f1_scores_bagging = []
    
    # I will also store all the historic variables of the models to obtain a ROC curve for each fold and method
    roc_curve_adaboost = []
    roc_curve_gradientboost = []
    roc_curve_stacking = []
    roc_curve_bagging = []
    
    roc_curve_data = []
    
    for train_idx, test_idx in cv.split(X,Y):
        #Split into train_test datasets
        x_train, y_train = X.iloc[train_idx], Y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], Y.iloc[test_idx]
        
        model_adaboost = AdaBoostClassifier(n_estimators = 10)
        model_gradientBoost = GradientBoostingClassifier(n_estimators = 10, learning_rate=0.8)
        stacking_classifiers = [('lr', LogisticRegression(C=100, max_iter=10000)), ('knn', KNeighborsClassifier(n_neighbors=5)), ('tree', DecisionTreeClassifier())]
        model_stacking = StackingClassifier(estimators= stacking_classifiers, final_estimator = LogisticRegression(C=100, max_iter=10000), cv=10)
        model_bagging = BaggingClassifier(DecisionTreeClassifier(max_depth=10),n_estimators=10, max_samples=0.5, max_features=0.5)
        #model_SVMK = SVC(gamma="auto", C=10000)
        
        #Train step
        model_adaboost.fit(x_train, y_train[ylabel].ravel())
        model_gradientBoost.fit(x_train, y_train[ylabel].ravel())
        model_stacking.fit(x_train, y_train[ylabel].ravel())
        model_bagging.fit(x_train, y_train[ylabel].ravel())
        
        ## VALIDATION AND METRIC OBTAINMENT METHODS ##
        mean_score_adaboost = mean_score_adaboost + model_adaboost.score(x_test,y_test[ylabel].ravel())
        mean_score_gradientboost = mean_score_gradientboost + model_gradientBoost.score(x_test,y_test[ylabel].ravel())
        mean_score_stacking = mean_score_stacking + model_stacking.score(x_test,y_test[ylabel].ravel())
        mean_score_bagging = mean_score_bagging + model_bagging.score(x_test,y_test[ylabel].ravel())
        
        score_roc_adaboost = model_adaboost.decision_function(x_test)
        score_roc_gradientboost = model_gradientBoost.decision_function(x_test)
        score_roc_stacking = model_stacking.decision_function(x_test)
        score_roc_bagging = model_bagging.predict_proba(x_test)
        
        fprada = dict()
        tprada = dict()
        roc_aucada = dict()
        
        fprgb = dict()
        tprgb = dict()
        roc_aucgb = dict()
        
        fprsk = dict()
        tprsk = dict()
        roc_aucsk = dict()
        
        fprbg = dict()
        tprbg = dict()
        roc_aucbg = dict()
        
        for i in range(numclasses):
            fprada[i], tprada[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_adaboost[:,i])
            roc_aucada[i] = auc(fprada[i], tprada[i])
            
            fprgb[i], tprgb[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_gradientboost[:,i])
            roc_aucgb[i] = auc(fprgb[i], tprgb[i])
            
            fprsk[i], tprsk[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_stacking[:,i])
            roc_aucsk[i] = auc(fprsk[i], tprsk[i])
            
            fprbg[i], tprbg[i], _ = roc_curve(label_binarize(y_test,classes=[i for i in range(numclasses)])[:, i], score_roc_bagging[:,i])
            roc_aucbg[i] = auc(fprbg[i], tprbg[i])
            
        
        roc_curve_data.append([[fprada, tprada, roc_aucada], [fprgb, tprgb, roc_aucgb], [fprsk, tprsk, roc_aucsk], [fprbg, tprbg, roc_aucbg]]) 
    
        print("Iteration ", iteration)
        iteration+=1
        print("Accuracy Adaboost: ", model_adaboost.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy GradientBoost: ", model_gradientBoost.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy Stacking: ", model_stacking.score(x_test,y_test[ylabel].ravel()))
        print("Accuracy Bagging: ", model_bagging.score(x_test,y_test[ylabel].ravel()))

        y_predicted_ada = model_adaboost.predict(x_test)
        y_predicted_gb = model_gradientBoost.predict(x_test)
        y_predicted_sk = model_stacking.predict(x_test)
        y_predicted_bagging = model_bagging.predict(x_test)
    
        precision_ada = f1_score(y_test.Borough.ravel(), y_predicted_ada, average=None)
        precision_gb = f1_score(y_test.Borough.ravel(), y_predicted_gb, average=None)
        precision_sk = f1_score(y_test.Borough.ravel(), y_predicted_sk, average=None)
        precision_bagging = f1_score(y_test.Borough.ravel(), y_predicted_bagging, average=None)
        
        f1_scores_adaboost.append(precision_ada)
        f1_scores_gradientboost.append(precision_gb)
        f1_scores_stacking.append(precision_sk)
        f1_scores_bagging.append(precision_bagging)

        print("F1-Score AdaBoost: ", precision_ada)
        print("F1-Score GradientBoost: ", precision_gb)
        print("F1-Score Stacking: ", precision_sk)
        print("F1-Score Bagging: ", precision_bagging)
    

        print("-----------------")

    result_lr = np.round(mean_score_adaboost/10,3)
    result_nb = np.round(mean_score_gradientboost/10,3)
    result_svm = np.round(mean_score_stacking/10,3)
    result_bg = np.round(mean_score_bagging/10,3)
    print("Accuracy in Cross validation AdaBoost: ", result_nb) 
    print("Accuracy in Cross validation GradientBoost: ", result_lr)    
    print("Accuracy in Cross validation Stacking: ", result_svm)   
    print("Accuracy in Cross validation Bagging: ", result_bg)
    
    #We return the ROC data
    return roc_curve_data

In the classification task, we saw that the key for obtaining the best predictions was by the precinct number, as it was a polignomically sepparable variable which allowed us to obtain nearly perfect predictions. However, it may seem pretty ovbious that this classifiers will have issues making this predictions if we removed this variable. This is what we are going to experiment here, we are going to see if classification methods ensembles are capable to deal with this issue.

In [None]:
dataroc_2 = trainClassification_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Encoded, 5, "Borough")

In [None]:
plot_roc_curves_CV(dataroc_2,[0], le_BOROUGH, ['AdaBoost', 'Gradient Boost', 'Stacking', 'Bagging'])

As we can see from the accuracy, models are not very accurate when predicting in which borough a crime has been committed. However, from the ROC curves, we observe that, despite of being a little bit low, we could somehow tackle the problem from a one vs many perspective, as we observe that, when we observe each case, the model is able to differentiate one variable from the rest. We can try this approach by implementing the same method with a binary classification task (where we do not try to tackle all variables but we focus on one label only, getting then four models for each label and ensembling them in a pipeline).

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import plot_roc_curve

In [None]:
def trainClassification_OVM_EnsemblesCV(X,Y,numclasses, ylabel):
    #We define the KFOLD separator
    cv = StratifiedKFold(n_splits=10)
    
    #Some utility variables that will be useful during our train/test process
    mean_score_adaboost = 0
    mean_score_gradientboost = 0
    mean_score_stacking = 0
    mean_score_bagging = 0
    iteration = 0
    
    #I would like to know the F1 scores of the models, as precission and recall are important when predicting
    #the targetted label we want to describe
    f1_scores_adaboost = []
    f1_scores_gradientboost = []
    f1_scores_stacking = []
    f1_scores_bagging = []
    
    # I will also store all the historic variables of the models to obtain a ROC curve for each fold and method
    roc_curve_adaboost = []
    roc_curve_gradientboost = []
    roc_curve_stacking = []
    roc_curve_bagging = []
    
    roc_curve_data = []
    
    for train_idx, test_idx in cv.split(X,Y):
        #Split into train_test datasets
        x_train, y_train = X.iloc[train_idx], Y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], Y.iloc[test_idx]
        
        model_adaboost = AdaBoostClassifier(n_estimators = 10)
        model_gradientBoost = GradientBoostingClassifier(n_estimators = 10, learning_rate=0.8)
        stacking_classifiers = [('lr', LogisticRegression(C=100, max_iter=10000)), ('knn', KNeighborsClassifier(n_neighbors=5)), ('tree', DecisionTreeClassifier())]
        model_stacking = StackingClassifier(estimators= stacking_classifiers, final_estimator = LogisticRegression(C=100, max_iter=10000), cv=10)
        model_bagging = BaggingClassifier(DecisionTreeClassifier(max_depth=10),n_estimators=10, max_samples=0.5, max_features=0.5)
        #model_SVMK = SVC(gamma="auto", C=10000)
        
        #Train step
        model_adaboost.fit(x_train, y_train[ylabel].ravel())
        model_gradientBoost.fit(x_train, y_train[ylabel].ravel())
        model_stacking.fit(x_train, y_train[ylabel].ravel())
        model_bagging.fit(x_train, y_train[ylabel].ravel())
        
        ## VALIDATION AND METRIC OBTAINMENT METHODS ##
        mean_score_adaboost = mean_score_adaboost + model_adaboost.score(x_test,y_test[ylabel].ravel())
        mean_score_gradientboost = mean_score_gradientboost + model_gradientBoost.score(x_test,y_test[ylabel].ravel())
        mean_score_stacking = mean_score_stacking + model_stacking.score(x_test,y_test[ylabel].ravel())
        mean_score_bagging = mean_score_bagging + model_bagging.score(x_test,y_test[ylabel].ravel())
        
        score_roc_adaboost = model_adaboost.decision_function(x_test)
        score_roc_gradientboost = model_gradientBoost.decision_function(x_test)
        score_roc_stacking = model_stacking.decision_function(x_test)
        score_roc_bagging = model_bagging.predict_proba(x_test)
        
        #roc_curve_data.append([[fprada, tprada, roc_aucada], [fprgb, tprgb, roc_aucgb], [fprsk, tprsk, roc_aucsk], [fprbg, tprbg, roc_aucbg]]) 
        


    result_lr = np.round(mean_score_adaboost/10,3)
    result_nb = np.round(mean_score_gradientboost/10,3)
    result_svm = np.round(mean_score_stacking/10,3)
    result_bg = np.round(mean_score_bagging/10,3)
    print("Accuracy in Cross validation AdaBoost: ", result_nb) 
    print("Accuracy in Cross validation GradientBoost: ", result_lr)    
    print("Accuracy in Cross validation Stacking: ", result_svm)   
    print("Accuracy in Cross validation Bagging: ", result_bg)
    
    #roc_curve_data = [roc_curve_adaboost, roc_curve_gradientboost,roc_curve_stacking,roc_curve_bagging]
    #We return the ROC data

In [None]:
label_bin = LabelBinarizer()
label_binarized = label_bin.fit_transform(Y_Encoded)
Y_Borough0 = pd.DataFrame([el[0] for el in label_binarized])
Y_Borough0.columns = ["Borough"]

Y_Borough1 = pd.DataFrame([el[1] for el in label_binarized])
Y_Borough1.columns = ["Borough"]

Y_Borough2 = pd.DataFrame([el[2] for el in label_binarized])
Y_Borough2.columns = ["Borough"]

Y_Borough3 = pd.DataFrame([el[3] for el in label_binarized])
Y_Borough3.columns = ["Borough"]

Y_Borough4 = pd.DataFrame([el[4] for el in label_binarized])
Y_Borough4.columns = ["Borough"]

In [None]:
print(f"BOROUGH 0 RESULTS")
trainClassification_OVM_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Borough0, 1, "Borough")
print(f"#################")
print(f"BOROUGH 1 RESULTS")
trainClassification_OVM_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Borough1, 1, "Borough")
print(f"#################")
print(f"BOROUGH 2 RESULTS")
trainClassification_OVM_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Borough2, 1, "Borough")
print(f"#################")
print(f"BOROUGH 3 RESULTS")
trainClassification_OVM_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Borough3, 1, "Borough")
print(f"#################")
print(f"BOROUGH 4 RESULTS")
trainClassification_OVM_EnsemblesCV(X_Scaled[["Month", "Offense", "Law_Code","Age", "Sex", "Race"]],Y_Borough4, 1, "Borough")
print(f"#################")

As we can see, we have a high accuracy metric when performing a classification in a OVM perspective when we lack with a linearly sepparable variable variable between all classes. Of course, this involves more complexity on the creation of our pipeline, which could be implemented as an ensemble of both four best predictors and handle their resuls as a voting system, the highest probability of a class will be the predicted one by the system.

Now, let's jump into the regression problem, as it is our point of interest in this work, as we had trouble predicting the crime latitude

In [None]:
# Function to train the regression problem with 10 - Fold CV
# The procedure is similar to the above function, even there are not that many parameters to obtain during
# the training process (we only want the R2 score)
def trainRegression_EnsemblesCV(X,Y, ylabel):
    cv = KFold(n_splits=10) 
    r2_st = 0
    r2_ada = 0
    r2_gb = 0
    r2_bg = 0
    r2_rfo = 0
    iteration = 0
    for train_idx, test_idx in cv.split(X,Y):
        x_train, y_train = X.iloc[train_idx], Y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], Y.iloc[test_idx]
        
        stacking_regressors = [('lr', LinearRegression()), ('svr', SVR(max_iter=1000, C=50, kernel="poly", coef0=2)), ('ridge', Ridge(alpha=0.1), ('lasso', Lasso(alpha=0.1)))]
        model_ada = AdaBoostRegressor(n_estimators=100)
        model_gb = GradientBoostingRegressor()
        model_RandomForest = RandomForestRegressor(criterion="mse", bootstrap=True)
        model_stacking = StackingRegressor(estimators=stacking_regressors, final_estimator=RandomForestRegressor(criterion="mse", n_estimators=10))
        model_bagging = BaggingRegressor(base_estimator=DecisionTreeRegressor(), n_estimators=10)
    
        model_ada.fit(x_train, y_train[ylabel].ravel())
        model_gb.fit(x_train, y_train[ylabel].ravel())
        model_RandomForest.fit(x_train, y_train[ylabel].ravel())
        model_stacking.fit(x_train, y_train[ylabel].ravel())
        model_bagging.fit(x_train, y_train[ylabel].ravel())
    
        print("Iteration ", iteration)
        iteration+=1
    
        r2_ada = r2_ada + model_ada.score(x_test, y_test[ylabel].ravel())
        r2_gb = r2_gb + model_gb.score(x_test, y_test[ylabel].ravel())
        r2_rfo = r2_rfo + model_RandomForest.score(x_test, y_test[ylabel].ravel())
        r2_bg = r2_bg + model_bagging.score(x_test, y_test[ylabel].ravel())
        r2_st = r2_st + model_stacking.score(x_test, y_test[ylabel].ravel())
    
        print(f"R2 Ada Boost {model_ada.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Gradient Boost {model_gb.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Random Forest {model_RandomForest.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Stacking {model_stacking.score(x_test, y_test[ylabel].ravel())}")
        print(f"R2 Bagging {model_bagging.score(x_test, y_test[ylabel].ravel())}")
    
        print("-----------------")

    result_ada = np.round(r2_ada/10,3)
    result_gb = np.round(r2_gb/10,3)
    result_rfo = np.round(r2_rfo/10,3)
    result_st = np.round(r2_st/10,3)
    result_bg = np.round(r2_bg/10,3)
    print("Mean R2 Ada Boost: ", result_ada) 
    print("Mean R2 Gradient Boost: ", result_gb)
    print("Mean R2 RandomForest: ", result_rfo) 
    print("Mean R2 Stacking: ", result_st)
    print("Mean R2 Bagging: ", result_bg)    

In [None]:
trainRegression_EnsemblesCV(X_scaled[["Offense", "Precint", "Month", "Race"]], Y_Lat, "Latitude")

In [None]:
trainRegression_EnsemblesCV(X_scaled[["Offense", "Precint", "Month", "Race"]], Y_Long, "Longitude")

As we can observe, we cannot improve the accuracy of the model when predicting the location of the occured crime, as the models still struggle with the problem related before (no presence of correlation between the predicted variable and the regressors). This brings up a clear idea for this work: more power does not have to bring better results, as these methods are completely data-driven and depend on premises that have to be accomplished if we desire to obtain good predictions from our models, which can have better results depending on their structure or the data properties.

## Clustering

As I have mentioned in my introduction, during my previous work I explored the criminality data on the used dataset during this project. In this section, we are going to apply clustering to try to unveil interesting information about how crimes and occurrencies can be grouped throughout the different boroughs that compose the dataset. Specifically, we are going to focus on how these boroughs are distributed in the drug-dealing crimes, as they are the most frequent ones in our dataset and it involved the majority of underage and young groups. 

The clustering operation will try to generate groups (which will represent ideally the boroughs) from the number of events and the population of each borough per year. So to speak, we are going to divide the data by neighborhoods in each borough and use the yearly population as a second variable (as we need to somehow contextualize that amount of incidencies). We will observe if the clusters have any tendency to group boroughs or (if the method reaches an ideal state) they are able to isolate the boroughs in each cluster.

We first need to prepare the dataset based on the neighborhoods and filter it to only get the drug-related crimes.

In [None]:
neighborhoods = pd.read_csv('../input/nyc-neighborhoods-dataset/nynta.csv')

In [None]:
data_crim = dataframe[["ARREST_DATE","OFNS_DESC","ARREST_BORO","Neighborhood"]].dropna()
drugs_dataframe = data_crim[(data_crim.OFNS_DESC == 'DANGEROUS DRUGS') | (data_crim.OFNS_DESC == 'LOITERING FOR DRUG PURPOSES')]
drugs_dataframe.reset_index(inplace=True)
drugs_dataframe.OFNS_DESC = ["Drugs" for _ in range(len(drugs_dataframe.OFNS_DESC))]
drugs_dataframe["Inc"] = [1 for _ in range(len(drugs_dataframe.OFNS_DESC))]
drugs_dataframe

Then, we need to isolate the year in the date-related field, as we are going to cluster from neighborhood and year (as population changes from one another).

In [None]:
drugs_dataframe.ARREST_DATE = pd.to_datetime(drugs_dataframe.ARREST_DATE).dt.year
drugs_dataframe.head()

We remove the index column generated in the previous operations.

In [None]:
drugs_dataframe = drugs_dataframe[["ARREST_DATE", "OFNS_DESC", "ARREST_BORO", "Neighborhood", "Inc"]]

Now, we have to prepare the population dataframe, as we need this reference to contextualize the incidencies between neighborhoods.

In [None]:
population = pd.read_csv('../input/new-york-city-population/new-york-city-population-by-borough-1950-2040.csv')
borough_interest = population[["Borough","2000", "2010", "2020"]]
borough_interest
population_dataframe = pd.DataFrame(columns=["Year","Total", "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"])

In [None]:
for i in range(2000, 2021):
    if i == 2000 or i == 2010 or i == 2020:
        population_dataframe = population_dataframe.append({"Year": str(i),"Total": borough_interest[borough_interest["Borough"] == "NYC Total"][str(i)][0],
                              "Bronx": borough_interest[borough_interest["Borough"] == "   Bronx"][str(i)][1],
                              "Brooklyn": borough_interest[borough_interest["Borough"] == "   Brooklyn"][str(i)][2],
                              "Manhattan": borough_interest[borough_interest["Borough"] == "   Manhattan"][str(i)][3],
                              "Queens": borough_interest[borough_interest["Borough"] == "   Queens"][str(i)][4],
                              "Staten Island": borough_interest[borough_interest["Borough"] == "   Staten Island"][str(i)][5]}, ignore_index=True)
    else:
        population_dataframe = population_dataframe.append({"Year": str(i),"Total": np.nan,
                              "Bronx": np.nan,
                              "Brooklyn": np.nan,
                              "Manhattan": np.nan,
                              "Queens": np.nan,
                              "Staten Island": np.nan}, ignore_index=True)
population_dataframe = population_dataframe.set_index("Year")
population_dataframe = population_dataframe.apply(pd.to_numeric)
population_dataframe = population_dataframe.interpolate(method="linear")

In [None]:
populations = []
correspondence = {"B": "Bronx", "K": "Brooklyn", "M": "Manhattan", "Q":"Queens", "S":"Staten Island"}
for row in drugs_dataframe[['ARREST_DATE','ARREST_BORO']].itertuples(index=False):
    #print(str(row.ARREST_DATE))
    populations.append(population_dataframe.loc[str(row.ARREST_DATE)][correspondence[row.ARREST_BORO]])

Once obtained our dataframe with populations, we append that column into our drugs incidents dataframe to perform the groupby operation

In [None]:
drugs_dataframe["Population"] = populations
drugs_dataframe

We then proceed to do the Neighborhood group by operation

In [None]:
grouped_incidents = drugs_dataframe.groupby('Neighborhood').sum()[["Inc", "Population"]]
grouped_incidents["Population"] = grouped_incidents["Population"] / grouped_incidents["Inc"]

We plot the data to see if visually evaluate the difficulty of clustering it between different groups.

In [None]:
plt.scatter(grouped_incidents['Population'], grouped_incidents['Inc'])

As we can observe, this data can be seppareted (at least) in three unique groups (by proximity). Even though, we are going to analitically determine in how many groups can be clustered these data and then observe if those groups correspond to each borough that represent this data. So to speak, we are going to determine if each borough is actually sepparable and differentiable from the others on drug events for each year.

I first scale the data to esase the complexity to the clustering method.

In [None]:
scaled_drugs = pd.DataFrame(StandardScaler().fit_transform(grouped_incidents))

In [None]:
from sklearn.cluster import DBSCAN, KMeans
import seaborn as sns

We then define a function that applies the elbow rule to determine the best number of clusters in the dataframe with the KMeans method, which is one of the methods we are going to use to perform clustering.

In [None]:
def elbow_rule(dataframe):
  plt.figure(figsize=(10, 8))
  wcss = []
  for i in range(1, 15):
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    kmeans.fit(dataframe)
    wcss.append(kmeans.inertia_) #criterion based on which K-means clustering works
  plt.plot(range(1, 15), wcss)
  plt.title('The Elbow Method')
  plt.xlabel('Number of clusters')
  plt.ylabel('WCSS')
  plt.show()

In [None]:
elbow_rule(scaled_drugs)

Once executed the elbow rule, we observe that the curve is very soft, so it is hard to tell which is the best number of clusters in this dataset. However, in this case I will pick up four, as it somehow starts to differentiate the end of the curve with the WCSS drop, so we can say it is the elbow point of this graphic. 

In [None]:
kmeans = KMeans(n_clusters = 4, init = 'k-means++').fit(scaled_drugs)

I then perform another method to compare. In this case, I use the DBSCAN method, as it is based on the population density in the clusters and it determines automatically the best number of clusters to use. It also is able to isolate potential outliers that difficult the clustering method. I also execute this clustering technique to have a reference between both methods.

In [None]:
dbscan = DBSCAN(eps=0.5).fit(scaled_drugs)

We then plot the obtained clusters.

In [None]:
grouped_incidents['Cluster_dbscan'] = dbscan.labels_
grouped_incidents['Cluster_kmeans'] = kmeans.labels_
#plt.scatter(grouped_incidents['Population'], grouped_incidents['Inc'], labels=)

First, we plot the KMeans clusters.

In [None]:
fg = sns.FacetGrid(data=grouped_incidents, hue='Cluster_kmeans')
fg.map(plt.scatter, 'Population', 'Inc').add_legend()

As we can observe, the KMeans clustering method produces four clusters that somehow can be logically splitted by distance in the dataset (except the third one, which picks up superior points in the graphic). 

And then the DBSCAN produced ones.

In [None]:
fg = sns.FacetGrid(data=grouped_incidents, hue='Cluster_dbscan')
fg.map(plt.scatter, 'Population', 'Inc').add_legend()

In this case, five distinctive groups have been determined (and another one composed by potential outliers of the dataset). This is interesting, as it can be seen that the DBSCAN method could have sepparated the data into boroughs (as we are dealing with five different classes).

However, we need to determine the frequency of boroughs of each cluster to determine wether the methods were able to sepparate the data ideally or not.

In [None]:
borough_list = []
for row in grouped_incidents.reset_index().itertuples(index=False):
    borough_list.append(neighborhoods[neighborhoods.NTAName == row.Neighborhood]["BoroName"].values[0])

In [None]:
grouped_incidents["Borough"] = borough_list
grouped_incidents["Count"] = [1 for _ in borough_list]

We plot the obtained clusters.

In [None]:
def plot_clusters(n_cluster, method):
    clusteringmth = f"Cluster_{method}"
    cluster = grouped_incidents[grouped_incidents[clusteringmth] == n_cluster]
    grouped = cluster.groupby("Borough").sum()["Count"]
    grouped = grouped.reset_index()
    fig = plt.figure()
    sns.barplot(x=grouped["Borough"], y=grouped["Count"])

for i in range(4):
    plot_clusters(i, "kmeans")

As we can observe, the KMeans clustering method is able to distinguish the Staten Island borough from the others, as it is the one with less criminality rates from all the presented data in our dataframe. However, it is not able to make a clear sepparation between the rest of the boroughs, as we observe that they are mixed-up in different clusters with no apparent sense (we do not have any economic or demographic dataset that could unveil these decissions).

We observe the results performed by the DBSCAN method.

In [None]:
for i in range(5):
    plot_clusters(i, "dbscan")

We observe again that the DBSCAN method is able to isolate Staten Island from the rest of the boroughs, doe to the same reason explained before. However, we see that the distribution of the rest of the boroughs are quite different from the method above. In this case, we can observe that two clusters groups the boroughs with the most criminality incidencies ratio and crime per capita in the dataset (as it is observed in my previous work): The Bronx and Manhattam, one being the most crowded one and the other one the most humble and tendent to have criminals for survival. The rest of the clusters get data from similar crime per capita boroughs, although it is able to partially isolate Brooklyn. Somehow, this is interesting, as the clustering method is able to differentiate the most conflictive boroughs from the rest, even of not being able to draw clear lines in the sepparation of them. Again, to evaluate the representativity of the presented clusters, we would need aditional data (such as economic demography) that we could not bring to the project doe to processing unfeasability (as Kaggle has its limitations) or lack of public resources to get the desired data.