We will first use the most basic approach to classify data: a knn classifier.

La bibliothèque scikit learn en propose une implémentation simple d'utilisation. Les paramètres sur lesquels nous jouerons sont le nombre de voisins dans le knn, et la distance utilisée pour rapprocher-éloigner les éléments du datasets.

The Scikit Learn API provides an efficient and easy-to-use implementation of the algorithm, with two parameters to play with : the number of neighhbors and the metric used to mesure the distance between elements.

We use the $\ell_p$ metric, with $p$ to optimize.

In [None]:
# %cd /content/drive/MyDrive/Ponts/MachineLearning/CrimeSF_Malap

/content/drive/MyDrive/Ponts/MachineLearning/CrimeSF_Malap


In [None]:
import pandas as pd
import numpy as np


We first get our pre-processed data. And then pick a sample to start running the algorithm and optimize the parameters.

In [None]:
train_data= pd.read_csv('data/pre_processing_train_data.csv')
train_data=train_data.iloc[:,1:]

In [None]:
train_sample = train_data.sample(n=20000)
train_labels=train_sample['Category']
train_sample.drop('Category',inplace =True, axis=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss

In [None]:
X = train_sample
y = train_labels

We separate the labelled data into a train set and a test set, we will use to check our performances.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
knn = KNeighborsClassifier(n_neighbors=9, p=1)
knn.fit(X_train, y_train)

p_pred = knn.predict_proba(X_test)

print(accuracy_score(y_test, np.argmax(p_pred,axis=1)))

0.15125


We now process a cross-validation on the two parameters of our knn : the number of neighbors and $p$.

In [None]:
def cross_val_knn(X,y,nb_range,p_range):
    n=len(nb_range)
    m=len(p_range)
    res=np.zeros((n,m))
    for i in range(n):
        for j in range(m):
            print((nb_range[i],p_range[j]))
            knn = KNeighborsClassifier(n_neighbors=nb_range[i], p=p_range[j])
            scores=cross_val_score(knn,X,y)
            score=np.mean(scores)
            print(score)
            res[i,j]=score
    (imax,jmax)=np.unravel_index(res.argmax(), res.shape)
    return (nb_range[imax],p_range[jmax])

n_voisins, p_opti = cross_val_knn(X,y,nb_range=range(1,21),p_range=range(1,6))


We will keep $n=20$ and $p=3$

Taking that much neghbors will put the probability of the least represented cases to 0.

We now train our knn on the whole dataset.

In [None]:
train_data= pd.read_csv('data/pre_processing_train_data.csv')
train_data=train_data.iloc[:,1:]
train_labels=train_data['Category']
train_data.drop('Category',inplace =True, axis=1)

In [None]:
knn = KNeighborsClassifier(n_neighbors=20, p=3)
knn.fit(train_data, train_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=20, p=3,
                     weights='uniform')

In [None]:
test_data = pd.read_csv('data/pre_processing_test_data.csv')
test_data = test_data.iloc[:,1:]


In [None]:
p_pred = knn.predict_proba(test_data)
prediction = np.argmax(p_pred,axis=1)


We keep our results in a npy table, so we won't have to evaluate on the whole dataset again.


In [None]:
np.save('results/prediction.npy', p_pred)

In [None]:
print(p_pred)

[[0.05 0.25 0.15 ... 0.   0.   0.  ]
 [0.05 0.25 0.15 ... 0.   0.   0.  ]
 [0.05 0.25 0.2  ... 0.   0.   0.  ]
 ...
 [0.05 0.05 0.05 ... 0.   0.   0.  ]
 [0.05 0.05 0.05 ... 0.   0.   0.  ]
 [0.05 0.05 0.05 ... 0.   0.   0.  ]]
