# K Nearest Neighbors sklearn

In this jupyter notebook, as our classifier we will be using sklearn KNeighborsClassifier, which we will use as a benchmark for our own kNN model.

### Importing libraries

First, we need to import necessary libraries. For our ML model, we will be using KNeighborsClassifier from sklearn.

In [1]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler

### Data handling

Next, we need to import our input data file, which has already been preprocessed beforehand.

In [2]:
# Load the data file

data = pd.read_csv('../output/data/real_estate_preprocessed.csv')
data = data[['district', 'size', 'floor', 'registration', 'rooms', 'parking', 'balcony', 'state', 'price']]

In [3]:
# Add class labels based on prices, and drop 'prices' column afterwords

classes = list()
for i, d in data.iterrows():
    if d['price'] <= 49999:
        classes.append('<= 49.999')
    elif d['price'] <= 99999:
        classes.append('50.000 - 99.999')
    elif d['price'] <= 149999:
        classes.append('100.000 - 149.999')
    elif d['price'] <= 199999:
        classes.append('150.000 - 199.999')
    else:
        classes.append('>= 200.000')

data.drop(labels=['price'], axis=1, inplace=True)
data = data.assign(target=classes)

In [4]:
# Split data into feature vectors and outputs

X = np.array(data.iloc[:, 0:-1])
y = np.array(data.iloc[:, -1:])

In [5]:
# Use OneHotEncoder for categorical features

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse=False), [0, 7])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [6]:
# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [7]:
# Use StandardScaler to scale train and test inputs

ct2 = ColumnTransformer([('standard_scaler', StandardScaler(), [-3, -5, -6])], remainder='passthrough')
X_train = ct2.fit_transform(X_train)
X_test = ct2.transform(X_test)

In [8]:
# Calculate number of neighbors as square root of number of examples in our train set

n_neigh = len(X_train)
n_neigh = int(np.ceil(sqrt(n_neigh)) // 2 * 2 + 1)

### Model training and evaluation

We will be using KNeighborsClassifier from sklearn.neighbors as our base model. We will be using classification_report to look at performance metrics (accuracy and f1-score).

In [9]:
# Use sklearn KNeighborsClassifier classifier to fit train set and predict test set

classifier = KNeighborsClassifier(n_neighbors=n_neigh)
classifier.fit(X_train, y_train.ravel())
y_pred = classifier.predict(X_test)

In [10]:
print(classification_report(y_test, y_pred))

                   precision    recall  f1-score   support

100.000 - 149.999       0.51      0.59      0.55       324
150.000 - 199.999       0.43      0.31      0.36       206
  50.000 - 99.999       0.57      0.63      0.60       306
        <= 49.999       1.00      0.06      0.12        63
       >= 200.000       0.70      0.79      0.74       261

         accuracy                           0.57      1160
        macro avg       0.64      0.48      0.47      1160
     weighted avg       0.58      0.57      0.55      1160

