**Classification using KNN**

<u>Goal:</u>

The main goal is to learn how to use KNN with scikit-learn library"

<u>To do:</u>

Complete the source code whenever you find <font color='red'>#?</font>

This case study concerns simple house data.

This data does not require preprocessing.

The goal is to :
- From a labeled population of houses, apply KNN to find the best classifier.
- Given a new house, predict the label of the class to which it belongs.

**1. Create the dataset**

The classification requires a <font color='red'>labeled dataset</font>.

In our case, the dataset is composed of 2 parts:
- <font color='red'>X</font> : data table of shape Nxd :
   - The N rows are houses
   - The d columns are the primitive caracteristics/features of houses. 
      - They should be numerical values required for machine learning. 
      - In our case , they are : 'surface' and 'nb_rooms'
- <font color='red'>y</font> : a list of N labels. 
   - The label values belong to a set of finite discrete class values.
   - In our case, the possible class values are {'cheap', 'expensive'}
   - Each label correponds to the class of a house (row) in X.

In [1]:
# Import numpy module
import numpy as np

In [2]:
# Create X a numpy matrix that contains the primitive data (surface and nb_rooms) of houses:
# [[100,2],
# [100,2],
# [200,2],
# [200,3],
# [300,2],
# [300,2],
# [400,3],
# [450,4],
# [500,4],
# [500,5],
# [600,4],
# [650,5]]
X =  #?
X

array([[100,   2],
       [100,   2],
       [200,   2],
       [200,   3],
       [300,   2],
       [300,   2],
       [400,   3],
       [450,   4],
       [500,   4],
       [500,   5],
       [600,   4],
       [650,   5]])

In [3]:
# Create y that represents the house class labels : ['cheap','cheap','cheap','cheap','cheap','cheap','expensive','expensive','expensive','expensive','expensive','expensive']
y = #?
y

['cheap',
 'cheap',
 'cheap',
 'cheap',
 'cheap',
 'cheap',
 'expensive',
 'expensive',
 'expensive',
 'expensive',
 'expensive',
 'expensive']

**2. Split the dataset into train & test**

The dataset is splitted into : 
- train set : 
    - It is a labeled sub dataset used for model learning
    - It is composed of X_train y_train
    - By default, it represents 70% of whole dataset
- test set : 
    - It is a labeled sub dataset used for model evaluation
    - It is composed of X_test, y_test
    - By default, it represents 30% of whole dataset

To perform the splitting, we generally use the function <font color='red'>train_test_split</font> from the module sklearn.model_selection

In [4]:
# Import train_test_split from sklearn.model_selection
from  #? import  #?

In [5]:
# Call train_test_split
# Pass arguments :
# - X : Observable/Mesurable/Input data
# - y : Target/Class/Output data
# - train_size : % of train data, in our case 0.7
# - stratify : an array of labels given to data 
#       + It ensures that the distribution of classes or categories 
#         in the original dataset is preserved in both the training and testing sets.
#       + It is important when dealing with imbalanced datasets
X_train, X_test, y_train, y_test = train_test_split( #?,  #?, train_size= #?, stratify=y)

In [6]:
# Show y_train
y_train

['cheap',
 'expensive',
 'cheap',
 'cheap',
 'expensive',
 'expensive',
 'cheap',
 'expensive']

In [7]:
# Show y_test
y_test

['cheap', 'expensive', 'expensive', 'cheap']

**2. Learning process using KNN**

In scikit-learn ,<font color='red'>KNN</font> is implemented as <font color='red'>KNeighborsClassifier class</font> in <font color='red'>sklearn.neighbors</font> module.

KNN class has :
- a <font color='red'>constructor function</font> that allows to initialize hyperparameters
- a <font color='red'>fit()</font> function that allows to fit the model given the training data
- a <font color='red'>predict()</font> function that allows to predict the class label for a given new data

In [10]:
# Import KNeighborsClassifier class from sklearn.neighbors module
from  #? import  #?

In [11]:
# Create a KNN instance denoted knn from KNeighborsClassifier class
# Initialize the hyperparameter :
# - n_neighbors : it represents the number of neighbors 
#   used to vote the class label in prediction process
#   In our case, its value is 3
knn = #?(n_neighbors=3)

In [12]:
# Call knn.fit() function
# Pass as arguments :
# - the train data X_train
# - the train class labels y_train 
knn.fit(#?, #?)

In [13]:
# We can show the class labels found by the classifier
knn.classes_

array(['cheap', 'expensive'], dtype='<U9')

**Evaluation using test dataset**

The evaluation step allows to judge the model performance.

Several metrics are computed using test data :
- Confusion matrix : a matrix that summarizes the classification results : TP,TN, FP and FN
- Accuracy score : a measure of how often the classifier correctly predicts the test data

In [14]:
# Predict class labels for test data
# Call predict predict() and
# Pass X_test as argument
y_test_pred= knn.#?(#?)
y_test_pred

array(['cheap', 'expensive', 'cheap', 'expensive'], dtype='<U9')

In [15]:
# Show matching between predicted class labels and true class labels
y_test_pred==y_test

array([ True,  True,  True,  True])

In [16]:
# Import confusion_matrix function from sklearn.metrics
from #? import #?

In [17]:
# compute confusion matrix
# Call confusion_matrix
# Pass as arguments :
# - y_test (true class labels of test data)
# - y_test_pred (predicted class labels of test data)

#?(#?, #?)

array([[2, 0],
       [0, 2]], dtype=int64)

In [18]:
# Import  accuracy_score function from sklearn.metrics
from #? import #?

In [19]:
# compute accuracy score
# Call accuracy_score function
# Pass as arguments :
# - y_test (true class labels of test data)
# - y_test_pred (predicted class labels of test data)

#?(#?, #?)

1.0

**4. Prediction process**

Once the model is successefully evaluated, we can use it to predict the class label for a new house

In [22]:
# Suppose we have data of a new house (that is not used in learning)
# For example: house_new = [500.67,24]
# Create the new house as a matrix of 1 row : 
house_new = #?
house_new

[[550, 4]]

In [23]:
# Predict the class label of house_new
# Call predict function and
# Pass the matrix house_new as argument

y_house_new=knn.#?(#?)
y_house_new

array(['expensive'], dtype='<U9')