# PC Lab 2: Data Preprocessing and Nearest Neighbors
---

## 1. Nearest neighbor algorithm for classification

### Introduction

<img src="https://www.postnetwork.co/wp-content/uploads/2022/11/irishflowerrs.png" width=600>

### Classification Problem

In the previous lab session, the iris flower data set was explored. Imagine now the case that for a 
new iris flower, we know its respective characteristics (sepal and petal length/height, respectively),
but do not know its species. A natural task would be to assign the flower to one of the three possible species, based on its characteristics (features). This task (or problem) is called a 
<strong> classification problem. </strong> In this practical session, an algorithm to solve this problem by looking at the closest training examples in the (labeled) dataset is described.

### Dataset
In the iris flower dataset (iris120.csv), each instance (i.e. flower) is described by 5 attributes:\
sepal length, sepal width, petal length, petal width and species.
For the matter of simplicity, we will only use the species, sepal length and sepal width. \
\
These properties can be seen as _variables_, and for a given flower, each of these variables takes a specific
value. In a classification setting, the aim is to predict the value of one of the variables (here the species), based on the value of the other variables (here sepal width and length). The variable of which the values have to be predicted is called the _output variable_ and the variables used to make this prediction are called the _input variables_ or _features_. \
\
A dataset $\mathcal{D}$ consists of a set of $n$ observations of input-output couples $(\boldsymbol{x_i}, y_i)$, where $i \in \{1, \dots, n \}$. Here, $\boldsymbol{x_i} = (x_{i1}, \dots, x_{ip}))^T \in \mathbb{R}^p$
are the observed values for the features (with $p$ the number of features), and $y_i$ are the observed output values, such that a training dataset $\mathcal{D}_{train}$ with $n$ instances can be written as
$$\mathcal{D}_{train} = \{ (\boldsymbol{x_1}, y_1), \dots, (\boldsymbol{x_n}, y_n) \}. $$ 

### Model and Problem Setup
Using the dataset, the goal is to build a _model_ $f$ that is able to predict the value of the output variable, given the value of the input variables. \
In our iris classification problem, both input variables take real values ($x_i \in \mathbb{R}^2$ for $i \in \{1, \dots, n \}$), whereas the output variable is nominal, taking values from the finite set $\{setosa, versicolor, virginica\}$. Hence, the model $f$ we are looking for is a mapping $$f: \mathbb{R}^2 \rightarrow \{setosa, versicolor, virginica\}.$$

### Nearest neighbour classification

A very simple technique to derive a classifier model from a given training dataset is the _nearest neighbour algorithm_. It departs from the assumption that instances whose features are highly similar are more likely to have the same label than those with very different features.\
In particular, the _k-nearest neighbors algorithm_ for classification classifies an instance by a plurality vote of its $k$ closest neighbours. If $k=1$, the model applies this idea in its most extreme from: the label is predicted as the label of the closest instance in the training dataset.\
\
In order to select the 'closest' instances in the training set, a suitable measure of distance $d(x_i, x_j)$ between two instances $x_i$ and $x_j$ is used. In our case, we will simply use the Euclidean distance:
$$ d_E (x_i, x_j) = \sqrt{\sum_{k=1}^p (x_{i,k} - x_{j,k})^2}.$$

Using this distance measure, the ($1$-) nearest neighbor algorithm consists of the following steps: \
1. For an instance with unknown label and known feature vector $\boldsymbol{x}$, calculate the distance to each instance in the dataset: $d_E(\boldsymbol{x}, \boldsymbol{x_i})$ where $i = 1, ... ,n.$
2. Select the closest instance and take its label as the prediction for the unknown label.



The following code snippet downloads all necessary files for this pc-lab:

In [None]:
import urllib.request

urllib.request.urlretrieve("https://raw.githubusercontent.com/BioML-UGent/MLLS/main/02_knn/abalone.csv",
                           "abalone.csv")
urllib.request.urlretrieve("https://raw.githubusercontent.com/BioML-UGent/MLLS/main/02_knn/iris120.csv", "iris120.csv")
urllib.request.urlretrieve("https://raw.githubusercontent.com/BioML-UGent/MLLS/main/02_knn/irisNA.csv", "irisNA.csv")

<div class="alert alert-success">

<b>EXERCISE 1.1</b>: **Load the dataset iris120.csv in to the memory and select the columns 'Sepal.Length', 'Sepal.Width', and 'Species'. Additionally, load the set of unclassified
instances (irisNA.csv) and select the same columns. Both datasets should be loaded as pandas data frames.**
</div>

Hint: In the following, always replace `".."` with the respective code.

In [None]:
import numpy as np
import pandas as pd

cols = ['Sepal.Length', 'Sepal.Width', 'Species']
# load the two datasets and select the respective columns

# iris120 as the training set
iris120 = " " # TODO: load the iris120.csv file into a pandas DataFrame
iris120 = "" # TODO: select only the columns in the 'cols' list
# irisNA as the test set
irisNA = "" # TODO: load the irisNA.csv file into a pandas DataFrame
irisNA = "" # TODO: select only the columns in the 'cols' list


In [None]:
iris120.head()

In [None]:
irisNA.head()

<div class="alert alert-success">

<b>EXERCISE 1.2</b>: **Implement the nearest neighbour algorithm (as explained above) for the iris problem in a function called nnIrisPredict.
Use this function to predict the species of unknown flowers irisNA.csv in the dataset. Make sure
your function has the structure given below. \
(Here, _new_obs_features_ is an array or dataframe containing the two features for one specific new instance, and _train_dataset_ is the dataframe containing the features and labels (i.e. _iris120_ as initialized above)).**
</div>

In [None]:
def nn_iris_predict(new_obs_features, train_dataset):

    # extract features (i.e. first two columns) from training dataset to calculate the distances
    train_features = " " # TODO

    # extract species (labels) of training dataset in a separate variable
    train_labels = " " # TODO

    # create a variable 'dist_euc' which is an array containing the euclidean distance of
    # the new instance to all instances (rows) in the training data set
    dist_euc = " " # TODO


    # extract index of nearest neighbor (i.e. the index of the smallest value in the array 'dist_euc')
    nn_ind = " " # TODO

    # extract species label on the respective index
    nn_label = " " # TODO

    return nn_label


Now test the algorithm to predict the species of the first instance of the unknown flowers dataset irisNA.csv.

In [None]:
# extract features of the first instance in the test set
new_obs_features = " " # TODO

# predict the species of the first instance in the test set
# TODO

## 2. Nearest neighbor algorithm for regression
In the previous section, the output $y$ was a nominal variable (i.e. one specific class, discrete set of possible classes). When the output is real-valued ($y \in \mathbb{R}$), the prediction problem is called a _regression problem_. \
\
As with nominal outputs, the nearest neighbor algorithm can be also applied in this case. It is identical to the one used in the classification task, where the predicted label now is the real-valued label of the instance closest in the training dataset.

### Data preprocessing
For solving the next task, some more elementary data preprocessing steps need to be introduced.

#### Dummy encoding of nominal variables 
Often, the features in a dataset are not numerical, but nominal or ordinal (named or named and ordered variables). In this case, to be still able to use algorithms relying on numerical values such as the nearest neighbor algorithm, we can use _dummy encodings_ for each nominal variable.\
\
In dummy encodings, a variable (feature) $x^i$ that can take $k$ possible (nominal) values is replaced by $k$ new binary variables (features). As an example, consider a dataset where one feature (i.e. $x^1$) displays the weather status, taking the $3$ possible values $\{Sunny, Overcast, Rainy\}$. Each of these values can be represented by a dummy variable: $x^{1a}$, $x^{1b}$ and $x^{1c}$, with values 

* $x^{1a} = \begin{cases} 1,\quad if \quad x^1 = "Sunny" \\
                     0, \quad otherwise \end{cases}$
* $x^{1b} = \begin{cases} 1,\quad if \quad x^1 = "Overcast" \\
                     0, \quad otherwise \end{cases}$
* $x^{1c} = \begin{cases} 1,\quad if \quad x^1 = "Rainy" \\
                     0, \quad otherwise \end{cases}$


For this task, we will look at the task of predicting the age of abalone, a type of marine snail. 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/LivingAbalone.JPG/2560px-LivingAbalone.JPG" width=500>

The dataset _abalone.csv_ contains measurements of physical properties of of several abalone specimen. Using these physical properties, the aim is to build a predictive model for the age of these animals (more information concerning this dataset can be found in [abalone.info](https://archive.ics.uci.edu/ml/datasets/Abalone)). In the following example, we replace the nominal variable 'sex' with 3 dummy variables (as many as the values it takes). In python, there are functions such as the [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function which are used for this purpose. (Another option is scikit-learn's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)) So firstly, we create the dummy variables, then we concatenate them with the original dataset and finally we remove the original variable form the dataset.

In [None]:
# read abalone dataset as pandas dataframe
abalone = pd.read_csv("abalone.csv")
print(abalone.head())

In [None]:
# create dummy variables for the "sex" feature
dummies = pd.get_dummies(abalone.sex)
print(dummies.head())

In [None]:
# concatenate the dummy variables to the original dataframe
abalone_dummy_encoded = pd.concat([abalone, dummies], axis=1)
# remove the original 'sex' column
abalone_dummy_encoded = abalone_dummy_encoded.drop(['sex'], axis=1)
print(abalone_dummy_encoded.head())

### Dealing with missing values
Missing values are commonly encountered in data mining studies. Often, missing values are imputed
(replaced by a value). Several techniques exist to choose this value. A simple, but often used method
is mean imputation. Here, each missing value is replaced by the mean of the observed values for
that variable. More advanced methods exist of building separate models to predict the missing
values.
When implementing the mean imputation, the [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) of scikit-learn library might be handy. The PC labs in this practical have no missing values however:

In [None]:
np.isnan(abalone_dummy_encoded).any()

### Standardizing the data
In realistic datasets, most features have different means and standard deviations. For the nearest
neighbour algorithm, it can easily be seen that features with a high standard deviation will be more
influential than features with a lower standard deviation. In most cases, this is unwanted since it
is not known in advance which features are most important. To overcome this problem, features
are often _standardized_, a scaling method where the values are centered around the (sample) mean with a unit standard deviation.\
The standardized version of a feature $x^i$ can be obtained as   
$$ x^{i'} = \frac{x^i - \mu_i}{\sigma_i},$$    
where $\mu_i$ and $\sigma_i$ represent the sample mean and standard deviation of $\boldsymbol{x^i}$.
The [Scaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) of scikit-learn can be used to perform this standardizing.

Notice that what we're doing here (scaling the training and test set together in one operation) is usually considered bad practice as it will leak data from test to train and hence bias model evaluation.

In [None]:
import sklearn

# split abalone_dummy_encoded into features and labels
y = abalone_dummy_encoded['age'].values # keep target variable
X = abalone_dummy_encoded.drop(['age'], axis=1) # remove it from the features

# scale the features: will return a numpy array
X = sklearn.preprocessing.scale(X)

### Data splitting and Prediction quality

How do we test the performance of a (nearest neighbor) algorithm? \
One obvious idea is tu use a test dataset whith known labels, and compare the predictions made by the algorithm with the true labels. Hence, having our train dataset $\mathcal{D}_{train}$ and a test dataset $\mathcal{D}_{test}$, for the regression problem, we can use the _mean of squared residuals_ to eavluate the quality of our model. For any dataset $\mathcal{D}$, it is calculated as follows:
$$ \text{MSR} = \frac{1}{|\mathcal{D} |} \sum_{\boldsymbol{x_i} \in \mathcal{D}}(f(\boldsymbol{x_i}) - y_i)^2$$
 We can compute it for the test set as well as the train set for comparison.


<div class="alert alert-success">

<b>EXERCISE 2.1</b>: **To prepare the dataset for the following exercise, split the dataset in a portion (80%) we will use to train on, and a portion we will use to predict on (20%) (test set). See the documentation of scikit learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for more info on how to do this.**
</div>

In [None]:
from sklearn.model_selection import train_test_split

# name the variables for train/test features and labels X_train, y_train, X_test, y_test
X_train, X_test, y_train, y_test = # TODO
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## 3. K-nearest Neighbors for Regression

Classifying an instance only by the (one) nearest neighbor to it might not always be wvery accurate. Hence, a simple extension of the neirest neighbor algorithm consists of taking not only one, but multiple neighbors into account. \

To this end, let $N_k(\boldsymbol{x}) \subset \mathcal{D}$ be the $k$ nearest neighbors of an instance with feature vector $\boldsymbol{x}$.

For __classification__:
- set $f(\boldsymbol{x})$ to be the _plurality vote_ of its neighbors, i.e. the label that occures most often in $N_k(\boldsymbol{x})$

For __regression__:
- set $f(\boldsymbol{x})$ to be the _average_ of $N_k(\boldsymbol{x})$, i.e. $$ f(\boldsymbol{x}) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i.$$

### K-nearest neighbors in Python

As with the $1$-nearest neighbour algorithm, it is possible to implement a version of the k-nearest neighbours
algorithm in python from scratch. However, to simplify things, an alternative is to use a pre-implemented version of the algorithm.\
HEre, we use the  [scikit-learn](http://scikit-learn.org/stable/) library again, which has an implementation of 
the $k$-nearest neighbor algorithm availbale. You can load and see more info about the usage of this function by going to the [documentation page](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
or else by typing the following command:

In [None]:
from sklearn import neighbors
help(neighbors.KNeighborsRegressor)

As you can see in the help window, in scikit-learn, an estimator for classification/regression is a _Python object_ that implements the methods _fit(X, y)_ and _predict(T)_.\
\
 The constructor of an estimator takes as arguments the parameters of the model (in our case the basic parameters are the number of neighbours and the distance metric). So the first step is to initialize this constructor. Here, we set the number of 
 neighbors to be considered to be $k=5$.

In [None]:
n_neighbors = 5
knn = neighbors.KNeighborsRegressor(n_neighbors=n_neighbors, metric='euclidean')


We call our estimator instance `knn`. It now must be fitted to the data, that is, it must learn from the data. This is done by passing our training set to the fit method.

In [None]:
knn.fit(X_train, y_train)

Now that the estimator is fitted to the training data, you can ask to the estimator which is the age of the first example in our test dataset (remember: by doing this, we are now comparing the features of this example to the features of all training samples, then determining what the closest neighbors are, then averaging the age of those neighbors as predictions).

In [None]:
y_pred_1 = knn.predict(X_test[0].reshape(1,-1))
y_pred_1

We can calculate the mean squared error of the prediction we got before (for the first example of the test dataset) by calculating it ourselves or by using the [mean_squared_error()](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) function of scikit-learn.

In [None]:
from sklearn.metrics import mean_squared_error
print(y_test[0], y_pred_1)
print(mean_squared_error(y_pred_1, y_test[0].reshape(1, -1)))

<div class="alert alert-success">

<b>EXERCISE 3.1</b>: **Build a 3-nearest-neighbor regressor in a similar manner as before. Predict the age for all training samples and test samples and compute the mean squared error of both prediction sets. Which predictions are better?**
</div>

In [None]:
# now use k=3 neighbors for training
n_neighbors = 3
# Instantiate the knn object
knn = # TODO
# Fit
# TODO
# Predict train
y_pred_train = # TODO
# Predict test
y_pred_test = # TODO
# print MSE train
print(mean_squared_error(y_pred_train, y_train))
# print MSE test
print(mean_squared_error(y_pred_test, y_test))

<div class="alert alert-success">

<b>EXERCISE 3.2</b>: **Now do the same but iterate over possible values for 'k' as the number of nearest neighbors taken into account, where k ranges between 1 and 50 (step size equal to 3). Plot the resulting errors of the models as a function of 'k'. 'k' is what we call a hyperparameter: a type of parameter that is (most often) tuned by hand according to the prediction task. For some datasets/tasks we will for example need more or less neighbors (as predictions tasks have varying difficulty).**
</div>

In [None]:
# use lists to store the results obtained from every iteration
results_train = []
results_test = [] 
for i in range(1, 50, 3):
    # instantiate the model
    ".." # TODO
    # Fit to train data
    ".." # TODO
    # Predict train
    y_pred_train = ".." # TODO 
    # Predict test
    y_pred_test = ".." # TODO
    # calculate and store the MSE train
    mse_train = ".." # TODO
    ".." # TODO append error in results list
    # calculate and store MSE test
    mse_test = ".." # TODO
    ".." # TODO append error in results list

import matplotlib.pyplot as plt

plt.plot(range(1, 50, 3), results_train) # training MSEs
plt.plot(range(1, 50, 3), results_test) # test MSEs
plt.ylabel('MSE')
plt.xlabel('$k$ nearest neighbors')
plt.legend(['Train MSE', 'Test MSE'])