## k-Nearest-Neighbor Classifier

The goal of this exercise is to practice using `pandas`, `numpy`, `matplotlib`, and `sklearn` to select the best hyperparameter `k` for a k-nearest-neighbor classifier on a modified version of the Iris dataset.

---
### Imports

In [2]:
# code

---
### Read the Data

Use `pandas` to read the `iris.csv` dataset located in the `./data/` directory. Note that this is not the original Iris dataset; it has been modified for the purpose of this exercise. After reading the data, the resulting DataFrame should have the following columns in this order:

1. `sepal_length`
2. `sepal_width`
3. `petal_length`
4. `petal_width`
5. `class`

In [3]:
# code

---
### Clean the Data

Check for the existence of missing data and duplicate rows. Decide how to handle these issues.

#### Missing Data

In [4]:
# code

---
### Duplicates

In [5]:
# code

Number of duplicates: 3


---
### Explore the Data

Use the `seaborn` library to create a pairwise scatter plot of the data. Try to figure out how to color the points and distributions by class.

Examine the plot. If you were asked to discard one of the features, which one would you choose to drop?

In [5]:
# code

---
### Encode Class Labels

Convert the string class labels to numeric class labels 0, 1, and 2.

In [8]:
class_map = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
df['class'] = df['class'].replace(class_map)

---
### k-NN Classifier

Implement the following function. Use the sklearn's `KNeighborsClassifier`. 

In [8]:
from sklearn.neighbors import KNeighborsClassifier

def knn(X_train, y_train, X_test, y_test, k=1):
     """
    This function trains a k-nearest neighbors classifier on the training data and evaluates its performance on 
    both the training and test sets. The function takes as input the feature matrices and target vectors of the 
    training and test sets (X_train, y_train, X_test, y_test), as well as an optional parameter k that specifies 
    the number of neighbors to use (default value is 1). The function returns the training and test error rates 
    as percentages.

    Parameter:
    ----------
        X_train : numpy array representing the train feature matrix
        y_train : numpy array representing the train classes
        X_test  : numpy array representing the test feature matrix
        y_test  : numpy array representing the test classes

    Return:
    -------
        train_error :  training error rate in percentage
        test_error  :  test error rate in percentage
    """

### Experiment

Write a Python script that performs k-Nearest Neighbors (kNN) classification on the cleaned Iris DataFrame `df` using different values of k and different train-test splits. 

The script should:

1. Split the data into training and test sets using a 2:1 ratio.
2. Standardize the training and test data.
3. Perform kNN classification for k ranging from 1 to 20.
4. Repeat the above steps for 100 trials.

Store the train and test errors for each trial and value of k in two 2D arrays: train_error and test_error.

**Hints:**

+ Use `sklearn.model_selection.train_test_split` for splitting the data into a training and test set. 

+ Use `sklearn.preprocessing.StandardScaler`to standardize your data.

+ Consult the `sklearn` documentation for more details.


In [9]:
# code

---
### Calculate Average and Standard Deviation of Error Rates

Calculate and store the average and standard deviation of the training and test error rates for each number  `k` of nearest neighbors. 

In [13]:
# code

---
### Plot Training and Test Error Rates

Plot the average training and test error rates as a function of the number k of nearest neighbors. The plot should include a legend that indicates which line corresponds to the average training error and which line corresponds to the average test error.

What do you observe and which number `k` of nearest neighbor would you choose for predicting future data?

In [10]:
# code