# INFO371 Lab: k-Nearest Neighbors

This lab asks you to:

* use $k$-nearest neighbors to categorize iris data
* use Cross-Validation to find the best metric and $k$

For hints about how to use cross-validation in python, see the following [tutorial](https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85).

---
## Categorize Iris Flowers

Your main task is to categorize Iris flowers based on the four measurements (aka: features) into the correct species (aka: labels) setosa, virginica, and versicolor.  The data contains 50 flowers of each species (150 in total), and four measurements for each species (petal length and width, and sepal length and width).  All of these are numeric measures. 

Here are the different iris flowers with the features (sepal and petals marked):
![dataset description](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)


1. Download data the dataset from this [link](https://archive.ics.uci.edu/ml/datasets/Iris).  Make a consistency check to ensure it is loaded correctly. 
2. Split your data into a feature dataframe that only contains the attributes (i.e. the sepal and petal width and length), and the labels array that contains the species name. 
3. Graphical exploration.  Make a few scatterplots of the data using different attributes on axes, while depicting the species with different colors.

      Note: sklearn's implementaiton of $k$-means can easily handle string labels (like species' names) but plotting cannot.  If you are using matplotlib for plotting, it is useful to convert the string labels into numbers (for example, convert "setosa" to 1). You can also use the Seaborn library for plotting which has built in features to handle string data already. 


In [5]:
#this code is another option for getting access to the dataset:

from sklearn import datasets
iris = datasets.load_iris()

#in this version, iris['data'] returns the dataframe for the features
# iris['target'] returns the labels where 1 is 'setosa', 2 is 'versicolor', and 3 is 'virginica'

#otherwise, you'll need to download the dataset from this link: https://archive.ics.uci.edu/ml/datasets/Iris
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

Now it is time to get into $k$-NN.  Your task is to predict and cross-validate a series of models over a range of $k$ values, and using four different metrics: Euclidean, Manhattan, Chebyshev and Mahalanobis. Note, you may find it helpful to check out sklearn's [documentation on disntace metrics](https://scikit-learn.org/0.24/modules/generated/sklearn.neighbors.DistanceMetric.html) to get an idea of what each one does. 


4. Using the sklearn library, create a $k$-NN model using a single neighbor and Euclidean metric. 

5. Cross-validate (10-fold) the model using your feature set and labels.  Use accuracy as your score.  As a reminder, you'll need code like

```
cv = cross_val_score(m, X, y, cv=10)
np.mean(cv)
```

6. Repeat the steps you did with Manhattan, Chebyshev, and Mahalanobis metric.  

    Note: For the latter, you need to compute the data covariance matrix, and thereafter you have to create a $k$-NN model requesting Mahalanobis distance with that metric.  This can be done like so: 

```
Sigma = np.cov(X, rowvar=False)
m = KNeighborsClassifier(n_neighbors=3,
                         metric="mahalanobis",
                         metric_params={"V":Sigma})
```

7. Now repeat the above for $k=1,2,\dots,15$.  Each time print out the cross-validated accuracy score.  Which $k$ and which metric gives you the best accuracy? Why do you think that is the case -- explain your reasoning. Is there a difference distance metric you think would work better? 

In [None]:
#code goes here