# **Homework 10: K-Nearest Neighbors II**
---

### **Description**
In this notebook, you will continue practicing with KNN with feature scaling and K-Folds Cross Validation.

<br>

### **Structure**
**Part 1**: [Real or Fake Money?](#p1)

**Part 2**: [Classifying Stars Revisited](#p2)






<br>

### **Learning Objectives**
By the end of this notebook, we will:
* Understand how to implement KNN models with sklearn and different K values.
* Recognize how to evaluate KNN models in sklearn.

<br>

### **Resources**
* [K-Nearest Neighbors with sklearn](https://docs.google.com/document/d/1QltUCIlM0FOkalov1aPXOkOVQme3Ot1AUThiSUbh-kI/edit?usp=drive_link)


* [Feature Scaling and K-Folds Cross Validation with sklearn](https://docs.google.com/document/d/1XCYdpH4jtrbKtCQvNRQPKI5H_UWFg4LiPdZ4qabHmfo/edit?usp=drive_link)


* [pandas Commands](https://docs.google.com/document/d/1xnKJsii1AsRH2t22XtrAh7FzSFGqAR0hAmW4oLYM4MI/edit)


* [Data Visualizations with matplotlib](https://docs.google.com/document/d/1_3hzeIBPvcT6VC-eK-DDGVsKUvdVSvylNepoSLn2-T4/edit?usp=drive_link)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, model_selection, metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import *
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *

<a name = "p1"></a>

---
## **Part 1: Real or Fake Money?**
---

The provided dataset contains information about real and fake banknotes (paper money). Each row represents information about an image of one banknote. This data contains 5 columns:

* `range` is the range of patterns in the banknote image
* `asymmetry` is the lack of symmetry in the banknote image
* `outliers` is the amount of patterns that don't fit in with the rest in the banknote image
* `information` is the amount of total information believed to be contained in the banknote image
* `class` is 0 if the banknote is real and 1 if the banknote is fake

#### **Problem #1.1**

**Run the code below to load in data.**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRdRzlASrap1oY15IoQxXZnB5hi0RhIUCp_thFmTYOnJOw_xjR0X8sGDVyTSdPesIwqYEUQL_yelQpj/pub?gid=1496556477&single=true&output=csv"
banknote_df = pd.read_csv(url)

banknote_df.head()

#### **Problem #1.2**

Split the data into a training and test set using `range` and `asymmetry` as the features to predict `class`.

In [None]:
# COMPLETE THIS CODE

#### **Problem #1.3**

Create a standardized version of the training and test data.

In [None]:
std_scaler = StandardScaler()
X_train_std = # COMPLETE THIS LINE
X_test_std = # COMPLETE THIS LINE

#### **Problem #1.4**

Create a normalized version of the training and test data.

In [None]:
norm_scaler = MinMaxScaler()
X_train_norm = # COMPLETE THIS LINE
X_test_norm = # COMPLETE THIS LINE

#### **Problem #1.5**

Now it's time to model our data. Let's be particularly thorough and rigorous in this process by performing 10-Folds CV on the following models:

* 1NN on unscaled data. **NOTE**: This is provided for you.
* 1NN on standardized data.
* 1NN on normalized data.

* 5NN on unscaled data.
* 5NN on standardized data.
* 5NN on normalized data.

* 33NN on unscaled data. **NOTE**: $\sqrt{\text{length of training data}} \approx 33$
* 33NN on standardized data.
* 33NN on normalized data.

* 549NN on unscaled data. **NOTE**: ${\frac{1}{2}\text{(length of training data})} \approx 549$
* 549NN on standardized data.
* 549NN on normalized data.

* Any other models you would like to try.


<br>

**NOTE**: This may seem like a *lot*, but there's very little that will need to change for each model.

In [None]:
# 1NN on unscaled data

knn_1_unscaled = KNeighborsClassifier(n_neighbors = 1)

scores_1_unscaled = cross_val_score(knn_1_unscaled, X_train, y_train, cv=10)
print("10-Folds CV Scores: " + str(scores_1_unscaled.mean()) + " +/- " + str(scores_1_unscaled.std()))

In [None]:
# 1NN on standardized data

In [None]:
# 1NN on normalized data

In [None]:
# 5NN on unscaled data

In [None]:
# 5NN on standardized data

In [None]:
# 5NN on normalized data

In [None]:
# 33NN on unscaled data

In [None]:
# 33NN on standardized data

In [None]:
# 33NN on normalized data

In [None]:
# 549NN on unscaled data

In [None]:
# 549NN on standardized data

In [None]:
# 549NN on normalized data

#### **Visualize the scores by running the cell below.**

**NOTE**: You will need to update the variable names here if they do not match with what you named the list of cross validation scores you used in any case.

In [None]:
plt.figure(figsize = (10, 8))

plt.plot(scores_1_unscaled, label = '1NN Unscaled')
plt.plot(scores_1_std, label = '1NN Standardized')
plt.plot(scores_1_norm, label = '1NN Normalized')

plt.plot(scores_5_unscaled, label = '5NN Unscaled')
plt.plot(scores_5_std, label = '5NN Standardized')
plt.plot(scores_5_norm, label = '5NN Normalized')

plt.plot(scores_33_unscaled, label = '33NN Unscaled')
plt.plot(scores_33_std, label = '33NN Standardized')
plt.plot(scores_33_norm, label = '33NN Normalized')

plt.plot(scores_549_unscaled, label = '549NN Unscaled')
plt.plot(scores_549_std, label = '549NN Standardized')
plt.plot(scores_549_norm, label = '549NN Normalized')

plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(bbox_to_anchor=(1, 1), fontsize = 14)

plt.show()

In [None]:
# Calculate mean and standard deviation for each data set
scores = [scores_1_unscaled, scores_1_std, scores_1_norm,
          scores_5_unscaled, scores_5_std, scores_5_norm,
          scores_33_unscaled, scores_33_std, scores_33_norm,
          scores_549_unscaled, scores_549_std, scores_549_norm]

mean_values = [np.mean(score) for score in scores]
std_dev_values = [np.std(score) for score in scores]

# Labels for the bars
labels = ['1NN Unscaled', '1NN Standardized', '1NN Normalized',
          '5NN Unscaled', '5NN Standardized', '5NN Normalized',
          '33NN Unscaled', '33NN Standardized', '33NN Normalized',
          '549NN Unscaled', '549NN Standardized', '549NN Normalized']

# Bar width
bar_width = 0.5

# Plotting
plt.rcParams.update({'font.size': 12})
fig, ax = plt.subplots(figsize = (10, 8))

# Bar plots with error bars representing standard deviations
ax.bar(labels, mean_values, bar_width, yerr=std_dev_values, capsize=10)

# Adding labels and title
ax.set_xlabel('Datasets')
ax.set_ylabel('Values')
ax.set_title('Mean and Standard Deviation of Different Datasets')

# Show the plot
plt.xticks(rotation = 45)
plt.show()

#### **Problem #1.6**

Choose the best model from above by considering all the information from the outputs and the graphs, train it on the whole training set, and evaluate it on the test set using the accuracy and confusion matrix.

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

<a name = "p2"></a>

---
## **Part 2: Classifying Stars Revisited**
---

In this section, you can revisit the stars dataset from last week to properly model this dataset with your newfound hyperparameter tuning and validation skillset.

#### **Problem #2.1**

**Run the code below to load in data.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTCZgoegOHa49SFXYU-ZZTdCkgTp0sneU1BsEOa7vusjTXPPLcn0i3kXhX1nyqkApJHCKTkw0mWuWr4/pub?gid=753880827&single=true&output=csv'
stars_df = pd.read_csv(url)

stars_df.head()

#### **Problem #2.2**

Split the data into a training and test set using `temperature` and `size` as the features to predict `class`.

In [None]:
# COMPLETE THIS CODE

#### **Problem #2.3**

Create a standardized version of the training and test data.

In [None]:
std_scaler = StandardScaler()
X_train_std = # COMPLETE THIS LINE
X_test_std = # COMPLETE THIS LINE

#### **Problem #2.4**

Create a normalized version of the training and test data.

In [None]:
norm_scaler = MinMaxScaler()
X_train_norm = # COMPLETE THIS LINE
X_test_norm = # COMPLETE THIS LINE

#### **Problem #2.5**

Now it's time to model our data. Let's be particularly thorough and rigorous in this process by performing 10-Folds CV on the following models:

* 1NN on unscaled data. **NOTE**: This is provided for you.
* 1NN on standardized data.
* 1NN on normalized data.

* 5NN on unscaled data.
* 5NN on standardized data.
* 5NN on normalized data.

* 15NN on unscaled data. **NOTE**: $\sqrt{\text{length of training data}} \approx 15$
* 15NN on standardized data.
* 15NN on normalized data.

* 107NN on unscaled data. **NOTE**: ${\frac{1}{2}\text{(length of training data})} \approx 107$
* 107NN on standardized data.
* 107NN on normalized data.

* Any other models you would like to try.


<br>

**NOTE**: This may seem like a *lot*, but there's very little that will need to change for each model.

In [None]:
# 1NN on unscaled data

In [None]:
# 1NN on standardized data

In [None]:
# 1NN on normalized data

In [None]:
# 5NN on unscaled data

In [None]:
# 5NN on standardized data

In [None]:
# 5NN on normalized data

In [None]:
# 33NN on unscaled data

In [None]:
# 33NN on standardized data

In [None]:
# 33NN on normalized data

In [None]:
# 549NN on unscaled data

In [None]:
# 549NN on standardized data

In [None]:
# 549NN on normalized data

#### **Visualize the scores by running the cell below.**

**NOTE**: You will need to update the variable names here if they do not match with what you named the list of cross validation scores you used in any case.

In [None]:
plt.figure(figsize = (10, 8))

plt.plot(scores_1_unscaled, label = '1NN Unscaled')
plt.plot(scores_1_std, label = '1NN Standardized')
plt.plot(scores_1_norm, label = '1NN Normalized')

plt.plot(scores_5_unscaled, label = '5NN Unscaled')
plt.plot(scores_5_std, label = '5NN Standardized')
plt.plot(scores_5_norm, label = '5NN Normalized')

plt.plot(scores_15_unscaled, label = '15NN Unscaled')
plt.plot(scores_15_std, label = '15NN Standardized')
plt.plot(scores_15_norm, label = '15NN Normalized')

plt.plot(scores_107_unscaled, label = '107NN Unscaled')
plt.plot(scores_107_std, label = '107NN Standardized')
plt.plot(scores_107_norm, label = '107NN Normalized')

plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(bbox_to_anchor=(1, 1), fontsize = 14)

plt.show()

In [None]:
# Calculate mean and standard deviation for each data set
scores = [scores_1_unscaled, scores_1_std, scores_1_norm,
          scores_5_unscaled, scores_5_std, scores_5_norm,
          scores_15_unscaled, scores_15_std, scores_15_norm,
          scores_107_unscaled, scores_107_std, scores_107_norm]

mean_values = [np.mean(score) for score in scores]
std_dev_values = [np.std(score) for score in scores]

# Labels for the bars
labels = ['1NN Unscaled', '1NN Standardized', '1NN Normalized',
          '5NN Unscaled', '5NN Standardized', '5NN Normalized',
          '15NN Unscaled', '15NN Standardized', '15NN Normalized',
          '107NN Unscaled', '107NN Standardized', '107NN Normalized']

# Bar width
bar_width = 0.5

# Plotting
plt.rcParams.update({'font.size': 12})
fig, ax = plt.subplots(figsize = (10, 8))

# Bar plots with error bars representing standard deviations
ax.bar(labels, mean_values, bar_width, yerr=std_dev_values, capsize=10)

# Adding labels and title
ax.set_xlabel('Datasets')
ax.set_ylabel('Values')
ax.set_title('Mean and Standard Deviation of Different Datasets')

# Show the plot
plt.xticks(rotation = 45)
plt.show()

#### **Problem #2.6**

Choose the best model from above by considering all the information from the outputs and the graphs, train it on the whole training set, and evaluate it on the test set using the accuracy and confusion matrix.

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

---

# End of Notebook

© 2023 The Coding School, All rights reserved