## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state = 5 # Use this to control randomness across runs e.g., dataset partitioning

## Preparing the Diabetes Dataset (2 points)

We will use the diabetes dataset from the UCI machine learning repository. Details about this dataset can be found [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The objective of this project is to predict whether or not a female patient has diabetes based on diagnostic measurements in the dataset.

The dataset consists of several medical predictor variables (features) and one target variable indicating whether or not the person has diabetes. Predictor variables include the number of pregnancies the patient has had, glucose level, blood pressure, skin, insulin, bmi, pedigree, and age.

### Loading the dataset

In [None]:
# These are the names of the columns in the dataset. They includes all features of the data and the label.
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# Download and load the dataset
import os
if not os.path.exists('diabetes.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364_2022/master/dataset/diabetes.csv 
diabetes_data = pd.read_csv('diabetes.csv', header=1, names=col_names)

FEATURE_NAMES=diabetes_data.drop('label',axis=1).columns

# Display the first five instances in the dataset
diabetes_data.head(5)

In [None]:
# Check the type of data in each column
diabetes_data.info()

#### Look at some statistics of the data using the `describe` function in pandas.

In [None]:
# Display some stats
diabetes_data.describe()

In [None]:
fig, ax = plt.subplots(1, 1)
ax.pie(diabetes_data.label.value_counts(),autopct='%1.1f%%', labels=['Not diabetes','Diabetes'], colors=['green','red'])
plt.axis('equal')
plt.ylabel('');

### Extract target and descriptive features (1 point)

#### Separate the target and features from the data.

In [None]:
# Store all the features from the data in X
X = # insert your code here

# Store all the target labels in y
y = # insert your code here

In [None]:
# Convert data to numpy arrays
X = # insert your code here
y = # insert your code here

### Create training and validation datasets (0.5 points)

Split the data into training and validation set using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation.

In [None]:
X_train, X_val, y_train, y_val = # insert your code here

### Preprocess the dataset (0.5 points)

#### Preprocess the dataset by normalizing each feature to have zero mean and unit standard deviation. This can be done using `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.

In [None]:
# Define the scaler for scaling the data
scaler = # insert your code here

# Normalize the training data
X_train = # insert your code here

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_val = # insert your code here

## Training K-nearest neighbor models (18 points)



#### We will use the `sklearn` library to train a K-nearest neighbors (kNN) classifier. Review ch.5 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) for more details. 

### Exercise 1: Learning a kNN classifier (18 points)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score 

#### Exercise 1a: Evaluate the effect of the number of neighbors (6 points)

#### Train kNN classifiers with different number of neighbors among {1, 5, 15, 100, length(X_train)}.

#### Keep all other parameters at their default values.  

#### Report the model's accuracy on the training and validation sets.

In [None]:
# insert your code here

#### Explain the effect of increasing the number of neighbors on the performance with the training and validation sets. 

**ANS**:

#### Exercise 1b: Evaluate the effect of a weighted kNN (6 points)

#### Train kNN classifiers with distance-weighting and vary the  number of neighbors among {1, 5, 15, 100, length(X_train)}.

#### Keep all other parameters at their default values.  

#### Report the model's accuracy on the training and validation sets.

In [None]:
# insert your code here

#### Compare the effect of the number of neighbors on model performance (train and validation) under the distance-weighted kNN against the uniformly weighted kNN. Explain any differences observed.

**ANS**:

#### Exercise 1c: Evaluate the effect of the power parameter in the Minkowski distance metric (6 points)

#### Train kNN classifiers with different distance functions by varying the power parameter for the Minkowski distance among {1, 2, 10, 100}.

#### Fix the number of neighbors to be 15, and use the uniformly-weighted kNN. Keep all other parameters at their default values.  
#### Report the model's accuracy over the validation set.

In [None]:
# insert your code here

#### Explain any effect observed on the model performance upon increasing the power parameter. 

**ANS**:

[Submit](https://docs.google.com/document/d/1xSVzQorsVdyWGaj0zg3po1jcA2RuwlvS6K8cCqARi2Y) to Gradescope when finished.