### Developing the K Nearest Neighbors (KNN) Regression to Predict the Disease Progression of Diabetes

In Lecture 8, we implemented a KNN algorithm for classification, which computed a `simple majority vote` of the nearest neighbors of each test sample. 

In Homework 2, we will implement a KNN-based regression algorithm, and the prediction of a test sample is computed based on the `mean of the target values` of its nearest neighbors.

`The diabetes dataset has 442 patients. Ten variables, age, sex, body mass index, average blood pressure, and six blood serum measurements, were obtained for each sample. The response/target variable is a quantitative measure of disease progression one year after baseline.`


[Task 1: Split the dataset](task1)

[Task 2: Implement the KNN for regession](task2)

[Task 3: Evaluate](task3)

[Task 4: Explore the impact of different K](Task4)

[Task 5: Improve KNN]()

#### Load the diabetes dataset

In [1]:
import numpy as np
from sklearn import datasets as ds
from sklearn.model_selection import train_test_split

# 1. Data preparation
db = ds.load_diabetes()
X = db.data #feature vectors
y = db.target #target values

print(db['DESCR'])



.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature va

#### Task 1: Split the dataset into training (75%) and test sets (25%). 5 points. <a id="task1">

In [2]:
# add your code here
print(X.shape, y.shape)






(442, 10) (442,)


#### Task 2: Implement the KNN algorithm for regression. 20 points. <a id="task2">

- for any input data sample, find its k nearest neighbors on the training set
- the prediction function will calculate the mean target values of the K neareast neighbors
    

##### Task 2.1: Complete the following function to identify the K nearest neighbors of a given data sample. 
- return both the indices and Euclidean distances of the K nearest neighbors for a given query/new sample.

In [36]:
def findKNgbs(X_train, x_query, K = 5):
    '''find K nearest neighbors for a given data sample in X_train
        
        input:
            X_train: training set
            x_query: new data sample/a query
            K: the number of neighbors
        return:
            indK: the indices of of the K nearest neighbors
            disK: the distances of the K nearest neighbors to x_new          
    '''
    #add your code here


    
    
    
x = X_test[0]
print('Input data sample:\n', x)
k_inx, k_dis = findKNgbs(X_train, x, K=5)
print('K nearest neighbors of x in the training set:\n', k_inx, '\nTheir distances to the query:\n',k_dis)

Input data sample:
 [ 0.06713621  0.05068012 -0.04177375  0.01154374  0.0025589   0.00588854
  0.04127682 -0.03949338 -0.0594727  -0.02178823]
K nearest neighbors of x in the training set:
 [309 262 128 105 135] 
Their distances to the query:
 [0.0030955  0.005225   0.00563592 0.00712298 0.00905236]



##### Task 2.2: Predict the target value for new data samples.


In [37]:
def predict(X_in, X_train, y_train, K=5):
    '''predict the target vlues for input queries
        Input:
            X_in: new data samples/queries. n*4. contains multiple data samples
            X_train: the feature vectors of training samples
            y_train: the target values of the training samples
            K: the number of neighbors

        return:
            y_pred: the predictions of the input queries
    '''

    #add your code here
    
    
    
    
    
    
    
n = 10 # 3 test samples
K = 6 # k neighbors

X_in = X_test[0:n]
y_pred = predict(X_in, X_train, y_train, K)
print('Predictions:', y_pred)
print('True target values:', y_test[:n])

Predictions: [ 91 154 112 244 102 191 220 110 119 248]
True target values: [ 75. 128. 125. 332.  37. 121. 259.  72.  40. 281.]


#### Task 3: Evaluate the KNN method using MSE and MAE. 10 points. <a id="task3">

- Implement the mean square error(MSE) and mean absolute error(MAE) functions.
- Calcualte the MSE and MAE of the knn on the test set

    - MSE: ((y_true[0]-y_pred[0])**2 + ...+(y_true[n-1]-y_pred[n-1])**2))/n where n is the number of samples
    - MAE: (|y_true[0]-y_pred[0]| + ...+|y_true[n-1]-y_pred[n-1]|)/n

In [51]:
# calculate mean square error (MSE) between y_true and y_pred
def myMSE(y_true, y_pred):
    '''
        y_true: the true target values
        y_pred: predictions

        return: the MSE between y_true and y_pred
    '''    
    #add your code here
    
    
    
    
    

def myMAE(y_true, y_pred):
    '''
        y_true: the true target values
        y_pred: predictions

        return: the MAE between y_true and y_pred
    '''    
    #add your code here

    
    
    
    
y_pred = predict(X_test, X_train, y_train, K=4)
mse = myMSE(y_test, y_pred)
mae = myMAE(y_test, y_pred)
print('The MSE on the test set is', round(mse, 2))
print('The MAE on the test set is', round(mae, 2))

The MSE on the test set is 3620.64
The MAE on the test set is 47.99


#### Task 4: Explore the impact of different K. 10 points. <a id="task4">
    
##### Task 4.1
    - calculate the mse values for K from 1 to 30
    - plot the mse values using a curve. Set the xlabel to 'K' and ylabel to 'MSE'. https://matplotlib.org/stable/gallery/pyplots/axline.html#sphx-glr-gallery-pyplots-axline-py

In [3]:
import matplotlib.pyplot as plt# evaluate the performance on the test set
mses= []

#add your code here

#### Task 4.2: Discuss your findings from the above curve, e.g., the trend, best K, and the impact of small and large Ks



Response: 

\
\
\

#### Task 5: Improve the KNN method. 5 points. <a id="task5">
    - 10 extra points for implementing the imovement that can reduce the MSE or MAE

In [57]:
k_best =14 # use your best K
y_pred = predict(X_test, X_train, y_train, k_best)
mse = myMSE(y_test, y_pred)
print('Using K=', k_best, ', the MSE on the test set is', round(mse,2))
mae = myMAE(y_test, y_pred)
print('Using K=', k_best, ', the MAE on the test set is', round(mae, 2))

# compare the predictions of the first 10 test samples and theirs true target values
for i in range(10):
    print(y_test[i], y_pred[i])

Using K= 14 , the MSE on the test set is 3009.05
Using K= 14 , the MAE on the test set is 44.53
75.0 92
128.0 129
125.0 108
332.0 216
37.0 81
121.0 167
259.0 210
72.0 98
40.0 107
281.0 240


From the above results, even we use the best K, the MSE and MAE are still large. Please suggest any possible improvements of the KNN alrorithm.


Response:
    
    
    
    
    