<a href="https://colab.research.google.com/github/IreneYIN7/CSC-Data_Mining-Machine_Learning_Projects/blob/master/Irene_CSC_321_Assignment_7_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC-321: Data Mining and Machine Learning
# Irene Yin
## Assignment 7: K-Nearest Neighbor

### Part 1: Implementation

For this assignment, I'm going to let you break down the implementation as you see fit. You're going to implement KNN, an example of lazy learning. In brief, this means:

- calculating euclidean distance between the feature values of a single test instance, and the feature values of a single training instance
- making a prediction requires iterating through *all* the training instances, calculating the distances and storing each distance in a list, along with the corresponding class value (probably as some sort of tuple)
- sorting the list, smallest distances first
- selecting the *k* nearest neighbors (where k should be a parameter)
- making a prediction by choosing the class that appears the most in the k nearest neighbors

To calculate euclidean distance between an instance x1 and an instance x2, we need to iterate through the input features of the two instances (for i features) and for each take the difference of (x1[i]) - (x2[i]), squaring that difference, and summing over all features. At the end, take the square root of the total. In other words:

$$distance=\sqrt{\sum_{i=1}^n (x1_{i} - x2_{i})^2}$$


I would strongly suggest you follow the implementation outline of previous algorithms in terms of the functions you use, but I'm leaving it up to you.

Below is the same contrived dataset you've used before. If your code works, you should be able to take an instance of this data, and compare it to all the others (including itself, where the distance SHOULD be 0). You should be able to select the k-nearest neighbors, and make a prediction based on the most frequently occuring class in those k neighbors. Try it for different values of k, from 1, 3 and 5.

Make sure you create a knn function that takes a training set (X_train, y_train), a test set (X_test) and a value for k, that returns a list of predictions - one prediction for each instance in the test set.

Run the algorithm over the sample dataset, using k=3. Print the predicted and the actual side by side.

In [None]:
# Contrived data set
import operator

dataset = [[3.393533211,2.331273381,0],
    [3.110073483,1.781539638,0],
    [1.343808831,3.368360954,0],
    [3.582294042,4.67917911,0],
    [2.280362439,2.866990263,0],
    [7.423436942,4.696522875,1],
    [5.745051997,3.533989803,1],
    [9.172168622,2.511101045,1],
    [7.792783481,3.424088941,1],
    [7.939820817,0.791637231,1]]

X_train = [i[:-1] for i in dataset]
X_test = X_train
y_train = [i[-1] for i in dataset]
y_test = y_train

def eucliDistance(instanceX1,instanceX2):
  # return: return the euclidean distance between the feature values of a single 
  #         test instance, and of a single trainning instance.
    distance = 0
    for i in range(len(instanceX1)-1):
        distance += (instanceX1[i] - instanceX2[i])**2
    totalDistance = distance**(0.5)
    return totalDistance


def disWithValue(X_train, instance):
  # return : store distance in a list, along with the corresponding class value 
      DistancewithValue = []
      Distances = []
      for j in range(len(X_train)):
            Distances.append(eucliDistance(instance, X_train[j]))
      Classvalue = [y_train[Distances.index(distance)] for distance in Distances]
      for i in range(len(Classvalue)):
        DistancewithValue.append((Distances[i],Classvalue[i]))
      return DistancewithValue
  
def selectKNearest(distanceList, k):
  # return: the k number of the nearest neighboours
    sortedList = sorted(distanceList)
    return sortedList[:k]


def predict(KnearestList, k):
  # return : The most frequent class
    classvalue = [i[-1] for i in KnearestList]
    return max(set(classvalue), key=classvalue.count)



def knn(X_train, y_train, X_test, k):
    predictions = []
    for i in range(len(X_test)):
        distance = disWithValue(X_train,X_test[i])
        kNearestList = selectKNearest(distance,k)
        predictions.append(predict(kNearestList,k))
    return predictions

k = 3
KnnPred = knn(X_train, y_train, X_test,k)
print('KNN prediction when k = 3:')
print('\nPREDICTED : ACTUAL')
print('------------------')
for i in range(len(KnnPred)):
  print('    {:.2f}  :  {:.2f}  '.format(KnnPred[i],y_test[i]))
print()



KNN prediction when k = 3:

PREDICTED : ACTUAL
------------------
    0.00  :  0.00  
    0.00  :  0.00  
    0.00  :  0.00  
    0.00  :  0.00  
    0.00  :  0.00  
    1.00  :  1.00  
    1.00  :  1.00  
    1.00  :  1.00  
    1.00  :  1.00  
    1.00  :  1.00  



### Part 2: Working with real data

Apply the KNN algorithm above to the abalone data set. You can find more about it here: http://archive.ics.uci.edu/ml/datasets/Abalone

I've started the process, because I want to show you another part of scikit learn. I've loaded in the data, and shown the head of the data. Pay attention to the sex column. 


In [None]:
import pandas as pd

labels = ['sex','length','diameter','height','whole_weight','shucked_weight',
          'viscera_weight','shell_weight','rings']

abalone_data = pd.read_csv('https://raw.githubusercontent.com/nixwebb/CSV_Data/master/abalone.csv',names=labels)
abalone_data.head()


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


That first column is nominal data, not numeric. We first have to change it into numbers - either replacing each value with an integer (label encoding) or creating new columns to represent each possible value (one-hot encoding).

For label encoding, I use the method in scikit learn. For one-hot encoding, if my data is in a pandas dataframe, I prefer the pandas method. Ultimately it doesn't matter, and you should feel free to check out both.

Ultimately, I'll do one-hot encoding here. The method in pandas is called get_dummies. I'm going to do that in stages, to show you what happens, but you don't need to print after ever step like this, once you're convinced it works.

The get_dummies method in pandas creates my new columns based on feature values. There are three possible feature values for the sex feature (M,F,I - and you should know what these mean), so I create three columns.

Notice that in pandas I can refer to a column using a built in attribute. The abalone_data dataframe has a column labeled 'sex', so I can use abalone_data.sex to access that column. I'm creating column headings from the feature values, and adding the prefix 'sex' to each value.

In [None]:
abalone_sex = pd.get_dummies(abalone_data.sex, prefix='sex')
abalone_sex.head()

Unnamed: 0,sex_F,sex_I,sex_M
0,0,0,1
1,0,0,1
2,1,0,0
3,0,0,1
4,0,1,0


Then I need to add these columns back into my overall dataframe. I'm using the pandas method concat to do that.

In [None]:
abalone_ohe = pd.concat([abalone_sex,abalone_data],axis=1)
abalone_ohe.head()

Unnamed: 0,sex_F,sex_I,sex_M,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,0,0,1,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0,0,1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,1,0,0,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0,0,1,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,0,1,0,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


And finally, now I've encoded the sex using one-hot encoding, I'm going to drop the sex column from the dataframe. However, just to show you how the scikit learn label encoder works, execute the code in the first code cell below, and check out what happens to the sex column values. Then run the second cell to drop the sex column from the data.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
abalone_ohe['sex'] = le.fit_transform(abalone_ohe.sex.values)
abalone_ohe.head()


Unnamed: 0,sex_F,sex_I,sex_M,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,0,0,1,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0,0,1,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,1,0,0,0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0,0,1,2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,0,1,0,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [None]:
abalone_ohe.drop('sex',axis=1,inplace=True)
abalone_ohe.head()

Unnamed: 0,sex_F,sex_I,sex_M,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,0,0,1,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,0,0,1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,1,0,0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,0,0,1,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,0,1,0,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


The y value for this data, the thing we're predicting, is the category represented by the number of rings.

Using this dataset, extract X and y data, normalize the X values, and run a 5-fold cross-validation, with k set as 5. Also run a classification baseline. Report on classification accuracy, and write up some results.

NOTE: This will be SLOW. If you're not sure your code is working, I recommend starting with a 3 fold cross-validation, and use k=1. You will probably get some UserWarnings from python also. I will explain these in class later.

In [None]:
from sklearn.preprocessing import MinMaxScaler
# extract X and y data. 

abalone_value = abalone_ohe.values
X_values = abalone_value[:,:-1]
y_values = abalone_value[:,len(abalone_value[0])-1]

rows,cols = X_values.shape
print("This is the wine data set. It has", rows, "instances, and it has", cols, "input features.\n")
print("The first FIVE instances look like:")

# normalize X values:

# Show the first five instances
print(X_values[:5])
print()
# Load and fit the scaler

scaler = MinMaxScaler()
scaler.fit(X_values)
# Use some attributes of the scaler to show min and max values per feature
# Note these should align with the information from the pandas .describe
# method, used above

print("MAX values:",scaler.data_max_)
print("MIN values:",scaler.data_min_)
print()

# Transform our X_values, so that data is now scaled
# Note we can apply this transform to any data, including new data
# and it will preserve the min and max values given above

X_values = scaler.transform(X_values)

# Take another look at those first five instances that should now be 
# normalized

print("After normalization, the first FIVE instances look like:")
print(X_values[:5])


This is the wine data set. It has 4177 instances, and it has 10 input features.

The first FIVE instances look like:
[[0.     0.     1.     0.455  0.365  0.095  0.514  0.2245 0.101  0.15  ]
 [0.     0.     1.     0.35   0.265  0.09   0.2255 0.0995 0.0485 0.07  ]
 [1.     0.     0.     0.53   0.42   0.135  0.677  0.2565 0.1415 0.21  ]
 [0.     0.     1.     0.44   0.365  0.125  0.516  0.2155 0.114  0.155 ]
 [0.     1.     0.     0.33   0.255  0.08   0.205  0.0895 0.0395 0.055 ]]

MAX values: [1.     1.     1.     0.815  0.65   1.13   2.8255 1.488  0.76   1.005 ]
MIN values: [0.     0.     0.     0.075  0.055  0.     0.002  0.001  0.0005 0.0015]

After normalization, the first FIVE instances look like:
[[0.         0.         1.         0.51351351 0.5210084  0.0840708
  0.18133522 0.15030262 0.1323239  0.14798206]
 [0.         0.         1.         0.37162162 0.35294118 0.07964602
  0.07915707 0.06624075 0.06319947 0.06826109]
 [1.         0.         0.         0.61486486 0.61344538 0.11

In [None]:
from sklearn.model_selection import StratifiedKFold

X_train = X_values
X_test = X_train
y_train = y_values
y_test = y_train

# ZeroR
def accuracy(actual_value, predicted_value):
  # return how many times the function predicts correctly in ratio.
  counter = 0 
  for i in range(len(actual_value)):
    if actual_value[i] == predicted_value[i]:
      counter += 1
  return counter/len(actual_value)

def zeroRC(train, test):
  # return baseline
  commonElement = max(set(train), key = train.count)
  return [commonElement for i in range(len(test))]

# 5-fold cross validation
skf = StratifiedKFold(n_splits=5)
zeroR_scores = []
knn_scores = []
k = 3

for train_index, test_index in skf.split(X_values,y_values):
    X_train, X_test = X_values[train_index], X_values[test_index]
    y_train, y_test = y_values[train_index], y_values[test_index]
  
    # Add in calls to your KNN and zeroR functions here
    # Then calculate accuracy, and append each score to the appropriate list
    KNN_Pred = knn(X_train,y_train,X_test,k)
    zeroR_pred = zeroRC(y_train.tolist(),X_test)

    accuracy_zeroR = accuracy(y_test, zeroR_pred) * 100
    zeroR_scores.append(accuracy_zeroR)

    accuracy_knn = accuracy(y_test, KNN_Pred) * 100
    knn_scores.append(accuracy_knn)

print("knn score:", knn_scores)
print("ZeroR score:", zeroR_scores)
knn_average_accuracy1 = (sum(knn_scores)/len(knn_scores))
zr_average_accuracy1 = (sum(zeroR_scores)/len(zeroR_scores))
KNN_min = min(knn_scores)
KNN_max = max(knn_scores) 
zr_min = min(zeroR_scores) 
zr_max = max(zeroR_scores) 
print('KNN accuracy: {:.2f}% ({:.2f}% / {:.2f}%)'.format(knn_average_accuracy1, KNN_min, KNN_max))

print('ZeroR accuracy: {:.2f}% ({:.2f}% / {:.2f}%)'.format(zr_average_accuracy1, zr_min, zr_max))
  





knn score: [17.942583732057415, 24.16267942583732, 22.275449101796408, 23.592814371257482, 23.233532934131738]
ZeroR score: [16.507177033492823, 16.507177033492823, 16.526946107784433, 16.407185628742514, 16.526946107784433]
KNN accuracy: 22.24% (17.94% / 24.16%)
ZeroR accuracy: 16.50% (16.41% / 16.53%)


#### Analysis
The baseline of the abalone dataset is 16.5%. KNN perfomrs better than the baseline where the average accuracy of this baseline is  22.24%, which is larger than the baseline. 

The baseline shows that there are 16.5% chance that the ages of the abalone can be predicted in the right class. And by using KNN model, there are 22.24% chance that the ages of the abalone is predicted in the right class. 

We cannot say which features is more important based on the KNN model since KNN is just calculating the distances between instances. 

Apart from that, it might be a good idea to use other classification model because the running time of KNN is very long. The more the data, the less efficient the KNN model is. It's not a good idea to use KNN model to train and predict with the large amonut of data.



### Part 3: KNN regression

We can also run KNN as a regression algorithm. In this case, instead of predicting the most common class in the k nearest neighbors, we can assign a predicted value that is the mean of the values of the k neighbors. 

Make this change to your algorithm (presumably by simply implementing a new predict function below, and then calling this new predict fucntion from your knn algorithm, because you divided your code up sensibly in Part 1), and run the abalone data as a regression problem. To do this, use the same number of folds and the same k value as before. Also run a regression baseline and report RMSE values for both. Give me some explanation of the results, both standalone and in comparison to the classification results above.


In [None]:
import numpy as np
def getMean(inputList):
  # return: the average (the mean) for any list of values(numbers).
  
  return np.mean(inputList)


def predictReg(KnearestList, k):
  # return : The mean 
    classValue = [i[-1] for i in KnearestList]
    return getMean(classValue)

def knnReg(X_train, y_train, X_test, k):
    predictions = []
    for i in range(len(X_test)):
        distance = disWithValue(X_train,X_test[i])
        kNearestList = selectKNearest(distance,k)
        predictions.append(predictReg(kNearestList,k))
    return predictions

def zeroRR(ytrain, Xtest):
  # return: compute the mean from the ytrain values and return the list of predictions.
  meanOfytrain = getMean(ytrain)
  prediction = []
  for num in Xtest:
    prediction.append(meanOfytrain)
  return prediction

def rmse(actual, predicted):
  # return: prediction error.
  predictionError = 0.0
  for i in range(len(actual)):
    predictionError += (actual[i] - predicted[i])**2
  predictionError_avg = predictionError/len(actual)
  predictionError_avg = predictionError_avg**0.5
  return predictionError_avg

prediction = zeroRR(y_train, X_train)
print("ZeroR RMSE with same train and test:", rmse(y_train, prediction))
# 5-fold cross validation
skf = StratifiedKFold(n_splits=5)
zeroRR_scores = []
knnReg_scores = []
k = 3

for train_index, test_index in skf.split(X_values,y_values):
    X_train, X_test = X_values[train_index], X_values[test_index]
    y_train, y_test = y_values[train_index], y_values[test_index]

    # Add in calls to your KNN and zeroR functions here
    # Then calculate accuracy, and append each score to the appropriate list
    KnnReg_Pred = knnReg(X_train,y_train,X_test,k)
    rmse_knn = rmse(y_test, KnnReg_Pred)
    knnReg_scores.append(rmse_knn)

    zeroRR_pred = zeroRR(y_train,X_test)
    rmse_zeroRR = rmse(y_test, zeroRR_pred)
    zeroRR_scores.append(rmse_zeroRR)
    


print("KNN Regression score:", knnReg_scores)
print("ZeroR score:", zeroRR_scores)
print('KNN Regression RMSE: {:.2f} ({:.2f} / {:.2f})'.format(getMean(knnReg_scores), min(knnReg_scores), max(knnReg_scores)))

print('ZeroR RMSE: {:.2f} ({:.2f} / {:.2f})'.format(getMean(zeroRR_scores), min(zeroRR_scores), max(zeroRR_scores)))





ZeroR RMSE with same train and test: 0.5
KNN Regression score: [2.5749013493849255, 2.430014559675294, 2.398685601484791, 2.4536149278642325, 2.425359477714809]
ZeroR score: [3.276135506905344, 3.2095351101549667, 3.2037611342745986, 3.227275777276688, 3.2016461361389403]
KNN Regression RMSE: 2.46 (2.40 / 2.57)
ZeroR RMSE: 3.22 (3.20 / 3.28)


#### Analysis
According to the data, we have the baseline as 3.22. This means that the prediction towards Abalone's age would have the error of 3.22.  Whereas the error rate of the KNN regression is 2.46, which is smaller than the baseline. That is, the prediction result using KNN would have error of 2.46. 

This shows that KNN regression did a slightly better than the baseline. However, considering the speed/time cost of the KNN regression, I would rather choose other model. 

Apart from that, it's also hard for us the know which features is more important since KNN is just calculating the distance between two instances.

According to the dataset and the y_value, we can tell that the Class values are numeric values. Comparing with using the classification which is good at predicting the categorical values, Regrssion would fit the numerical data better. Thus, the Regression fits this task better. 

## Part 4: Introduction to scikit-learn

One of the most popular open-source python machine learning libraries is scikit-learn. You can find out more in general at: https://scikit-learn.org/stable/index.html


As we go through this class I'll introduce you to some of the functionality. Below I want you to use BOTH a KNN Classifier and the KNN Regressor.

I also I want you to explore the cross_val_score function. Previously you used the StratifiedKFold function. You then had to fit the model, then use the predict method to apply the model, then collect the scores, find the mean and the min and the max. cross_val_score does most of that for you. 

cross_val_score takes a model (the classifier, or regressor) you want to use, X and y data, a value for cv (the default is 5 for a 5-fold cross validation), and a scoring metric. It does all the fitting, prediction and collection of scores for you. By default, the scoring measure is accuracy, which is great for classification.

What is returned is a list of the cross-validation scores. You can apply the .mean(), .min() and .max() methods to this list to generate scores as you have before.

If we want to cross-validate a regression algorithm, then we need to change the scoring. Use the parameter scoring='neg_mean_squared_error'.

To turn this neg_mean_squared_error into a meaningful score for us, we'll need to take the absolute value (to reverse the sign) and then apply math.sqrt to transform the results into RMSE.

The links to the relevant documentation pages are:
- [KNN Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [KNN Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)
- [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

I'll load the relevant models from scikit-learn, but it's up to you to train and test them, and report the scores appropriately, including comparison to baselines (use scikits dummy classifier) and write up. Your scores should be the broadly the same as your code, above.

NOTE: I've added some code below to suppress the user warnings from earlier.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.dummy import DummyRegressor
import math

# There are some user warnings that this time I'm going to suppress
import warnings
warnings.filterwarnings("ignore",category=UserWarning)


# KNN classifier
knnClass = KNeighborsClassifier(n_neighbors=3)
KNNClass = knnClass.fit(X_train, y_train)
knnClass_pred = KNNClass.predict(X_test)

knnClassScore = cross_val_score(KNNClass, X_test, y_test, cv=5)
knnClassEval = []
for i in knnClassScore:
  knnClassEval.append(i*100)

# KNN Regressor
knnReg = KNeighborsRegressor(n_neighbors=3)
KNNReg = knnReg.fit(X_train, y_train)
knnReg_pred = KNNReg.predict(X_test)

knnRegScore = cross_val_score(KNNReg, X_test, y_test, scoring='neg_mean_squared_error')
knnRegEval = [abs(i)**(0.5) for i in knnRegScore]

# ZeroRC
zr_clf = DummyClassifier(strategy="most_frequent")
zeroR = zr_clf.fit(X_train, y_train)
zeroR_pred = zeroR.predict(X_test)
zrScore = cross_val_score(zeroR, X_test, y_test, cv=5)
zrEval = []
for j in zrScore:
  zrEval.append(j*100)

# ZeroRR
zr = DummyRegressor()
zeroRR = zr.fit(X_train, y_train)
zr_predY = zeroRR.predict(X_test)

zeRScore = cross_val_score(zeroRR, X_test, y_test, scoring='neg_mean_squared_error')
zeREval = [abs(i)**(0.5) for i in zeRScore]

print()
print('KNN accuracy sklearn: {:.2f}% ({:.2f}% / {:.2f}%)'.format(getMean(knnClassEval), min(knnClassEval), max(knnClassEval)))
print('KNN accuracy: {:.2f}% ({:.2f}% / {:.2f}%)'.format(knn_average_accuracy1, KNN_min, KNN_max))
print()
print('ZeroR accuracy sklearn: {:.2f}% ({:.2f}% / {:.2f}%)'.format(getMean(zrEval), min(zrEval), max(zrEval)))
print('ZeroR accuracy: {:.2f}% ({:.2f}% / {:.2f}%)'.format(zr_average_accuracy1, zr_min, zr_max))
print()
print('KNN RMSE sklearn: {:.2f} ({:.2f} / {:.2f})'.format(getMean(knnRegEval), min(knnRegEval), max(knnRegEval)))
print('KNN RMSE: {:.2f} ({:.2f} / {:.2f})'.format(getMean(knnReg_scores), min(knnReg_scores), max(knnReg_scores)))
print()
print('ZeroR RMSE sklearn: {:.2f} ({:.2f} / {:.2f})'.format(getMean(zeREval), min(zeREval), max(zeREval)))
print('ZeroR RMSE: {:.2f} ({:.2f} / {:.2f})'.format(getMean(zeroRR_scores), min(zeroRR_scores), max(zeroRR_scores)))




KNN accuracy sklearn: 21.08% (17.37% / 23.35%)
KNN accuracy: 22.24% (17.94% / 24.16%)

ZeroR accuracy sklearn: 16.53% (16.17% / 16.77%)
ZeroR accuracy: 16.50% (16.41% / 16.53%)

KNN RMSE sklearn: 2.36 (1.43 / 3.91)
KNN RMSE: 2.46 (2.40 / 2.57)

ZeroR RMSE sklearn: 3.04 (1.91 / 5.20)
ZeroR RMSE: 3.22 (3.20 / 3.28)


#### Analysis
The data result by using the sklearn is similar to the results by writing the code ourselves. Since the 5-fold cross validation, the bins would be created randomly. Hence, there might be slightly different. The method, cross_val_score, in sklearn works really efficiently and easily. It automatically collect scores for us.

For the KNN classification:

The baseline of the abalone dataset is 16.53%. KNN perfomrs better than the baseline where the average accuracy of this baseline is  21.08%, which is larger than the baseline. 

The baseline shows that there are 16.53% chance that the ages of the abalone can be predicted in the right class. And by using KNN model, there are 21.08% chance that the ages of the abalone is predicted in the right class. 

For the KNN regression:

According to the data, we have the baseline as 3.04. This means that the prediction towards Abalone's age would have the error of 3.04.  Whereas the error rate of the KNN regression is 2.36, which is smaller than the baseline. That is, the prediction result using KNN would have error of 2.36. This shows that KNN regression did a slightly better than the baseline. 


According to the dataset and the y_value, we can tell that the Class values are numeric values. Comparing with using the classification which is good at predicting the categorical values, Regrssion would fit the numerical data better. Thus, the Regression fits this task better.  

We cannot say which features is more important based on the KNN model since KNN is just calculating the distances between instances. 
