
**Consider the Abalone data set available at UCI ML Repository https://archive.ics.uci.edu/ml/index.php.**

In [67]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#https://archive.ics.uci.edu/ml/index.php



In [68]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data" 
df = pd.read_csv(url)

In [69]:
df.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


**Change the names of the columns using the following code**

**df.columns = [ "Sex", "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Rings"]**

In [70]:
df.columns = [ "Sex", "Length", "Diameter", 
              "Height", "Whole weight", "Shucked weight", 
              "Viscera weight", "Shell weight", "Rings"]

In [71]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8



#### The goal is to use K-Nearest Neighbors Regressor to predict the number of rings df["Rings"] (target variable) using the features given. Note that one of the features is categorical and you can choose to drop it or to encode it numerically.

#### sklearn KNeighborsRegressor https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

#### Divide the data set into subsets for training and testing.


In [72]:
df['Sex'].replace(['M', 'F', 'I'], [2, 1, 0], inplace= True)


In [73]:
df['Rings'].unique()

array([ 7,  9, 10,  8, 20, 16, 19, 14, 11, 12, 15, 18, 13,  5,  4,  6, 21,
       17, 22,  1,  3, 26, 23, 29,  2, 27, 25, 24])

In [74]:
#df = df.drop(['Sex','Shucked weight'], axis = 1)
df = df.drop('Sex', axis = 1)

 We decide to drop 'Sex' since it is not a physical measurement. 

In [75]:
y = df['Rings'].values
X = df.drop('Rings', axis = 1)
#X = df[['Shell weight', 'Diameter', 'Height','Length' ]].to_numpy()

In [76]:
y

array([ 7,  9, 10, ...,  9, 10, 12])

In [77]:
X

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700
1,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100
2,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550
3,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550
4,0.425,0.300,0.095,0.3515,0.1410,0.0775,0.1200
...,...,...,...,...,...,...,...
4171,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490
4172,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605
4173,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080
4174,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960


In [78]:
#trainng set 
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#validation to tune hyperparameters
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=42)


In [79]:
knnmod = KNeighborsRegressor(14)
knnmod.fit(X_train, y_train)

In [80]:
y_pred = knnmod.predict(X_valid)
y_pred

array([10.78571429,  7.28571429,  8.21428571, 10.71428571,  4.35714286,
        9.        ,  8.64285714, 11.        , 10.07142857, 12.64285714,
       13.42857143, 11.57142857, 16.07142857, 11.        ,  8.64285714,
       11.5       , 12.92857143,  8.78571429, 12.5       ,  9.85714286,
        5.92857143, 10.64285714,  5.07142857,  5.07142857,  8.35714286,
        6.28571429,  7.28571429,  9.14285714, 11.57142857,  9.14285714,
       10.35714286,  9.64285714,  7.64285714,  6.42857143,  9.57142857,
        7.5       , 11.78571429, 10.92857143,  9.5       ,  6.71428571,
       10.28571429, 11.71428571, 10.21428571,  8.85714286, 11.92857143,
        7.5       ,  9.        , 14.5       , 10.85714286, 11.78571429,
       10.78571429, 10.21428571,  8.        , 10.78571429,  9.35714286,
        9.28571429, 13.78571429, 10.42857143, 11.5       ,  7.64285714,
       10.5       , 11.07142857,  5.35714286,  6.57142857, 16.5       ,
       11.92857143, 11.14285714,  8.07142857, 12.14285714, 12.42

We started with k=5, where the mse = 4.91. We increased k by increments of 5. When k = 10, mse= 4.38. When we incresed to 15 and 20 neighbors, our mse was approximately 4.37. Furthermore, we test numbers between 10-15, and found that when k = 14, mse = 4.337.   

#### Since this is a regression problem, the appropriate error functions you may use are MSE, MAE, and RMSE. Consider several different choices of the hyper-parameter K and see how the error changes with respect to K.

In [81]:
mse = mean_squared_error(y_valid, y_pred)
mse

4.337400708786509

In [82]:
mean_absolute_error(y_valid, y_pred)

1.470059880239521

In [83]:
np.sqrt(mse)

2.082642722308968

In [84]:
knnmod.score(X_test, y_test)*100

55.58667533464421

We can use ‘score’ function to see how well our model predictions match up to the actual results. Our model has an accuracy of approximately 55.59%. Let's see if we can increase model performance.

#### Use GridSearchCV to select the optimal values of K (and maybe other hyper-parameters) and report the model performance.

In [85]:

from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV


In [86]:
X_train, X_test, y_train,y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)




In [87]:
gs = GridSearchCV(estimator = KNeighborsRegressor(),
                  #dictionary of all values we want to test for n_neighbors
                  param_grid = {"n_neighbors": range(1,31), #1 to 30; we get 30 choices
                                "weights": ['distance']},  
                  cv=10)

gs.fit(X_train, y_train)

gs.cv_results_

{'mean_fit_time': array([0.00408585, 0.00347254, 0.00358813, 0.00348098, 0.00314555,
        0.00315082, 0.00317593, 0.00314665, 0.00307841, 0.00319469,
        0.00312421, 0.00313148, 0.00348694, 0.00328956, 0.00329783,
        0.00316212, 0.00312788, 0.00309439, 0.00312278, 0.00315788,
        0.00312726, 0.00311556, 0.00317595, 0.00316815, 0.00310986,
        0.00314963, 0.00320845, 0.00312829, 0.00312293, 0.00318136]),
 'std_fit_time': array([1.63773117e-03, 3.49930474e-04, 4.88515934e-04, 3.83134859e-04,
        9.59566037e-05, 1.41389303e-04, 1.50629522e-04, 1.24262519e-04,
        6.20861207e-05, 1.12456726e-04, 1.01928694e-04, 7.07880942e-05,
        6.75142542e-04, 2.65858551e-04, 1.79994831e-04, 1.08967070e-04,
        1.35885207e-04, 7.33690944e-05, 1.03350285e-04, 1.31312956e-04,
        6.87803552e-05, 6.29429745e-05, 1.33906599e-04, 1.37629520e-04,
        1.06222637e-04, 1.34794588e-04, 1.52763839e-04, 1.10695515e-04,
        9.90414568e-05, 1.18667669e-04]),
 'mean_scor

In [88]:
gs.best_params_

{'n_neighbors': 19, 'weights': 'distance'}

In [89]:
gs.best_score_ *100

54.893911337394144

In [90]:
y_pred = gs.predict(X_test)
mean_squared_error(y_test, y_pred)

4.826267093812806

We can see that 19 is the optimal value for ‘n_neighbors’. We then use the ‘best_score_’ function to check the accuracy of our model when ‘n_neighbors’ is 19 and get 54.89% accuracy. We did not improve the model, the accuracy decreased approximately 0.7%. And, the mean squared error slightly increased, which is not what we want.