# K-nearest neighbor algorithm API

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm='auto')  
n_neighbors：  
int, optional (default = 5), k_neighbors queries the number of neighbors used by default  
algorithm：{‘auto’，‘ball_tree’，‘kd_tree’，‘brute’}  

# Case: Prediction of iris flower species with K tuning

## Introduction to the data set

The Iris data set is a commonly used classification experimental data set, collected and organized by Fisher, 1936. Iris is also called the iris flower data set, which is a type of data set for multivariate analysis.

## Analysis Step

1. Get the data set
2. Data processing: Split the Dataset
3. Feature Engineering: Normalization and standardization
4. Machine learning (model training)
5. Model evaluation

## Code

### Import module

In [80]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier


### Get the data set 

1. Get the dataset by instantiate the load_iris

In [81]:
iris = load_iris()

In [82]:
#iris

### Data Processing: divide the data set

In [83]:
x_train, x_test, y_train, y_test = train_test_split(
    iris.data, 
    iris.target, 
    test_size = 0.2, 
    random_state = 22)

In [84]:
#print(x_test)

### Feature Engineering: Normalization and standardization

- Standardization feature variable

In [85]:
transfer = StandardScaler()

In [86]:
x_train = transfer.fit_transform(x_train)

In [87]:
# x_train

In [88]:
x_test = transfer.fit_transform(x_test)

In [89]:
#x_test

### Model training and prediction

- Machine learning (model training):  
 1. Model selection and tuning-grid search and cross-validation  

In [142]:
estimator = KNeighborsClassifier()

 Parameters

In [143]:
param_dict = {"n_neighbors":[1,2,3,4,5,6,7]}

In [144]:
estimator2 = GridSearchCV(estimator, param_grid = param_dict, cv = 4)

 2. Model Training

In [145]:
estimator2.fit(x_train, y_train)

GridSearchCV(cv=4, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7]})

 3. Model Prediction

In [146]:
y_predict = estimator2.predict(x_test)
print("the predict of y", y_predict)

the predict of y [0 2 1 1 1 1 1 1 1 0 2 1 2 2 0 2 1 1 1 1 0 2 0 1 1 0 1 2 2 1]


### Model evaluation

1. Method 1: Comparison betwwen the truth and the prediction

In [147]:
print("the comparison between the truth and the prediction of y", 
      y_predict == y_test)

the comparison between the truth and the prediction of y [ True  True  True False  True  True  True False  True  True  True  True
  True  True  True  True  True  True False  True  True  True  True  True
 False  True False  True  True False]


2. Method 2: Compute the rate of accuracy

In [148]:
score = estimator2.score(x_test,y_test)

In [149]:
print("the rate of accuracy is: ",score)

the rate of accuracy is:  0.8


 3. Method of CV and GS:  
    Then perform an evaluation to see the final selection result and the cross-validation result

In [150]:
print("the best score in the Cross validation: ", estimator2.best_score_)

the best score in the Cross validation:  0.9666666666666668


In [151]:
print("the best model with K: \n",  estimator2.best_params_)

the best model with K: 
 {'n_neighbors': 3}


In [152]:
print("the result in every cross validation: \n",estimator2.cv_results_)

the result in every cross validation: 
 {'mean_fit_time': array([0.00150049, 0.00075305, 0.00099951, 0.00075179, 0.00100082,
       0.0010016 , 0.00050068]), 'std_fit_time': array([4.99010613e-04, 4.34849100e-04, 1.89520277e-06, 4.34053428e-04,
       6.82206341e-07, 1.52662378e-06, 5.00680038e-04]), 'mean_score_time': array([0.00500125, 0.00350189, 0.00300086, 0.00325334, 0.00299954,
       0.00299847, 0.00299948]), 'std_score_time': array([1.58132610e-03, 5.00444488e-04, 7.07815062e-04, 4.40415473e-04,
       1.88486437e-06, 1.78813934e-06, 1.46365513e-06]), 'param_n_neighbors': masked_array(data=[1, 2, 3, 4, 5, 6, 7],
             mask=[False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 2}, {'n_neighbors': 3}, {'n_neighbors': 4}, {'n_neighbors': 5}, {'n_neighbors': 6}, {'n_neighbors': 7}], 'split0_test_score': array([0.96666667, 0.96666667, 1.        , 1.        , 1.        ,
       1.   

In [154]:
estimator2.cv_results_

{'mean_fit_time': array([0.00150049, 0.00075305, 0.00099951, 0.00075179, 0.00100082,
        0.0010016 , 0.00050068]),
 'std_fit_time': array([4.99010613e-04, 4.34849100e-04, 1.89520277e-06, 4.34053428e-04,
        6.82206341e-07, 1.52662378e-06, 5.00680038e-04]),
 'mean_score_time': array([0.00500125, 0.00350189, 0.00300086, 0.00325334, 0.00299954,
        0.00299847, 0.00299948]),
 'std_score_time': array([1.58132610e-03, 5.00444488e-04, 7.07815062e-04, 4.40415473e-04,
        1.88486437e-06, 1.78813934e-06, 1.46365513e-06]),
 'param_n_neighbors': masked_array(data=[1, 2, 3, 4, 5, 6, 7],
              mask=[False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 1},
  {'n_neighbors': 2},
  {'n_neighbors': 3},
  {'n_neighbors': 4},
  {'n_neighbors': 5},
  {'n_neighbors': 6},
  {'n_neighbors': 7}],
 'split0_test_score': array([0.96666667, 0.96666667, 1.        , 1.        , 1.        ,
        1.        , 1.     