### KNN
<br>
- [Sklearn Official document URL - KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

> class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

```python
from sklearn.neighbors import KNeighborsClassifier
```

### Example code

#####  Iris dataset

In [10]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
# x : Sepal Length, Sepal Width, Petal Length and Petal Width
X = iris.data
# y : 0 - Setosa, 1 - Versicolour, and 2 - Virginica
y = iris.target

# build your own linear regression model using iris data set
# 1) Divide train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=1/3, 
                                                    random_state=1)
# 2) Use KNN with pipeline(PCA : reduce dimension)
knn = KNeighborsClassifier(n_neighbors=5)

knn1, pca = KNeighborsClassifier(n_neighbors=5), PCA(n_components=2)
knn_pca = Pipeline(steps = [('feature_extractor_pca',pca), ('neibor', knn1)])

# 3) 학습하고
knn.fit(X_train, y_train)
knn_pca.fit(X_train, y_train)

# 4) 성능 비교
print("Only kneighbor Score :",accuracy_score(y_test, np.abs(np.round(knn.predict(X_test)))),"\n\
PCA+kneighbor Score :",accuracy_score(y_test, np.abs(np.round(knn_pca.predict(X_test)))))

Only kneighbor Score : 0.98 
PCA+kneighbor Score : 1.0


### Cross Validation

<br>

### K-fold Cross-validation

- [Sklearn Official document URL - cross_validation](https://scikit-learn.org/stable/modules/cross_validation.html)

```python
from sklearn.model_selection import cross_val_score
```

In [5]:
import pandas as pd
import numpy as np

train = pd.read_csv('titanic/train_preprocessing.csv') # X_train
test = pd.read_csv('titanic/test_preprocessing.csv') # y_train
target = pd.read_csv('titanic/target_preprocessing.csv') # X_tet

print(f"list of missing value \n train : {train.isna().sum().values}\n test : {test.isna().sum().values}\n target : {target.isna().sum().values}")

train.fillna(0, inplace=True)
test.fillna(0, inplace=True)
# target.fillna(0, inplace=True)

x = train.to_numpy()[:,1:]
y = target.to_numpy()[:,1]
X_test = test.to_numpy()[:,2:]

list of missing value 
 train : [0 0 0 0 0 0 2 0 0]
 test : [0 0 0 0 0 0 0 0 1 0]
 target : [0 0]


In [7]:
# 1) x, y -> train/test
from sklearn.model_selection import train_test_split, cross_val_score

x_train, x_test, y_train, y_test = train_test_split(x,y,
                                                    test_size=1/8, 
                                                    random_state=0)

# 2) LDA + KNN의 Pipeline Classifier만들고
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline as pipeline

lda, kneighbor = lda(n_components=1), KNeighborsClassifier(n_neighbors=5)
clf = pipeline(steps=[('LDA',lda),('kneighbor',kneighbor)])

# 3) k-Fold Cross Validation 성능 확인
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, x_train, y_train, cv=5)
print(f"Score Array : {scores} \n\
Score Mean : {np.mean(scores)}\n\
Score Std : {np.std(scores)}")

# 4) 유의미하다면 전체 데이터로 학습
clf.fit(x_train, y_train)

# 5) test 분리한걸로 검증
print("kneighbor + LDA 성능 :",accuracy_score(y_test, np.abs(np.round(clf.predict(x_test)))))

# 6) 최종 학습된 모델로 아래 데이터 res로 예측
res = clf.predict(X_test)

Score Array : [0.76282051 0.80128205 0.84615385 0.74358974 0.76774194] 
Score Mean : 0.784317617866005
Score Std : 0.03607533685172391
kneighbor + LDA 성능 : 0.7946428571428571


### 

### Grid-Search

<br>

##### Grid-SearchCV

- [Sklearn Official document URL - cross_validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

```python
from sklearn.model_selection import GridSearchCV
```

##### RandomizedSearchCV

- [Sklearn Official document URL - cross_validation](class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False))

```python
from sklearn.model_selection import RandomizedSearchCV
```

```
sklearn.model_selection.GridSearchCV(estimator, param_grid,scoring=None,
          n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, 
          pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
```

1.estimator: Pass the model instance for which you want to check the hyperparameters.<br>
2.params_grid: the dictionary object that holds the hyperparameters you want to try<br>
3.scoring: evaluation metric that you want to use, you can simply pass a valid string/ object of evaluation metric<br>
4.cv: number of cross-validation you have to try for each selected set of hyperparameters<br>
5.verbose: you can set it to 1 to get the detailed print out while you fit the data to GridSearchCV<br>
6.n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors. 

In [7]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

iris_data = load_iris()
label = iris_data.target
data = iris_data.data

X_train, X_val, y_train, y_val = train_test_split(data, label, test_size=0.2)


# GridSearchCV의 param_grid 설정
params = {
    'max_depth': [2, 3],
    'min_samples_split': [2, 3]
}

dtc = DecisionTreeClassifier()

grid_tree = GridSearchCV(dtc, param_grid=params, cv=3, refit=True)
grid_tree.fit(X_train, y_train)

print('best parameters : ', grid_tree.best_params_)
print('best score : ', grid_tree.best_score_)
print('best model parameter : ', em)
em = grid_tree.best_estimator_
pred = em.predict(X_val)
accuracy_score(y_val, pred)

best parameters :  {'max_depth': 2, 'min_samples_split': 2}
best score :  0.9333333333333332
best model parameter :  DecisionTreeClassifier(max_depth=3)


0.9666666666666667