# **(Part 5 Cross-validation and Hyper-parameter Tuning)**

## Objectives

Compare with other classification methods
   * Decision trees with **`tree.DecisionTreeClassifier()`**
   * K-nearest neighbors with **`neighbors.KNeighborsClassifier()`**
   * Random forests with **`ensemble.RandomForestClassifier()`**
   * Perceptron (both gradient and stochastic gradient) with **`mlxtend.classifier.Perceptron`**
   * Multilayer perceptron network (both gradient and stochastic gradient) with **`mlxtend.classifier.MultiLayerPerceptron`**

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction'

---

#### Cross Validation Techniques

#### K-Fold Validation

In [None]:
cv1 = KFold(n_splits= 13, random_state = 12, shuffle=True)    

scores_kfold_rfc = cross_val_score(RFC_model, X, Y, scoring='accuracy',cv=cv1, n_jobs=-1)
scores_kfold_lrc = cross_val_score(LR_model, X, Y, scoring='accuracy',cv=cv1, n_jobs=-1)
scores_kfold_knn = cross_val_score(KNNI, X, Y, scoring='accuracy',cv=cv1, n_jobs=-1)

print("The Kfold Cross Validation for our Random Forest Classifier yields %0.2f accuracy " % (scores_kfold_rfc.mean()))
print("The Kfold Cross Validation for our Logistic Regression Classifier yields %0.2f accuracy " % (scores_kfold_lrc.mean()))
print("The Kfold Cross Validation for our KNN Classifier yields %0.2f accuracy " % (scores_kfold_knn.mean()))

#### Stratified K-fold Validation

In [None]:
skfold = StratifiedKFold(n_splits=3, random_state=100, shuffle=True)
scores_skfold_ = cross_val_score(RFC_model, X, Y, scoring='accuracy', cv=skfold, n_jobs=-1)
scores_skfold

scores_skfold_rfc = cross_val_score(RFC_model, X, Y, scoring='accuracy',cv=skfold , n_jobs=-1)
scores_skfold_lrc = cross_val_score(LR_model, X, Y, scoring='accuracy',cv=skfold , n_jobs=-1)
scores_skfold_knn = cross_val_score(KNNI, X, Y, scoring='accuracy',cv=skfold , n_jobs=-1)

print("The Kfold Cross Validation for our Random Forest Classifier yields %0.2f accuracy " % (scores_skfold_rfc.mean()))
print("The Kfold Cross Validation for our Logistic Regression Classifier yields %0.2f accuracy " % (scores_skfold_lrc.mean()))
print("The Kfold Cross Validation for our KNN Classifier yields %0.2f accuracy " % (scores_skfold_knn.mean()))

### Hyperparameters Tuning With RandomizedSearch CV

The hyper parameter tuning is the process of evaluating the best fit hyperparameters set for our model. This is often refered to as search in Machine Learning models and can be categorized into two based on their best fit search patterns. 

#### I. RandomizedSearch CV

This is the kind of search technique that moves around searching the best fit of combination in a random fashion for a fixed set of parameters over a predetermined iterations. This is the best search method incase of many numbers of hyperparameters as it reduces the computation cost. Therefore, as Random Forest Classifier depends on multiple hyperparaemters , we will be using RadomizedSearch CV for this.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)


random_search = {'criterion': ['entropy', 'gini'],
 'max_depth': list(np.linspace(5, 1200, 10, dtype = int)) + [None],
 'max_features': ['auto', 'sqrt','log2', None],
 'min_samples_leaf': [4, 6, 8, 12],
 'min_samples_split': [3, 7, 10, 14],
 'n_estimators': list(np.linspace(5, 1200, 3, dtype = int))}

model = RandomForestClassifier()

modelrf = RandomizedSearchCV(estimator = model ,param_distributions = random_search, cv = 4, verbose= 5, random_state= 101, n_jobs = -1)

modelrf.fit(X_train, Y_train)


After performing the search the best fit hyperparameters for our random forest model are found to be 

In [None]:
modelrf.best_params_

In [None]:
Randomized_YPred =modelrf.predict(X_test)

print(classification_report(Randomized_YPred  , Y_test))

In [None]:
metrics.accuracy_score(Y_test, Randomized_YPred)

#### II. GridSearch CV

Next is the GridSearch CV that is used to find the best hyperparameters for our model based on a grid based search. This is implied in the KNN model to find the best fit value for k which is the hyper parameter for that module.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

params = {'n_neighbors' : list(range(1,10))}

KNN_cv= GridSearchCV(KNNI,params,cv=100)
KNN_cv.fit(X_train,Y_train)
Y_pred  = KNN_cv.predict(X_test)

In [None]:
k = KNN_cv.best_params_.get("n_neighbors")
print(f"The best fit value for k is {k}.")

In [None]:
KNN_cv.best_score_

---