# Model Tuning

- **Problem**: search for a set of optimal hyperparameters for a learning algorithm.
- **Solution**: find a set of optimal hyperparameters that results in an optimal model. 
- **Optimal model**: yields an optimal **score**. 
- **Score**: in sklearn defaults to accuracy (classification) and R2 (regression).
- **Important**: a model generalization performance is tested using cross-validation.

### Approaches
- **Grid Search**: 
    - Manually set a grid of discrete hyperparameter values.
    - Set a metric for scoring model performance.
    - Search exhaustively through the grid. 
    - For each set of hyperparameters, evaluate each model's CV score.
    - The optimal hyperparameters are those of the model achieving the best CV score. 

- **Random Search**:

- **Bayesian Optimization**:

- **Genetic Algorithms**:



### Definitions
- **paramaters**: learned from data
    - Example: split-point of a node, split-feature of a node.. 
- **hyperparameters**: not learned from data, set prior to training
    - Example: max_depth, min_samples_leaf, splitting criterion...


### Cons
- Tuning is expensive
    - computationally expensive
    - sometimes leads to very slight improvements

### Grid Search example

In [2]:
from sklearn.tree import DecisionTreeClassifier

seed = 1

# Instantiate the model
dt = DecisionTreeClassifier(random_state=seed)
print(dt.get_params())


{'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': 1, 'splitter': 'best'}


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.datasets import load_breast_cancer

# Load data
data = load_breast_cancer()
X = data.data[:, :]
y = data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)


# Define the hyperparameters grid
params_dt = {
    "max_depth": [3, 4, 5, 6],
    "min_samples_leaf": [0.04, 0.06, 0.08],
    "max_features": [0.2, 0.4, 0.6, 0.8]
}

# Instantiate a grid search CV object
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='accuracy',
                       cv=10,
                       n_jobs=-1)

# Fit to the training data
grid_dt.fit(X_train, y_train)

# Extract best hyperparameters
best_hyperparams = grid_dt.best_params_
print(f"Best hyperparams:\n {best_hyperparams}")

# Extract best CV score
best_CV_score = grid_dt.best_score_
print(f"Best CV accuracy: \n {best_CV_score}")

# Extract the best model from 'grid_dt'
best_model = grid_dt.best_estimator_

# Evaluate test set accuracy 
test_acc = best_model.score(X_test, y_test)
print(f"Test set accuracy of best model: \n {test_acc}")

Best hyperparams:
 {'max_depth': 3, 'max_features': 0.2, 'min_samples_leaf': 0.04}
Best CV accuracy: 
 0.9373076923076923
Test set accuracy of best model: 
 0.8888888888888888
