# Tweaking & Adjusting Your Model

A big factor in whether a machine learning model will perform well is a lot of tweaking...
![machine_learning_xkcd](./images/machine_learning_xkcd.png)
You can think of hyperparameters as little dials to adjust to make it easier for the machine learning model to learn
![dials](./images/dials.png)
But how do we know what to adjust them to?!

# Grid Search: Find the best for us!
[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

A way for us to search over multiple hyperparameters for the given model(s)

# Basic Grid Search

In [1]:
# Getting some data
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=27)

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [1, 10, 50]
}

clf_sv = SVC()
clf = GridSearchCV(clf_sv, parameters)
clf.fit(X_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10, 50], 'kernel': ['linear', 'rbf']})

# Bad Grid Search!

Note we still have to be careful in performing a grid search!

We can accidentally "leak" information by doing transformations with the **whole data set**, instead of just the **training set**!

## Example of leaking information

This will leak information when doing **cross-validation**:

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Scales over all of the X_train data! (validation set will be considered in scaling)
scaled_data = scaler.fit_transform(X_train)

parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [1, 10]
}

clf_sv = SVC()
clf = GridSearchCV(clf_sv, parameters)
clf.fit(X_train, y_train)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ['linear', 'rbf']})

**Why leaking?**
- When using cross-validation, the model has already seen the validation set from the *scaled_data* part.

## 2.2 Example of with no leakage

We can help prevent leaking by using **pipeline** to encapsulate the transformation with a *Transformer & Predictor* (to form a new *Estimator*)

In [6]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC())
])

parameters = {
    'scaler__with_mean': [True, False],
    'clf__kernel': ['linear', 'rbf'],
    'clf__C': [1, 10]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

In [9]:
scaler.get_params().keys()

dict_keys(['copy', 'with_mean', 'with_std'])

In [10]:
clf.get_params().keys()

dict_keys(['cv', 'error_score', 'estimator__C', 'estimator__break_ties', 'estimator__cache_size', 'estimator__class_weight', 'estimator__coef0', 'estimator__decision_function_shape', 'estimator__degree', 'estimator__gamma', 'estimator__kernel', 'estimator__max_iter', 'estimator__probability', 'estimator__random_state', 'estimator__shrinking', 'estimator__tol', 'estimator__verbose', 'estimator', 'n_jobs', 'param_grid', 'pre_dispatch', 'refit', 'return_train_score', 'scoring', 'verbose'])