# Tweaking & Adjusting Your Model

A big factor in whether a machine learning model will perform well is a lot of tweaking...
![machine_learning_xkcd](./images/machine_learning_xkcd.png)
You can think of hyperparameters as little dials to adjust to make it easier for the machine learning model to learn
![dials](./images/dials.png)
But how do we know what to adjust them to?!

# Grid Search: Find the best for us!
[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

A way for us to search over multiple hyperparameters for the given model(s)

# Basic Grid Search

In [1]:
# Getting some data
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=27)

In [2]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [1, 10, 50]
}

clf_sv = SVC()
clf = GridSearchCV(clf_sv, parameters)
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 10, 50], 'kernel': ['linear', 'rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

# Bad Grid Search!

Note we still have to be careful in performing a grid search!

We can accidentally "leak" information by doing transformations with the **whole data set**, instead of just the **training set**!

## Example of leaking information

This will leak information when doing **cross-validation**:

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Scales over all of the X_train data! (validation set will be considered in scaling)
scaled_data = scaler.fit_transform(X_train)

parameters = {
    'kernel': ['linear', 'rbf'],
    'C': [1, 10]
}

clf_sv = SVC()
clf = GridSearchCV(clf_sv, parameters)
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 10], 'kernel': ['linear', 'rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

**Why leaking?**
- When using cross-validation, the model has already seen the validation set from the *scaled_data* part.

## 2.2 Example of with no leakage

We can help prevent leaking by using **pipeline** to encapsulate the transformation with a *Transformer & Predictor* (to form a new *Estimator*)

In [9]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC())
])

parameters = {
    'sacler__with_mean': [True, False],
    'clf__kernel': ['linear', 'rbf'],
    'clf__C': [1, 10]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

ValueError: Invalid parameter sacler for estimator Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr', degree=3,
                     gamma='scale', kernel='rbf', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.