<a href="https://colab.research.google.com/github/MavrellousG/Azure_ML_DFE/blob/main/CrossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#pip install lime

## Learning Objectives

Today we will be covering the full ML pipeline in all of its glory, starting from good clean data (that is a big if) to the final predictions.

## The full picture

With cross validation we can show you the full picture of model building (after you have done the hard work of data munging). The magic that cross validation unlocks is twofold

1. It allow you to have more training data and therefore get better performance and more accurate representations of your performance
2. It actually simplifies the process. You will no longer need to keep 3 sets of data and you can get by with just two.

Let's get started:

In [2]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split


boston_data = load_boston()

# we make our test set
X_train, X_test, y_train, y_test = train_test_split(boston_data['data'], boston_data['target'], test_size=0.2, random_state=1)

# and we make our validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

Our next step will be to define the model that we are looking at:

In [3]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor()

Then we determine which parameters we would like to search over:

In [4]:
params = {
    'max_depth': range(2, 20, 2),
    'min_samples_leaf': range(5, 25, 5)
}

And finally we use GridSearchCV which will search over the parameters doing cross validation to determine their performance:

In [5]:
from sklearn.model_selection import GridSearchCV

gs = GridSearchCV(reg, params, scoring='neg_mean_absolute_error')

In [6]:
gs.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeRegressor(),
             param_grid={'max_depth': range(2, 20, 2),
                         'min_samples_leaf': range(5, 25, 5)},
             scoring='neg_mean_absolute_error')

We get a lot of goodies. We can see the best score and estimator:

In [7]:
gs.best_score_

-2.8928227256649848

In [8]:
gs.best_estimator_

DecisionTreeRegressor(max_depth=6, min_samples_leaf=5)

And we get to use the grid search object as that estimator as well:

In [9]:
gs.predict(X_train[:5])

array([15.55      , 16.37777778, 21.72222222, 19.728     , 34.98571429])

## A note on hyperparam tuning

Grid search might be becoming a bit old school in the next few years, with advancements like random search, hyperband, bayesian hyperparam search and more we might use a more advanced way to search through available params. That being said it is good to know and still widely used in ML.

### Model Explainer

In [10]:
import lime
import lime.lime_tabular
import numpy as np

ModuleNotFoundError: ignored

In [None]:
categorical_features = np.argwhere(np.array([len(set(boston_data.data[:,x])) for x in range(boston_data.data.shape[1])]) <= 10).flatten()

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(X_train, feature_names=boston_data.feature_names, class_names=['price'], categorical_features=categorical_features, verbose=True, mode='regression')

In [None]:
i = 15
exp = explainer.explain_instance(X_test[i], gs.predict, num_features=5)

In [None]:
exp.show_in_notebook(show_table=True)

In [None]:
exp.as_list()