#                 Experiment 3
  
  ### Intro To Scikit-learn

### Getting Started

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

### Fitting and predicting: estimator basics

Scikit-learn provides dozens of built-in machine learning algorithms and models, called <b>estimators</b>. Each estimator can be fitted to some data using its fit method.

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

x = [[ 1,  2,  3],[11, 12, 13]]
y = [0, 1]

clf.fit(x, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [2]:
clf.predict(x)

array([0, 1])

### Transformers and pre-processors

Transformers and estimators (predictors) can be combined together into a single unifying object: a <b>Pipeline</b>. The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with fit and predict. Using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [4]:
# create a pipline object
pipe = make_pipeline(StandardScaler(), LogisticRegression(random_state=0))

# Load iris data set and split it into train and test sets
X, y = load_iris(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=.3)

# fit the whole pipeline
pipe.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=0,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [5]:
pipe.predict(x_test) == y_test

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True])

In [6]:
# now we can use it as other estimator
accuracy_score(pipe.predict(x_test), y_test)

0.9777777777777777

### Model evaluation

Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We have just seen the train_test_split helper that splits a dataset into train and test sets, but scikit-learn provides many other tools for model evaluation, in particular for <b>cross-validation</b>.

We here briefly show how to perform a 5-fold cross-validation procedure, using the cross_validate helper. Note that it is also possible to manually iterate over the folds, use different data splitting strategies, and use custom scoring functions

In [7]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

In [8]:
X, y = make_regression(n_samples=10000, random_state=0)
print("Shape of dataset: ",X.shape)

lr = LinearRegression()
result = cross_validate(lr, X, y)  # default to 5 fold cv
result["test_score"]

Shape of dataset:  (10000, 100)


array([1., 1., 1., 1., 1.])

### Automatic parameter searches

All estimators have parameters (often called hyper-parameters in the literature) that can be tuned. The generalization power of an estimator often critically depends on a few parameters. For example a <b>RandomForestRegressor</b> has a n_estimators parameter that determines the number of trees in the forest, and a max_depth parameter that determines the maximum depth of each tree. Quite often, it is not clear what the exact values of these parameters should be since they depend on the data at hand.

Scikit-learn provides tools to automatically find the best parameter combinations (via cross-validation). In the following example, we randomly search over the parameter space of a random forest with a <b>RandomizedSearchCV</b> object. When the search is over, the RandomizedSearchCV behaves as a <b>RandomForestRegressor</b> that has been fitted with the best set of parameters.

In [9]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

In [10]:
X, y = fetch_california_housing(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=.3)

# defining parameters space that will be searched over
param_distributions = {'n_estimators': randint(1,5), 'max_depth':randint(5, 10)}

# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                             n_iter=5,
                             param_distributions=param_distributions,
                             random_state=0)

search.fit(x_train, y_train)

RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                           

In [11]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [12]:
# the search object now looks like a normal random forest estimator
#  with max depth = 9 and n_estimator = 4

search.score(x_test, y_test)

0.7494733952983665

<b>Note:</b> In practice, you almost always want to search over a pipeline, instead of a single estimator. One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator  