## Fitting and predicting: estimator basics

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample

clf.fit(X, y)

RandomForestClassifier(random_state=0)

- The fit method generally accepts 2 inputs:
- The samples matrix X (n_samples, n_features) - samples are represented as rows, features are represented as columns.
- The target values y (real numbers for regression tasks, or integers for classifications). 
    - For unsupervised learning tasks, y does not need to be specified. 
    - y is usually a 1d array where the ith entry corresponds to the target of the ith sample (row) of X.
- Both X and y are usually numpy arrays. Some estimators work with other formats such as sparse matrices.
- Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator.

In [2]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

In [3]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

## Transformers and pre-processors
- A typical pipeline has a pre-processing step that transforms or imputes the data, and a predictor that predicts target values.

In [4]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15], [1, -10]]

StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

## Pipelines: chaining pre-processors and estimators
- Transformers and estimators (predictors) can be combined into a Pipeline.
    - Pipelines have the same API as regular estimators: they can be fitted and used for prediction with fit and predict. 
    - As we will see later, pipelines also prevent you from data leakage, (disclosing test data in your training data.)
- Below: load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data.

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression(random_state=0)
)

# load the iris dataset and split it into train and test sets
X, y                             = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model evaluation
- Fitting a model does not ensure that it will predict well on unseen data. This needs to be evaluated. We just saw the **train_test_split** helper that splits a dataset into train and test sets.
- Below: perform a **5-fold cross-validation** using the **cross_validate** helper. You also can manually iterate over the folds, use different data splitting strategies, and use custom scoring functions.

In [6]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y   = make_regression(n_samples=1000, random_state=0)
lr     = LinearRegression()
result = cross_validate(lr, X, y)  # defaults to 5-fold CV
result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])

## Automatic parameter searches
- All estimators have tunable parameters (aka hyper-parameters). The generalization power of an estimator often depends on a few parameters. 
    - For example a **RandomForestRegressor** has a **n_estimators** parameter that sets the number of trees in the forest, and a **max_depth** parameter that sets the max depth of each tree. It's usually unclear what exact values of these parameters should be.

- Scikit-learn can automatically find the best parameter combinations (via cross-validation). 
- Below: randomly search the parameters of a random forest with a **RandomizedSearchCV** object. When complete, **RandomizedSearchCV** behaves as an optimized **RandomForestRegressor**.

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

X, y                             = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# define the parameter space
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}

# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator           = RandomForestRegressor(random_state=0),
                            n_iter              = 5,
                            param_distributions = param_distributions,
                            random_state        = 0)
search.fit(X_train, y_train)

print(search.best_params_)

# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
print(search.score(X_test, y_test))

{'max_depth': 9, 'n_estimators': 4}
0.735363411343253


- In practice, you almost always want to search over a pipeline, instead of a single estimator. 
    - If you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. 
    - Since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. 
    - This leads to over-estimating the generalization power of the estimator.
    - Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.

## Next Steps
- [User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide)
- [API](https://scikit-learn.org/stable/modules/classes.html#api-ref)
- [Examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples)
- [Tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu)