# Getting Started

From thois guide: https://scikit-learn.org/stable/getting_started.html

In [2]:
import sklearn
print("sklearn version:", sklearn.__version__)

sklearn version: 0.24.2


### 1. Estimator

In [3]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)

In [9]:
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]

y = [0, 1]  # classes of each sample

print(X)
print(y)

[[1, 2, 3], [11, 12, 13]]
[0, 1]


`.fit()` call:

In [10]:
clf.fit(X, y)

RandomForestClassifier(random_state=0)

`.predict()` call:

In [11]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

In [12]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

### 2. Transformers (pre-processors)

Note this quote:
> A typical pipeline consists of a pre-processing step that transforms or imputes the data, and *a final predictor that predicts target values*.

**💡 Design note:**
In scikit-learn, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same `BaseEstimator` class). 

In [13]:
from sklearn.preprocessing import StandardScaler

In [14]:
X = [[0, 15],
     [1, -10]]
# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

### 3. Pipelines

> The pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with `fit` and `predict`.

In the following example, we load the Iris dataset, split it into train and test sets, and compute the accuracy score of a pipeline on the test data:

In [15]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [16]:
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

In [17]:
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [18]:
# fit the whole pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [19]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

### 4. Automatic parameter searches

Scikit-learn provides tools to automatically find the best parameter combinations **(via cross-validation)**. In the following example, we randomly search over the parameter space of a random forest with a `RandomizedSearchCV` object. 

When the search is over, the `RandomizedSearchCV` behaves as a RandomForestRegressor that has been fitted with the best set of parameters. 

In [20]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

In [21]:
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [22]:
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}

In [24]:
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=0),
    n_iter=5,
    param_distributions=param_distributions,
    random_state=0
)
search.fit(X_train, y_train)

RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe751835bb0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fe751265700>},
                   random_state=0)

In [25]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [26]:
# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)

0.735363411343253

#### ‼️ Note `Pipeline`s, hyperparameter search, and leakage

In practice, you almost always want to search over a pipeline, instead of a single estimator. One of the main reasons is that if you apply a pre-processing step to the whole dataset without using a pipeline, and then perform any kind of cross-validation, you would be breaking the fundamental assumption of independence between training and testing data. Indeed, since you pre-processed the data using the whole dataset, some information about the test sets are available to the train sets. This will lead to over-estimating the generalization power of the estimator (you can read more in this Kaggle post).

Using a pipeline for cross-validation and searching will largely keep you from this common pitfall.