In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


# Part 1. `scikit-learn`  basics

[`sklearn`](https://scikit-learn.org/stable/getting_started.html) is a very popular Python library, where plenly of usefull machine learning tools (models, preprocessing methods, etc.) are implemented. 

We will consider three usefull building-blocks of sklearn:
- Estimators 
- Transformers 
- Pipeline


### 1. Estimators

Models, aka estimators, are objects which perform model estimation. They have 2 main methods:

* Method `fit()`:
    - takes as input design matrix `X` and target values `y`
    - `X` is supposed to have the shape `(n_samples, n_features)`
    - for unsupervised models `y=None` by default
    - trains the model (aka finds optimal parameters)
    
* Method `predict()`:
    - takes as input design matrix `X`
    - returns predicted target values
    - makes sense to use it **after** the `fit()`
     

In [2]:
#Consider toy dataset
X = np.array([[ 1,  2,  3], [11, 12, 13]])
y = np.array([2.1, 11.9])
print(X.shape, y.shape)

X_test =  np.array([[ 2, 3, 4]])
print(X_test.shape)

(2, 3) (2,)
(1, 3)


`LinearRegression()` - is an example of an estimator. You'll find out more about this model next week.


Fit linear regression and predict target variable for the test dataset

In [3]:
from sklearn.linear_model import LinearRegression

# Init the model and set all the hyperparameters (if there are any)
lr = LinearRegression()
# Train the model
lr.fit(X, y)

# make a prediction
lr.predict(X_test)

array([3.08])

### 2. Transformers

Objects, which are mostly used to pre-process and transform the data. It's interface is very similar to estimators, but they have method `transform` instead of `predict`.

     
Most sklearn transformers can be found in the following modules:
- [sklearn.impute](https://scikit-learn.org/stable/modules/impute.html#impute) --- missing values imputation
- [sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) --- scaling, centering, normalization and binarization

We will discuss some of them in the course.

Scale the design matrix `X` usign `StandardScaler`. 

In [5]:
from sklearn.preprocessing import StandardScaler
# Init, fit and transform
sc = StandardScaler()

sc.fit(X)
print(X)
      
X_transformed = sc.transform(X)
print(X_transformed)

[[ 1  2  3]
 [11 12 13]]
[[-1. -1. -1.]
 [ 1.  1.  1.]]


In [6]:
# the same, but in 1 line
sc.fit_transform(X)

array([[-1., -1., -1.],
       [ 1.,  1.,  1.]])

### 3. Pipelines


 Usually, we have to apply several transformers to our dataset (e.g. fill in missing values, standartize all the numerical features) before we can train the model:

<img src="1_ML_pipe.png" width=600 height=600 />

Recall, that each block here requires calling `fit` and `transorm`/`predict` methods to get an output. 

`Pipeline` allows us to combine transformers in estimators in one class, so that we can call only one `fit` to train 'em all!
<img src="2_ML_pipe.png" width=600 height=600 />

Consider the following setting:

1. Use Standard Scaler
2. Fit linear regression model
3. Make prediction on `X_test`

With the `Pipeline`:

In [8]:
from sklearn.pipeline import Pipeline
# create a pipeline
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('regression', LinearRegression())
])

# fit the whole pipeline
pipe.fit(X, y)

# we can now use it like any other estimator and make a prediction
pipe.predict(X_test)

array([3.08])

There is much more in `sklearn`, you can read about all the classes and function in the [user guide](https://scikit-learn.org/stable/user_guide.html#user-guide).

In [9]:
from sklearn.preprocessing import OneHotEncoder

In [13]:
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 10], ['Female', 2]]
enc.fit_transform(X).toarray()

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

In [None]:
X = pd.Dataf