# "Sklearn"
> "Subheader"

- author: Christopher Thiemann
- toc: true
- branch: master
- badges: true
- comments: true
- categories: [python, ]
- hide: true
- search_exclude: true


In [None]:
!pip install -U scikit-learn

In [1]:
#hide
import warnings
1+1

import numpy as np
import scipy as sp
import sklearn
import statsmodels.api as sm
from statsmodels.formula.api import ols


import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

  import pandas.util.testing as tm


In [59]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

In [3]:
df = sns.load_dataset("car_crashes")
df.abbrev = df.abbrev.astype('category')

num_cols = df.select_dtypes('float').columns.to_list()
num_cols.remove('total')
dep = 'total'

df.head()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA


## Pipeline

## Basic Example

Each object exept the last one needs to implement a fit adn a transform method, the last one a predict and fit method.

In [15]:
pipe = Pipeline(
    [
     ('scaler', StandardScaler()),
     ('svc', Ridge())
     ]
)




pipe.fit(df.loc[:, num_cols], df[dep])
pipe.predict(df.loc[:, num_cols])

array([17.86228382, 17.90200464, 18.7561831 , 22.73033036, 12.49048332,
       14.08798741, 10.9956788 , 17.03614203,  6.74224775, 18.12711261,
       15.90388986, 17.81620763, 16.09238586, 14.20969611, 15.39380929,
       15.56291131, 16.40187115, 17.96284823, 21.02572903, 14.65131058,
       13.56723847,  8.40320825, 12.98428004, 10.16498961, 16.55899852,
       16.15101221, 21.47392571, 15.70972648, 15.9815532 , 11.55023624,
       10.26867416, 18.30452552, 11.44977103, 16.24816342, 24.79908013,
       14.39453087, 20.59331809, 12.30851163, 18.30540464, 11.04240863,
       23.51067113, 19.39009105, 18.14712038, 19.72229907, 11.33877896,
       14.60754975, 12.97136277, 10.71060558, 22.93849817, 12.58163609,
       17.37071736])

In this case the whole dataframe first gets passded to the standardscaler and then to the ridge regressor.

### Visualizing The pipeline

In [5]:
from sklearn import set_config
set_config(display='diagram')
pipe

### Column Transformer

Usually we want to do specific tranform for individual or subsets of columns of the feature matrix $X$. The column tranformer object is a neat *tranformer* where we can specifiy a list of tranformations and a list of column names.

In [18]:
col_tranform = ColumnTransformer(
    [
     ('scaler', #name
      StandardScaler(), #transformer for specified column
      ['speeding'] # columns of df of wehich to transform
      )
     ],
     remainder = 'passthrough'
)

pd.DataFrame(col_tranform.fit_transform(df.loc[:, num_cols])).head()

Unnamed: 0,0,1,2,3,4,5
0,1.168148,5.64,18.048,15.04,784.55,145.08
1,1.212695,4.525,16.29,17.014,1053.48,133.93
2,0.756709,5.208,15.624,17.856,899.47,110.35
3,-0.483614,5.824,21.056,21.28,827.34,142.39
4,-0.399524,3.36,10.92,10.68,878.41,165.63


Often we would liek to do certain tranformations on all numerical columns say standatization and one hot encoding on all category variables. Instead of typing in all columns by hand we can use the 

In [23]:
ct = make_column_transformer(
      (StandardScaler(),
       make_column_selector(dtype_include='float')),  
      (OneHotEncoder(sparse=False),
       make_column_selector(dtype_include='category')))

pd.DataFrame(ct.fit_transform(df)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,0.737446,1.168148,0.439938,1.002301,0.277692,-0.580083,0.430514,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.565936,1.212695,-0.211311,0.608532,0.807258,0.943258,-0.0229,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.688443,0.756709,0.187615,0.459357,1.033141,0.070876,-0.981778,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.619498,-0.483614,0.547408,1.676052,1.9517,-0.337701,0.321125,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.928653,-0.399524,-0.891763,-0.594276,-0.891968,-0.048418,1.266178,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
from sklearn import set_config
set_config(display='diagram')
ct

### Hyperparameter Search with pipelines

In [45]:



pipe = Pipeline(
    [
     ('scaler', #name
      StandardScaler(), #transformer for specified column
      make_column_selector(dtype_include='float') # columns of df of wehich to transform
      ),
     ('pca', #name
      PCA(n_components=2) #transformer for specified column
 # columns of df of wehich to transform
      )]
)

pd.DataFrame(col_tranform.fit_transform(df)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.737446,1.168148,0.439938,1.002301,0.277692,-0.580083,0.430514,-101.126075,19.626205
1,0.565936,1.212695,-0.211311,0.608532,0.807258,0.943258,-0.0229,165.796019,-14.90549
2,0.688443,0.756709,0.187615,0.459357,1.033141,0.070876,-0.981778,10.315344,-24.92548
3,1.619498,-0.483614,0.547408,1.676052,1.9517,-0.337701,0.321125,-58.78429,13.517557
4,-0.928653,-0.399524,-0.891763,-0.594276,-0.891968,-0.048418,1.266178,-5.74729,31.51926


In [67]:
ct = make_column_transformer(
      (StandardScaler(),
       make_column_selector(dtype_include='float')) 
      )

clf = make_pipeline(ct, PCA(), Ridge()).fit(df.loc[:, num_cols], df.total)

param_grid =  {'columntransformer__standardscaler__with_mean': [True, False], 'pca__n_components': [3, 4, 5]}
    



grid = GridSearchCV(clf, n_jobs=1, param_grid=param_grid)
grid.fit(df.loc[:, num_cols], df.total)


In [58]:
clf.get_params()

{'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7fd266552550>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__remainder': 'drop',
 'columntransformer__sparse_threshold': 0.3,
 'columntransformer__standardscaler': StandardScaler(),
 'columntransformer__standardscaler__copy': True,
 'columntransformer__standardscaler__with_mean': True,
 'columntransformer__standardscaler__with_std': True,
 'columntransformer__transformer_weights': None,
 'columntransformer__transformers': [('standardscaler',
   StandardScaler(),
   <sklearn.compose._column_transformer.make_column_selector at 0x7fd266552550>)],
 'columntransformer__verbose': False,
 'memory': None,
 'pca': PCA(),
 'pca__copy': True,
 'pca__iterated_power': 'auto',
 'pca__n_components': None,
 'pca__random_state': None,
 'pca__svd_solver': 'auto',
 'pca__tol': 0.0,
 'pca__whiten': F

## Helper Functions

## Plot for the Blog Post

## Sources

- Hello This is a markdown page {% cite signaltrain %}

https://github.com/ypeleg/HungaBunga

https://github.com/alegonz/baikal

https://github.com/jem1031/pandas-pipelines-custom-transformers

https://github.com/jundongl/scikit-feature

https://github.com/scikit-multilearn/scikit-multilearn

https://github.com/amueller/patsylearn

https://www.scikit-yb.org/en/latest/

https://github.com/koaning/scikit-lego

https://github.com/tmadl/sklearn-expertsys

https://scikit-learn.org/stable/tutorial/machine_learning_map/

https://medium.com/@chris_bour/an-extended-version-of-the-scikit-learn-cheat-sheet-5f46efc6cbb

https://twitter.com/justmarkham/status/1239900312862953473

https://twitter.com/amuellerml/status/1255662574416408577

## References

{% bibliography --cited %}