## What is a pipeline?

Series of transformations of data
- raw -> processed -> clean -> prediction

An useful abstraction in machine learning

## Why sklearn pipelines?

Readibility
- eaiser to read workflow

Reproducible
- pickle the pipeline object

Testable
- test individual transformers

Reusable
- everyone in the team can use the same transformations

## What is a sklearn pipeline?

Steps of **transformations**, ending in an **estimator**
- tranformation = preprocessing, cleaning, feature engineering
- estimator = super/unsupervised learning model

Transformations can be either sequential or parallel

You can gridsearch within a pipeline

Remember sklearn expects numpy arrays
- can be tricky to get column names flowing through pipelines

In [10]:
!rm -rf climate-data
!git clone https://github.com/ADGEfficiency/climate-data -q
!python climate-data/nasa.py

getting NASA temperatures
NASA Goddard Institute for Space Studies
https://data.giss.nasa.gov/gistemp/
getting NASA carbon
getting co2_annmean_gl.csv
getting co2_annmean_mlo.csv
getting co2_gr_gl.csv
getting co2_gr_mlo.csv
getting co2_mm_gl.csv
co2_mm_gl.csv
getting co2_mm_mlo.csv
co2_mm_mlo.csv
getting co2_trend_gl.csv
co2_trend_gl.csv
getting co2_weekly_mlo.csv
co2_weekly_mlo.csv
getting co2_weekly_mlo_lastyear.csv
co2_weekly_mlo_lastyear.csv


In [11]:
import os
import pandas as pd

def dataframe_health(raw):
    for col in raw.columns:
        dupe_mask = raw.loc[:, col].duplicated() 
        print('duplicates', col, sum(dupe_mask))

        miss_mask = raw.loc[:, col].isnull() 
        print('missing', col, sum(miss_mask))
        print(' ')

carbon = pd.read_csv(
    '{}/climate-data/carbon/clean/co2_mm_gl.csv'.format(os.environ['HOME']),
    index_col=0
)

temp = pd.read_csv(
    '{}/climate-data/temperature/clean/global.csv'.format(os.environ['HOME']),
    index_col=0
)

temp.head()

Unnamed: 0_level_0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON,Year.1
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1880-01-01,-0.17,-0.23,-0.08,-0.15,-0.09,-0.2,-0.17,-0.09,-0.14,-0.22,-0.21,-0.17,-0.16,0.0,0.0,-0.11,-0.15,-0.19,1880
1881-01-01,-0.18,-0.13,0.04,0.06,0.07,-0.18,0.01,-0.02,-0.14,-0.2,-0.17,-0.06,-0.07,-0.08,-0.16,0.06,-0.06,-0.17,1881
1882-01-01,0.18,0.15,0.06,-0.15,-0.13,-0.21,-0.15,-0.06,-0.13,-0.23,-0.15,-0.35,-0.1,-0.07,0.09,-0.08,-0.14,-0.17,1882
1883-01-01,-0.28,-0.35,-0.11,-0.18,-0.17,-0.06,-0.07,-0.13,-0.21,-0.1,-0.23,-0.1,-0.17,-0.19,-0.33,-0.15,-0.08,-0.18,1883
1884-01-01,-0.12,-0.08,-0.36,-0.39,-0.34,-0.34,-0.32,-0.27,-0.27,-0.24,-0.32,-0.3,-0.28,-0.26,-0.1,-0.36,-0.31,-0.28,1884


In [12]:
carbon.head()

Unnamed: 0,year,month,decimal,average,trend
1980-01-01,1980,1,1980.042,338.45,337.82
1980-02-01,1980,2,1980.125,339.15,338.1
1980-03-01,1980,3,1980.208,339.48,338.13
1980-04-01,1980,4,1980.292,339.87,338.25
1980-05-01,1980,5,1980.375,340.3,338.78


In [13]:
raw = pd.concat([carbon, temp], axis=1, sort=True)
raw.dropna(axis=0, inplace=True)

raw.head()

Unnamed: 0,year,month,decimal,average,trend,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON,Year.1
1980-01-01,1980.0,1.0,1980.042,338.45,337.82,0.3,0.4,0.3,0.3,0.35,0.2,0.22,0.18,0.2,0.13,0.3,0.22,0.26,0.28,0.39,0.32,0.2,0.21,1980.0
1981-01-01,1981.0,1.0,1981.042,340.09,339.46,0.52,0.42,0.48,0.32,0.24,0.29,0.32,0.35,0.15,0.12,0.23,0.41,0.32,0.31,0.39,0.35,0.32,0.17,1981.0
1982-01-01,1982.0,1.0,1982.042,341.27,340.66,0.05,0.16,0.03,0.15,0.18,0.06,0.14,0.03,0.14,0.13,0.17,0.42,0.14,0.14,0.2,0.12,0.08,0.15,1982.0
1983-01-01,1983.0,1.0,1983.042,342.28,341.62,0.53,0.43,0.41,0.28,0.33,0.22,0.18,0.35,0.37,0.16,0.3,0.17,0.31,0.33,0.46,0.34,0.25,0.28,1983.0
1984-01-01,1984.0,1.0,1984.042,344.23,343.57,0.31,0.14,0.26,0.06,0.33,0.02,0.19,0.19,0.21,0.14,0.07,-0.04,0.16,0.17,0.21,0.22,0.13,0.14,1984.0


## A simple example

- one hot encode the year & month
- fit a linear model

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler

from sklearn.linear_model import Lasso

from sklearn.model_selection import train_test_split

x = raw.loc[:, ['year', 'month']]
y = raw.loc[:, 'J-D']

x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.2)

pipe = Pipeline([
    ('one-hot', OneHotEncoder(categories='auto', handle_unknown='ignore')), 
    ('pred', Lasso())
])

pipe = pipe.fit(x_tr, y_tr)

In [15]:
pipe.predict(x_te)

array([0.52483871, 0.52483871, 0.52483871, 0.52483871, 0.52483871,
       0.52483871, 0.52483871, 0.52483871])

## Column transformers

- deal with different features in different ways

In order to deal with having to treat different columns with different transformations, we use a `ColumnTransformer`
- allows different columns or column subsets of the input to be transformed separately
- the features generated by each transformer will be concatenated to form a single feature space

In [16]:
x = raw.loc[:, ['year', 'month', 'average']]
y = raw.loc[:, 'J-D']

x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.2)

x.head()

Unnamed: 0,year,month,average
1980-01-01,1980.0,1.0,338.45
1981-01-01,1981.0,1.0,340.09
1982-01-01,1982.0,1.0,341.27
1983-01-01,1983.0,1.0,342.28
1984-01-01,1984.0,1.0,344.23


In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

feature_engineering = ColumnTransformer(
    transformers = [
        ('one-hot', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['x0', 'x1']),
        ('scaler', StandardScaler(), ['x2'])
    ]
)

pipe = Pipeline(
    steps=[
        ('features', feature_engineering), ('model', LogisticRegression())
    ]
)

pipe = pipe.fit(x_tr, y_tr)

pipe.predict(x_te)

array([0.50322581, 0.50322581, 0.50322581, 0.50322581, 0.50322581,
       0.50322581, 0.50322581, 0.50322581])

## Transforming the target

One of the problems with the example above is that our target is untouched - we can preprocess the target using a `TransformedTargetRegressor`

In [18]:
from sklearn.compose import TransformedTargetRegressor

from sklearn.preprocessing import QuantileTransformer

pipe = Pipeline(
    steps=[
        ('features', feature_engineering), ('model', TransformedTargetRegressor(regressor=Lasso(), transformer=QuantileTransformer()))
    ]
)

pipe = pipe.fit(x_tr, y_tr)

pipe.predict(x_te)

  % (self.n_quantiles, n_samples))


array([0.47225806, 0.47225806, 0.47225806, 0.47225806, 0.47225806,
       0.47225806, 0.47225806, 0.47225806])

## Grid searching

In [19]:
from sklearn.model_selection import GridSearchCV

params = {
    'model__alpha': [0.01, 0.1, 1.0]
}

feature_engineering = ColumnTransformer(
    transformers = [
        ('one-hot', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['year', 'month']),
        ('scaler', StandardScaler(), ['average'])
    ]
)

pipe = Pipeline(
    steps=[
        ('features', feature_engineering), ('model', Lasso())
    ]
)

grid = GridSearchCV(pipe, cv=3, n_jobs=-1, param_grid=params)

grid = grid.fit(x_tr, y_tr)

grid.predict(x_te)



array([0.2835446 , 0.25750065, 0.2297704 , 0.57930266, 0.54360934,
       0.61911805, 0.35277653, 0.77360175])

## Feature unions

The last use case we have is to transform the same feature mulitple ways and concatenate the results, which we can do with a f

In [27]:
from sklearn.pipeline import FeatureUnion

feature_engineering = ColumnTransformer(
    transformers = [
        ('one-hot', OneHotEncoder(categories='auto', handle_unknown='ignore'), ['year', 'month']),
        ('scalers', FeatureUnion(
            (('sclr', StandardScaler()), ('sclr1', MinMaxScaler()))), ['average'])
    ]
)

pipe = Pipeline(
    steps=[
        ('features', feature_engineering), 
        ('model', TransformedTargetRegressor(regressor=Lasso(), transformer=QuantileTransformer()))
    ]
)

x = raw.loc[:, ['year', 'month', 'average']]
y = raw.loc[:, 'J-D']

x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size=0.2)

pipe = pipe.fit(x_tr, y_tr)

pipe.predict(x_te)

  % (self.n_quantiles, n_samples))


array([0.45129032, 0.45129032, 0.45129032, 0.45129032, 0.45129032,
       0.45129032, 0.45129032, 0.45129032])

In [None]:
dddddddddddd

In [26]:
x_tr

Unnamed: 0,year,month
2010-01-01,2010.0,1.0
2016-01-01,2016.0,1.0
1986-01-01,1986.0,1.0
2018-01-01,2018.0,1.0
1992-01-01,1992.0,1.0
2015-01-01,2015.0,1.0
2011-01-01,2011.0,1.0
1998-01-01,1998.0,1.0
2006-01-01,2006.0,1.0
1997-01-01,1997.0,1.0
