# Pipelining In Machine Learning

In [109]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [110]:
# save filepath to variable for easier access
melbourne_file_path = '../hitchhikersGuideToMachineLearning/home-data-for-ml-course/train.csv'
# read the data and store data in DataFrame titled melbourne_data
train_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


Instead of directly attacking the dataset I will cover the capabilities of Piplelining library from scikit-learn package!
- Pipelines and composite estimators
- Pipeline: chaining estimators
- Transforming target in regression
- FeatureUnion: composite feature spaces
- ColumnTransformer for heterogeneous data

Especially FetureUnion and ColumnTransformers are important.
Also I will demontrate how to incorporate custom techniques and tricks into piplines.
We will use few of this tricks on our dataset!

Data cleaning and preprocessing are a crucial step in the machine learning project.
Whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the machine learning model to make predictions. This becomes a tedious and time-consuming process!

An alternate to this is creating a machine learning pipeline that remembers the complete set of preprocessing steps in the exact same order. So that whenever any new data point is introduced, the machine learning pipeline performs the steps as defined and uses the machine learning model to predict the target variable.

Setting up a machine learning algorithm involves more than the algorithm itself. You need to preprocess the data in order for it to fit the algorithm. It's this preprocessing pipeline that often requires a lot of work. Building a flexible pipeline is key. Here's how you can build it in python.

#### What is a pipeline?
A pipeline in sklearn is a set of chained algorithms to extract features, preprocess them and then train or use a machine learning algorith

In [111]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [112]:
# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2,random_state=42 )
X_train.shape

(120, 4)

#### Construction
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:


In [113]:
estimators = [('minmax', MinMaxScaler()),('lr', LogisticRegression(C=1))]
pipe = Pipeline(estimators)

In [114]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('minmax', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('lr',
                 LogisticRegression(C=1, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [115]:
score = pipe.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.967


The utility function make_pipeline is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

In [116]:
from sklearn.pipeline import make_pipeline
pipe2= make_pipeline(MinMaxScaler(), LogisticRegression(C=10))
pipe2

Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('logisticregression',
                 LogisticRegression(C=10, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

###### Accessing steps
The estimators of a pipeline are stored as a list in the steps attribute, but can be accessed by index or name by indexing (with [idx]) the Pipeline:

In [117]:
pipe.steps[0]

('minmax', MinMaxScaler(copy=True, feature_range=(0, 1)))

In [118]:
pipe['minmax']

MinMaxScaler(copy=True, feature_range=(0, 1))

Pipeline’s named_steps attribute allows accessing steps by name with tab completion in interactive environments:

In [119]:
pipe.named_steps.minmax

MinMaxScaler(copy=True, feature_range=(0, 1))

A sub-pipeline can also be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations (or their inverse):

In [120]:
pipe2[:1]

Pipeline(memory=None,
         steps=[('minmaxscaler',
                 MinMaxScaler(copy=True, feature_range=(0, 1)))],
         verbose=False)

###### Nested parameters
Parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax:

In [121]:
pipe.set_params(lr__C=2)

Pipeline(memory=None,
         steps=[('minmax', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('lr',
                 LogisticRegression(C=2, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

This is particularly important for doing grid searches!

Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to 'passthrough'

In [122]:
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings='ignore'
param_grid = dict(minmax=['passthrough'],lr__C=[1, 2, 3])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

In [123]:
gd=grid_search.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [124]:
gd.best_estimator_

Pipeline(memory=None,
         steps=[('minmax', 'passthrough'),
                ('lr',
                 LogisticRegression(C=1, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)

In [125]:
score = pipe.score(X_test, y_test)
score

0.9666666666666667

Let's also explore a regression dataset so that i can drill a few more useful points in your skull!

In [126]:
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression

In [127]:
X, y = load_boston(return_X_y=True)

In [128]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [129]:
estimators = [('minmax', MinMaxScaler()),('lr', LinearRegression())]
raw_Y_regr = Pipeline(estimators)

In [130]:
raw_Y_regr.fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_Y_regr.score(X_test, y_test)))

R2 score: 0.64


Okay I can transform the input features and do all kinds of preprocessing , but what if i want to transform 
Y. In many regression tasks you may have to transform Y while feeding the input to the model.For example taing log of Y. 
But when we are predicting we ill have to scale it back to original unit basically we will have apply inverse of the transformation on predicted Y.
All this can be done in PIpeline very easily!


In [131]:
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()
regr = TransformedTargetRegressor(regressor=regressor,transformer=transformer)

In [132]:
regr.fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: 0.67


  % (self.n_quantiles, n_samples))


For simple transformations, instead of a Transformer object, a pair of functions can be passed, defining the transformation and its inverse mapping. It means you can make your custom transformers!

In [133]:
def func(x):
     return np.log(x)
def inverse_func(x):
     return np.exp(x)

In [134]:
regr = TransformedTargetRegressor(regressor=regressor,func=func,inverse_func=inverse_func)

In [135]:
regr.fit(X_train, y_train)

TransformedTargetRegressor(check_inverse=True,
                           func=<function func at 0x7fa075054400>,
                           inverse_func=<function inverse_func at 0x7fa0265b9048>,
                           regressor=LinearRegression(copy_X=True,
                                                      fit_intercept=True,
                                                      n_jobs=None,
                                                      normalize=False),
                           transformer=None)

In [136]:
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: 0.65


###### FeaturUnions
This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In [137]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.feature_selection import SelectKBest

estimators = [('linear_pca', PCA()), ('select_k_best', SelectKBest(k=10))]
combined_features = FeatureUnion(estimators)

From PCA I am selecting 8 and from select K best I am slecting 7

In [138]:
X.shape[1]

13

In [139]:
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")


pipeline = Pipeline([("features", combined_features), ("lr", LinearRegression())])

Combined space has 23 features


In [140]:
param_grid = dict(features__linear_pca__n_components=[4, 6],
                  features__select_k_best__k=[5, 8])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)


Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] features__linear_pca__n_components=4, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=4, features__select_k_best__k=5, score=0.246, total=   0.0s
[CV] features__linear_pca__n_components=4, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=4, features__select_k_best__k=5, score=0.660, total=   0.0s
[CV] features__linear_pca__n_components=4, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=4, features__select_k_best__k=5, score=0.197, total=   0.0s
[CV] features__linear_pca__n_components=4, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=4, features__select_k_best__k=5, score=-0.018, total=   0.0s
[CV] features__linear_pca__n_components=4, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=4, features__select_k_best__k=5, score=-0.243, total=   0.0s
[CV] features__linear_pca__n_components=4, features__select_k_best__

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    0.2s remaining:    0.0s


[CV]  features__linear_pca__n_components=6, features__select_k_best__k=5, score=0.476, total=   0.0s
[CV] features__linear_pca__n_components=6, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=6, features__select_k_best__k=5, score=0.732, total=   0.0s
[CV] features__linear_pca__n_components=6, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=6, features__select_k_best__k=5, score=0.663, total=   0.0s
[CV] features__linear_pca__n_components=6, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=6, features__select_k_best__k=5, score=0.052, total=   0.0s
[CV] features__linear_pca__n_components=6, features__select_k_best__k=5 
[CV]  features__linear_pca__n_components=6, features__select_k_best__k=5, score=-0.368, total=   0.0s
[CV] features__linear_pca__n_components=6, features__select_k_best__k=8 
[CV]  features__linear_pca__n_components=6, features__select_k_best__k=8, score=0.639, total=   0.0s
[CV] features__linear_pca__n

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.4s finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('linear_pca',
                                                                        PCA(copy=True,
                                                                            iterated_power='auto',
                                                                            n_components=None,
                                                                            random_state=None,
                                                                            svd_solver='auto',
                                                                            tol=0.0,
                                                                            whiten=False)),
                                                                

In [141]:
print(grid_search.best_estimator_)

Pipeline(memory=None,
         steps=[('features',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('linear_pca',
                                                 PCA(copy=True,
                                                     iterated_power='auto',
                                                     n_components=6,
                                                     random_state=None,
                                                     svd_solver='auto', tol=0.0,
                                                     whiten=False)),
                                                ('select_k_best',
                                                 SelectKBest(k=8,
                                                             score_func=<function f_classif at 0x7fa027460730>))],
                              transformer_weights=None, verbose=False)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_

###### ColumnTransformer for heterogeneous data
Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps.

In [155]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer


Lets use subset of our data to illustrate few more points.

In [143]:
X=train_data.iloc[:,0:10].drop(['Alley'],axis=1)
X

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub
...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,Reg,Lvl,AllPub
1456,1457,20,RL,85.0,13175,Pave,Reg,Lvl,AllPub
1457,1458,70,RL,66.0,9042,Pave,Reg,Lvl,AllPub
1458,1459,20,RL,68.0,9717,Pave,Reg,Lvl,AllPub


In [144]:
X.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'LotShape', 'LandContour', 'Utilities'],
      dtype='object')

For this data, we might want to encode the 'street' column as a categorical variable using preprocessing.

As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say 'street_category'. 

By default, the remaining rating columns are ignored (remainder='drop').

We can keep the remaining rating columns by setting remainder='passthrough' also the remainder parameter can be set to an estimator to transform the remaining rating columns.

In [145]:
column_trans = ColumnTransformer(
     [('street_category', OneHotEncoder(dtype='int'),['Street'])],
     remainder='drop')

column_trans.fit(X)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('street_category',
                                 OneHotEncoder(categories='auto', drop=None,
                                               dtype='int',
                                               handle_unknown='error',
                                               sparse=True),
                                 ['Street'])],
                  verbose=False)

In [146]:
column_trans.get_feature_names()

['street_category__x0_Grvl', 'street_category__x0_Pave']

The make_column_selector is used to select columns based on data type or column name. Lets use OneHotEncoder on categorical data 

In [147]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector
ct = ColumnTransformer([
       ('scale', StandardScaler(),
       make_column_selector(dtype_include=np.number)),
       ('ohe',OneHotEncoder(),
       make_column_selector(pattern='Street', dtype_include=object))])
ct.fit_transform(X)

array([[-1.73086488,  0.07337496, -0.20803433, -0.20714171,  0.        ,
         1.        ],
       [-1.7284922 , -0.87256276,  0.40989452, -0.09188637,  0.        ,
         1.        ],
       [-1.72611953,  0.07337496, -0.08444856,  0.07347998,  0.        ,
         1.        ],
       ...,
       [ 1.72611953,  0.30985939, -0.16683907, -0.14781027,  0.        ,
         1.        ],
       [ 1.7284922 , -0.87256276, -0.08444856, -0.08016039,  0.        ,
         1.        ],
       [ 1.73086488, -0.87256276,  0.20391824, -0.05811155,  0.        ,
         1.        ]])

If you will use LabelEncoder() in place of OneHotEncoder() you will run into this error 

fit_transform() takes 2 positional arguments but 3 were given

Read here for workaround
https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize

Didn't I told you that that we can use custom functionalities! Let's Change the labelencoder is implemented so that we can use it in our pipeline!

In [148]:
from sklearn.base import BaseEstimator, TransformerMixin

If you want to add some custom functionallity to your pipeline it will be basically two things either you will want to do some transformation which is not present in sklearn(or present but not in suitable format) or some estimator!

You should have obsereved by now that  we have to make a object of every transformer or estimator before calling functions over it.For example

>lr=LinearRegression()

>lr.fit(data)

Pipelines also work in the same way so we will need to implememnt our functionallities as Classes!But you dont have to do everything from scratch Scikit-Learn got you covered.

Writing custom functionallity in SKlearn depends upon the inheritence of two classes:


- class sklearn.base.TransformerMixi:
Mixin class for all transformers in scikit-learn.This is the base class for writting all kinds of transformation you want ! That is all other classes will derive it as a parent class.

- class sklearn.base.BaseEstimator:
Base class for all estimators in scikit-learn and it for writting estimators!


So this is your custom functionality

In [149]:
class MultiColumnLabelEncoder(BaseEstimator,TransformerMixin):
    
    def __init__(self, columns = None):
        self.columns = columns # list of column to encode
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        
        output = X.copy()
        
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname, col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        
        return output
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

In [108]:
sk_pipe = Pipeline([ ("mssing", SimpleImputer(Strategy='most_frequent'))
                    ,("MLCLE", MultiColumnLabelEncoder()), 
                    ("lr", LinearRegression())])


NameError: name 'SimpleImputer' is not defined

In [107]:
sk_pipe.fit(X,y)

ValueError: Found input variables with inconsistent numbers of samples: [1460, 506]

One can also exploit featureUnion

scikit created a FunctionTransformer as part of the preprocessing class. It can be used in a similar manner as above but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.

For example, if the input to a pipeline is a series, the transformer would be as follows:



In [43]:
def trans_func(input_series):
    return output_series

from sklearn.preprocessing import FunctionTransformer
name_transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", name_transformer), ("lr", LinearRegression())])


In [44]:
sk_pipe

Pipeline(memory=None,
         steps=[('trans',
                 FunctionTransformer(accept_sparse=False, check_inverse=True,
                                     func=<function trans_func at 0x7f4ed9cbc1e0>,
                                     inv_kw_args=None, inverse_func=None,
                                     kw_args=None, validate=False)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

Lets put this all concept together and create simple pipleine!

before jumping to pipelines a few thigs still needed to be taken careof: 

In [165]:
train_data = pd.read_csv('../hitchhikersGuideToMachineLearning/home-data-for-ml-course/train.csv' , index_col ='Id') 
X_test_full = pd.read_csv('../hitchhikersGuideToMachineLearning/home-data-for-ml-course/test.csv', index_col='Id')


In [166]:
X_test_full['Neighborhood'].unique()
train_data['Neighborhood']

Id
1       CollgCr
2       Veenker
3       CollgCr
4       Crawfor
5       NoRidge
         ...   
1456    Gilbert
1457     NWAmes
1458    Crawfor
1459      NAmes
1460    Edwards
Name: Neighborhood, Length: 1460, dtype: object

In [167]:
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice
train_data.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(train_data, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary) for one hot encoding
categorical_cols_type1 = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() <= 10 and 
                    X_train_full[cname].dtype == "object"]

# Select categorical columns with high cardinality (convenient but arbitrary) for one hot encoding
categorical_cols_type2 = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() > 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols_type1+categorical_cols_type2 + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

In [168]:
class LabelOneHotEncoder():
    def __init__(self):
        self.ohe = OneHotEncoder()
        self.le = LabelEncoder()
    def fit_transform(self, x):
        features = self.le.fit_transform( x)
        return self.ohe.fit_transform( features.reshape(-1,1))
    def transform( self, x):
        return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
    def inverse_tranform( self, x):
        return self.le.inverse_transform( self.ohe.inverse_tranform( x))
    def inverse_labels( self, x):
        return self.le.inverse_transform( x)

In [169]:
class ModifiedLabelEncoder(LabelEncoder):

    def fit_transform(self, y, *args, **kwargs):
        return super().fit_transform(y).reshape(-1, 1)

    def transform(self, y, *args, **kwargs):
        return super().transform(y).reshape(-1, 1)


pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])

<3x2 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [171]:
pipe.fit_transform(X_train["Street"])

<1168x2 sparse matrix of type '<class 'numpy.float64'>'
	with 1168 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

In [None]:
# Preprocessing for numerical data
numerical_transformer1 = SimpleImputer(strategy='constant') # Your code here

# Preprocessing for categorical data
categorical_transformer1 = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
]) # Your code here

# Bundle preprocessing for numerical and categorical data
preprocessor1 = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer1, numerical_cols),
        ('cat', categorical_transformer1, categorical_cols)
    ])

# Define model
model1 = RandomForestRegressor(n_estimators=150, random_state=0)
 # Your code here

# Check your answer
step_1.a.check()

In [None]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor1),
                              ('model', model1)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Check your answer
step_1.b.check()

https://www.datanami.com/2018/09/05/how-to-build-a-better-machine-learning-pipeline/
https://www.analyticsvidhya.com/blog/2020/01/build-your-first-machine-learning-pipeline-using-scikit-learn/
https://cloud.google.com/ai-platform/prediction/docs/custom-pipeline
https://stackoverflow.com/questions/31259891/put-customized-functions-in-sklearn-pipeline

http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
https://g-stat.com/using-custom-transformers-in-your-machine-learning-pipelines-with-scikit-learn/
