Pipelines sequentially apply **a list of transforms** and a **final estimator**. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.


### Summary

* Pipeline of transforms with a final estimator.

* Sequentially apply a list of transforms and a final estimator. 

    * Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. 
    * The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

* The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

## Simple Pipeline

The simple pipeline is composed of the folloing steps:
- Transformation
    - Scaling values between 0 and 1
    - PCA (we keep 2 components) -> riduce il numero di colonne , tiene solo due
- Estimator
    - Logistic Regression

### Transformation


In [3]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

preprocessing_transformer = Pipeline(steps=[('scale_01', MinMaxScaler(feature_range=(0, 1))),
                                            ('PCA', PCA(n_components=2))])
# gli step sono una lista di label+valori , con una label e l'operatore da applicare
# ci sono due operazioni a cascata

### Estimator

In [5]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', multi_class='auto')

### Creating and evaluating the Pipeline

In [15]:
from sklearn.metrics import accuracy_score

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessing_transformer', preprocessing_transformer),
                              ('model', model)
                             ], verbose = True)
# verbose stampa il tempo di eseguzione


# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

[Pipeline]  (step 1 of 2) Processing preprocessing_transformer, total=   0.0s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.0s
Accuracy Score: 0.8666666666666667


### Analyzing the transformation

In [7]:
X_train

array([[6.4, 3.1, 5.5, 1.8],
       [5.4, 3. , 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.2],
       [5.2, 2.7, 3.9, 1.4],
       [5.7, 3.8, 1.7, 0.3],
       [6. , 2.7, 5.1, 1.6],
       [5.9, 3. , 4.2, 1.5],
       [5.8, 2.6, 4. , 1.2],
       [6.8, 3. , 5.5, 2.1],
       [4.7, 3.2, 1.3, 0.2],
       [6.9, 3.1, 5.1, 2.3],
       [5. , 3.5, 1.6, 0.6],
       [5.4, 3.7, 1.5, 0.2],
       [5. , 2. , 3.5, 1. ],
       [6.5, 3. , 5.5, 1.8],
       [6.7, 3.3, 5.7, 2.5],
       [6. , 2.2, 5. , 1.5],
       [6.7, 2.5, 5.8, 1.8],
       [5.6, 2.5, 3.9, 1.1],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.3, 4.7, 1.6],
       [5.5, 2.4, 3.8, 1.1],
       [6.3, 2.7, 4.9, 1.8],
       [6.3, 2.8, 5.1, 1.5],
       [4.9, 2.5, 4.5, 1.7],
       [6.3, 2.5, 5. , 1.9],
       [7. , 3.2, 4.7, 1.4],
       [6.5, 3. , 5.2, 2. ],
       [6. , 3.4, 4.5, 1.6],
       [4.8, 3.1, 1.6, 0.2],
       [5.8, 2.7, 5.1, 1.9],
       [5.6, 2.7, 4.2, 1.3],
       [5.6, 2

In [8]:
transformed_Dataset = preprocessing_transformer.fit_transform(X_train)

In [None]:
#preprocessing_transformer.transform(X_valid)

In [9]:
transformed_Dataset

array([[ 3.92209763e-01,  4.84561075e-02],
       [ 9.04869356e-02, -8.95853266e-02],
       [-6.29042414e-01,  1.33227096e-01],
       [ 2.97394765e-01, -1.69364128e-02],
       [ 5.25821952e-01, -7.19394322e-02],
       [-8.81287220e-03, -2.16744469e-01],
       [-5.36677344e-01,  3.00755516e-01],
       [ 2.68771542e-01, -1.40752213e-01],
       [ 1.18143450e-01, -2.69059275e-02],
       [ 2.51091885e-02, -1.81673754e-01],
       [ 5.25692275e-01,  5.32019212e-02],
       [-6.94577142e-01, -3.59042800e-02],
       [ 5.43323143e-01,  1.04376017e-01],
       [-5.34922624e-01,  1.01994571e-01],
       [-6.15543324e-01,  2.31944792e-01],
       [-1.46419376e-01, -4.91909893e-01],
       [ 4.09356632e-01,  2.26626713e-02],
       [ 6.26859126e-01,  1.45225238e-01],
       [ 2.57239670e-01, -3.25774678e-01],
       [ 4.91330874e-01, -1.45418497e-01],
       [-3.11137441e-02, -2.39957062e-01],
       [ 7.51055857e-01,  1.48508623e-01],
       [ 2.30644832e-01,  1.25073917e-01],
       [-4.

In [10]:
type(transformed_Dataset) # in generale ogni trasformatore da un array

numpy.ndarray

In [11]:
tra_df = pd.DataFrame(transformed_Dataset)

In [12]:
tra_df.shape

(120, 2)

In [13]:
tra_df.head()

Unnamed: 0,0,1
0,0.39221,0.048456
1,0.090487,-0.089585
2,-0.629042,0.133227
3,0.297395,-0.016936
4,0.525822,-0.071939


In [None]:
# Simple check

scaled = MinMaxScaler(feature_range=(0, 1))

scaled_X_train = scaled.fit_transform(X_train)

print(scaled_X_train[:4])

pcaed = PCA(n_components=2)
pca_X_train = pcaed.fit_transform(scaled_X_train)

In [None]:
pca_X_train[:4]

## ColumnTransformer: Managing different kinds of transformers on different columns:
applica un operatore nelle colonne che io scelgo.
es:
- applicare la PCA solo ad una colonna
- applicare il OneHot encoder a **solo** le colonne testuali


Extracted and extended from a kaggle.com tutorial

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In [19]:
dataset = pd.read_csv("data/melb_data.csv")
dataset.head(5)
# vogliamo prevedere il prezzo dati tutti i campi
# obbiettivo avere una pipeline che tratti i campi numerici in un modo e i campi testuali in un altro modo


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [None]:
dataset.describe()

In [None]:
dataset['Price']

In [None]:
#dataset = dataset[dataset['Price'].isnull()==False]

In [None]:
columns = dataset.columns.to_list()
columns

In [None]:
if 'Price' in columns:
  columns.remove('Price')

X = dataset[columns]
y = dataset['Price']

In [None]:
X.info()

In [None]:
#X = X[['Longtitude', 'Lattitude', 'YearBuilt', 'Car', 'Bathroom', 'Bedroom2', 'Rooms', 'Postcode']]
#X.isnull().any(axis=0)
#X.fillna(0, inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

We construct the full pipeline in three steps.
* Step 1: Define Preprocessing Steps
* Step 2: Define the Model
* Step 3: Create and Evaluate the Pipeline

### Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:
- imputes missing values in numerical data, and
- imputes missing values and applies a one-hot encoding to categorical data.

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train.columns if
                    X_train[cname].nunique() < 10 and 
                    X_train[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]

In [None]:
categorical_cols

In [None]:
numerical_cols

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')



# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse = False))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
       ('num', numerical_transformer, numerical_cols),
       ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define the Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

### Step 3: Create and Evaluate the Pipeline

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

In [None]:
sum(preds==np.NAN)

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, preds)

In [None]:
pd.DataFrame({'label':y_test, 'preds':preds}).head(4)

### Parameter tuning

Setting parameters of the various steps is enabled by using their names and the parameter name separated by a ‘__’

In [None]:
# Example using a Grid Search
from sklearn.model_selection import GridSearchCV

parameters = {
    'model__n_estimators': [10,50,100],
    'preprocessor__num__strategy': ['most_frequent','constant'],
    'preprocessor__cat__imputer__strategy': ['most_frequent','constant'],
}

gs_clf = GridSearchCV(my_pipeline, parameters, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

gs_clf.fit(X, y)

In [None]:
gs_clf.best_params_

In [None]:
gs_clf.best_score_

Applying the pipeline only to some attributes only

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)],
       remainder='passthrough')

In [None]:
#This generate an error!

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

Remainder with estimator

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)],
       remainder=OneHotEncoder(handle_unknown='ignore', sparse = False))

In [None]:

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

Alternative techniques to indicate columns

In [None]:
X.columns.get_indexer(numerical_cols)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, X.columns.get_indexer(numerical_cols))])

In [None]:
preprocessor.fit_transform(X_train)

Be carefull!

In [None]:
#This code generates a mistake

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Preprocessing for all dataset
simple=SimpleImputer(strategy='most_frequent')

# Preprocessing for numerical data
numerical_transformer = Normalizer()


# Bundle preprocessing for numerical
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)])

# Bundle preprocessing 
preprocessing_transformer = Pipeline(steps=[('simple', simple),
                                            ('preprocessor', preprocessor)
                                            ])

In [None]:
preprocessing_transformer.fit_transform(X_train)

In [None]:
X_train.columns.get_indexer(numerical_cols)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Preprocessing for all dataset
simple=SimpleImputer(strategy='most_frequent')

# Preprocessing for numerical data
numerical_transformer = Normalizer()


# Bundle preprocessing for numerical
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, X_train.columns.get_indexer(numerical_cols))])

# Bundle preprocessing 
preprocessing_transformer = Pipeline(steps=[('simple', simple),
                                            ('preprocessor', preprocessor)
                                            ])

In [None]:
preprocessing_transformer.fit_transform(X_train)

## FeatureUnion: Applying multiple transformers in parallel

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [None]:
iris.feature_names

In [None]:
# This dataset is way too high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=2)

#Normalizing is always a good choice
scaler = MinMaxScaler(feature_range=(0, 1))

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection), ("normal", scaler)])

In [None]:
X.shape[1]

In [None]:
pca.fit_transform(X).shape[1]

In [None]:
selection.fit_transform(X, y).shape[1]

In [None]:
scaler.fit_transform(X).shape[1]

In [None]:
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()


# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('combined_features', combined_features),
                              ('model', model)
                             ], verbose = True)

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

## FunctionTransformer: Constructs a transformer from an arbitrary callable.

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In [None]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor

dataset = pd.read_csv("melb_data.csv")
columns = dataset.columns.to_list()
columns.remove('Price')
X = dataset[columns]
y = dataset['Price']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

#Selecting the numerical columns
def columns_num(X):
    numerical_cols = [cname for cname in X.columns if  X[cname].dtype in ['int64', 'float64']]
    return X.loc[:,numerical_cols]

fill_na_transformer = Pipeline(steps=[ ('drop_cols', FunctionTransformer(columns_num, validate=False)),
                                       ('fill_na', SimpleImputer(strategy='most_frequent'))  ], verbose=True)


model = RandomForestRegressor(n_estimators=100, random_state=0)



# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', fill_na_transformer),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
#score = mean_absolute_error(y_valid, preds)
#print('MAE:', score)

In [None]:
fill_na_transformer.fit_transform(X) 

In [None]:
fill_na_transformer = Pipeline(steps=[ ('drop_cols', FunctionTransformer(columns_num, validate=True)),
                                       ('fill_na', SimpleImputer(strategy='most_frequent'))  ])

In [None]:
fill_na_transformer.fit_transform(X) 

In [None]:
columns_num(X)