# Pipeline

Le pipeline applicano una lista di trasformatori e un estimatore finale.

Gli step intermedi di  una pipeline devono sempre essere dei trasformatori, cioè devono implementare sempre  i metodi `fit` e `trasfrom`, l’estimatore finale dovrà solo possedere il metodo `fit`

l’obbiettivo della pipeline é quello di assemblare diversi passaggi che possono poi essere cross validati insieme mentre si scelgono i parametri differenti

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)


# Semplice Pipeline

una semplice pipeline potrebbe essere la seguente:

- **Trasformatori**
    - Scalare i valori tra 0 e 1(MinMaxScaler)
    - PCA tenendo due colonne
- **Estimatore**
    - Logistic Regression

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

preprocessing_transformer = Pipeline(steps=[('scale_01', MinMaxScaler(feature_range=(0, 1))),
                                            ('PCA', PCA(n_components=2))])


model = LogisticRegression(solver='lbfgs', multi_class='auto')


# pipeline
my_pipeline = Pipeline(steps=[('preprocessing_transformer', preprocessing_transformer),
                              ('model', model)
                              ], verbose = True)
# verbose stampa il tempo di eseguzione


# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

[Pipeline]  (step 1 of 2) Processing preprocessing_transformer, total=   0.0s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.0s
Accuracy Score: 0.8666666666666667


Gli `steps` di una pipeline non sono altro che una lista di tuple con una label ( che possiamo scegliere il nome che vogliamo ) e l’operatore da applicare

# ColumnTrasformer:

Questo metodo permette di gestire differenti tipi di trasformatori per differenti tipi di colonne

un esempio sarebbe:

1. Applicare la PCA solo ad una colonna
2. Applicare il OHE a solo le colonne testuali

In [4]:
house = pd.read_csv("Data/zameen-updated.csv")
house.head()

Unnamed: 0,property_id,location_id,page_url,property_type,price,location,city,province_name,latitude,longitude,baths,area,purpose,bedrooms,date_added,agency,agent,Area Type,Area Size,Area Category
0,237062,3325,https://www.zameen.com/Property/g_10_g_10_2_gr...,Flat,10000000,G-10,Islamabad,Islamabad Capital,33.67989,73.01264,2,4 Marla,For Sale,2,02-04-2019,,,Marla,4.0,0-5 Marla
1,346905,3236,https://www.zameen.com/Property/e_11_2_service...,Flat,6900000,E-11,Islamabad,Islamabad Capital,33.700993,72.971492,3,5.6 Marla,For Sale,3,05-04-2019,,,Marla,5.6,5-10 Marla
2,386513,764,https://www.zameen.com/Property/islamabad_g_15...,House,16500000,G-15,Islamabad,Islamabad Capital,33.631486,72.926559,6,8 Marla,For Sale,5,07-17-2019,,,Marla,8.0,5-10 Marla
3,656161,340,https://www.zameen.com/Property/islamabad_bani...,House,43500000,Bani Gala,Islamabad,Islamabad Capital,33.707573,73.151199,4,2 Kanal,For Sale,4,04-05-2019,,,Kanal,2.0,1-5 Kanal
4,841645,3226,https://www.zameen.com/Property/dha_valley_dha...,House,7000000,DHA Defence,Islamabad,Islamabad Capital,33.492591,73.301339,3,8 Marla,For Sale,3,07-10-2019,Easy Property,Muhammad Junaid Ceo Muhammad Shahid Director,Marla,8.0,5-10 Marla


In [5]:
house['price']

0         10000000
1          6900000
2         16500000
3         43500000
4          7000000
            ...   
168441    26500000
168442    12500000
168443    27000000
168444    11000000
168445     9000000
Name: price, Length: 168446, dtype: int64

In [6]:
# possiamo togliere i due id che sono completamente inutili:
house = house.drop(columns=['property_id','location_id'])
house

Unnamed: 0,page_url,property_type,price,location,city,province_name,latitude,longitude,baths,area,purpose,bedrooms,date_added,agency,agent,Area Type,Area Size,Area Category
0,https://www.zameen.com/Property/g_10_g_10_2_gr...,Flat,10000000,G-10,Islamabad,Islamabad Capital,33.679890,73.012640,2,4 Marla,For Sale,2,02-04-2019,,,Marla,4.0,0-5 Marla
1,https://www.zameen.com/Property/e_11_2_service...,Flat,6900000,E-11,Islamabad,Islamabad Capital,33.700993,72.971492,3,5.6 Marla,For Sale,3,05-04-2019,,,Marla,5.6,5-10 Marla
2,https://www.zameen.com/Property/islamabad_g_15...,House,16500000,G-15,Islamabad,Islamabad Capital,33.631486,72.926559,6,8 Marla,For Sale,5,07-17-2019,,,Marla,8.0,5-10 Marla
3,https://www.zameen.com/Property/islamabad_bani...,House,43500000,Bani Gala,Islamabad,Islamabad Capital,33.707573,73.151199,4,2 Kanal,For Sale,4,04-05-2019,,,Kanal,2.0,1-5 Kanal
4,https://www.zameen.com/Property/dha_valley_dha...,House,7000000,DHA Defence,Islamabad,Islamabad Capital,33.492591,73.301339,3,8 Marla,For Sale,3,07-10-2019,Easy Property,Muhammad Junaid Ceo Muhammad Shahid Director,Marla,8.0,5-10 Marla
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168441,https://www.zameen.com/Property/gulshan_e_maym...,House,26500000,Gadap Town,Karachi,Sindh,25.029909,67.137192,0,9.6 Marla,For Sale,6,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
168442,https://www.zameen.com/Property/gadap_town_gul...,House,12500000,Gadap Town,Karachi,Sindh,25.017951,67.136393,0,8 Marla,For Sale,3,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,8.0,5-10 Marla
168443,https://www.zameen.com/Property/gulshan_e_maym...,House,27000000,Gadap Town,Karachi,Sindh,25.015384,67.116330,0,9.6 Marla,For Sale,6,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
168444,https://www.zameen.com/Property/gulshan_e_maym...,House,11000000,Gadap Town,Karachi,Sindh,25.013265,67.120818,0,7.8 Marla,For Sale,3,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,7.8,5-10 Marla


In [7]:
# creiamo il dataset per prevedere il prezzo delle case
y = house['price']
X = house.drop(columns=['price'])
X

Unnamed: 0,page_url,property_type,location,city,province_name,latitude,longitude,baths,area,purpose,bedrooms,date_added,agency,agent,Area Type,Area Size,Area Category
0,https://www.zameen.com/Property/g_10_g_10_2_gr...,Flat,G-10,Islamabad,Islamabad Capital,33.679890,73.012640,2,4 Marla,For Sale,2,02-04-2019,,,Marla,4.0,0-5 Marla
1,https://www.zameen.com/Property/e_11_2_service...,Flat,E-11,Islamabad,Islamabad Capital,33.700993,72.971492,3,5.6 Marla,For Sale,3,05-04-2019,,,Marla,5.6,5-10 Marla
2,https://www.zameen.com/Property/islamabad_g_15...,House,G-15,Islamabad,Islamabad Capital,33.631486,72.926559,6,8 Marla,For Sale,5,07-17-2019,,,Marla,8.0,5-10 Marla
3,https://www.zameen.com/Property/islamabad_bani...,House,Bani Gala,Islamabad,Islamabad Capital,33.707573,73.151199,4,2 Kanal,For Sale,4,04-05-2019,,,Kanal,2.0,1-5 Kanal
4,https://www.zameen.com/Property/dha_valley_dha...,House,DHA Defence,Islamabad,Islamabad Capital,33.492591,73.301339,3,8 Marla,For Sale,3,07-10-2019,Easy Property,Muhammad Junaid Ceo Muhammad Shahid Director,Marla,8.0,5-10 Marla
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
168441,https://www.zameen.com/Property/gulshan_e_maym...,House,Gadap Town,Karachi,Sindh,25.029909,67.137192,0,9.6 Marla,For Sale,6,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
168442,https://www.zameen.com/Property/gadap_town_gul...,House,Gadap Town,Karachi,Sindh,25.017951,67.136393,0,8 Marla,For Sale,3,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,8.0,5-10 Marla
168443,https://www.zameen.com/Property/gulshan_e_maym...,House,Gadap Town,Karachi,Sindh,25.015384,67.116330,0,9.6 Marla,For Sale,6,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,9.6,5-10 Marla
168444,https://www.zameen.com/Property/gulshan_e_maym...,House,Gadap Town,Karachi,Sindh,25.013265,67.120818,0,7.8 Marla,For Sale,3,07-18-2019,Al Shahab Enterprises,Shahmir,Marla,7.8,5-10 Marla


In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168446 entries, 0 to 168445
Data columns (total 17 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   page_url       168446 non-null  object 
 1   property_type  168446 non-null  object 
 2   location       168446 non-null  object 
 3   city           168446 non-null  object 
 4   province_name  168446 non-null  object 
 5   latitude       168446 non-null  float64
 6   longitude      168446 non-null  float64
 7   baths          168446 non-null  int64  
 8   area           168446 non-null  object 
 9   purpose        168446 non-null  object 
 10  bedrooms       168446 non-null  int64  
 11  date_added     168446 non-null  object 
 12  agency         124375 non-null  object 
 13  agent          124374 non-null  object 
 14  Area Type      168446 non-null  object 
 15  Area Size      168446 non-null  float64
 16  Area Category  168446 non-null  object 
dtypes: float64(3), int64(2), obje

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                    random_state=0)

# Primo metodo:
faremo 3 step:
1. Definire il processo degli steps
2. definire il modello
3. creare e valutare la pipeline

In [10]:
categorical_cols = [cname for cname in X_train.columns if
                    X_train[cname].nunique() < 10 and
                    X_train[cname].dtype == "object"]

numerical_cols = [cname for cname in X_train.columns if
                  X_train[cname].dtype in ['int64', 'float64']]

La condizione `X_train[cname].nunique() < 10` dice che se ci sono più di 10 elementi doppi nella colonna , la colona non sarà utilizzata come colonna d’interesse ( questo perché non ci fornisce nessuna conoscenza supplementare, es: un indirizzo di una via)

In [11]:
categorical_cols

['property_type', 'city', 'province_name', 'purpose', 'Area Type']

In [12]:
numerical_cols

['latitude', 'longitude', 'baths', 'bedrooms', 'Area Size']

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# preprocessing per i valori numerici
numeric_trasformer = SimpleImputer(strategy="most_frequent")

# preprocessing per i valori testuali
categoric_trasform = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy="most_frequent")),
    ('OHE',OneHotEncoder(handle_unknown='ignore',sparse=False))
    ])

# uniamo i due processi
preproces  = ColumnTransformer(
    transformers=[
        ('num',numeric_trasformer,numerical_cols),
        ('cat',categoric_trasform,categorical_cols)
    ])


Il metodo `SimpleImputer` é un trasformatore che toglie i valori nulli e in questo caso li sostituisce con il valore più frequente

il parametro `trasformers` é una lista di tuple di tre elem:

1. etichetta
2. operatore
3. colonne a cui applicare l’operatore( che posso essere sia gli indici che il nome)

### 2 step:

il modello che usiamo é un `RandomForestRegressor`

In [14]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=50,random_state=0)

### 3 step:

In [19]:
from  sklearn.metrics import  mean_absolute_error
from math import sqrt

# creiamo la pipeline:
my_pipeline = Pipeline(steps=[
                        ('preprocessor',preproces),
                        ('model',model),
                        ])

my_pipeline.fit(X_train,y_train)

pred = my_pipeline.predict(X_test)

# valutiamo il modello
score = sqrt(mean_absolute_error(y_test,pred))
print("MAE: ", score)

  mode = stats.mode(array)


MAE:  1760.0313832423517


### Tuning dei parametri:

é possibile fare il tuning di vari parametri usando il nome dato per etichetta separato da `__` :

In [21]:
from sklearn.model_selection import  GridSearchCV

params = {
    'model__n_estimators': [20,50,75],
    'preprocessor__num__strategy': ['most_frequent','constant'],
    'preprocessor__cat__imputer__strategy': ['most_frequent','costant'],
}

gs = GridSearchCV(my_pipeline, params, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

gs.fit(X,y)

  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by settin

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         SimpleImputer(strategy='most_frequent'),
                                                                         ['latitude',
                                                                          'longitude',
                                                                          'baths',
                                                                          'bedrooms',
                                                                          'Area '
                                                                          'Size']),
                                                                        ('cat',
                                                                         Pipeline(steps=[('imputer',
                                          

In [22]:
gs.best_estimator_

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  SimpleImputer(strategy='most_frequent'),
                                                  ['latitude', 'longitude',
                                                   'baths', 'bedrooms',
                                                   'Area Size']),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('OHE',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['property_type', 'city',
      

In [23]:
gs.best_score_

-3259086.925025675

# 2 Metodo:

Il problema con il primo metodo e che perdi le informazioni che scarti nella fase  del `ColumnTrasformer` .

Per evitare questo si utilizza un parametro del `ColumnTrasformer`  chiamato `remainder` che possiede tre possibile soluzione:

1. `passthrough` le colonne che non vengono interessata dal Trasformer vengono passate direttamente all’output
2. specificare un trasformatore di  ********default******** che viene applicato nelle colonne che non vengono interessate
3. Il default che elimina le colonne non interessante

In [None]:
numerical_transformer = SimpleImputer(strategy='most_frequent')

preproces = ColumnTransformer(transformers=
                           [('num',numerical_transformer,numerical_cols)],
                           remainder=OneHotEncoder(handle_unknown='ignore',sparse=False))

# creiamo la pipeline:
my_pipeline = Pipeline(steps=[
    ('preprocessor',preproces),
    ('model',model),
])

my_pipeline.fit(X_train,y_train)

pred = my_pipeline.predict(X_test)

# valutiamo il modello
score = sqrt(mean_absolute_error(y_test,pred))
print("MAE: ", score)

  mode = stats.mode(array)


# Metodo alternativo per indicare le colonne

Il problema che se indichiamo le colonne con il loro nome, e il trasformatore é dopo un altro invece di essere prima, questo causerà un problema.

Per risolverlo si utilizza:

In [16]:
X_test.columns.get_indexer(numerical_cols).tolist()

[5, 6, 7, 10, 15]

In [19]:
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

# preprocessing  per tutto il dataset
simple = SimpleImputer(strategy="most_frequent")

norma = Normalizer()


prepro = ColumnTransformer(transformers=[('num', norma, X_train.columns.get_indexer(numerical_cols))])

prep_transfor = Pipeline(steps=[('simple',simple),('prep',prepro)])

prep_transfor.fit_transform(X_train)

array([[0.39023229, 0.91889088, 0.03706551, 0.03706551, 0.02471034],
       [0.4163612 , 0.90430029, 0.        , 0.03712957, 0.08663567],
       [0.34392081, 0.92984375, 0.09705468, 0.08318972, 0.02772991],
       ...,
       [0.3941204 , 0.91487207, 0.        , 0.        , 0.08762535],
       [0.34751909, 0.93503173, 0.02777824, 0.02777824, 0.0583343 ],
       [0.39034891, 0.9066204 , 0.03721898, 0.02481265, 0.15383843]])

# FeatureUnion:

É una terza tipo di pipeline che prende delle altre pipeline e applica al dataset queste pipeline in parallelo:

In [22]:
from sklearn.pipeline import FeatureUnion
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [23]:
from sklearn.feature_selection import SelectKBest

pca = PCA(n_components=2)

selection =  SelectKBest(k=2)

scaler = MinMaxScaler(feature_range=(0,1))

combine_feature = FeatureUnion([("pca",pca), ("select",selection), ('normal',scaler)])

4

Il metodo `SelectKBest()` permette di estrarre solo le migliori k colonne.

Una cosa da osservare:

In [31]:
X.shape[1]

4

In [24]:
pca.fit_transform(X).shape[1]

2

In [27]:
selection.fit_transform(X,y).shape[1]

2

In [28]:
scaler.fit_transform(X).shape[1]

4

In [29]:
combine_feature.fit_transform(X,y).shape[1]

8

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()


# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('combined_features', combine_feature),
                              ('model', model)
                              ], verbose = True)

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

[Pipeline] . (step 1 of 2) Processing combined_features, total=   0.0s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.0s
Accuracy Score: 1.0


# FunctionTrasformer:

É possibile creare dei trasformatori personalizzati:
Teneto conto però che i trasformer ritornano degli array non dei DataFrame

In [34]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

dataset = pd.read_csv("Data/zameen-updated.csv")
columns = dataset.columns.to_list()
columns.remove('price')
X = dataset[columns]
y = dataset['price']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

#Selecting the numerical columns -> toglie le colonne che sono di tipo testuali, mantiene solo le numeriche
def columns_num(X):
    numerical_cols = [cname for cname in X.columns if  X[cname].dtype in ['int64', 'float64']]
    return X.loc[:,numerical_cols]

fill_na_transformer = Pipeline(steps=[ ('drop_cols', FunctionTransformer(columns_num, validate=False)), # input viene lasciato così come e quindi permette di avviare la funzione ( perchè se validate=True renderebbe l'input un array monodimensionale
                                       ('fill_na', SimpleImputer(strategy='most_frequent'))  ], verbose=True)


model = RandomForestRegressor(n_estimators=5, random_state=0)



# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', fill_na_transformer),
                              ('model', model)
                              ])

# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

[Pipeline] ......... (step 1 of 2) Processing drop_cols, total=   0.0s
[Pipeline] ........... (step 2 of 2) Processing fill_na, total=   0.1s


  mode = stats.mode(array)


MAE: 8392568.661241287
