Pipelines sequentially apply **a list of transforms** and a **final estimator**. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.


### Summary

* Pipeline of transforms with a final estimator.

* Sequentially apply a list of transforms and a final estimator. 

    * Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. 
    * The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

* The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

## Simple Pipeline

The simple pipeline is composed of the folloing steps:
- Transformation
    - Scaling values between 0 and 1
    - PCA (we keep 2 components) -> riduce il numero di colonne , tiene solo due
- Estimator
    - Logistic Regression

### Transformation


In [3]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

preprocessing_transformer = Pipeline(steps=[('scale_01', MinMaxScaler(feature_range=(0, 1))),
                                            ('PCA', PCA(n_components=2))])
# gli step sono una lista di label+valori , con una label e l'operatore da applicare
# ci sono due operazioni a cascata

### Estimator

In [5]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', multi_class='auto')

### Creating and evaluating the Pipeline

In [6]:
from sklearn.metrics import accuracy_score

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessing_transformer', preprocessing_transformer),
                              ('model', model)
                             ], verbose = True)
# verbose stampa il tempo di eseguzione


# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

[Pipeline]  (step 1 of 2) Processing preprocessing_transformer, total=   0.0s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.0s
Accuracy Score: 0.8666666666666667


### Analyzing the transformation

In [7]:
X_train

array([[6.4, 3.1, 5.5, 1.8],
       [5.4, 3. , 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.2],
       [5.2, 2.7, 3.9, 1.4],
       [5.7, 3.8, 1.7, 0.3],
       [6. , 2.7, 5.1, 1.6],
       [5.9, 3. , 4.2, 1.5],
       [5.8, 2.6, 4. , 1.2],
       [6.8, 3. , 5.5, 2.1],
       [4.7, 3.2, 1.3, 0.2],
       [6.9, 3.1, 5.1, 2.3],
       [5. , 3.5, 1.6, 0.6],
       [5.4, 3.7, 1.5, 0.2],
       [5. , 2. , 3.5, 1. ],
       [6.5, 3. , 5.5, 1.8],
       [6.7, 3.3, 5.7, 2.5],
       [6. , 2.2, 5. , 1.5],
       [6.7, 2.5, 5.8, 1.8],
       [5.6, 2.5, 3.9, 1.1],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.3, 4.7, 1.6],
       [5.5, 2.4, 3.8, 1.1],
       [6.3, 2.7, 4.9, 1.8],
       [6.3, 2.8, 5.1, 1.5],
       [4.9, 2.5, 4.5, 1.7],
       [6.3, 2.5, 5. , 1.9],
       [7. , 3.2, 4.7, 1.4],
       [6.5, 3. , 5.2, 2. ],
       [6. , 3.4, 4.5, 1.6],
       [4.8, 3.1, 1.6, 0.2],
       [5.8, 2.7, 5.1, 1.9],
       [5.6, 2.7, 4.2, 1.3],
       [5.6, 2

In [8]:
transformed_Dataset = preprocessing_transformer.fit_transform(X_train)

In [9]:
#preprocessing_transformer.transform(X_valid)

In [10]:
transformed_Dataset

array([[ 3.92209763e-01,  4.84561075e-02],
       [ 9.04869356e-02, -8.95853266e-02],
       [-6.29042414e-01,  1.33227096e-01],
       [ 2.97394765e-01, -1.69364128e-02],
       [ 5.25821952e-01, -7.19394322e-02],
       [-8.81287220e-03, -2.16744469e-01],
       [-5.36677344e-01,  3.00755516e-01],
       [ 2.68771542e-01, -1.40752213e-01],
       [ 1.18143450e-01, -2.69059275e-02],
       [ 2.51091885e-02, -1.81673754e-01],
       [ 5.25692275e-01,  5.32019212e-02],
       [-6.94577142e-01, -3.59042800e-02],
       [ 5.43323143e-01,  1.04376017e-01],
       [-5.34922624e-01,  1.01994571e-01],
       [-6.15543324e-01,  2.31944792e-01],
       [-1.46419376e-01, -4.91909893e-01],
       [ 4.09356632e-01,  2.26626713e-02],
       [ 6.26859126e-01,  1.45225238e-01],
       [ 2.57239670e-01, -3.25774678e-01],
       [ 4.91330874e-01, -1.45418497e-01],
       [-3.11137441e-02, -2.39957062e-01],
       [ 7.51055857e-01,  1.48508623e-01],
       [ 2.30644832e-01,  1.25073917e-01],
       [-4.

In [11]:
type(transformed_Dataset) # in generale ogni trasformatore da un array

numpy.ndarray

In [12]:
tra_df = pd.DataFrame(transformed_Dataset)

In [13]:
tra_df.shape

(120, 2)

In [14]:
tra_df.head()

Unnamed: 0,0,1
0,0.39221,0.048456
1,0.090487,-0.089585
2,-0.629042,0.133227
3,0.297395,-0.016936
4,0.525822,-0.071939


In [15]:
# Simple check

scaled = MinMaxScaler(feature_range=(0, 1))

scaled_X_train = scaled.fit_transform(X_train)

print(scaled_X_train[:4])

pcaed = PCA(n_components=2)
pca_X_train = pcaed.fit_transform(scaled_X_train)

[[0.58333333 0.45833333 0.75862069 0.70833333]
 [0.30555556 0.41666667 0.5862069  0.58333333]
 [0.25       0.625      0.06896552 0.04166667]
 [0.5        0.41666667 0.65517241 0.70833333]]


In [16]:
pca_X_train[:4]

array([[ 0.39220976,  0.04845611],
       [ 0.09048694, -0.08958533],
       [-0.62904241,  0.1332271 ],
       [ 0.29739477, -0.01693641]])

## ColumnTransformer: Managing different kinds of transformers on different columns:
applica un operatore nelle colonne che io scelgo.
es:
- applicare la PCA solo ad una colonna
- applicare il OneHot encoder a **solo** le colonne testuali


Extracted and extended from a kaggle.com tutorial

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In [17]:
dataset = pd.read_csv("data/melb_data.csv")
dataset.head(5)
# vogliamo prevedere il prezzo dati tutti i campi
# obbiettivo avere una pipeline che tratti i campi numerici in un modo e i campi testuali in un altro modo

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [18]:
dataset.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [19]:
dataset['Price']

0        1480000.0
1        1035000.0
2        1465000.0
3         850000.0
4        1600000.0
           ...    
13575    1245000.0
13576    1031000.0
13577    1170000.0
13578    2500000.0
13579    1285000.0
Name: Price, Length: 13580, dtype: float64

In [None]:
#dataset = dataset[dataset['Price'].isnull()==False]

In [20]:
columns = dataset.columns.to_list()
columns

['Suburb',
 'Address',
 'Rooms',
 'Type',
 'Price',
 'Method',
 'SellerG',
 'Date',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'CouncilArea',
 'Lattitude',
 'Longtitude',
 'Regionname',
 'Propertycount']

In [21]:
# tolgo il prezzo dalle colonne e creo il dataset:
if 'Price' in columns:
  columns.remove('Price')

X = dataset[columns]
y = dataset['Price']

In [22]:
X.info() # object = stringa

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Method         13580 non-null  object 
 5   SellerG        13580 non-null  object 
 6   Date           13580 non-null  object 
 7   Distance       13580 non-null  float64
 8   Postcode       13580 non-null  float64
 9   Bedroom2       13580 non-null  float64
 10  Bathroom       13580 non-null  float64
 11  Car            13518 non-null  float64
 12  Landsize       13580 non-null  float64
 13  BuildingArea   7130 non-null   float64
 14  YearBuilt      8205 non-null   float64
 15  CouncilArea    12211 non-null  object 
 16  Lattitude      13580 non-null  float64
 17  Longtitude     13580 non-null  float64
 18  Region

In [None]:
#X = X[['Longtitude', 'Lattitude', 'YearBuilt', 'Car', 'Bathroom', 'Bedroom2', 'Rooms', 'Postcode']]
#X.isnull().any(axis=0)
#X.fillna(0, inplace=True)

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

We construct the full pipeline in three steps.
* Step 1: Define Preprocessing Steps
* Step 2: Define the Model
* Step 3: Create and Evaluate the Pipeline

### Step 1: Define Preprocessing Steps

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:
- imputes missing values in numerical data, and
- imputes missing values and applies a one-hot encoding to categorical data.

In [24]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)

# creo una lista di valori che contiene i nomi delle colonne catergoriali e numerici
categorical_cols = [cname for cname in X_train.columns if
                    X_train[cname].nunique() < 10 and # nunique permette di contare gli elementi doppi nella colonna -> se ci sono piú di 10 elementi doppi la colonna
                    X_train[cname].dtype == "object"] #  non sarà in categorical_cols


# Select numerical columns
numerical_cols = [cname for cname in X_train.columns if 
                X_train[cname].dtype in ['int64', 'float64']]

In [25]:
categorical_cols

['Type', 'Method', 'Regionname']

In [26]:
numerical_cols

['Rooms',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Car',
 'Landsize',
 'BuildingArea',
 'YearBuilt',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [27]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # trasformatore che toglie i valori nulli !! -> e lo sostituisce con il valore più frequente
from sklearn.preprocessing import OneHotEncoder # operatore che trasforma i valori in colonne

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')



# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse = False))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[ # é una lista di tuple di tre elem -> etichetta, operatore e le colonne a cui si applica l'operatore
       ('num', numerical_transformer, numerical_cols),
       ('cat', categorical_transformer, categorical_cols)
    ])

### Step 2: Define the Model

In [30]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

### Step 3: Create and Evaluate the Pipeline

In [31]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model -> si valuta l'errore medio -> sbaglia l'errore di una casa di 160k euro  , é una cattiva  stima
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

  mode = stats.mode(array)


MAE: 161591.2226535872


In [None]:
sum(preds==np.NAN)

In [None]:
from sklearn.metrics import r2_score

r2_score(y_test, preds)

In [None]:
pd.DataFrame({'label':y_test, 'preds':preds}).head(4)

### Parameter tuning

Setting parameters of the various steps is enabled by using their names and the parameter name separated by a ‘__’

In [32]:
# Example using a Grid Search
from sklearn.model_selection import GridSearchCV
# provare la pipeline modificando i paremetri
parameters = {
    'model__n_estimators': [1,5,10], # nomedellapipeline__modellodiparemetri
    'preprocessor__num__strategy': ['most_frequent','constant'], # nomedellapipeline__nomedellapipeline__paremetri
    'preprocessor__cat__imputer__strategy': ['most_frequent','constant'],
}

gs_clf = GridSearchCV(my_pipeline, parameters, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

gs_clf.fit(X, y)

  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         SimpleImputer(strategy='most_frequent'),
                                                                         ['Rooms',
                                                                          'Distance',
                                                                          'Postcode',
                                                                          'Bedroom2',
                                                                          'Bathroom',
                                                                          'Car',
                                                                          'Landsize',
                                                                          'BuildingArea',
                                               

In [33]:
gs_clf.best_params_

{'model__n_estimators': 10,
 'preprocessor__cat__imputer__strategy': 'most_frequent',
 'preprocessor__num__strategy': 'most_frequent'}

In [34]:
gs_clf.best_score_

-190165.69738200435

# Variazione di ColumnTrasfromer:
Il problema che fino ad ora perdo le informazioni delle colonne che non sto considerando nel preprocessing( tipo Address).
### Come faccio a mantenerle ?
1. ( si faceva ma ora non più) si applica un trasformatore identità che non fa nulla
2. mettendo il parametro `remainder` con `passthrough` le colonne che non sono interessate, vengono direttamente passate in output


In [35]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)],
       remainder='passthrough')

In [36]:
#This generate an error!
'''
perchè avrai in input anche le colonne di testo rimaste
'''


from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

  mode = stats.mode(array)


ValueError: could not convert string to float: 'St Kilda'

Remainder with estimator

In [37]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)],
       remainder=OneHotEncoder(handle_unknown='ignore', sparse = False)) # alle rimanenti colonne -> applico il OneHotEncoder

In [38]:

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
                              ('preprocessor', preprocessor),
                              ('model', model),
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)

  mode = stats.mode(array)


KeyboardInterrupt: 

Alternative techniques to indicate columns

In [40]:
X_test.columns.get_indexer(numerical_cols)

array([ 2,  7,  8,  9, 10, 11, 12, 13, 14, 16, 17, 19])

In [41]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='most_frequent')


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, X.columns.get_indexer(numerical_cols))]) # non cambia nulla a prima

In [42]:
preprocessor.fit_transform(X_train)

  mode = stats.mode(array)


array([[ 1.0000000e+00,  5.0000000e+00,  3.1820000e+03, ...,
        -3.7859840e+01,  1.4498670e+02,  1.3240000e+04],
       [ 2.0000000e+00,  8.0000000e+00,  3.0160000e+03, ...,
        -3.7858000e+01,  1.4490050e+02,  6.3800000e+03],
       [ 3.0000000e+00,  1.2600000e+01,  3.0200000e+03, ...,
        -3.7798800e+01,  1.4482200e+02,  3.7550000e+03],
       ...,
       [ 4.0000000e+00,  6.7000000e+00,  3.0580000e+03, ...,
        -3.7735720e+01,  1.4497256e+02,  1.1204000e+04],
       [ 3.0000000e+00,  1.2000000e+01,  3.0730000e+03, ...,
        -3.7720570e+01,  1.4502615e+02,  2.1650000e+04],
       [ 4.0000000e+00,  6.4000000e+00,  3.0110000e+03, ...,
        -3.7794300e+01,  1.4488750e+02,  7.5700000e+03]])

Be carefull!

In [43]:
#This code generates a mistake

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Preprocessing for all dataset
simple=SimpleImputer(strategy='most_frequent')

# Preprocessing for numerical data
numerical_transformer = Normalizer()


# Bundle preprocessing for numerical
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, numerical_cols)])

# Bundle preprocessing 
preprocessing_transformer = Pipeline(steps=[('simple', simple),
                                            ('preprocessor', preprocessor) # dovrebbe fare un errore perché viene fatta una ricerca per nomi di colonne, che potrebbero non essere presenti se mettiamo il preprocessor dopo il simple -> se lo avessimo messo prima non ci sarebbe stato questo errore
                                            ])

In [44]:
preprocessing_transformer.fit_transform(X_train)

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

In [45]:
X_train.columns.get_indexer(numerical_cols)

array([ 2,  7,  8,  9, 10, 11, 12, 13, 14, 16, 17, 19])

In [46]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Preprocessing for all dataset
simple=SimpleImputer(strategy='most_frequent')

# Preprocessing for numerical data
numerical_transformer = Normalizer()


# Bundle preprocessing for numerical
preprocessor = ColumnTransformer(
    transformers=[('num', numerical_transformer, X_train.columns.get_indexer(numerical_cols))])

# Bundle preprocessing 
preprocessing_transformer = Pipeline(steps=[('simple', simple),
                                            ('preprocessor', preprocessor)
                                            ])

In [47]:
preprocessing_transformer.fit_transform(X_train)

array([[ 7.26963825e-05,  3.63481912e-04,  2.31319889e-01, ...,
        -2.75227341e-03,  1.05400086e-02,  9.62500104e-01],
       [ 2.72783243e-04,  1.09113297e-03,  4.11357131e-01, ...,
        -5.16351401e-03,  1.97632142e-02,  8.70178547e-01],
       [ 5.72636606e-04,  2.40507375e-03,  5.76454183e-01, ...,
        -7.21499218e-03,  2.76434595e-02,  7.16750152e-01],
       ...,
       [ 3.39067054e-04,  5.67937315e-04,  2.59216763e-01, ...,
        -3.19873485e-03,  1.22888547e-02,  9.49726818e-01],
       [ 1.36582374e-04,  5.46329495e-04,  1.39905878e-01, ...,
        -1.71732166e-03,  6.60267194e-03,  9.85669464e-01],
       [ 4.77479959e-04,  7.63967935e-04,  3.59423039e-01, ...,
        -4.51150521e-03,  1.72952194e-02,  9.03630823e-01]])

## FeatureUnion: Applying multiple transformers in parallel
É una terza forma di pipeline.
Prende delle pipeline e applica al dataset queste pipeline in parallelo.

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In [48]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [49]:
from sklearn import datasets
iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [50]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [51]:
# This dataset is way too high-dimensional. Better do PCA:
pca = PCA(n_components=2)

# Maybe some original features where good, too?
selection = SelectKBest(k=2) # seleziona le migliori due colonne

#Normalizing is always a good choice
scaler = MinMaxScaler(feature_range=(0, 1))

# Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection), ("normal", scaler)])

In [52]:
X.shape[1]

4

In [53]:
pca.fit_transform(X).shape[1]

2

In [54]:
selection.fit_transform(X, y).shape[1]

2

In [55]:
scaler.fit_transform(X).shape[1]

4

In [56]:
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

Combined space has 8 features


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()


# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('combined_features', combined_features),
                              ('model', model)
                             ], verbose = True)

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = accuracy_score(y_valid, preds)
print('Accuracy Score:', score)

## FunctionTransformer: Constructs a transformer from an arbitrary callable.
Posso stabilire dei trasformatori che voglio io

Concatenates results of multiple transformer objects.

This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

In [None]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestRegressor

dataset = pd.read_csv("melb_data.csv")
columns = dataset.columns.to_list()
columns.remove('Price')
X = dataset[columns]
y = dataset['Price']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

#Selecting the numerical columns
def columns_num(X):
    numerical_cols = [cname for cname in X.columns if  X[cname].dtype in ['int64', 'float64']]
    return X.loc[:,numerical_cols]

fill_na_transformer = Pipeline(steps=[ ('drop_cols', FunctionTransformer(columns_num, validate=False)),
                                       ('fill_na', SimpleImputer(strategy='most_frequent'))  ], verbose=True)


model = RandomForestRegressor(n_estimators=100, random_state=0)



# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', fill_na_transformer),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
#score = mean_absolute_error(y_valid, preds)
#print('MAE:', score)

In [None]:
fill_na_transformer.fit_transform(X) 

In [None]:
fill_na_transformer = Pipeline(steps=[ ('drop_cols', FunctionTransformer(columns_num, validate=True)),
                                       ('fill_na', SimpleImputer(strategy='most_frequent'))  ])

In [None]:
fill_na_transformer.fit_transform(X) 

In [None]:
columns_num(X)