Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__', as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to 'passthrough' or None.

[https://www.youtube.com/watch?v=HZ9MUzCRlzI&ab_channel=KrishNaik]

class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)

[https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html]

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

## StandardSclaer is a transformation technique and LogisticRegression is an estimator

In [2]:
steps = [("standard_sclaer", StandardScaler()),
        ("classifier", LogisticRegression())]

print(steps)

[('standard_sclaer', StandardScaler()), ('classifier', LogisticRegression())]


In [3]:
Pipeline(steps)

Pipeline(steps=[('standard_sclaer', StandardScaler()),
                ('classifier', LogisticRegression())])

In [4]:
pipe = Pipeline(steps)

In [5]:
# Now we are going to visualize the pipeline

from sklearn import set_config
set_config(display="diagram")

In [6]:
pipe

In [7]:
## create dataset
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000)

In [8]:
X.shape, y.shape

((1000, 20), (1000,))

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

In [10]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((700, 20), (300, 20), (700,), (300,))

In [11]:
X_train

array([[-1.09684136, -0.2438515 ,  0.21127497, ...,  0.53861212,
        -0.75666194, -0.11702523],
       [ 1.08391049,  1.02214624,  2.88561218, ...,  1.9927919 ,
        -1.09011547,  2.06013994],
       [ 0.61230331, -0.67302279,  1.14870126, ...,  1.271442  ,
         0.56955592, -1.02641368],
       ...,
       [-1.04161287, -1.15185711,  1.26601888, ...,  1.04374367,
         0.19326706,  0.13513589],
       [ 0.80747217, -1.3002386 ,  0.11236367, ...,  1.73706923,
        -0.92800346,  0.68595133],
       [ 2.30960466, -0.68692388,  0.92239711, ...,  1.5269985 ,
        -0.22134621,  0.88897436]])

In [12]:
pipe.fit(X_train, y_train)  # we don't need to transform data separetly using standardscaler. The data transformation will be done inside 
                            # the pipeline and logistic regression will be performed.

In [13]:
y_pred = pipe.predict(X_test) # only transform is performed (no fit is required)

In [14]:
y_pred

array([1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0])

This time we will combine pre-processing steps. Let's see how we can perform this using pipeline.

Standard Scaling ---> PCA ---> Estimator

In [15]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

In [16]:
steps = [("standard_scaler", StandardScaler()), 
        ("pca", PCA(n_components = 3)),
        ("svc", SVC())]

In [17]:
steps

[('standard_scaler', StandardScaler()),
 ('pca', PCA(n_components=3)),
 ('svc', SVC())]

In [18]:
pipe = Pipeline(steps)
print(pipe)

Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('pca', PCA(n_components=3)), ('svc', SVC())])


In [19]:
pipe

In [20]:
# pipe['standard_scaler'].fit_transform(X_train)   # In case, if we want to check if a particular pipeline is working correctly
                                                # we need to use the key to access it

In [21]:
pipe.fit(X_train, y_train)

In [22]:
pipe.predict(X_test)

array([0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0])

Now we are going to see a new example of column transformer.

In [23]:
from sklearn.impute import SimpleImputer  # numerical variable imputing
import numpy as np

In [24]:
numeric_processor = Pipeline(steps = [("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
                    ("standard_scaler", StandardScaler())])

                    # this pipeline is created for creating numerical data processing

In [25]:
numeric_processor

In [26]:
from sklearn.preprocessing import OneHotEncoder

categorical_processor = Pipeline(steps = [("imputation_constant", SimpleImputer(fill_value= "missing", strategy="constant")),
                        ("onehot", OneHotEncoder(handle_unknown = "ignore"))])

In [27]:
categorical_processor

Now are going to combine both of these pipelines.

In [28]:
from sklearn.compose import ColumnTransformer

preprocessor=ColumnTransformer(
    [("categorical",categorical_processor,["gender","City"]),
    ("numerical",numeric_processor,["age","height"])])

In [29]:
preprocessor

**Making final pipeline**

In [30]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(preprocessor, LogisticRegression())

In [31]:
pipe

In [33]:
import pandas as pd
df = pd.read_csv(r"C:\Users\USER\Desktop\Project\Brushing Up Machine Learning\pipeline_dataset.csv")
df.head()

Unnamed: 0,gender,City,age,height,passed
0,M,D,23,170,1
1,F,C,19,154,1
2,F,R,21,148,0
3,M,B,20,160,1
4,F,K,24,151,1


In [34]:
X = df.iloc[:, 0:4]
y = df.iloc[:, 4]

In [35]:
X.head()

Unnamed: 0,gender,City,age,height
0,M,D,23,170
1,F,C,19,154
2,F,R,21,148
3,M,B,20,160
4,F,K,24,151


In [36]:
y.head()

0    1
1    1
2    0
3    1
4    1
Name: passed, dtype: int64

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3, test_size = 0.3)

In [38]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3, 4), (2, 4), (3,), (2,))

In [39]:
pipe.fit(X_train, y_train)

In [40]:
y_pred = pipe.predict(X_test)

In [41]:
y_pred

array([1, 1], dtype=int64)

In [42]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



In [43]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

In [44]:
import seaborn as sns

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [55]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

In [56]:
X.head()

Unnamed: 0,tip,sex,smoker,day,time,size
0,1.01,Female,No,Sun,Dinner,2
1,1.66,Male,No,Sun,Dinner,3
2,3.5,Male,No,Sun,Dinner,3
3,3.31,Male,No,Sun,Dinner,2
4,3.61,Female,No,Sun,Dinner,4


In [57]:
y.head()

0    16.99
1    10.34
2    21.01
3    23.68
4    24.59
Name: total_bill, dtype: float64

In [58]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size = 0.3)

In [59]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((170, 6), (74, 6), (170,), (74,))

In [60]:
numeric_processor = Pipeline(steps = [("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
                            ("scaler", StandardScaler())])   

In [61]:
categorical_processor = Pipeline(steps = [("imputation_constant", SimpleImputer(fill_value="missing", strategy= "constant")),
                                ("onehot", OneHotEncoder(handle_unknown="ignore"))])

In [65]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [("categorical", categorical_processor, ["sex", "smoker", "day", "time"]),
    ("numerical", numeric_processor, ["tip", "size"]),
    ]
)

In [66]:
pipe = Pipeline(steps=[("preprocessor", preprocessor),
                ("regressor", RandomForestRegressor())])

In [67]:
from sklearn import set_config

set_config(display="diagram")

In [68]:
pipe

In [69]:
pipe.fit(X_train, y_train)

In [70]:
pipe.predict(X_test)

array([17.8572    , 13.61934952, 21.4211    , 28.67476   , 13.04691262,
       14.14390952, 16.186     , 16.1605069 , 21.8257    , 21.443     ,
       19.17306667, 14.3143    , 10.89408   , 14.14390952, 11.59153095,
       15.4882    , 21.98434   , 19.563     , 14.717585  , 28.4399    ,
       20.6526    , 18.8079    , 20.0389    , 14.3143    , 23.8965    ,
       15.94487   , 13.46413333, 27.4667    , 21.4211    , 24.702     ,
       22.9147    , 14.6801    , 19.782     , 18.37893333, 20.7461    ,
       22.2433    , 12.9969    , 28.9715    , 19.67635   , 14.12793833,
       13.1534    , 11.92732171, 16.13089   , 15.7453    , 14.42991905,
       13.8564    , 19.6995    , 17.7591    , 11.45245   , 16.712675  ,
       14.33768   , 20.02703333, 26.5003    , 14.16387667, 21.6238    ,
       12.70275167, 26.73      , 12.4921    , 19.36401667, 33.44638667,
       31.196     , 19.1865    , 26.5362    , 12.30998857, 13.09687333,
       18.7556    , 14.43088   , 14.8506    , 30.831     , 21.10

In [71]:
import warnings
warnings.filterwarnings('ignore')

In [72]:
#hyperparameter tuning

param_grid = {
    "regressor__n_estimators": [100, 200, 500],
    "regressor__max_features": ["auto", "log2", "sqrt"],
    "regressor__max_depth": [4, 5, 6, 7, 8]
}

grid_search = GridSearchCV(param_grid=param_grid, estimator=pipe, n_jobs=-1)

In [73]:
grid_search.fit(X_train, y_train)

In [74]:
grid_search.best_params_

{'regressor__max_depth': 5,
 'regressor__max_features': 'sqrt',
 'regressor__n_estimators': 100}

In [75]:
grid_search.best_estimator_

In [76]:
pipe = Pipeline(steps=[("preprocessor", preprocessor),
                ("regressor", RandomForestRegressor(max_depth=5, n_estimators=100, max_features="sqrt"))])

In [77]:
pipe.fit(X_train, y_train)

In [79]:
y_pred = pipe.predict(X_test)

In [80]:
y_pred

array([18.29189307, 14.95040684, 21.28242534, 32.37069579, 13.66804027,
       19.2273427 , 16.39544248, 15.05798717, 20.31553215, 17.44045535,
       18.85161674, 14.84817245, 12.27899523, 19.2273427 , 12.25035381,
       16.26690019, 20.81889965, 20.74472517, 16.65629044, 28.07401227,
       23.6988186 , 21.62705664, 20.05318533, 14.84817245, 23.70312128,
       15.46774019, 14.59042499, 23.81246621, 21.28242534, 25.91037758,
       23.2069682 , 17.3302064 , 21.93054199, 22.00306995, 21.35277365,
       24.11062409, 17.63484072, 26.3018108 , 16.856094  , 17.81751699,
       14.84449151, 12.52809884, 15.73162425, 18.62380091, 15.061647  ,
       15.89244817, 17.05352877, 20.04285882, 13.46097817, 16.73189958,
       17.91380225, 23.59880196, 25.89994118, 14.45387658, 20.35220999,
       13.8303674 , 26.45389329, 15.00409853, 20.50687296, 28.29630455,
       29.87973918, 18.15264209, 26.55099329, 13.04312375, 15.30405872,
       19.67560198, 17.94658392, 15.91927432, 27.27220298, 22.59

In [84]:
y_test

24     19.82
6       8.77
153    24.55
211    25.89
198    13.00
       ...  
165    24.52
154    19.77
216    28.15
79     17.29
29     19.65
Name: total_bill, Length: 74, dtype: float64

In [83]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

print(r2_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))

0.40572354135654387
4.599224833281552
41.39511725775138
