Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__', as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to 'passthrough' or None.

[https://www.youtube.com/watch?v=HZ9MUzCRlzI&ab_channel=KrishNaik]

class sklearn.pipeline.Pipeline(steps, *, memory=None, verbose=False)

[https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html]

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

## StandardSclaer is a transformation technique and LogisticRegression is an estimator

In [6]:
steps = [("standard_sclaer", StandardScaler()),
        ("classifier", LogisticRegression())]

print(steps)

[('standard_sclaer', StandardScaler()), ('classifier', LogisticRegression())]


In [7]:
Pipeline(steps)

Pipeline(steps=[('standard_sclaer', StandardScaler()),
                ('classifier', LogisticRegression())])

In [8]:
pipe = Pipeline(steps)

In [9]:
# Now we are going to visualize the pipeline

from sklearn import set_config
set_config(display="diagram")

In [10]:
pipe

In [11]:
## create dataset
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000)

In [12]:
X.shape, y.shape

((1000, 20), (1000,))

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

In [14]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((700, 20), (300, 20), (700,), (300,))

In [15]:
X_train

array([[-0.73096954,  0.72161213,  0.59709576, ...,  1.25703038,
        -0.49860009,  1.05549933],
       [ 1.20328064, -0.30281421, -1.29242428, ..., -1.98274232,
        -1.74760109, -1.56782264],
       [-0.59169192,  0.62315217,  0.87528252, ..., -1.69220857,
         0.67135837, -0.31361111],
       ...,
       [-0.85936136, -0.24332362, -1.9811566 , ...,  1.395996  ,
        -0.25180935,  1.0299083 ],
       [-0.27260729, -1.13200134,  0.46995393, ...,  1.33836653,
         0.17655464, -0.75096714],
       [ 2.39817533, -0.2314117 , -0.01630132, ...,  1.52961345,
        -0.78749187,  2.02508267]])

In [16]:
pipe.fit(X_train, y_train)  # we don't need to transform data separetly using standardscaler. The data transformation will be done inside 
                            # the pipeline and logistic regression will be performed.

In [17]:
y_pred = pipe.predict(X_test) # only transform is performed (no fit is required)

In [18]:
y_pred

array([1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1])

This time we will combine pre-processing steps. Let's see how we can perform this using pipeline.

Standard Scaling ---> PCA ---> Estimator

In [19]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

In [20]:
steps = [("standard_scaler", StandardScaler()), 
        ("pca", PCA(n_components = 3)),
        ("svc", SVC())]

In [21]:
steps

[('standard_scaler', StandardScaler()),
 ('pca', PCA(n_components=3)),
 ('svc', SVC())]

In [22]:
pipe = Pipeline(steps)
print(pipe)

Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('pca', PCA(n_components=3)), ('svc', SVC())])


In [23]:
pipe

In [27]:
# pipe['standard_scaler'].fit_transform(X_train)   # In case, if we want to check if a particular pipeline is working correctly
                                                # we need to use the key to access it

In [28]:
pipe.fit(X_train, y_train)

In [29]:
pipe.predict(X_test)

array([1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1])

Now we are going to see a new example of column transformer.

In [30]:
from sklearn.impute import SimpleImputer  # numerical variable imputing
import numpy as np

In [34]:
numeric_processor = Pipeline(steps = [("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
                    ("standard_scaler", StandardScaler())])

                    # this pipeline is created for creating numerical data processing

In [35]:
numeric_processor

In [37]:
from sklearn.preprocessing import OneHotEncoder

categorical_processor = Pipeline(steps = [("imputation_constant", SimpleImputer(fill_value= "missing", strategy="constant")),
                        ("onehot", OneHotEncoder(handle_unknown = "ignore"))])

In [38]:
categorical_processor

Now are going to combine both of these pipelines.

In [39]:
from sklearn.compose import ColumnTransformer

preprocessor=ColumnTransformer(
    [("categorical",categorical_processor,["gender","City"]),
    ("numerical",numeric_processor,["age","height"])])

In [40]:
preprocessor

**Making final pipeline**

In [41]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(preprocessor, LogisticRegression())

In [42]:
pipe

In [51]:
import pandas as pd
df = pd.read_csv(r"C:\Users\USER\Desktop\Project\Brushing Up Machine Learning\pipeline_dataset.csv")
df.head()

Unnamed: 0,gender,City,age,height,passed
0,M,D,19,170,1
1,F,C,23,155,0
2,F,B,17,145,1
3,F,K,21,152,1
4,M,R,25,165,0


In [52]:
X = df.iloc[:, 0:4]
y = df.iloc[:, 4]

In [53]:
X.head()

Unnamed: 0,gender,City,age,height
0,M,D,19,170
1,F,C,23,155
2,F,B,17,145
3,F,K,21,152
4,M,R,25,165


In [54]:
y.head()

0    1
1    0
2    1
3    1
4    0
Name: passed, dtype: int64

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 3, test_size = 0.3)

In [56]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3, 4), (2, 4), (3,), (2,))

In [57]:
pipe.fit(X_train, y_train)

In [59]:
y_pred = pipe.predict(X_test)

In [60]:
y_pred

array([1, 0], dtype=int64)

In [61]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

