<a href="https://colab.research.google.com/github/Madhan-sukumar/Machine-Learning/blob/main/Pipelines_in_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In scikit-learn's pipeline library, have Pipeline and make_pipeline are used to construct data processing pipelines, but have slight differences in terms of their usage and convenience.

1. Pipeline: Pipeline is a class in scikit-learn that allows you to define a sequence of data processing steps. It takes a list of tuples, where each tuple consists of a name and an estimator. The name is a string that identifies the particular step in the pipeline, and the estimator is an object that performs a specific transformation or modeling task. The Pipeline class enforces the order of the steps, ensuring that the data flows through the pipeline in the defined sequence. You can access and modify individual steps in the pipeline using their names.

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

2. make_pipeline: make_pipeline is a function in scikit-learn that provides a more concise way to create pipelines without explicitly naming the steps. It automatically generates names for the steps based on their class names.

pipeline = make_pipeline(StandardScaler(), LogisticRegression())


To create pipeline, we need transformer and a final estimator. Before creating it into pipeline, should add both in the stages or steps

In [2]:
from sklearn.pipeline import Pipeline
##feature scaling
from sklearn.preprocessing import StandardScaler   #transformers
from sklearn.linear_model import LogisticRegression #estimators

In [3]:
steps = [('standard_scaler',StandardScaler()),
         ('classififer',LogisticRegression())]

In [4]:
steps

[('standard_scaler', StandardScaler()), ('classififer', LogisticRegression())]

In [6]:
# to chanage into pipeline
pipe = Pipeline(steps)

In [7]:
pipe

# Example 1

In [8]:
#creating a dataset
from sklearn.datasets import make_classification
X,y = make_classification(n_samples = 1000)

In [9]:
X.shape

(1000, 20)

In [10]:
y.shape

(1000,)

In [14]:
X[0]

array([ 3.0565521 ,  0.56248793,  1.1914865 , -1.22045874, -2.1380763 ,
        0.01094034,  0.38564112, -1.70898876, -0.03365053,  1.03153205,
        0.75696343,  0.34320985, -0.52695697, -2.5472051 , -1.61047995,
        0.82998414, -0.9137373 , -1.75627591, -1.0402325 , -1.00152366])

In [15]:
#train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,train_size = 0.7,test_size = 0.3,random_state=42)

In [16]:
x_train.shape

(700, 20)

In [17]:
# passing the training data to the pipeline for transformation and training
pipe.fit(x_train,y_train)

### Prediction on testing data

When testing data passes through pipeline,the pipeline doesn't fit and transform, instead it do only transform and 

In [18]:

y_pre = pipe.predict(x_test)

In [20]:
y_pre

array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0])

## Example 2

Displaying a pipeline with standard scaler, dimentionality reduction and then estimator

In [21]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

In [22]:
steps =[('scaling',StandardScaler()),
        ('PCA',PCA(n_components=3)),
         ('SVC Model',SVC())
        ]

In [23]:
#converting the steps into pipeline
pipe2 = Pipeline(steps)

In [24]:
pipe2

In [25]:
#fitting the pipeline by passing training data
pipe2.fit(x_train,y_train)

In [27]:
#prediction
y_pred = pipe2.predict(x_test)

### Example 3

In [32]:
 from sklearn.impute import SimpleImputer
 from sklearn.preprocessing import OneHotEncoder

In [30]:
## Numerical Processing pipeline 
#All occurrences of missing_values np.nan will be imputed by mean
import numpy as np

numeric_processor = Pipeline(
                  steps = [('imputation_mean',SimpleImputer(missing_values=np.nan,strategy='mean')),
                           ('scaler',StandardScaler())
                        ])

In [31]:
numeric_processor

In [33]:
## categorical  Processing pipeline 
# When strategy == "constant", fill_value is used to replace all occurrences of missing_values as 'missing'
categorical_processor = Pipeline(
                  steps = [('imputation_constant',SimpleImputer(fill_value='missing',strategy='constant')),
                           ('onehot',OneHotEncoder(handle_unknown='ignore'))
                        ])

### Now combining both the pipeline

In [34]:
from sklearn.compose import ColumnTransformer 

In [35]:
preprocessor=ColumnTransformer(
    [("categorical",categorical_processor,["gender","City"]),  #giving column name where categorical pipeline should apply
    ("numerical",numeric_processor,["age","height"])] ##giving column name where numerical pipeline should apply
)

In [36]:
preprocessor

In [37]:
#finally adding the pipeline to estimator to create final pipeline
from sklearn.pipeline import make_pipeline
pipe=make_pipeline(preprocessor,LogisticRegression())



In [38]:
pipe