Creating a pipeline in supervised learning is a structured way to automate the workflow for model training and evaluation, ensuring that the sequence of steps is consistently followed.

The Pipeline class in scikit-learn is used to streamline and automate the process of applying a sequence of transformations and an estimator to a dataset. It helps in ensuring that the same sequence of steps is applied consistently during both the training and testing phases, reducing the risk of data leakage and making the code cleaner and more maintainable. Here are the key benefits and uses of using a Pipeline

<h1>Data leakage</h1>also known as information leakage, refers to a situation where information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates during model evaluation and ultimately resulting in poor generalization to new, unseen data. Data leakage occurs when the model has access to data that it shouldn't have during training, typically because this data won't be available in a real-world scenario.

Data Preprocessing on Entire Dataset:
Performing data preprocessing (like scaling, imputing, or encoding) on the entire dataset before splitting it into training and test sets. This can cause information from the test set to influence the training process.

<h3>Examples of Data Leakage</h3>

**Example with Feature Derivation**:

Suppose you're predicting whether a person will develop diabetes. If one of the features includes future medical records that indicate whether the person developed diabetes, this would leak future information into the training process.


**Example with Preprocessing**:
When you scale features using the mean and standard deviation of the entire dataset before splitting into training and test sets, the test set statistics influence the training data, leading to data leakage.

# simple pipeline

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n {classification_report(y_test, y_pred)}")


Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# multimodel pipeline

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = load_iris()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing
preprocessing = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Define pipelines for each model
pipeline_lr = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('classifier', LogisticRegression())
])

pipeline_rf = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('classifier', RandomForestClassifier())
])

pipeline_svc = Pipeline(steps=[
    ('preprocessing', preprocessing),
    ('classifier', SVC())
])

# Define parameter grids for each pipeline
param_grid_lr = {
    'classifier__C': [0.1, 1, 10],
    'classifier__solver': ['lbfgs', 'liblinear']
}

param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_features': ['auto', 'sqrt', 'log2']
}

param_grid_svc = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

# Combine pipelines and parameter grids
pipelines = [
    (pipeline_lr, param_grid_lr),
    (pipeline_rf, param_grid_rf),
    (pipeline_svc, param_grid_svc)
]

best_models = []
for pipeline, param_grid in pipelines:
    grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_models.append((grid_search.best_estimator_, grid_search.best_params_, grid_search.best_score_))

# Find the best overall model
best_model = max(best_models, key=lambda x: x[2])

print(f"Best Model: {best_model[0]}\n")
print(f"Best Parameters: {best_model[1]}\n")
print(f"Best Score: {best_model[2]}\n")

# Evaluate the best model on test data
y_pred = best_model[0].predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}\n")
print(f"Classification Report:\n {classification_report(y_test, y_pred)}\n")


Best Model: Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('imputer', SimpleImputer()),
                                 ('scaler', StandardScaler())])),
                ('classifier', LogisticRegression(C=1))])

Best Parameters: {'classifier__C': 1, 'classifier__solver': 'lbfgs'}

Best Score: 0.9583333333333334

Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30




**Pipeline Structure:**

The pipeline expects a fixed sequence of steps. By placing a classifier step with any initial classifier (in this case, LogisticRegression()), we create a valid pipeline structure that can later be modified dynamically during the grid search process.

**Dynamic Replacement in Grid Search:**

When GridSearchCV runs, it systematically replaces the classifier step with each classifier specified in the param_grid. The grid search process involves:

Replacing the classifier step with each candidate model (e.g., LogisticRegression(), RandomForestClassifier(), SVC()).
Setting the specified hyperparameters for each candidate model.
Evaluating the performance of each configuration through cross-validation.