imports

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV


####  **Creating a Simple Pipeline**

A typical pipeline consists of:

-   **Data preprocessing steps**: like scaling or imputation.
-   **Modeling step**: like classification or regression.

For this example, let's create a pipeline that:

1.  Imputes missing values in the dataset.
2.  Scales numerical features.
3.  Uses a `RandomForestClassifier` as the model.

Example Dataset

In [3]:
# Sample dataset with missing values
data = {
    'age': [25, 30, 35, 40, 45, None, 50, 55],
    'salary': [50000, 60000, 70000, 80000, 90000, 100000, None, 120000],
    'target': [1, 0, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

# Features (X) and target (y)
X = df[['age', 'salary']]
y = df['target']


Building the Pipeline

Here we will use the SimpleImputer to handle missing data, StandardScaler to scale the numerical data, and a RandomForestClassifier for classification.

In [5]:
# pipeline for preprocessing and classification
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', Pipeline([
                ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
                ('scaler', StandardScaler())  # Standardize the numerical features
            ]), ['age', 'salary'])
        ]
    )),
    ('classifier', RandomForestClassifier())  # Classifier step
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

print(predictions)


[1 1]


Using GridSearchCV with Pipelines

You can also perform hyperparameter tuning on the pipeline using GridSearchCV. This allows you to search through multiple combinations of hyperparameters for the preprocessing steps and the model.

Grid Search Example

In [10]:
# hyperparameters for GridSearchCV
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],  # Imputation strategy
    'classifier__n_estimators': [50, 100, 150]  # Number of trees for RandomForestClassifier
}

grid_search = GridSearchCV(pipeline, param_grid, cv=3)
grid_search.fit(X_train, y_train)

# Best parameters found
print("Best parameters:", grid_search.best_params_)

# Best model performance
print("Best model score:", grid_search.best_score_)

# Predictions with the best model
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
print(predictions)




Best parameters: {'classifier__n_estimators': 50, 'preprocessor__num__imputer__strategy': 'mean'}
Best model score: 0.3333333333333333
[1 1]


##### **Advantages of Using Pipelines**

-   **Efficiency**: Pipelines allow you to automate preprocessing and modeling, reducing the chances of errors during data transformation or modeling.
-   **Reproducibility**: Pipelines ensure that the same steps are applied to both the training and testing data, preventing data leakage and making the workflow easier to reproduce.
-   **Ease of Deployment**: Once a pipeline is defined and tested, it can be deployed as a single unit, making the deployment process cleaner and more manageable.

#####  **Advanced Pipeline Usage**

You can also create more advanced pipelines with multiple preprocessing steps and different models, such as:

-   **Feature selection** (e.g., using `SelectKBest`).
-   **Custom transformers** for specific preprocessing needs.
-   **Handling different feature types** (e.g., numerical and categorical features).

For example:

In [18]:
# pipeline = Pipeline([
#     ('preprocessor', ColumnTransformer(
#         transformers=[
#             ('num', Pipeline([
#                 ('imputer', SimpleImputer(strategy='mean')),
#                 ('scaler', StandardScaler())
#             ]), ['age', 'salary']),
#             ('cat', OneHotEncoder(handle_unknown='ignore'), ['category'])  # handle unknown categories
#         ]
#     )),
#     ('feature_selection', SelectKBest(k=2)),  # Select top 2 features
#     ('classifier', RandomForestClassifier())
# ])

# # Fit and predict using the updated pipeline
# pipeline.fit(X_train, y_train)
# predictions = pipeline.predict(X_test)
# print(predictions)

In [19]:
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import OneHotEncoder

# Pipeline with feature selection
print(X_train.columns)

pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), ['age', 'salary'])  # Standardize 'age' and 'salary'
        ]
    )),
    ('classifier', RandomForestClassifier())
])

# Fit and predict using the updated pipeline
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(predictions)






Index(['age', 'salary'], dtype='object')
[1 1]


Summary

Pipeline in Scikit-learn helps streamline machine learning workflows by combining preprocessing and modeling steps into one object.

You can use GridSearchCV for hyperparameter tuning within a pipeline.

Pipelines make your workflow more reproducible, manageable, and less error-prone.