Pipelines in Machine Learning

In machine learning, a pipeline is a series of data processing steps that are connected in a workflow. Each step in the pipeline performs a specific task, such as data preprocessing, feature engineering, model selection, and prediction.

Components of a Pipeline:

1. Data Ingestion: Loading data from various sources.
2. Data Preprocessing: Cleaning, transforming, and preparing data for modeling.
3. Feature Engineering: Selecting and transforming relevant features.
4. Model Selection: Choosing a suitable machine learning algorithm.
5. Model Training: Training the model on the prepared data.
6. Model Evaluation: Evaluating the performance of the trained model.
7. Deployment: Deploying the model in a production-ready environment.

Benefits of Pipelines:

1. Reusability: Pipelines can be reused for similar tasks or datasets.
2. Efficiency: Pipelines automate the workflow, reducing manual effort.
3. Consistency: Pipelines ensure consistency in data processing and modeling.
4. Scalability: Pipelines can be scaled up or down depending on the dataset size and complexity.
5. Collaboration: Pipelines facilitate collaboration among data scientists and engineers.


First without Pipelines

In [1]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load Titanic dataset
data = sns.load_dataset("titanic")

# Drop columns with too many missing values or irrelevant
data = data.drop(columns=["deck", "embark_town", "alive", "class", "who", "adult_male"])

# Handle missing values
data["age"] = data["age"].fillna(data["age"].median())
data["embarked"] = data["embarked"].fillna(data["embarked"].mode()[0])

# Encode categorical variables manually (get_dummies)
data = pd.get_dummies(data, drop_first=True)

# Features and target
X = data.drop(columns="survived")
y = data["survived"]

# Split data
train_X, test_X, train_Y, test_Y = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Decision Tree
model = DecisionTreeClassifier(random_state=42)
model.fit(train_X, train_Y)

# Predict
y_pred = model.predict(test_X)

# Evaluate
print("Accuracy Score:", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))


Accuracy Score: 0.8156424581005587

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.87      0.85       110
           1       0.78      0.72      0.75        69

    accuracy                           0.82       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.82      0.81       179



Now with the Pipelines

In [2]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report

# Load Titanic dataset
data = sns.load_dataset("titanic")

# Drop irrelevant columns
data = data.drop(columns=["deck", "embark_town", "alive", "class", "who", "adult_male"])

# Features and target
X = data.drop(columns="survived")
y = data["survived"]

# Separate numeric and categorical features
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X.select_dtypes(include=["object", "category"]).columns

# Preprocessing for numeric features: fill missing + scale
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Preprocessing for categorical features: fill missing + one-hot encode
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Final pipeline with DecisionTree
clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", DecisionTreeClassifier(random_state=42))
])

# Split dataset
train_X, test_X, train_Y, test_Y = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train pipeline
clf.fit(train_X, train_Y)

# Predict
y_pred = clf.predict(test_X)

# Evaluate
print("Accuracy Score:", accuracy_score(test_Y, y_pred))
print("\nClassification Report:\n", classification_report(test_Y, y_pred))


Accuracy Score: 0.8156424581005587

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85       110
           1       0.79      0.71      0.75        69

    accuracy                           0.82       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.82      0.81       179



'    Without Pipeline (manual preprocessing) 'Pros

Transparent: You see each preprocessing step (fillna, encoding, scaling, etc.) explicitly.

Flexible debugging: Easy to test transformations step by step.

Good for learning: Helps beginners understand preprocessing clearly.

Cons

Repetition: You must reapply the exact same preprocessing to test data, future unseen data, or deployment manually.

Error-prone: Easy to forget a step (e.g., forgetting to scale test data).

Messy code: Preprocessing and model training code can get long and hard to maintain.

Not reusable: If you want to try another model, you must duplicate preprocessing code.

With Pipeline
Pros

Cleaner & modular: Preprocessing + model = one object (Pipeline).

Consistency: Ensures the same transformations are applied to both training and testing data.

Easy to deploy: You can just .fit() on train and .predict() on new data, everything is handled automatically.

Works with hyperparameter tuning: You can directly tune model + preprocessing steps inside GridSearchCV / RandomizedSearchCV.

Production ready: Saves you from data leakage because transformations fit only on training data inside the pipeline.

Cons

Less transparent: Harder to debug inside (need to check intermediate steps using .named_steps).

Learning curve: Beginners may find ColumnTransformer and Pipeline syntax confusing.

Less control: If you want custom preprocessing logic (like domain-specific feature engineering), it might feel restrictive.

Summary:

Use manual preprocessing when you’re exploring or learning step-by-step.

Use a Pipeline for real-world projects, experiments with multiple models, or production — it’s more robust, reusable, and less error-prone