<a href="https://www.kaggle.com/code/angelchaudhary/scikit-learn-pipelines-vs-custom-ml-workflows?scriptVersionId=292223160" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# End-to-End Machine Learning Pipeline: Scikit-learn Pipelines vs Custom ML Workflows

# Introduction
Building an end-to-end machine learning system involves more than model training—it includes data preprocessing, feature engineering, validation, and ensuring consistency across the entire workflow. These steps are commonly implemented either using Scikit-learn’s built in Pipelines or through custom, manually structured ML pipelines. Choosing the right approach impacts reliability, scalability, and the risk of data leakage.

This case study aims to compare Scikit-learn Pipelines with custom ML workflows to understand their real-world trade-offs. While Scikit-learn enforces structure and safety, custom pipelines offer flexibility and control. Understanding both approaches is essential for building production-ready ML systems and for demonstrating strong system design thinking in interviews.

## Approach
We solve the same machine learning problem using two approaches:
1. An end-to-end Scikit-learn Pipeline using `Pipeline` and `ColumnTransformer`
2. A custom-built ML workflow with explicit preprocessing and modeling steps

Both approaches are evaluated on performance, maintainability, and robustness to highlight their strengths and limitations.

# LET'S DO IT!!!!
![funny gif](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExOTg0c3M5emUzNjdyMHo4ejJmZ20xY3ljcm1rMzZqMDJiem1vd2hkcCZlcD12MV9naWZzX3NlYXJjaCZjdD1n/maNB0qAiRVAty/giphy.gif)

## Dataset Overview 
**This dataset is taken from kaggle competition, "Student Score Prediction"**. We'll use it to predict exam scores based on study habits, sleep patterns, attendance, and demographic information. This dataset contains both numerical and categorical features, making it suitable for demonstrating end-to-end ML pipeline design.

### Target Variable
- `exam_score` (Regression)

### Features
- Numerical: `study_hours`, `sleep_hours`, `attendance_rate`
- Categorical: `gender`, `course`, `study_method`
- Ordinal: `sleep_quality`, `facility_rating`

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

In [3]:
# Load dataset
df = pd.read_csv("/kaggle/input/kaggle-dataset/train.csv")
df.head()

Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0


In [4]:
df.shape

(630000, 13)

In [5]:
df.dtypes

id                    int64
age                   int64
gender               object
course               object
study_hours         float64
class_attendance    float64
internet_access      object
sleep_hours         float64
sleep_quality        object
study_method         object
facility_rating      object
exam_difficulty      object
exam_score          float64
dtype: object

In [6]:
df.isnull().sum()

id                  0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
exam_score          0
dtype: int64

In [7]:
df["exam_score"].describe()

count    630000.000000
mean         62.506672
std          18.916884
min          19.599000
25%          48.800000
50%          62.600000
75%          76.300000
max         100.000000
Name: exam_score, dtype: float64

In [8]:
X = df.drop(columns=["exam_score"])
y = df["exam_score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (504000, 12)
Test shape: (126000, 12)


### Feature Categorization
We explicitly separate numerical and categorical features to ensure consistent preprocessing.

In [9]:
numerical_features = [
    "study_hours",
    "sleep_hours",
    "attendance_rate"
]

categorical_features = [
    "gender",
    "course",
    "study_method",
    "sleep_quality",
    "facility_rating"
]

## Baseline Model (No Pipeline)
Before building pipelines, we train a simple baseline model without structured preprocessing. This helps establish a reference point.

In [11]:
from sklearn.metrics import root_mean_squared_error, r2_score

baseline_model = LinearRegression()
baseline_model.fit(pd.get_dummies(X_train, drop_first=True),y_train)
# Predictions
baseline_preds = baseline_model.predict(pd.get_dummies(X_test, drop_first=True))

# Evaluation
rmse = root_mean_squared_error(y_test, baseline_preds)
r2 = r2_score(y_test, baseline_preds)

print(f"Baseline RMSE: {rmse:.2f}")
print(f"Baseline R² Score: {r2:.3f}")

Baseline RMSE: 8.89
Baseline R² Score: 0.778


## Observation (Baseline Model)

The baseline Linear Regression model achieves an RMSE of **8.89** and an R² score of **0.778**, indicating that the model is able to explain a significant portion of the variance in exam scores. However, this approach relies on manual one-hot encoding using `pd.get_dummies`, which introduces several limitations:
- The preprocessing logic is **not reusable** for inference or deployment.
- There is a **risk of feature mismatch** if the training and test sets contain different categorical levels.
- Preprocessing is performed **outside the model**, making the workflow harder to maintain and scale.

While this baseline provides a reasonable performance reference, it highlights the need for a more structured and reliable pipeline-based approach.

## Approach A - Scikit-learn Pipeline
In this approach, we use Scikit-learn’s `Pipeline` and `ColumnTransformer` to build a fully structured end-to-end ML workflow.
This ensures:
- Consistent preprocessing across training and testing
- No data leakage
- Better readability and maintainability
- Easier experimentation and deployment

### Preprocessing with ColumnTransformer

Numerical and categorical features require different preprocessing strategies.
We apply scaling to numerical features and one-hot encoding to categorical features
using a `ColumnTransformer`.

In [15]:
# Automatically infer feature types
numerical_features = X_train.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X_train.select_dtypes(include=["object", "category"]).columns.tolist()

print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)

Numerical features: ['id', 'age', 'study_hours', 'class_attendance', 'sleep_hours']
Categorical features: ['gender', 'course', 'internet_access', 'sleep_quality', 'study_method', 'facility_rating', 'exam_difficulty']


In [16]:
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

categorical_transformer = Pipeline(steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))])

preprocessor = ColumnTransformer(
    transformers=[("num", numeric_transformer, numerical_features),("cat", categorical_transformer, categorical_features)])

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

pipeline_lr = Pipeline(steps=[("preprocessing", preprocessor),("model", LinearRegression())])

### Cross-Validation Evaluation

We'll evaluate the pipeline using 5-fold cross-validation. RMSE is used as the primary metric for consistency with the baseline comparison.

In [17]:
pipeline_lr = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LinearRegression())
])

kf = KFold(n_splits=5, shuffle=True, random_state=42)

cv_rmse = -cross_val_score(
    pipeline_lr,
    X_train,
    y_train,
    cv=kf,
    scoring="neg_root_mean_squared_error"
)

print(f"CV RMSE Mean: {cv_rmse.mean():.2f}")
print(f"CV RMSE Std: {cv_rmse.std():.2f}")

CV RMSE Mean: 8.90
CV RMSE Std: 0.00


In [18]:
pipeline_lr.fit(X_train, y_train)

pipeline_preds = pipeline_lr.predict(X_test)

pipeline_rmse = root_mean_squared_error(y_test, pipeline_preds)
pipeline_r2 = r2_score(y_test, pipeline_preds)

print(f"Pipeline RMSE: {pipeline_rmse:.2f}")
print(f"Pipeline R² Score: {pipeline_r2:.3f}")

Pipeline RMSE: 8.89
Pipeline R² Score: 0.778


### Observation (Cross-Validation)

The pipeline achieves a mean cross-validation RMSE of **8.90** with a standard
deviation of **0.00**, indicating highly consistent performance across all folds.

This suggests that the model’s predictions are stable and not sensitive to different train-validation splits. Importantly, the performance closely matches the baseline model, confirming that the pipeline does not introduce data leakage or optimistic bias. The primary benefit of the pipeline lies not in improved accuracy, but in ensuring reliability, reproducibility, and consistency across the entire
machine learning workflow.

## Approach B: Custom ML Pipeline

In this approach, we manually implement each step of the machine learning workflow, including preprocessing, feature transformation, model training, and evaluation. While this method offers flexibility and transparency, it requires careful handling to avoid common pitfalls such as data leakage, feature mismatch, and inconsistent transformations between training and inference.

### Manual Feature Separation

We explicitly separate numerical and categorical features to apply custom preprocessing steps.

In [20]:
### Explicit feature separation
num_features = numerical_features
cat_features = categorical_features

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num = scaler.fit_transform(X_train[num_features])
X_test_num = scaler.transform(X_test[num_features])

### Manual One-Hot Encoding of Categorical Features

A `OneHotEncoder` is fitted on the training data and used to transform both training and test sets.

In [22]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

X_train_cat = encoder.fit_transform(X_train[cat_features])
X_test_cat = encoder.transform(X_test[cat_features])

### Feature Concatenation

Numerical and categorical features are manually combined to form the final feature matrices.

In [23]:
X_train_final = np.hstack([X_train_num, X_train_cat])
X_test_final = np.hstack([X_test_num, X_test_cat])

print("Final train shape:", X_train_final.shape)
print("Final test shape:", X_test_final.shape)

Final train shape: (504000, 31)
Final test shape: (126000, 31)


### Model Training and Evaluation

We train the same Linear Regression model on the manually processed features and evaluate it on the test set.

In [24]:
manual_model = LinearRegression()

manual_model.fit(X_train_final, y_train)

manual_preds = manual_model.predict(X_test_final)

manual_rmse = root_mean_squared_error(y_test, manual_preds)
manual_r2 = r2_score(y_test, manual_preds)

print(f"Manual Pipeline RMSE: {manual_rmse:.2f}")
print(f"Manual Pipeline R² Score: {manual_r2:.3f}")

Manual Pipeline RMSE: 8.89
Manual Pipeline R² Score: 0.778


## Observation (Manual vs Pipeline Comparison)

All three approaches—baseline, Scikit-learn Pipeline, and custom manual pipeline produce nearly identical performance metrics. This confirms that the model learns the same underlying relationships when preprocessing is applied correctly. However, the key difference lies in *how safe and maintainable* each approach is. While the manual pipeline matches the pipeline’s performance in this controlled setting, it relies heavily on careful implementation and bookkeeping. As the workflow grows more complex, this approach becomes increasingly fragile.

Scikit-learn Pipelines provide the same performance guarantees while systematically preventing common errors such as data leakage, feature mismatch,and inconsistent inference logic.

## Final Comparison

| Aspect | Baseline | Sklearn Pipeline | Manual Pipeline |
|------|--------|----------------|----------------|
| Performance | ✅ | ✅ | ✅ |
| Data Leakage Risk | High | Low | Medium |
| Reusability | Low | High | Medium |
| Debuggability | Low | High | Medium |
| Production Ready | ❌ | ✅ | ❌ |


## Key Takeaways

- Model performance alone is not sufficient to judge an ML system.
- Structured pipelines reduce human error without sacrificing accuracy.
- Manual pipelines require strict discipline and do not scale well.
- Scikit-learn Pipelines are the safest choice for production ML workflows.