# Streamlining Workflows with Scikit-Learn Pipelines 🚀

Most machine learning projects involve a sequence of steps: loading data, splitting it, preprocessing (like feature scaling), and finally, training a model. Managing these steps manually can be tedious and, more importantly, can lead to common mistakes like **data leakage**.

### The Problem with Manual Steps

A frequent error is to apply a preprocessing step, like scaling, to the entire dataset *before* splitting it into training and testing sets. This causes the scaler to learn from the test data, "leaking" information into the training process and leading to an overly optimistic evaluation of the model's performance. The correct method is to fit the scaler *only* on the training data and then use it to transform both the training and test sets.

### The Solution: `Pipeline`

A **`scikit-learn` Pipeline** solves this problem by chaining multiple steps together into a single "meta-estimator". It bundles preprocessing and modeling into one object, ensuring that the steps are always performed in the correct order and that data leakage is prevented.

This notebook will first demonstrate the manual process of scaling and training an SVM and then show how to achieve the same result more efficiently and safely with a `Pipeline`.

---

## 1. The Manual Workflow: Scaling and Training Separately

We will use the Raisin dataset. The features have different scales, so scaling is an important step before using a distance-based algorithm like SVM.

### Step 1.1: Data Loading and Splitting

In [1]:
import pandas as pd
from graphviz import pipe_lines

df = pd.read_excel('Raisin_Dataset.xlsx')
df.sample(5)

Unnamed: 0,Area,MajorAxisLength,MinorAxisLength,Eccentricity,ConvexArea,Extent,Perimeter,Class
51,114648,508.128933,288.953981,0.822571,118314,0.681905,1340.897,Kecimen
292,72219,376.650492,249.529454,0.749065,74373,0.777795,1050.221,Kecimen
54,111450,478.310971,298.630592,0.78115,113256,0.690093,1298.188,Kecimen
765,121080,573.403612,270.632507,0.881612,124432,0.72379,1418.385,Besni
192,37569,232.427848,208.152006,0.44495,38874,0.794371,734.102,Kecimen


In [4]:
X = df[['Area', 'MajorAxisLength', 'MinorAxisLength', 'Eccentricity', 'ConvexArea', 'Extent', 'Perimeter']]
y = df['Class']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

### Step 1.2: Manual Scaling and Model Training

Here, we manually apply `StandardScaler` and then train an SVM model on the scaled data.


In [11]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Manually scale the data
scaler = StandardScaler()
scaler.fit(X) # Note: Best practice is to fit only on X_train
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model on the scaled data
model = SVC(kernel='rbf')
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       Besni       0.91      0.83      0.87        83
     Kecimen       0.87      0.93      0.90        97

    accuracy                           0.88       180
   macro avg       0.89      0.88      0.88       180
weighted avg       0.88      0.88      0.88       180



The model achieves an accuracy of **88%**. This process works, but it's cumbersome and requires us to manage the scaled and unscaled data separately.

## 2. The Efficient Workflow: Building a Pipeline

Now, let's achieve the same result using a `Pipeline`. We define a series of steps: first, scale the data, and second, apply the SVM classifier.


In [9]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf'))
])

### Using the Pipeline

The pipeline object now acts as our model. We can fit it directly on the **original, unscaled training data**. The pipeline will automatically handle the scaling process correctly.


In [12]:
# Fit the entire pipeline on the original (unscaled) training data
pipeline.fit(X_train, y_train)

# Make predictions on the original (unscaled) test data
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       Besni       0.91      0.83      0.87        83
     Kecimen       0.87      0.93      0.90        97

    accuracy                           0.88       180
   macro avg       0.89      0.88      0.88       180
weighted avg       0.88      0.88      0.88       180



As you can see, the result is identical to the manual process, but the code is much cleaner and safer.

**What happens under the hood?**
* When we call `pipeline.fit(X_train, y_train)`, the pipeline first calls `fit_transform` on the `StandardScaler` using `X_train`, then passes the transformed data to the `SVC` model for fitting.
* When we call `pipeline.predict(X_test)`, it automatically calls `transform` on the `StandardScaler` using `X_test` and then passes the scaled data to the `SVC`'s `predict` method.


## 3. Why Use a Pipeline?

* **Prevents Data Leakage:** The pipeline correctly fits transformers (like `StandardScaler`) on the training data only, preventing information from the test set from influencing the model.
* **Simplicity:** It simplifies the code by consolidating multiple steps into a single object. You only need to call `.fit()` and `.predict()` once.
* **Reproducibility:** The entire workflow is captured in one object, making it easy to save, load, and reuse, ensuring consistent results.
* **Grid Search:** Pipelines are essential for hyperparameter tuning. You can use tools like `GridSearchCV` to simultaneously tune parameters for both the preprocessing steps and the final model.