# Practical Pipeline Building

Machine learning projects often require multiple steps:
- Data preprocessing (handling missing values, scaling, encoding)
- Feature selection or dimensionality reduction
- Model training and evaluation

Instead of applying these steps manually, we can use **Scikit-learn Pipelines** for a clean and reproducible workflow.

## 1. Why Pipelines?
- Ensure reproducibility
- Prevent data leakage
- Simplify code
- Combine preprocessing + modeling in one object

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 2. Building a Simple Pipeline
Pipeline with:
- StandardScaler (normalization)
- PCA (dimensionality reduction)
- Logistic Regression (classifier)

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', LogisticRegression())
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Evaluate
accuracy = pipeline.score(X_test, y_test)
print("Pipeline accuracy:", accuracy)

## 3. Using GridSearch with Pipelines
- Pipelines integrate smoothly with **GridSearchCV**.
- We can tune preprocessing and model hyperparameters together.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'pca__n_components': [2, 3],
    'classifier__C': [0.1, 1, 10]
}

grid = GridSearchCV(pipeline, param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best cross-validation score:", grid.best_score_)

## ✅ Summary
- Pipelines combine preprocessing and modeling steps.
- Prevents data leakage and ensures clean workflows.
- Easily integrated with **GridSearchCV** for hyperparameter tuning.
- A good practice for real-world ML projects!