
# 🧠 Complete Guide to Pipelines and GridSearchCV in Scikit-Learn

---
## 1. Introduction

In **Machine Learning Engineering**, you rarely just train a model — you build systems that combine preprocessing, feature transformation, and modeling.  
This notebook shows how **Pipelines** and **GridSearchCV** help automate and optimize these workflows in Scikit-Learn, while teaching concepts that generalize to other ML frameworks.

---


Pipeline = a tool in scikit-learn that lets you chain multiple preprocessing and modeling steps together so they run as one clean workflow (e.g., scaling → model).

GridSearchCV = a tool that automatically tests combinations of hyperparameters (like C, gamma, etc.) using cross-validation to find the best model settings.

how they work together:
1. Use Pipeline to streamline preprocessing and modeling.
2. Integrate GridSearchCV with the pipeline to optimize hyperparameters.


In [None]:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Titanic dataset
titanic = fetch_openml("titanic", version=1, as_frame=True)
X = titanic.data[["pclass", "sex", "age", "fare"]]
y = titanic.target

# Define preprocessing
numeric_features = ["age", "fare"]
categorical_features = ["pclass", "sex"]

from sklearn.impute import SimpleImputer

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),  # Fill missing numeric values with the mean
    ("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # Fill missing categorical values with the most frequent value
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Combine into pipeline
clf = Pipeline(steps=[("preprocessor", preprocessor),
                      ("classifier", SVC(kernel="rbf", C=1))])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))



---
## 2. Optimizing Pipelines with GridSearchCV

Once your pipeline is built, you can tune hyperparameters **inside** it using `GridSearchCV`.

This method automates the process of testing different combinations of hyperparameters using cross-validation.


In [None]:

from sklearn.model_selection import GridSearchCV

param_grid = {
    "classifier__C": [0.1, 1, 10],
    "classifier__kernel": ["linear", "rbf"]
}

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

y_pred = grid_search.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))



---
## 3. Key ML Engineering Takeaways

| Concept | Purpose | Engineering Insight |
|----------|----------|---------------------|
| **Pipeline** | Combines preprocessing + model steps | Like a mini DAG for ML workflows |
| **GridSearchCV** | Automates hyperparameter tuning | Similar to parameter sweeps in any framework |
| **Serialization** | Pipelines can be saved & deployed | e.g., `joblib.dump(pipeline, 'model.joblib')` |

These ideas extend beyond Scikit-Learn — pipelines exist in TensorFlow Extended, Spark ML, and MLOps frameworks like Vertex AI and SageMaker.

---
## ✅ Summary

- Pipelines unify your ML workflow.  
- GridSearchCV finds the best model configuration.  
- Both concepts generalize across programming languages and ML systems.

---
