# 🚀 Constructing Pipelines in Scikit-Learn

As datasets grow more complex, data preprocessing often involves multiple steps such as imputing missing values, scaling features, encoding categorical variables, etc. These steps must be applied in the correct order and consistently across training, validation, test, and future production data.

To streamline this process, Scikit-Learn provides the **`Pipeline`** class — a powerful utility for chaining data transformations. ✨

---

## 🛠️ Building a Numerical Pipeline

A typical pipeline for numerical attributes might include:

* **Imputation** of missing values (e.g., with median)
* **Feature scaling** (e.g., with standardization)

```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),  # Step 1: Impute missing values
    ("standardize", StandardScaler()),            # Step 2: Standardize features
])
```

---

### 🔍 How It Works

* The pipeline takes a **list of steps** as `(name, transformer)` pairs.
* **Names must be unique** and **should not contain double underscores** `__`.
* All intermediate steps must be **transformers** (i.e., they must implement `fit_transform()`).
* The **final step** can be either a transformer or a predictor.

---

## ⚡ Using `make_pipeline`

If you don't want to manually name the steps, you can use **`make_pipeline()`**:

```python
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
```

This automatically names the steps using the class names in lowercase. If the same class appears multiple times, a number is appended (e.g., `standardscaler-1`).

---

## 🔧 Applying the Pipeline

To apply all transformations in sequence, call `fit_transform()`:

```python
housing_num_prepared = num_pipeline.fit_transform(housing_num)
print(housing_num_prepared[:2].round(2))
```

### Example Output:

```plaintext
array([[-1.42,  1.01,  1.86,  0.31,  1.37,  0.14,  1.39, -0.94],
       [ 0.60, -0.70,  0.91, -0.31, -0.44, -0.69, -0.37,  1.17]])
```

* Each row corresponds to a **transformed sample**.
* Each column corresponds to a **scaled feature**.

---

## 🔄 Retrieving Feature Names

To turn the result back into a DataFrame with feature names:

```python
df_housing_num_prepared = pd.DataFrame(
    housing_num_prepared,
    columns=num_pipeline.get_feature_names_out(),
    index=housing_num.index
)
```

---

## 🔄 Pipeline as a Transformer or Predictor

* If the **last step** is a transformer, the pipeline behaves like a **transformer** (`fit_transform()`, `transform()`).
* If the **last step** is a **predictor** (e.g., a model), the pipeline behaves like an **estimator** (`fit()`, `predict()`).

This flexibility makes **`Pipeline`** the standard way to handle data preprocessing and modeling in Scikit-Learn projects. 🏆

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("xlsx/California Housing Prices/housing.csv")

In [4]:
data["income_cat"] = pd.cut(data["median_income"], bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf], labels=[1, 2, 3, 4, 5])

In [5]:
from sklearn.model_selection import StratifiedShuffleSplit

In [6]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

In [7]:
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [8]:
# Remove the income_cat column
for sett in (strat_train_set, strat_test_set):
    sett.drop("income_cat", axis=1, inplace=True)

In [9]:
df = strat_train_set.copy()

In [10]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,72100.0,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,279600.0,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,82700.0,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,112500.0,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,238300.0,<1H OCEAN


In [11]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

In [12]:
housing = housing.drop("ocean_proximity", axis=1)

In [13]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964


In [14]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [15]:
my_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
     ("standardize", StandardScaler())
])

In [16]:
my_pipeline.fit_transform(housing)

array([[-0.94135046,  1.34743822,  0.02756357, ...,  0.73260236,
         0.55628602, -0.8936472 ],
       [ 1.17178212, -1.19243966, -1.72201763, ...,  0.53361152,
         0.72131799,  1.292168  ],
       [ 0.26758118, -0.1259716 ,  1.22045984, ..., -0.67467519,
        -0.52440722, -0.52543365],
       ...,
       [-1.5707942 ,  1.31001828,  1.53856552, ..., -0.86201341,
        -0.86511838, -0.36547546],
       [-1.56080303,  1.2492109 , -1.1653327 , ..., -0.18974707,
         0.01061579,  0.16826095],
       [-1.28105026,  2.02567448, -0.13148926, ..., -0.71232211,
        -0.79857323, -0.390569  ]])