# 🚀 Final Preprocessing Code using Scikit-Learn Pipelines 🛠️

In this section, we'll consolidate everything we've done so far into one final script using **Scikit-Learn Pipelines**. This includes:

* 📊 **Creating a stratified test set**
* 🔄 **Handling missing values**
* 🔢 **Encoding categorical variables**
* ⚖️ **Scaling numerical features**
* 🔗 **Combining everything using Pipeline and ColumnTransformer**

This will ensure **clean**, **modular**, and **reproducible** code — perfect for production and education! 😎

---

## 📝 Code Walkthrough

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.preprocessing import OrdinalEncoder  # Uncomment if you prefer ordinal
```

### 1. 📥 **Load the Data**

First, let's load our dataset:

```python
housing = pd.read_csv("housing.csv")
```

### 2. 🎯 **Create a Stratified Test Set based on Income Category**

We want to split the data such that income categories are evenly distributed between train and test sets:

```python
housing["income_cat"] = pd.cut(
    housing["median_income"],
    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
    labels=[1, 2, 3, 4, 5]
)
```

Next, we'll use **StratifiedShuffleSplit** to create a split based on the income category:

```python
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index].drop("income_cat", axis=1)
    strat_test_set = housing.loc[test_index].drop("income_cat", axis=1)
```

### 3. 🧹 **Separate Predictors and Labels**

Now, we'll separate out the labels from the features:

```python
housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1)
```

### 4. 🔠 **Separate Numerical and Categorical Columns**

We’ll divide the columns into **numerical** and **categorical**:

```python
num_attribs = housing.drop("ocean_proximity", axis=1).columns.tolist()
cat_attribs = ["ocean_proximity"]
```

### 5. 🛠️ **Set Up Pipelines**

#### ➡️ **Numerical Pipeline**

We’ll handle missing values using a **SimpleImputer** (with median strategy) and scale the numerical features:

```python
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # Handle missing data
    ("scaler", StandardScaler()),  # Scale numerical features
])
```

#### ➡️ **Categorical Pipeline**

For categorical variables, we’ll use **OneHotEncoder**:

```python
cat_pipeline = Pipeline([
    # ("ordinal", OrdinalEncoder())  # Use this if you prefer ordinal encoding
    ("onehot", OneHotEncoder(handle_unknown="ignore"))  # OneHot encoding
])
```

### 6. 🔗 **Full Pipeline with ColumnTransformer**

Combine the pipelines into a **ColumnTransformer** to process numerical and categorical features separately:

```python
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])
```

### 7. 🔄 **Transform the Data**

Now, we apply the transformation to the training set:

```python
housing_prepared = full_pipeline.fit_transform(housing)
```

The result, `housing_prepared`, is a **NumPy array** that's now ready for training a model!

```python
# housing_prepared is now a NumPy array ready for training
print(housing_prepared.shape)
```