# wellcome to pipeline

<h1 style="color:green"'>NOTE</h1>

#### I use pipeline into this repository on github [My GitHub Repository](https://github.com/DevJelvehgar/house_prediction)


## 1. Introduction to Pipelines in Machine Learning
In machine learning, pipelines are a powerful way to organize and streamline the process of data preprocessing and model training. A pipeline allows you to chain together a series of steps—such as transforming data, selecting features, and training models—into a single cohesive workflow.<br>

A typical machine learning pipeline may consist of several steps:<br>

1. `Preprocessing`: Scaling, normalizing, or encoding data.

2. `Feature Selection`: Choosing the most important features that will contribute to model performance.

3. `Model Training`: Using a machine learning algorithm to fit a model to the data.

4. `Model Evaluation`: Assessing the performance of the model using validation metrics.

#### 1.1 Benefit of pipeline in end-to-end ml project
- Simplifies Code and Workflow
- Improved Reproducibility
- Reduction of Data Leakage
- Hyperparameter Tuning
- Improved Code Maintenance
- Handling Multiple Steps in One Object

## 2. Building Basic Pipelines

#### 2.0 First import packages and Data

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler , OneHotEncoder
from sklearn.model_selection import train_test_split 

In [2]:
# Import dataset that minimize of real dataset of california housing
data = pd.read_csv('data/housing.csv')
housing = data.copy()

In [3]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42) #Just use train_set

In [4]:
housing = train_set.drop('median_house_value', axis=1) # this is X_train
housing_label = train_set['median_house_value'].copy() # this is y_train

In [5]:
housing.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
49,-122.27,37.82,40,946.0,375.0,700.0,352.0,1.775,<1H OCEAN
70,-122.29,37.81,26,768.0,152.0,392.0,127.0,1.7719,NEAR BAY


#### 2.1 Problem: we have a `NaN` in our data and first we should fix it and then use feature scaling

In [6]:
housing.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           3
total_bedrooms        5
population            1
households            4
median_income         1
ocean_proximity       3
dtype: int64

#### 2.2 we use pipeline to fix both of them : We have `Numerical` and `Categorical` features

In [7]:
# we can see pipeline processing into diagram
from sklearn import set_config
set_config(display='diagram')  

In [8]:
# handle Numerical categories (option1- Recommended)
num_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler())

##### option 2: (Not Recommended)
need use explicit name such as `imputer or scaler`
```python
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
```


In [9]:
num_pipeline

0,1,2
,steps,"[('imputer', ...), ('scaler', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [29]:
# handle categorical (option 1) : Recommended
cat_pipeline = make_pipeline(
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore'))

In [34]:
# Implicity given name of transformer 'simpleimputer' and 'onehotencoder'
cat_pipeline.named_steps

{'simpleimputer': SimpleImputer(strategy='most_frequent'),
 'onehotencoder': OneHotEncoder(handle_unknown='ignore')}

##### option 2: (Not Recommended)
need use explicitly name such as `imputer or encoder`
```python
cat_pipeline = Pipeline([
     ('imputer', SimpleImputer(strategy='most_frequent')),
     ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
```


In [30]:
cat_pipeline

0,1,2
,steps,"[('simpleimputer', ...), ('onehotencoder', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [12]:
defualt_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

## 3. Understanding and Using ColumnTransformer

| Feature                      | **Using `ColumnTransformer`**                                                                                                                 | **Using `make_column_transformer` and `make_column_selector`**                                                                                  |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| **How Columns are Selected** | Columns are explicitly listed by name (e.g., `num_attribs`, `cat_attribs`)                                                                    | Columns are automatically selected based on data type (`dtype_include=np.number` for numerical and `dtype_include=object` for categorical)      |
| **Syntax**                   | You manually specify the columns in each transformation step (e.g., `"num"` and `"cat"` for numerical and categorical columns, respectively). | The selection of columns is done dynamically using `make_column_selector`. You don't need to explicitly list columns.                           |
| **Flexibility**              | More control over column selection, since you specify exactly which columns to apply each transformation to.                                  | More flexible and concise, automatically selecting columns based on their data type. You don't need to maintain lists of column names manually. |
| **Maintenance**              | Requires keeping track of column names (`num_attribs`, `cat_attribs`). If the dataset changes, you need to manually update these lists.       | Easier to maintain if column names are not fixed (e.g., if the dataset changes frequently, column types are automatically detected).            |
| **Custom Column Types**      | You define custom column subsets (e.g., `num_attribs`, `cat_attribs`) explicitly.                                                             | Automatically handles columns based on data type without needing predefined lists.                                                              |
| **Use Case**                 | Useful when you have specific column names or when the column types are mixed and need fine-tuned control.                                    | Ideal when you want a cleaner, simpler solution for handling typical column types (numeric and categorical).                                    |


In [13]:
from sklearn.compose import ColumnTransformer, make_column_selector

In [14]:
preprocessing = ColumnTransformer([
    ('numerical', num_pipeline,  make_column_selector(dtype_include=np.number)),
    ('categorical', cat_pipeline,  make_column_selector(dtype_include=object))
    ],
     remainder=defualt_pipeline)

##### option 2 : first option is shorter and clean but you can use seccond option (Not recommended)
```python
housing_num = housing.select_dtypes(include=[np.number]).columns
housing_cat = housing.select_dtypes(include=[object]).columns

preprocessing = ColumnTransformer([
    ('numerical', num_pipeline,  housing_num),
    ('categorical', cat_pipeline, housing_cat)
    ],
     remainder=defualt_pipeline)

In [24]:
housing_prepared = preprocessing.fit_transform(housing)

In [25]:
preprocessing.get_feature_names_out()

array(['numerical__longitude', 'numerical__latitude',
       'numerical__housing_median_age', 'numerical__total_rooms',
       'numerical__total_bedrooms', 'numerical__population',
       'numerical__households', 'numerical__median_income',
       'categorical__ocean_proximity_<1H OCEAN',
       'categorical__ocean_proximity_INLAND',
       'categorical__ocean_proximity_NEAR BAY'], dtype=object)

In [26]:
df_housing_prepared = pd.DataFrame(
    housing_prepared,
    columns= preprocessing.get_feature_names_out(),
    index=housing.index)

In [27]:
df_housing_prepared.head()

Unnamed: 0,numerical__longitude,numerical__latitude,numerical__housing_median_age,numerical__total_rooms,numerical__total_bedrooms,numerical__population,numerical__households,numerical__median_income,categorical__ocean_proximity_<1H OCEAN,categorical__ocean_proximity_INLAND,categorical__ocean_proximity_NEAR BAY
49,-0.107943,0.102321,-0.318842,-0.500065,0.083594,-0.18023,0.133416,-0.269572,1.0,0.0,0.0
70,-0.146882,0.0817,-1.438333,-0.660964,-0.906775,-0.865386,-0.914053,-0.271708,0.0,0.0,1.0
68,-0.166351,0.0817,0.640721,-0.838133,-1.097743,-1.127881,-1.123546,-0.216786,0.0,0.0,1.0
15,-0.088474,0.164184,0.480794,-0.342782,-0.324989,-0.186903,-0.276261,-0.028384,0.0,1.0,0.0
39,-0.088474,0.122942,0.640721,0.843168,1.593575,1.32578,1.734879,0.291914,0.0,0.0,1.0


## 4. Custom Transformers