Pipelines help automate and simplify the process of data preprocessing and model training.

Instead of manually transforming data in multiple steps, we combine them into one structured pipeline.

Using Grid Search and Random Search, we efficiently find the best hyperparameters for our model.

The final evaluation is based on Root Mean Squared Error (RMSE), where lower values indicate a better model.

In [1]:
# Load the housing dataset
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_openml
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


data = housing

print(data.info())

# Splitting data into features (X) and target (y)
X = data.drop(columns=["median_house_value"]) # Input features
y = data["median_house_value"] # Target variable

# Splitting data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns # Numerical columns
categorical_features = X.select_dtypes(include=['object']).columns # Categorical columns

# Numeric pipeline: Impute and Scale
# Define a pipeline for numerical features (handling missing values and scaling)
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')), # Fill missing values with mean
    ('scaler', StandardScaler()) # Normalize data
])

# Categorical pipeline: Impute and OneHotEncode
# Define a pipeline for categorical features (handling missing values and encoding)
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')), # Fill missing values with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # Convert categories to binary representation
])

# Combine both pipelines into a full preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)

# Create the final pipeline with preprocessing and model (Linear Regression)
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Grid Search
# The goal of Grid Search is to automatically find the best combination of hyperparameters instead of manually tuning them and testing each value separately.
# Grid Search systematically tries all possible combinations of the specified hyperparameter values, then evaluates the model’s performance for each combination to select the best one.
param_grid = {
    'model__fit_intercept': [True, False], # line passes through the origin (0,0) or calculate the y intercept
    'model__positive': [True, False] # Whether to enforce that all coefficients are positive (if True)
}

grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    grid_search.fit(X_train, y_train)

# Best model from Grid Search
print("Best Parameters from Grid Search:", grid_search.best_params_)

# Evaluate the model
y_pred_grid = grid_search.predict(X_test)
mse_grid = mean_squared_error(y_test, y_pred_grid)
rmse_grid = np.sqrt(mse_grid)
print(f"Grid Search RMSE: {rmse_grid}")

# Random Search
param_distributions = {
    'model__fit_intercept': [True, False],
    'model__positive': [True, False],
    'model__n_jobs': [-1, 1, 2, 3, 4]  # List of values for n_jobs (use -1 for all cores)
}

random_search = RandomizedSearchCV(model_pipeline, param_distributions, cv=3, n_iter=5, scoring='neg_mean_squared_error', verbose=1, random_state=42)
random_search.fit(X_train, y_train)

# Best model from Random Search
print("Best Parameters from Random Search:", random_search.best_params_)

# Evaluate the model
y_pred_random = random_search.predict(X_test)
mse_random = mean_squared_error(y_test, y_pred_random)
rmse_random = np.sqrt(mse_random)
print(f"Random Search RMSE: {rmse_random}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
Fitting 3 folds for each of 4 candidates, totalling 12 fits


1 fits failed out of a total of 12.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 662, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wr

Best Parameters from Grid Search: {'model__fit_intercept': True, 'model__positive': False}
Grid Search RMSE: 69753.53427451904
Fitting 3 folds for each of 5 candidates, totalling 15 fits
Best Parameters from Random Search: {'model__positive': False, 'model__n_jobs': 3, 'model__fit_intercept': False}
Random Search RMSE: 69753.53427451904


2 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/pipeline.py", line 662, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wr

**Grid Search & Random Search**

When training a model, **Hyperparameters** influence its performance, such as `random_state`. The goal is to **find the best combination of these parameters** to achieve **the highest accuracy or best performance**.

However, if we have **10 Hyperparameters**, it is impractical to manually adjust each one and test the model’s performance every time, as this would be **too time-consuming and exhausting**.

Thus, we need an **automated method** to **find the best combination of values** without manual trial and error. This is where the following techniques come in:

1️⃣ **Grid Search**

2️⃣ **Random Search**

Both methods **help optimize the model (Hyperparameter Tuning)** by **testing different parameter combinations** and selecting **the best one**.

**Grid Search**

Grid Search systematically tries all possible combinations of the specified hyperparameter values, then evaluates the model’s performance for each combination to select the best one.

**Random Search**
 is a method for selecting the best hyperparameter combinations without trying every possible one, unlike **Grid Search**.

Instead, it randomly samples values from a predefined distribution, making it **faster** and less computationally expensive, especially when the number of hyperparameters is large.

……………………………………………………………………………………………………………………………………………….

Select the hyperparameter range and select the number of experiments

And using the random search  ⇒ It will determine the best combination of hyperparameters

……………………………………………………………………………………………………………………………………………….

 **The best comparison hyperparameters : The result will be approximate, meaning not 100%.**



---



Common Goal :

Both **Grid Search** and **Random Search** aim to **tune (adjust) hyperparameters** to achieve the **best accuracy** for the model and enhance its performance.

Comparison Between Grid Search and Random Search :

1. **Accuracy:**
    - **Grid Search**: More accurate because it tests all possible combinations of hyperparameters in the given range. However, this can be costly in terms of time and resources.
    - **Random Search**: Less accurate because it selects hyperparameter combinations randomly, which may mean not testing all possible options. However, it often provides good results faster.
2. **Cost and Time:**
    - **Grid Search**: Often takes longer and requires more resources (such as memory and CPU) since it tests all combinations.
    - **Random Search**: Faster and less resource-intensive because it picks random combinations instead of examining every possibility.
3. **Accuracy in finding the best hyperparameters:**
    - **Grid Search**: Achieves more precise results as it explores all possible combinations in the range.
    - **Random Search**: Doesn't guarantee finding the best combination but can often be sufficient and saves time.

Note the Result :

Best Parameters from Grid Search: {'model__fit_intercept': True, 'model__positive': False}

Grid Search RMSE: 69753.53427451904

------------------------------------------------------------------

Best Parameters from Random Search: {'model__positive': False, 'model__n_jobs': 3, 'model__fit_intercept': False}

Random Search RMSE: 69753.53427451904
