In this lesson we will learn how to use **combinatorial grid search** to find best combination of parameters for a given model.
- For a given model, it is possible that increasing the value of one parameter generally improves model perfomance upto a certain point, however if other parameters are set too high or too low. this may doom the model to overfitting or underfitting.
- The only way we can be sure is to try every single combination! Hence why grid search is sometimes referred to as exhaustive search.

Grid Search systematically evaluates all possible combinations of hyperparameter value within a predefined grid, training and evaluating a model for each combination to identify the set that yields the optimal perfomance. This method is often employed in cojunction with cross-validation to estimate a model's generalization perfomance and select the best hyperparameter set.

# Understanding GridSearchCV
GridSearchCV is a powerful tool in scikit-learn that allows for exhaustive search over specified parameter values for an estimator. It is particularly useful for hyperparameter tuning, where the goal is to find the best combination of parameters that result in the highest model performance. The GridSearchCV object takes an estimator, a parameter grid, and a scoring metric as inputs and performs a grid search over the specified parameter values, evaluating the model's performance using the chosen scoring metric. 
Simply put GridSearchCV combines K-Fold Cross-Validation with a grid search of parameters. 

Key components of GridSearchCV:

Estimator: The machine learning model to be tuned.

Param_grid: Dictionary that contains all the parameters to try.

Scoring: Metric used to evaluate model performance.

CV: Cross-validation strategy.

In [11]:
# Demonstration of how to use GridSearchCV
# estimator_KNN = 

# parameters_KNN = {
#     'n_neighbors': (3,5,7,9),
#     'leaf_size': (20,40),
#     'p': (1,2),
#     'weights': ('uniform', 'distance'),
#     'metric': ('minkowski', 'chebyshev'),}
                   
# # with GridSearch
# grid_search_KNN = GridSearchCV(
#     estimator= KNeighborsClassifier(algorithm='auto'),
#     param_grid=parameters_KNN,
#     scoring = 'accuracy',
#     n_jobs = -1,
#     cv = 5
# )

## Drawbacks of GridSearchCV
The best combination of parameters found is more of a conditional "best" combination. This is due to the fact that the search can only test the parameters that you fed into param_grid.

The main drawback of an exhaustive search such as GridsearchCV is that there is no way of telling what's best until we've exhausted all possibilities. This means training many versions of the same machine learning model, which can be very time consuming and computationally expensive.

Considering the example we had above - we have 5 different parameters with 4,2,2,2,2 variations to try respectively and we also set the model to use cross-validation with a value of 5, meaning each model will be built 5 times and their perfomances averaged together.

Simple math = 4 * 2 * 2  * 2 * 2 * 5 = 320 diferent models trained.


# Pipelines
Pipelines are extremely useful tools used to write clean and manageable code for machine learning. They are a streamlines process that automates the flow of data from raw inputs to valuable predictions or decisions, encompassing a series of steps including data collection, data preprocessing, feature engineering, model building and model evaluation.

Pipeline functionality can be found in scikit-learn's Pipeline module.

In [12]:
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# from sklearn.neighbors import KNeighborsClassifier


# pipe = Pipeline([
#     ('scaler', StandardScaler(),
#      'knn', KNeighborsClassifier())
# ])

# pipe.fit(X_train, y_train)

# pipe.score(X_test,y_test)

In [13]:
# # Create the pipeline
# pipe = Pipeline([
#     ('scaler', StandardScaler(),
#      'knn', KNeighborsClassifier())
# ])

# # Create the grid parameter
# grid = [{'neighbors': [3,5,7,9], 
#          'weights': ['uniform', 'distance']}]


# # Create the grid, with "pipe" as the estimator
# gridsearch = GridSearchCV(estimator=pipe, 
#                           param_grid=grid, 
#                           scoring='accuracy', 
#                           cv=5)

# # Fit using grid search
# gridsearch.fit(X_train, y_train)

# # Calculate the test score
# gridsearch.score(X_test, y_test)

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


ames_data = pd.read_csv('ames.csv')
ames_data.head()
ames_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [15]:
X = ames_data.drop(columns=['Id', 'SalePrice','Alley','FireplaceQu','PoolQC','Fence'])
y = ames_data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  
    ('scaler', StandardScaler())  
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ])

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor())
])

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9],
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]  
}

grid_search = GridSearchCV(model_pipeline, param_grid, cv=3, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

print(best_params)

y_pred = grid_search.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)



Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


{'knn__n_neighbors': 3, 'knn__p': 1, 'knn__weights': 'distance'}


[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:    5.4s finished
