# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Title | How to Choose the Best Model ?</p>


<h1 style="font-family: 'sans-serif'; font-weight: bold; text-align: center; color: white;">Author: Haider Rasool Qadri</h1>

<h1 style="text-align: center"

[![Gmail](https://img.shields.io/badge/Gmail-Contact%20Me-red?style=for-the-badge&logo=gmail)](haiderqadri.07@gmail.com)
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/haiderrasoolqadri)
[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/HaiderQadri)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](www.linkedin.com/in/haider-rasool-qadri-06a4b91b8)

</h1>

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">About this Notebook</p>

In this notebook my purpose is to explain the very import concepts like `pipelines, column transformers, hyperparameter tuning and cross validation` for both `regression tasks and classification tasks` and I will select the `best model from various models`. In this notebook I am using all these concepts on `tips dataset` because tips dataset is small and require less computation.

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Some Basic Definitions</p>


| Term                  | Definition                                                                                                                                               |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Pipeline**          | A sequence of data processing steps that are chained together to automate and streamline the machine learning (ML) flow. A pipeline allows you to combine multiple data preprocessing and machine learning steps into a single object, making it easier to organize and manage your machine learning code. `Key components of pipeline are:`   1. Data Preprocessing 2. Model Training 3. Model Evaluation 4. Predictions |
| **Hyperparameter Tuning** | Hyperparameter is the process of finding best combinations of hyperpameters for a give model for example GridSearch and RandomSearch. |
| **Cross-Validation**  | Cross-Validation is a technique used to evaluate the performance of a model on unseen data. It is used to check how the model generalizes to new data.


# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Import Necessary Liberaries</p>


In [53]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Import train split test and grid search and random search for cross validation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import QuantileTransformer, PowerTransformer

# Column transformer
from sklearn.compose import ColumnTransformer

# Pipeline
from sklearn.pipeline import Pipeline

# Import both regression and classification models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from xgboost import XGBRegressor, XGBClassifier
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Import regression and classification metrice 
from sklearn.metrics import mean_absolute_error, mean_squared_error, f1_score
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Saning the model
import pickle

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Regression Models with Hyperparameters</p>

In [54]:
# Load the dataset using pandas liberary
df = pd.read_csv(r'C:\Users\Admin\Desktop\PYTHON-For-Data-Science_and_AI\00_projects\05_tips_pipeline_hyperparameter_tunning_gridsearch_cv\data\tips.csv')

In [55]:
# Let' see the any 5 rows of the dataset
df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
3,23.68,3.31,Male,No,Sun,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2
108,18.24,3.76,Male,No,Sat,Dinner,2
98,21.01,3.0,Male,Yes,Fri,Dinner,2
180,34.65,3.68,Male,Yes,Sun,Dinner,4


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [58]:
# Seperate categorical and numerical features
categorical_features = ['sex', 'smoker', 'day', 'time']
numerical_features = ['total_bill', 'tips', 'size']

# Deine the transformers for preprocessing
categorical_transformer = Pipeline(steps = [
    ('label_encoder', LabelEncoder()),
    ('standard_scalar', StandardScaler())
])

numerical_transformer  =Pipeline(steps = [
    ('transform', QuantileTransformer(output_distribution = 'normal')),
    ('standar_scalar', StandardScaler())
])

preprocessor = ColumnTransformer(transformers = [
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])


In [None]:
# # Let's encode seperately categorical features using LabelEncoder because at the end I will decode these features 
# le_sex = LabelEncoder()
# le_smoker = LabelEncoder()
# le_day = LabelEncoder()
# le_time = LabelEncoder()

# df['sex'] = le_sex.fit_transform(df['sex'])
# df['smoker'] = le_smoker.fit_transform(df['smoker'])
# df['day'] = le_day.fit_transform(df['day'])
# df['time'] = le_time.fit_transform(df['time'])

# # Let's transform total_bill and tip column using QuantileTransformer
# qt_total_bill = QuantileTransformer(output_distribution = 'normal', random_state = 42)
# qt_tip = QuantileTransformer(output_distribution = 'normal', random_state = 42)

# df['total_bill'] = qt_total_bill.fit_transform(df[['total_bill']])
# df['tip'] = qt_tip.fit_transform(df[['tip']])

# # Let's scale the whole dataset
# sc_total_bill = StandardScaler() 
# sc_tip = StandardScaler() 
# sc_sex = StandardScaler() 
# sc_smoker = StandardScaler() 
# sc_day = StandardScaler() 
# sc_time = StandardScaler() 
# sc_size = StandardScaler() 

# df['total_bill'] = sc_total_bill.fit_transform(df[['total_bill']])
# df['tip'] = sc_tip.fit_transform(df[['tip']])
# df['sex'] = sc_sex.fit_transform(df[['sex']])
# df['smoker'] = sc_smoker.fit_transform(df[['smoker']])
# df['day'] = sc_day.fit_transform(df[['day']])
# df['time'] = sc_time.fit_transform(df[['time']])
# df['size'] = sc_size.fit_transform(df['size'])


In [59]:
# Let' see the any 5 rows of the dataset after performing preprocessing
df.sample(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
171,15.81,3.16,Male,Yes,Sat,Dinner,2
182,45.35,3.5,Male,Yes,Sun,Dinner,3
132,11.17,1.5,Female,No,Thur,Lunch,2
50,12.54,2.5,Male,No,Sun,Dinner,2
225,16.27,2.5,Female,Yes,Fri,Lunch,2


In [61]:
# Seperate the dataset into Features (X) and Labels (y)
X = df.drop('tip', axis = 1)
y = df['tip']

# Split the data into train and test
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">GridSearch Cross-Validation</p>

In [63]:
# Let's Create a dictionaries of models to evaluate performance with hyperparameters

models = { 
          'LinearRegression' : (LinearRegression(), {}),

          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),

          'DecisionTreeRegressor' : (DecisionTreeRegressor(random_state=42), {'max_depth': [None, 5, 10]}),

          'RandomForestRegressor' : (RandomForestRegressor(random_state=42), {'n_estimators': [10, 100]}),

          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),

          'GradientBoostingRegressor' : (GradientBoostingRegressor(random_state=42),{'n_estimators': [10, 100]}),

          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),  

          'AdaBoostRegressor': (AdaBoostRegressor(random_state=42), {'n_estimators': [10, 100]}),        
          }

model_scores = []
# For loop to iterate over the models
for name, (model, params) in models.items():
    # Create a pipline
    pipeline = GridSearchCV(preprocessor, model, params, cv=5 , n_jobs=-1)
    
    # Fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # Make prediction from each model
    y_pred = pipeline.predict(X_test)
    # Metric
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    best_parameter = pipeline.best_params_
    model_scores.append((name,r2 , mae , mse,best_parameter))

# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=False)

# Printing Each model with evaluation metrics
print_boxed_zigzag_heading('R2 Scores')
for model in sorted_models:
    print('R_2 for', f"{model[0]} is {model[1]: .2f}")
print('\n')
print_boxed_zigzag_heading('MAE Of Models')
for model in sorted_models:
    print('MAE for', f"{model[0]} is {model[2]: .2f}")
print('\n')
print_boxed_zigzag_heading('MSE Of Models')
for model in sorted_models:
    print('MSE for', f"{model[0]} is {model[3]: .2f}")
    
# Selecting the best model based on R2
best_r2_model = max(model_scores, key=lambda x: x[1])
print_boxed_zigzag_heading(f"Best model based on R2 is {best_r2_model[0]} with R2 of {best_r2_model[1]:.2f}")
print_boxed_blue_heading(f'Best Parameters: {best_r2_model[4]}')
# Selecting the best model based on MAE
best_mae_model = min(model_scores, key=lambda x: x[2])
print_boxed_zigzag_heading(f"Best model based on MAE is' {best_mae_model[0]} with MAE of {best_mae_model[2]:.2f}")
print_boxed_blue_heading(f'Best Parameters: {best_mae_model[4]}')
# Selecting the best model based on MSE 
best_mse_model = min(model_scores, key=lambda x: x[3])
print_boxed_zigzag_heading(f"Best model based on MSE is' {best_mse_model[0]} with MSE of {best_mse_model[3]:.2f}")
print_boxed_blue_heading(f'Best Parameters: {best_mse_model[4]}')

TypeError: GridSearchCV.__init__() takes 3 positional arguments but 4 positional arguments (and 2 keyword-only arguments) were given