# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Title | How to Choose the Best Model ?</p>


<h1 style="font-family: 'sans-serif'; font-weight: bold; text-align: center; color: white;">Author: Haider Rasool Qadri</h1>

<h1 style="text-align: center"

[![Gmail](https://img.shields.io/badge/Gmail-Contact%20Me-red?style=for-the-badge&logo=gmail)](haiderqadri.07@gmail.com)
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/haiderrasoolqadri)
[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/HaiderQadri)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](www.linkedin.com/in/haider-rasool-qadri-06a4b91b8)

</h1>

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">About this Notebook</p>

In this notebook my purpose is to explain the very import concepts like `pipelines, column transformers, hyperparameter tuning and cross validation` for both `regression tasks and classification tasks` and I will select the `best model from various models`. In this notebook I am using all these concepts on `tips dataset` because tips dataset is small and require less computation.

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Some Basic Definitions</p>


| Term                  | Definition                                                                                                                                               |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Pipeline**          | A sequence of data processing steps that are chained together to automate and streamline the machine learning (ML) flow. A pipeline allows you to combine multiple data preprocessing and machine learning steps into a single object, making it easier to organize and manage your machine learning code. `Key components of pipeline are:`   1. Data Preprocessing 2. Model Training 3. Model Evaluation 4. Predictions |
| **Hyperparameter Tuning** | Hyperparameter is the process of finding best combinations of hyperpameters for a give model for example GridSearch and RandomSearch. |
| **Cross-Validation**  | Cross-Validation is a technique used to evaluate the performance of a model on unseen data. It is used to check how the model generalizes to new data.


# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Import Necessary Liberaries</p>


In [54]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Import train split test and grid search and random search for cross validation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import QuantileTransformer, PowerTransformer

# Column transformer
from sklearn.compose import ColumnTransformer

# Pipeline
from sklearn.pipeline import Pipeline

# Import both regression and classification models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from xgboost import XGBRegressor, XGBClassifier
from catboost import CatBoostRegressor, CatBoostClassifier
from lightgbm import LGBMRegressor, LGBMClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Import regression and classification metrice 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Saning the model
import pickle

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Regression Models with Hyperparameters</p>

In [55]:
# Load the tups dataset using pandas liberary
df = pd.read_csv(r'C:\Users\Admin\Desktop\PYTHON-For-Data-Science_and_AI\00_projects\05_tips_pipeline_hyperparameter_tunning_gridsearch_cv\data\tips.csv')

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">GridSearch Cross-Validation</p>

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Classification Models with Hyperparameters</p>

In [56]:
# Load the iris dataset using pandas liberary
df = pd.read_csv(r'C:\Users\Admin\Desktop\PYTHON-For-Data-Science_and_AI\00_projects\05_tips_pipeline_hyperparameter_tunning_gridsearch_cv\data\Iris.csv')

In [57]:
# Let's the random five rows of the dataset
df.sample(5)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
73,74,6.1,2.8,4.7,1.2,Iris-versicolor
79,80,5.7,2.6,3.5,1.0,Iris-versicolor
126,127,6.2,2.8,4.8,1.8,Iris-virginica
129,130,7.2,3.0,5.8,1.6,Iris-virginica
30,31,4.8,3.1,1.6,0.2,Iris-setosa


In [58]:
# Let's remove unnecessary column Id from the dataset
df.drop('Id', axis = 1, inplace = True)

In [59]:
# Let's the random three rows of the dataset
df.sample(3)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
112,6.8,3.0,5.5,2.1,Iris-virginica
77,6.7,3.0,5.0,1.7,Iris-versicolor
3,4.6,3.1,1.5,0.2,Iris-setosa


In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [61]:
# Let's encode Species feature usin one-hot encoder
# Call the encoder
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False, drop = 'first')

df['Species'] = ohe.fit_transform(df[['Species']])


In [62]:
# Dictionary of classification models with their respective hyperparameters for grid search
classification_models = {
    'Logistic Regression': {
        'model': LogisticRegression(),
        'params': {
            'model__C': [0.1],
            'model__max_iter': [1000]
        }
    },
        'Support Vector Classifier': {
        'model': SVC(),
        'params': {
            'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'model__C': [0.1, 1, 10],
           
        }
    },
    
    'Decision Tree Classifier': {
        'model': DecisionTreeClassifier(),
        'params': {
            'model__splitter': ['best', 'random'],
            'model__max_depth': [None, 1, 2, 3, 4]
        }
    },
    'Random Forest Classifier': {
        'model': RandomForestClassifier(),
        'params': {
            'model__n_estimators': [10, 100],
            'model__max_depth': [None, 1, 2, 3, 4],
            'model__max_features': ['auto', 'sqrt', 'log2']
        }
    },
     'Gradient Boosting Classifier': {
        'model': GradientBoostingClassifier(),
        'params': {
            'model__n_estimators': [10, 100],
            'model__max_depth': [None, 1, 2, 3, 4]
        }
    },
    'AdaBoost Classifier': {
        'model': AdaBoostClassifier(),
        'params': {
            'model__n_estimators': [10, 100]
        }
    },
    'K-Nearest Neighbors Classifier': {
        'model': KNeighborsClassifier(),
        'params': {
            'model__n_neighbors': [3, 5, 7]
        }
    },
    'XGBoost Classifier': {
        'model': XGBClassifier(),
        'params': {
            'model__n_estimators': [10, 100],
            'model__max_depth': [None, 1, 2, 3]
        }
    },
    'CatBoost Classifier': {
        'model': CatBoostClassifier(verbose=0),
        'params': {
            'model__iterations': [10, 100],
            'model__depth': [1, 2, 3, 4]
        }
    },
    'LGBM Classifier': {
        'model': LGBMClassifier(),
        'params': {
            'model__n_estimators': [10, 100],
            'model__max_depth': [None, 1, 2, 3],
            'model__learning_rate': [0.1, 0.2, 0.3],
            'model__verbose': [-1]
        }
    },
    'GaussianNB': {
        'model': GaussianNB(),
        'params': {}
    },
}

In [63]:
# Choose Features (X) and Labels (y)
X = df.drop('Species', axis = 1)
y = df['Species']

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

In [64]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Assuming `preprocessor` is defined, for example:
# from sklearn.preprocessing import StandardScaler
# preprocessor = StandardScaler()

# Initialize a list to store model performance metrics
model_scores = []
best_accuracy = 0
best_estimator = None

# Iterate through each model in the classification_models dictionary
for name, model_info in classification_models.items():
    model = model_info['model']
    params = model_info['params']

    # Create a pipeline with a preprocessor and the model
    pipeline = Pipeline(steps=[
        ('model', model)
    ])

    # Initialize GridSearchCV with the model and parameters
    grid_search_cv = GridSearchCV(
        estimator=pipeline,
        param_grid=params,
        cv=5,
        scoring='accuracy',
        verbose=1,  # Verbose output to see progress
        n_jobs=-1,  # Use all available cores
    )

    # Fit the GridSearchCV object to the training data
    grid_search_cv.fit(X_train, y_train)

    # Predict the target variable for the test set
    y_pred = grid_search_cv.predict(X_test)
    
    # Calculate the metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='binary' if len(set(y_test)) == 2 else 'macro')
    recall = recall_score(y_test, y_pred, average='binary' if len(set(y_test)) == 2 else 'macro')
    f1 = f1_score(y_test, y_pred, average='binary' if len(set(y_test)) == 2 else 'macro')

    # Append performance metrics of the current model to the list
    model_scores.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    })

    # Check if this model is the best so far based on accuracy
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_estimator = grid_search_cv.best_estimator_

# Sort the models based on their name for readability
sorted_models = sorted(model_scores, key=lambda x: x['Model'])

# Convert sorted model performances to a DataFrame
metrics = pd.DataFrame(sorted_models)

# Identify the best performing model based on accuracy
best_clf_model = max(sorted_models, key=lambda x: x['Accuracy'])

# Print out the DataFrame with the metrics
print(metrics)

# Print the best classifier's details
print("Best Model based on Accuracy:")
print(best_clf_model)

# Print the best estimator object
print("Best Estimator Object:")
print(best_estimator)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 2 candidates, totalling 10 fits
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 8 candidates, totalling 40 fits
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
                             Model  Accuracy  Precision    Recall  F1 Score
0              AdaBoost Classifier  0.933333   0.818182  1.000000  0.900000
1              CatBoost Classifier  1.000000   1.000000  1.000000  1.000000
2         Decision Tree Classifier  1.000000   1.000000  1.000000  1.000000
3                       Gaussia