# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Title | How to Choose the Best Model ?</p>


<h1 style="font-family: 'sans-serif'; font-weight: bold; text-align: center; color: white;">Author: Haider Rasool Qadri</h1>

<h1 style="text-align: center"

[![Gmail](https://img.shields.io/badge/Gmail-Contact%20Me-red?style=for-the-badge&logo=gmail)](haiderqadri.07@gmail.com)
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/haiderrasoolqadri)
[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/HaiderQadri)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](www.linkedin.com/in/haider-rasool-qadri-06a4b91b8)

</h1>

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">About this Notebook</p>

In this notebook my purpose is to explain the very import concepts like `pipelines, column transformers, hyperparameter tuning and cross validation` for both `regression tasks and classification tasks` and I will select the `best model from various models`. In this notebook I am using all these concepts on `tips dataset` because tips dataset is small and require less computation.

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Some Basic Definitions</p>


| Term                  | Definition                                                                                                                                               |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Pipeline**          | A sequence of data processing steps that are chained together to automate and streamline the machine learning (ML) flow. A pipeline allows you to combine multiple data preprocessing and machine learning steps into a single object, making it easier to organize and manage your machine learning code. `Key components of pipeline are:`   1. Data Preprocessing 2. Model Training 3. Model Evaluation 4. Predictions |
| **Hyperparameter Tuning** | Hyperparameter is the process of finding best combinations of hyperpameters for a give model for example GridSearch and RandomSearch. |
| **Cross-Validation**  | Cross-Validation is a technique used to evaluate the performance of a model on unseen data. It is used to check how the model generalizes to new data.


# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Import Necessary Liberaries</p>


In [31]:
# For data manipulation and analysis
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Import train split test and grid search and random search for cross validation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import QuantileTransformer, PowerTransformer

# Column transformer
from sklearn.compose import ColumnTransformer

# Pipeline
from sklearn.pipeline import Pipeline

# Import both regression and classification models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from xgboost import XGBRegressor, XGBClassifier
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Import regression and classification metrice 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Saning the model
import pickle

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Regression Models with Hyperparameters</p>

In [11]:
# # Load the dataset using pandas liberary
# df = pd.read_csv(r'C:\Users\Admin\Desktop\PYTHON-For-Data-Science_and_AI\00_projects\05_tips_pipeline_hyperparameter_tunning_gridsearch_cv\data\tips.csv')

In [12]:
# # Let' see the any 5 rows of the dataset
# df.sample(5)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


In [14]:
df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [15]:
# # Let's seperate categorical and numeric features
# categorical_features = ['sex', 'smoker', 'day', 'time']
# numerical_features = ['total_bill', 'size']
# # Create a pipeline for categorical features
# categorical_transformer = Pipeline(steps = [
#     ('onehotencoder', OneHotEncoder())
# ])

# # Create a pipeline for numerical features
# numerical_transformer = Pipeline(steps = [
#     ('quantiletransformer', QuantileTransformer(output_distribution = 'normal', random_state = 42)),
#     ('minmaxscalar', MinMaxScaler())
# ])

# # Let's combine these two using column transformer
# preprocessor = ColumnTransformer(transformers = [
#     ('cat', categorical_transformer, categorical_features),
#     ('num', numerical_transformer, numerical_features)
# ])

In [16]:
# # Split the data into Features (X) and Labels (y)
# X = df.drop('tip', axis = 1)
# y = df['tip']

# # Spit the data into train and test
# X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

In [17]:
# # Let's encode seperately categorical features using LabelEncoder because at the end I will decode these features 
# le_sex = LabelEncoder()
# le_smoker = LabelEncoder()
# le_day = LabelEncoder()
# le_time = LabelEncoder()

# df['sex'] = le_sex.fit_transform(df['sex'])
# df['smoker'] = le_smoker.fit_transform(df['smoker'])
# df['day'] = le_day.fit_transform(df['day'])
# df['time'] = le_time.fit_transform(df['time'])

# # Let's transform total_bill and tip column using QuantileTransformer
# qt_total_bill = QuantileTransformer(output_distribution = 'normal', random_state = 42)
# qt_tip = QuantileTransformer(output_distribution = 'normal', random_state = 42)

# df['total_bill'] = qt_total_bill.fit_transform(df[['total_bill']])
# df['tip'] = qt_tip.fit_transform(df[['tip']])

# # Let's scale the whole dataset
# sc_total_bill = StandardScaler() 
# sc_tip = StandardScaler() 
# sc_sex = StandardScaler() 
# sc_smoker = StandardScaler() 
# sc_day = StandardScaler() 
# sc_time = StandardScaler() 
# sc_size = StandardScaler() 

# df['total_bill'] = sc_total_bill.fit_transform(df[['total_bill']])
# df['tip'] = sc_tip.fit_transform(df[['tip']])
# df['sex'] = sc_sex.fit_transform(df[['sex']])
# df['smoker'] = sc_smoker.fit_transform(df[['smoker']])
# df['day'] = sc_day.fit_transform(df[['day']])
# df['time'] = sc_time.fit_transform(df[['time']])
# df['size'] = sc_size.fit_transform(df['size'])


In [18]:
# # Let' see the any 5 rows of the dataset after performing preprocessing
# df.sample(5)

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">GridSearch Cross-Validation</p>

# <p style="background-color:#05544b;font-family:newtimeroman;color:white;font-size:100%;text-align:center;border-radius:40px 40px;">Classification Models with Hyperparameters</p>

In [20]:
df = pd.read_csv(r'C:\Users\Admin\Desktop\PYTHON-For-Data-Science_and_AI\00_projects\05_tips_pipeline_hyperparameter_tunning_gridsearch_cv\data\Iris.csv')

In [23]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [27]:
df.drop('Id', axis = 1, inplace = True)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [29]:
df.columns

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [32]:
# Seperate categorical and numerical features 
categorical_features = ['Species']
numerical_features = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']

# Create a pipeline for categoical features
categorical_transformer = Pipeline(steps = [
    ('ordinalencoder', OrdinalEncoder())
])

# Create  pipeline for numerical features
numerical_transformer = Pipeline(steps = [
    ('standadscaler', StandardScaler())
])

# Combine these two into column transformer
preprocessor = ColumnTransformer(transformers = [
    ('cat', categorical_transformer, categorical_features),
    ('num', numerical_transformer, numerical_features)
])

In [33]:
# Choose Features (X) and Labels (y)
X = df.drop('Species', axis = 1)
y = df['Species']

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)


In [None]:
# # Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LogicRegression' : (LogisticRegression(), {}),

          'SVC' : (SVC(), {'kernel': ['rbf', 'poly', 'sigmoid']}),

          'DesicionTreeClassifier' : (DecisionTreeClassifier(random_state=42), {'max_depth': [None, 5, 10],'random_state': [42]}),

          'RandomForestClassifier' : (RandomForestClassifier(random_state=42), {'n_estimators': [10, 100],'random_state': [42],'max_depth': [None, 5, 10]}),

          'KNeighborsClassifier' : (KNeighborsClassifier(), {'n_neighbors': np.arange(3, 100, 2),}),

          'GradientBoostingClassifier' : (GradientBoostingClassifier(random_state=42),{'n_estimators': [10, 100],'random_state': [42]}),

          'XGBClassifier' : (XGBClassifier(), {'n_estimators': [10, 100]}),  

          'AdaBoostClassifier': (AdaBoostClassifier(random_state=42), {'n_estimators': [10, 100],'random_state': [42]}),
          
          }