# **Day88 Main Assignment**
# Selecting the best model with Best hyperparameters

**Author:** Shahid Umar\
**Enrolled:** In Data Science and AI Course\
**Email:** shahidcontacts@gmail.com\
**Contact:** +923455516634


---
- ### <span style="color:pink">Code to convert the time into minutes, and seconds is stored in the variable 'total_time'</span>

In [1]:
import time
# Start time
start_time = time.time()

---
# <span style="color:yellow;">**1)  IMPORT LIBRARIES AND LOAD THE DATASET**</span>
---

- ### <span style="color:pink">Import the necessary libraries</span>

In [2]:
%%time
# Above code is majid command to measure the time it takes to run this code

# Import Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import train test split the data libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Import regression algorithms libraris
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Import GridSearchCV library for cross validation
from sklearn.model_selection import GridSearchCV

# Import preprocessors libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# To remove warnings from output
import warnings
warnings.filterwarnings('ignore')

CPU times: total: 3.98 s
Wall time: 34 s


In [3]:
pd.set_option('display.max_columns', None) # this is to display all the columns in the dataframe
pd.set_option('display.max_rows', None) # this is to display all the rows in the dataframe

- ### <span style="color:pink">Load the dataset for regression tasks</span>

In [4]:
%%time
# load dataset
df = sns.load_dataset('tips')
# This dataset is loaded for performing regression tasks

CPU times: total: 15.6 ms
Wall time: 348 ms


---
# <span style="color:yellow;">**2) DATA PREPROCESSING FOR REGRESSION**</span>
---

In [5]:
# Display top 5 rows of the dataset
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
# To check the column names
df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [7]:
# To check the dataset brief information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


- There are four categorical variables  in the dataset

In [8]:
# To check null or missing values
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

---
# <span style="color:yellow;">**3) REGRESSION TASKS**</span>
---

- ### <span style="color:pink">Lable encoding the categorical variables (Independent variables)</span>

In [9]:
%%time
# select features and variables
X = df.drop('tip', axis=1) # Independent variables
y = df['tip'] # Dependent variable

# label encode categorical variables
le = LabelEncoder()
X['sex'] = le.fit_transform(X['sex'])
X['smoker'] = le.fit_transform(X['smoker'])
X['day'] = le.fit_transform(X['day'])
X['time'] = le.fit_transform(X['time'])

# fit_transform: This method fits a transformation model to the data and applies the transformation to the dataset, returning the transformed data.


CPU times: total: 0 ns
Wall time: 4 ms


---
# <span style="color:yellow">**3.1) HYPERPARAMETER TUNING FOR REGRESSION MODELS**</span>
---

- ### <span style="color:pink">Split the data into train and test data with 80% training dataset and predict the best model with evaluation of `regression metrics`</span>
- ### <span style="color:pink">To Choose the best model through *`for loop`*</span>

In [10]:
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create a dictionary of models with hyperparameters to evaluate
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'n_estimators': [10, 100]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),          
          }

# Initialize variables to track the best model and its performance
best_regression_model = None
best_mse = float('inf')
best_r2 = -float('inf')
best_mae = float('inf')
# float('inf') is used here because the code is initializing best_mse to a value that is guaranteed to be larger than any other real number.

# Iterate over each model, train, predict, and evaluate performance metrics
for name, (model, params) in models.items():
    # Create a pipeline with the model
    pipeline = GridSearchCV(model, params, cv=5) # 5-fold cross-validation
    
    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = pipeline.predict(X_test)
    
    # Calculate evaluation metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Print the performance metrics
    print(name, 'MSE:', mse)
    print(name, 'R2:', r2)
    print(name, 'MAE:', mae)
    print()
    
    # Check if this model has better performance
    if mse < best_mse:
        best_regression_model = pipeline
        best_mse = mse
        best_r2 = r2
        best_mae = mae

# Print the best model's performance metrics
print('Best Regression Model:', best_regression_model.best_estimator_)
print('Best MSE:', best_mse)
print('Best R2:', best_r2)
print('Best MAE:', best_mae)

LinearRegression MSE: 0.9010765093466211
LinearRegression R2: 0.2688449578261525
LinearRegression MAE: 0.7030677287148895

SVR MSE: 0.722510769711588
SVR R2: 0.4137374719904898
SVR MAE: 0.6353762826521168

DecisionTreeRegressor MSE: 1.1103059656945642
DecisionTreeRegressor R2: 0.09907117014743683
DecisionTreeRegressor MAE: 0.8256388209661746

RandomForestRegressor MSE: 0.798906332244899
RandomForestRegressor R2: 0.3517482844281148
RandomForestRegressor MAE: 0.7049877551020411

KNeighborsRegressor MSE: 0.7213128467265923
KNeighborsRegressor R2: 0.4147094953664522
KNeighborsRegressor MAE: 0.647888198757764

GradientBoostingRegressor MSE: 0.6662961023916425
GradientBoostingRegressor R2: 0.4593513982539841
GradientBoostingRegressor MAE: 0.6740958714953833

XGBRegressor MSE: 0.7565200490013941
XGBRegressor R2: 0.3861415289429111
XGBRegressor MAE: 0.6937729422900142

Best Regression Model: GradientBoostingRegressor(n_estimators=10)
Best MSE: 0.6662961023916425
Best R2: 0.4593513982539841
Bes

---
# <span style="color:yellow;">**4) CLASSIFICATION TASKS**</span>
---

In [11]:
# !pip install lightgbm -q

- ### <span style="color:pink">Import the necessary libraries for classification and load the dataset</span>

In [12]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

- ### <span style="color:pink">Split the data into train and test data</span>

In [13]:
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

- ### <span style="color:pink">Generate Dicturionary for Classfier models</span>

In [14]:
# Create a dictionary of classifiers to evaluate
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(),
    'Random Forest': RandomForestClassifier(),
    'GradientBoosting': GradientBoostingClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost' : XGBClassifier(),
    #'LightGBM': lgb.LGBMClassifier(),
    'KNN': KNeighborsClassifier(),
    'NaiveBayes': GaussianNB(),
}

- ### <span style="color:pink">Perform Cross-Validation with respect to mean accuracy</span>

In [15]:
# Perform k-fold cross-validation and calculate the mean accuracy
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X, y, cv=kfold)
    accuracy = np.mean(scores)
    print("Classifier:", name)
    print("Mean Accuracy:", accuracy)
    print()

Classifier: Logistic Regression
Mean Accuracy: 0.9733333333333334

Classifier: Decision Tree
Mean Accuracy: 0.9466666666666667

Classifier: SVM
Mean Accuracy: 0.9666666666666668

Classifier: Random Forest
Mean Accuracy: 0.9600000000000002

Classifier: GradientBoosting
Mean Accuracy: 0.9533333333333334

Classifier: AdaBoost
Mean Accuracy: 0.9466666666666667

Classifier: XGBoost
Mean Accuracy: 0.9400000000000001

Classifier: KNN
Mean Accuracy: 0.9733333333333334

Classifier: NaiveBayes
Mean Accuracy: 0.9600000000000002



- ### <span style="color:pink">**Method-1** -Select the best Classification model</span>

In [16]:
best_classification_model_1 = {}

for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X_train, y_train, cv=kfold)
    best_classification_model_1[name] = scores.mean()

best_classifier = [classifier for classifier, score in best_classification_model_1.items() if score == max(best_classification_model_1.values())][0]
print("Best Calssification Model: ", best_classifier)

Best Calssification Model:  KNN


- ### <span style="color:pink">**Method-2** -Select the best Classification model</span>

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression


# Split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a dictionary of models with hyperparameters to evaluate
models = { 
    'Logistic Regression': (LogisticRegression(), {}),
    'Decision Tree': (DecisionTreeClassifier(), {'max_depth': [None, 5, 10]}),
    'SVM': (SVC(), {'kernel': ['linear', 'rbf', 'poly']}),
    'Random Forest': (RandomForestClassifier(), {'n_estimators': [10, 100]}),
    'KNN': (KNeighborsClassifier(), {'n_neighbors': np.arange(3, 100, 2)}),
    'GradientBoosting': (GradientBoostingClassifier(), {'n_estimators': [10, 100]}),
    'XGBoost': (XGBClassifier(), {'n_estimators': [10, 100]}),
    #'LightGBM': (lgb.LGBMClassifier(), {}),
    'AdaBoost': (AdaBoostClassifier(), {}),
    'Naive Bayes': (GaussianNB(), {}),
}

# Initialize variables to track the best model and its performance
best_classification_model_2 = None
best_accuracy = -float('inf')
best_precision = -float('inf')
best_recall = -float('inf')
best_f1 = -float('inf')

# Iterate over each model, train, predict, and evaluate performance metrics
for name, (model, params) in models.items():
    # Create a pipeline with the model
    pipeline = GridSearchCV(model, params, cv=5) # 5-fold cross-validation
    
    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = pipeline.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Print the performance metrics
    print(name, 'Accuracy:', accuracy)
    print(name, 'Precision:', precision)
    print(name, 'Recall:', recall)
    print(name, 'F1:', f1)
    print()
    
    # Check if this model has better performance
    if accuracy > best_accuracy:
        best_classification_model_2 = pipeline
        best_accuracy = accuracy
        best_precision = precision
        best_recall = recall
        best_f1 = f1

# Print the best model's performance metrics
print('Best Classification Model:', best_classification_model_2.best_estimator_)
print('Best Accuracy:', best_accuracy)
print('Best Precision:', best_precision)
print('Best Recall:', best_recall)
print('Best F1:', best_f1)

Logistic Regression Accuracy: 1.0
Logistic Regression Precision: 1.0
Logistic Regression Recall: 1.0
Logistic Regression F1: 1.0

Decision Tree Accuracy: 1.0
Decision Tree Precision: 1.0
Decision Tree Recall: 1.0
Decision Tree F1: 1.0

SVM Accuracy: 1.0
SVM Precision: 1.0
SVM Recall: 1.0
SVM F1: 1.0

Random Forest Accuracy: 1.0
Random Forest Precision: 1.0
Random Forest Recall: 1.0
Random Forest F1: 1.0

KNN Accuracy: 1.0
KNN Precision: 1.0
KNN Recall: 1.0
KNN F1: 1.0

GradientBoosting Accuracy: 1.0
GradientBoosting Precision: 1.0
GradientBoosting Recall: 1.0
GradientBoosting F1: 1.0

XGBoost Accuracy: 1.0
XGBoost Precision: 1.0
XGBoost Recall: 1.0
XGBoost F1: 1.0

AdaBoost Accuracy: 1.0
AdaBoost Precision: 1.0
AdaBoost Recall: 1.0
AdaBoost F1: 1.0

Naive Bayes Accuracy: 1.0
Naive Bayes Precision: 1.0
Naive Bayes Recall: 1.0
Naive Bayes F1: 1.0

Best Classification Model: LogisticRegression()
Best Accuracy: 1.0
Best Precision: 1.0
Best Recall: 1.0
Best F1: 1.0


---
# <span style="color:yellow;">**5) SAVE THE MODELS**</span>
---

In [18]:
# Import library for model deployment
import pickle

1. - ### <span style="color:pink">Save the Regression Model</span>

In [19]:
pickle.dump(best_regression_model, open('best_regression_model.pkl', 'wb'))

2. - ### <span style="color:pink">Save the Classification Model of **Method-1**</span>

In [20]:
pickle.dump(best_classification_model_1, open('best_classification_model_1.pkl', 'wb'))

3. - ### <span style="color:pink">Save the Classification Model of **Method-2**</span>

In [21]:
pickle.dump(best_classification_model_2, open('best_classification_model_2.pkl', 'wb'))

---
# <span style="color:yellow;">**6) LOAD THE MODELS**</span>
---

1. - ### <span style="color:pink">Load the Regression Model</span>

In [22]:
regression_model_load = pickle.load(open('best_regression_model.pkl', 'rb'))

1. - ### <span style="color:pink">Load the Classification Model-1</span>

In [23]:
classification_model_1_load = pickle.load(open('best_classification_model_1.pkl', 'rb'))
# Here rb means read binary

1. - ### <span style="color:pink">Load the Classification Model-2</span>

In [24]:
classification_model_2_load = pickle.load(open('best_classification_model_2.pkl', 'rb'))
# Here rb means read binary

---
# <span style="color:yellow;">**7) PREDICT THE MODELS**</span>
---

In [25]:
# Ensure new_X is in the appropriate format (numeric values)
new_X = [[best_classifier]]  # Provide the actual numeric values

predictions = regression_model_load.predict(new_X)

# Print the predictions
print(predictions)

ValueError: could not convert string to float: 'KNN'

# **Main Assignment:**

## Write the complete code to select the best Regressor and classifier for the given dataset called diamonds `(if you have a high end machine, you can use the whole dataset, else use the sample dataset provided in the link)` or you can use Tips datset for Regression task and Iris dataset for Classification task.

## You have to choose all possible models with their best or possible hyperparameters and compare them with each other and select the best model for the given dataset.

## Your code should be complete and explained properly. for layman, each and every step of the code should be commented properly.

## You code should also save the best model in the pickle file.

## You should also write the code to load the pickle file and use it for prediction. in the last snippet of the code.

## Submit your assignment to the discord inbox. (Do not share the link of your notebook, just upload the notebook in the discord inbox). Do not share the notebook in public channels on our discord server.


# **Deadline for Submission:**

## `29th December before 09:30 pm Pakistan time. (No late submission will be accepted).`


In [26]:
# End time
end_time = time.time()

# Calculate the total run time
total_time = end_time - start_time

# Print the total run time in seconds
print("Total run time: {:.2f} seconds".format(total_time))

Total run time: 106.70 seconds


In [28]:
# Convert the time into minutes, and seconds is stored in the variable 'total_time'
# Convert seconds to minutes and seconds
minutes, seconds = divmod(total_time, 60)
# Format the time as "mm:ss"
time_format = "{:02d}:{:02d}".format(int(minutes), int(seconds))
# Print the formatted time
print("Total run time (mm:ss): {}".format(time_format))

Total run time (mm:ss): 01:46
