# **Day88 Main Assignment**
# Selecting the best model with Best hyperparameters

**Author:** Shahid Umar\
**Enrolled:** In Data Science and AI Course\
**Email:** shahidcontacts@gmail.com\
**Contact:** +923455516634


---
- ### <span style="color:pink">Code to convert the time into minutes, and seconds is stored in the variable 'total_time'</span>

In [2]:
import time
# Start time
start_time = time.time()

- ### <span style="color:pink">Import the necessary libraries</span>

In [3]:
%%time
# Above code is majid command to measure the time it takes to run this code

# Import Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import train test split the data libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Import regression algorithms libraris
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Import GridSearchCV library for cross validation
from sklearn.model_selection import GridSearchCV

# Import preprocessors libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# To remove warnings from output
import warnings
warnings.filterwarnings('ignore')

CPU times: total: 2.94 s
Wall time: 3.29 s


- ### <span style="color:pink">Load the dataset for regression tasks</span>

In [4]:
%%time
# load dataset
df = sns.load_dataset('tips')
# This dataset is loaded for performing regression tasks

CPU times: total: 15.6 ms
Wall time: 27 ms


---
# <span style="color:yellow;">**DATA PREPROCESSING**</span>
---

In [5]:
# Display top 5 rows of the dataset
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
# To check the column names
df.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [7]:
# To check the dataset brief information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


- There are four categorical variables  in the dataset

In [8]:
# To check null or missing values
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

---
# <span style="color:yellow;">**REGRESSION TASKS**</span>
---

- ### <span style="color:pink">Lable encoding the categorical variables (Independent variables)</span>

In [9]:
%%time
# select features and variables
X = df.drop('tip', axis=1) # Independent variables
y = df['tip'] # Dependent variable

# label encode categorical variables
le = LabelEncoder()
X['sex'] = le.fit_transform(X['sex'])
X['smoker'] = le.fit_transform(X['smoker'])
X['day'] = le.fit_transform(X['day'])
X['time'] = le.fit_transform(X['time'])

# fit_transform: This method fits a transformation model to the data and applies the transformation to the dataset, returning the transformed data.


CPU times: total: 0 ns
Wall time: 4 ms


- ### <span style="color:pink">Split the data into train and test data with 80% training dataset and predict the best model with evaluation of `regression metrics`</span>

1. Best Model Choosing Through **`MAE`** matric

In [10]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics

model_scores = [] # create an empty list
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = mean_absolute_error(y_test, y_pred) # Predict the model on this basis
    model_scores.append((name, metric))
    
# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=False)
for model in sorted_models:
    print('Mean Absolute Error: ', f"{model[0]} is {model[1]: .2f}") 

Mean Absolute Error:  SVR is  0.57
Mean Absolute Error:  LinearRegression is  0.67
Mean Absolute Error:  XGBRegressor is  0.67
Mean Absolute Error:  KNeighborsRegressor is  0.73
Mean Absolute Error:  GradientBoostingRegressor is  0.73
Mean Absolute Error:  RandomForestRegressor is  0.79
Mean Absolute Error:  DecisionTreeRegressor is  0.95
CPU times: total: 688 ms
Wall time: 665 ms


2. Best Model Choosing Through **`R-Square Score`** matric

In [11]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = r2_score(y_test, y_pred)
    model_scores.append((name, metric))

# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print('R_squared Score', f"{model[0]} is {model[1]: .2f}") 

R_squared Score SVR is  0.57
R_squared Score LinearRegression is  0.44
R_squared Score XGBRegressor is  0.41
R_squared Score GradientBoostingRegressor is  0.35
R_squared Score KNeighborsRegressor is  0.33
R_squared Score RandomForestRegressor is  0.23
R_squared Score DecisionTreeRegressor is -0.18
CPU times: total: 938 ms
Wall time: 711 ms


3. Best Model Choosing Through **`MSE`** matric

In [12]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = mean_squared_error(y_test, y_pred)
    model_scores.append((name, metric))

# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=False)
for model in sorted_models:
    print('Mean Squared Error: ', f"{model[0]} is {model[1]: .2f}") 

Mean Squared Error:  SVR is  0.54
Mean Squared Error:  LinearRegression is  0.69
Mean Squared Error:  XGBRegressor is  0.74
Mean Squared Error:  GradientBoostingRegressor is  0.80
Mean Squared Error:  KNeighborsRegressor is  0.84
Mean Squared Error:  RandomForestRegressor is  0.92
Mean Squared Error:  DecisionTreeRegressor is  1.30
CPU times: total: 906 ms
Wall time: 596 ms


4. Best Model Choosing Through **`RMSE`** matric

In [13]:
%%time
# split the data into train and test data with 80% training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a dictionaries of list of models to evaluate performance
models = { 
          'LinearRegression' : LinearRegression(),
          'SVR' : SVR(),
          'DecisionTreeRegressor' : DecisionTreeRegressor(),
          'RandomForestRegressor' : RandomForestRegressor(),
          'KNeighborsRegressor' : KNeighborsRegressor(),
          'GradientBoostingRegressor' : GradientBoostingRegressor(),
          'XGBRegressor' : XGBRegressor()          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

model_scores = []
for name, model in models.items():
    # fit each model from models on training data
    model.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = model.predict(X_test)
    metric = np.sqrt(mean_squared_error(y_test, y_pred))
    model_scores.append((name, metric))
    
# selecting the best model from all above models with evaluation metrics sorting method
sorted_models = sorted(model_scores, key=lambda x: x[1], reverse=False)
for model in sorted_models:
    print('Root Mean Squared Error: ', f"{model[0]} is {model[1]: .2f}") 

Root Mean Squared Error:  SVR is  0.73
Root Mean Squared Error:  LinearRegression is  0.83
Root Mean Squared Error:  XGBRegressor is  0.86
Root Mean Squared Error:  GradientBoostingRegressor is  0.90
Root Mean Squared Error:  KNeighborsRegressor is  0.92
Root Mean Squared Error:  RandomForestRegressor is  0.97
Root Mean Squared Error:  DecisionTreeRegressor is  1.20
CPU times: total: 656 ms
Wall time: 590 ms


- <span style="color:pink">To Choose the best model through *`for loop`*</span>

In [18]:
# Create a dictionary of models with hyperparameters to evaluate
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'n_estimators': [10, 100]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),          
          }

# Initialize variables to track the best model and its performance
best_model = None
best_mse = float('inf')
best_r2 = -float('inf')
best_mae = float('inf')
# float('inf') is used here because the code is initializing best_mse to a value that is guaranteed to be larger than any other real number.

# Iterate over each model, train, predict, and evaluate performance metrics
for name, (model, params) in models.items():
    # Create a pipeline with the model
    pipeline = GridSearchCV(model, params, cv=5) # 5-fold cross-validation
    
    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = pipeline.predict(X_test)
    
    # Calculate evaluation metrics
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Print the performance metrics
    print(name, 'MSE:', mse)
    print(name, 'R2:', r2)
    print(name, 'MAE:', mae)
    print()
    
    # Check if this model has better performance
    if mse < best_mse:
        best_model = pipeline
        best_mse = mse
        best_r2 = r2
        best_mae = mae

# Print the best model's performance metrics
print('Best Model:', best_model.best_estimator_)
print('Best MSE:', best_mse)
print('Best R2:', best_r2)
print('Best MAE:', best_mae)

LinearRegression MSE: 0.694812968628771
LinearRegression R2: 0.4441368826121932
LinearRegression MAE: 0.6703807496461157

SVR MSE: 1.460718141299992
SVR R2: -0.1686013018011976
SVR MAE: 0.8935334948775431

DecisionTreeRegressor MSE: 0.8774153020453993
DecisionTreeRegressor R2: 0.298051667053291
DecisionTreeRegressor MAE: 0.7189481629481629

RandomForestRegressor MSE: 0.9266166475510215
RandomForestRegressor R2: 0.25868968832338335
RandomForestRegressor MAE: 0.7701653061224492

KNeighborsRegressor MSE: 0.6640950568462677
KNeighborsRegressor R2: 0.4687117753876745
KNeighborsRegressor MAE: 0.6203721488595437

GradientBoostingRegressor MSE: 0.8106801524004928
GradientBoostingRegressor R2: 0.351441010654877
GradientBoostingRegressor MAE: 0.7657809818712309

XGBRegressor MSE: 0.6624107100882575
XGBRegressor R2: 0.4700592836840687
XGBRegressor MAE: 0.6549163442728472

Best Model: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=

---
# <span style="color:yellow">**HYPERPARAMETER TUNING**</span>
---

In [14]:
%%time
# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'n_estimators': [10, 100]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

LinearRegression MSE:  0.694812968628771
LinearRegression R2:  0.4441368826121932
LinearRegression MAE:  0.6703807496461157


SVR MSE:  1.460718141299992
SVR R2:  -0.1686013018011976
SVR MAE:  0.8935334948775431


DecisionTreeRegressor MSE:  0.8774153020453994
DecisionTreeRegressor R2:  0.2980516670532909
DecisionTreeRegressor MAE:  0.7189481629481629


RandomForestRegressor MSE:  0.9916518757142871
RandomForestRegressor R2:  0.2066603130827278
RandomForestRegressor MAE:  0.7970673469387758


KNeighborsRegressor MSE:  0.6640950568462677
KNeighborsRegressor R2:  0.4687117753876745
KNeighborsRegressor MAE:  0.6203721488595437


GradientBoostingRegressor MSE:  0.8106801524004932
GradientBoostingRegressor R2:  0.35144101065487676
GradientBoostingRegressor MAE:  0.7657809818712309


XGBRegressor MSE:  0.6624107100882575
XGBRegressor R2:  0.4700592836840687
XGBRegressor MAE:  0.6549163442728472


CPU times: total: 6.2 s
Wall time: 5.59 s


In [34]:
%%time
# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline
    pipeline = GridSearchCV(model, params, cv=5)
    
    # fit the pipeline
    pipeline.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = pipeline.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

LinearRegression MSE:  0.694812968628771
LinearRegression R2:  0.4441368826121932
LinearRegression MAE:  0.6703807496461157


SVR MSE:  0.6794885084267436
SVR R2:  0.45639673181592255
SVR MAE:  0.6309897323209411


DecisionTreeRegressor MSE:  0.955737583411837
DecisionTreeRegressor R2:  0.23539240557290553
DecisionTreeRegressor MAE:  0.7774590204502854


RandomForestRegressor MSE:  0.8612868964374735
RandomForestRegressor R2:  0.31095468732565246
RandomForestRegressor MAE:  0.7407757292986019


KNeighborsRegressor MSE:  0.6437675304097399
KNeighborsRegressor R2:  0.4849741693324664
KNeighborsRegressor MAE:  0.6385880398456918




30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site

GradientBoostingRegressor MSE:  0.760105541957492
GradientBoostingRegressor R2:  0.3919016265196056
GradientBoostingRegressor MAE:  0.695414310929991


XGBRegressor MSE:  0.7601696611425505
XGBRegressor R2:  0.3918503299956485
XGBRegressor MAE:  0.7351689690959697


CPU times: total: 9h 48min 57s
Wall time: 10h 1min 16s


# **Add preprocessor inside the pipeline**

## Assignment: Find the errors

In [40]:
%%time
# make a preprocessor

preprocessor = ColumnTransformer(
    transformers=['numeric_scaling', StandardScaler(), ['total_bill', 'size']], remainder='passthrough')


# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'kernel': ['rbf', 'poly', 'sigmoid'], 'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'epsilon': [0.1, 0.01, 0.001]}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'max_depth': [None, 5, 10], 'splitter': ['best', 'random']}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'n_estimators': [10, 100, 1000], 'max_depth': [None, 5, 10]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'n_neighbors': np.arange(3, 100, 2), 'weights': ['uniform', 'distance']}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'loss': ['ls', 'lad', 'huber', 'quantile'], 'n_estimators': [10, 100, 1000]}),
          'XGBRegressor' : (XGBRegressor(), {'n_estimators': [10, 100, 1000], 'learning_rate': [0.1, 0.01, 0.001]}),          
          }

# train and predict each model with evaluation metrics as well making a for loop to iterate over the models

for name, (model, params) in models.items():
    # create a pipline with preprocessor
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])   
    
    # make a grid search cv to tune the hyperparameter
    grid_search = GridSearchCV(pipeline, params, cv=5)
    
    
    # fit the pipeline
    grid_search.fit(X_train, y_train)
    
    # make prediction from each model
    y_pred = grid_search.predict(X_test)
    
      
    # print the performing metric
    print(name, 'MSE: ', mean_squared_error(y_test, y_pred))
    print(name, 'R2: ', r2_score(y_test, y_pred))
    print(name, 'MAE: ', mean_absolute_error(y_test, y_pred))
    print('\n')

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 423, in fit
    Xt = self._fit(X, y, **fit_params_steps)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 377, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\pipeline.py", line 957, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_set_output.py", line 157, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\compose\_column_transformer.py", line 750, in fit_transform
    self._validate_transformers()
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\compose\_column_transformer.py", line 430, in _validate_transformers
    names, transformers, _ = zip(*self.transformers)
                             ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'StandardScaler' object is not iterable


---
# <span style="color:yellow;">**CLASSIFICATION TASKS**</span>
---

In [37]:
%%time
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# dont show warnings
import warnings
warnings.filterwarnings('ignore')

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a dictionary of classifiers to evaluate
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

# Perform k-fold cross-validation and calculate the mean accuracy
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

for name, classifier in classifiers.items():
    scores = cross_val_score(classifier, X, y, cv=kfold)
    accuracy = np.mean(scores)
    print("Classifier:", name)
    print("Mean Accuracy:", accuracy)
    print()

Classifier: Logistic Regression
Mean Accuracy: 0.9733333333333334

Classifier: Decision Tree
Mean Accuracy: 0.9533333333333335

Classifier: Random Forest
Mean Accuracy: 0.9533333333333335

Classifier: SVM
Mean Accuracy: 0.9666666666666668

Classifier: KNN
Mean Accuracy: 0.9733333333333334

CPU times: total: 1.34 s
Wall time: 2.66 s


# **Main Assignment:**

## Write the complete code to select the best Regressor and classifier for the given dataset called diamonds `(if you have a high end machine, you can use the whole dataset, else use the sample dataset provided in the link)` or you can use Tips datset for Regression task and Iris dataset for Classification task.

## You have to choose all possible models with their best or possible hyperparameters and compare them with each other and select the best model for the given dataset.

## Your code should be complete and explained properly. for layman, each and every step of the code should be commented properly.

## You code should also save the best model in the pickle file.

## You should also write the code to load the pickle file and use it for prediction. in the last snippet of the code.

## Submit your assignment to the discord inbox. (Do not share the link of your notebook, just upload the notebook in the discord inbox). Do not share the notebook in public channels on our discord server.


# **Deadline for Submission:**

## `29th December before 09:30 pm Pakistan time. (No late submission will be accepted).`


<span style="color:yellow">
1. Specify the total time of notebook execution./n
2. Put complete overview of all algorithms used in the notebook. Seprate regression and classfication algorithms
3. </span>

In [None]:
# End time
end_time = time.time()

# Calculate the total run time
total_time = end_time - start_time

# Print the total run time in seconds
print("Total run time: {:.2f} seconds".format(total_time))

In [None]:
# Convert the time into minutes, and seconds is stored in the variable 'total_time'
# Convert seconds to minutes and seconds
minutes, seconds = divmod(total_time, 60)
# Format the time as "mm:ss"
time_format = "{:02d}:{:02d}".format(int(minutes), int(seconds))
# Print the formatted time
print("Total run time: {}".format(time_format))