# **Main Assignment:**

### Write the complete code to select the best Regressor and classifier for the given dataset called diamonds `(if you have a high end machine, you can use the whole dataset, else use the sample dataset provided in the link)` or you can use Tips datset for Regression task and Iris dataset for Classification task.

### You have to choose all possible models with their best or possible hyperparameters and compare them with each other and select the best model for the given dataset.

### Your code should be complete and explained properly. for layman, each and every step of the code should be commented properly.

### You code should also save the best model in the pickle file.

### You should also write the code to load the pickle file and use it for prediction. in the last snippet of the code.

> we will use hyperparameters testing to check which parameters are relaible for training our best model. We also cross validate all the models and check which model performs best to our dataset 

In [15]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# for traning and spliting
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Regression algorithms
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor

# for evaluation
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score

# for cross validation
from sklearn.model_selection import cross_val_score

# for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# To save the best model 
import pickle


In [16]:
# load the diamond dataset and getting the 2000 rows as sample from it
df = sns.load_dataset('diamonds')
df = df.sample(n=2000, random_state=42)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1388,0.24,Ideal,G,VVS1,62.1,56.0,559,3.97,4.0,2.47
50052,0.58,Very Good,F,VVS2,60.0,57.0,2201,5.44,5.42,3.26
41645,0.4,Ideal,E,VVS2,62.1,55.0,1238,4.76,4.74,2.95
42377,0.43,Premium,E,VVS2,60.8,57.0,1304,4.92,4.89,2.98
17244,1.55,Ideal,E,SI2,62.3,55.0,6901,7.44,7.37,4.61


In [17]:
# we will predict the price of the diamond based on the other features
X = df.drop(['price'], axis=1)
y = df['price']

# encoding the categorical features using label encoding
label_encoder = LabelEncoder()
X['cut'] = label_encoder.fit_transform(X['cut'])
X['color'] = label_encoder.fit_transform(X['color'])
X['clarity'] = label_encoder.fit_transform(X['clarity'])


# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.metrics import mean_absolute_error

best_model = None
best_r2 = 0.0
best_parameters = None

# Create a dictionaries of list of models to evaluate performance with hyperparameters
models = { 
          'LinearRegression' : (LinearRegression(), {}),
          'SVR' : (SVR(), {'model__kernel': ['rbf', 'poly', 'sigmoid']}),
          'DecisionTreeRegressor' : (DecisionTreeRegressor(), {'model__max_depth': [None, 5, 10]}),
          'RandomForestRegressor' : (RandomForestRegressor(), {'model__n_estimators': [10, 100]}),
          'KNeighborsRegressor' : (KNeighborsRegressor(), {'model__n_neighbors': np.arange(3, 100, 2)}),
          'GradientBoostingRegressor' : (GradientBoostingRegressor(), {'model__n_estimators': [10, 100]}),         
          }

for name, (model, params) in models.items():
    # create a pipline with preprocessor
    pipeline = Pipeline(steps=[('model', model)])
    # create a grid search
    grid_search = GridSearchCV(pipeline, params, cv=5, scoring='r2')
    
    # fit the model
    grid_search.fit(X_train, y_train)
    
    # predict the model
    y_pred = grid_search.predict(X_test)
   
     # Calculate r2 score
    r2 = r2_score(y_test, y_pred)
    
    # # Print the performance metrics
    # print("Model:", name)
    # print("Cross-validation Accuracy:", mean_r2)
    # print("Test Accuracy:", r2)
    # print()
    
    # Check if the current model has the best r2
    if r2 > best_r2:
        best_r2 = r2
        best_model = grid_search.best_estimator_
        best_parameters = grid_search.best_params_
        
        # y_pred = pipeline.predict(X_test)
        

# Retrieve the best model and parameters
print("Best Model:", best_model)
print("Best Parameters:", best_parameters)
print("Best Accuracy:", best_r2)


# Ensure the model is fitted before saving
if best_model is not None:
	# Save the best model
	pickle.dump(best_model, open('./Save_models/diamond_price_prediction_model.pkl', 'wb'))

    

Best Model: Pipeline(steps=[('model', GradientBoostingRegressor())])
Best Parameters: {'model__n_estimators': 100}
Best Accuracy: 0.9566903331232955




In [49]:
print(best_model.predict([[0.58,	4,	2,	7,	60.0,	57.0,	5.44,	5.42,	3.26]]))

[2268.79721367]




In [44]:
X.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
1388,0.24,2,3,6,62.1,56.0,3.97,4.0,2.47
50052,0.58,4,2,7,60.0,57.0,5.44,5.42,3.26
41645,0.4,2,1,7,62.1,55.0,4.76,4.74,2.95
42377,0.43,3,1,7,60.8,57.0,4.92,4.89,2.98
17244,1.55,2,1,3,62.3,55.0,7.44,7.37,4.61


In [45]:
y.head()

1388      559
50052    2201
41645    1238
42377    1304
17244    6901
Name: price, dtype: int64

In [47]:
# Load the model
model = pickle.load(open('./Save_models/diamond_price_prediction_model.pkl', 'rb'))

# Predict the model
model.predict([[0.58,	4,	2,	7,	60.0,	57.0,	5.44,	5.42,	3.26]])



array([2268.79721367])