# Car Price Prediction Challenge (Celebal Technologies)

## Table of Content

[Task](#tsk)

[Essential Libraries](#Lib)


[Importing Datasets](#dss)


[Exploratory Data Analysis](#eda)  
    
   i.[Data Cleaning](#eda)  
   ii.[Data Visualisation & Insights](#edaii)  

[Preprocessing](#pre)      
      i. [Train Test Split](#pre)   
      ii. [Encoding](#prei)  

[Model Building](#mb)  
    i. [Base Model XGBOOST & Eval.](#xg)  
    ii. [Hyper Parameter XGBOOST & Eval.](#xg1)  
    iii. [Base Random Forest & Eval.](#rf)  
    iv. [Hyper Parameter Random Forest & Eval.](#rf1)

[Benchmark & Conclusion](#BNK)

# <div class="h1">Task</div> <a class="anchor" id="tsk"></a>

*Perform data cleaning and pre-processing.    
              What steps did you use in this process and how did you clean your data?
  
*Perform exploratory data analysis on the given dataset.    
              Explain each and every graph that you make.
  
*Train a ml-model and evaluate it using different metrics.    
              Why did you choose that particular model? What was the accuracy?
  
*Hyperparameter optimization and feature selection is a plus.   
              Model deployment and use of ml-flow is a plus. (optional)
  
*Perform model interpretation and show feature importance for your model.    
             Provide some explanation for the above point.
  
*Future steps.

# <div class="h1">Essential Libraries</div> <a class="anchor" id="Lib"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#importing required libaries
# Data Handeling
import numpy as np
import pandas as pd

#For Visulisation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

#Preprocessing
#ENCODING
#pip install category_encoders
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder

#Imputation:
# from sklearn.experimental import enable_iterative_imputer
# from sklearn.impute import IterativeImputer

#SCALING
from sklearn.preprocessing import StandardScaler

#Feature Selection:
from sklearn.feature_selection import VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split

#Model
#Regression:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
import xgboost as xg

#Metrics and Validation:
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error


#Warnings
#Hadeling Warnings
import warnings
warnings.filterwarnings('ignore')

## <div class="h2">Importing Datasets</div> <a class="anchor" id="dss"></a>

In [None]:
df = pd.read_csv('../input/car-price-prediction-challenge/car_price_prediction.csv')
df.head()

## <div class="h2">Exploratory Data Analysis</div> <a class="anchor" id="eda"></a>

### <div class="h4">I Data Cleaning</div> <a class="anchor" id="eda"></a>

In [None]:
#to display all the the possible output rows for better view of data.
pd.set_option('display.max_rows',None)

In [None]:
df.shape

In [None]:
df.rename(columns = {'Drive wheels':'Drive_wheels' }, inplace = True)
df.rename(columns = {'Gear box type':'Transmission' }, inplace = True)
df.rename(columns = {'Fuel type':'FuelType' }, inplace = True)
df.rename(columns = {'Prod. year':'Year' }, inplace = True)

In [None]:
df.dtypes

In [None]:
df.isnull().sum(axis=0)

In [None]:
len(df[df.Levy=='-'])

In [None]:
len(df[df.Levy==0])

In [None]:
df.Levy = df.Levy.replace('-',0).astype('int64')

In [None]:
df.Mileage = df.Mileage.str.replace(r'([a-z])','').astype('int64')

In [None]:
cat_col = list(df.select_dtypes('O').columns)
num_col = list(df.select_dtypes(np.number).columns)

In [None]:
for i in cat_col:
    print(i,df[i].unique(),sep=':\n',end='\n\n')

In [None]:
# SURELY ENGINE VOLUME ARE CONTINOUS VALUES JUST FEW OF THEM HAVE TURBO.

In [None]:
df['Turbo'] = df['Engine volume'].str.strip().apply(lambda x : 'Yes' if len(x) > 3 else 'No')

In [None]:
# WE CAN SEPRATE TURBO AS ANOTHER FEATURE THAN COULD HELP US IN MODEL BUIDLING PRODCEDURE.

In [None]:
df['Engine volume'] = df['Engine volume'].str.replace(r'([a-z,' ',A-Z])','').astype('f')

In [None]:
df['Doors'] = df['Doors'].str.replace('02-Mar','2-3').replace('04-May','4-5')

In [None]:
### As 2-3 got and 4-5 got coverted to date while converting to csv we convert it back to categorical variable.

In [None]:
df['Cylinders'] = df['Cylinders'].astype('O')
df['Airbags'] = df['Airbags'].astype('O')
df['Year'] = df['Year'].astype('O')

In [None]:
### AS THESE VALUES ARE QUITE DISCRETE AND DONT SHOW ANY TREND WITH PRICE WE CAN CONSIDER THEM AS CATEGORICAL COLUMNS.

In [None]:
df.describe()

In [None]:
# PRICE HAVE A EXTREME VALUE THAT NEEDS TO BE FIGURED OUT MOVING FORWARD.

In [None]:
df_new = df.copy()

In [None]:
### DROPPING OUTLIERS FOR BETTER MODELING:

In [None]:
q1=df_new.quantile(0.25)
q3=df_new.quantile(0.75)
iqr = q3-q1
len(df_new[((df_new>q3+3*iqr) | (df_new<q1-3*iqr)).any(axis=1)])

In [None]:
df_new = df_new[~(((df_new>q3+3*iqr) | (df_new<q1-3*iqr)).any(axis=1))]

In [None]:
df_new.shape

In [None]:
df.shape

In [None]:
cat_col = list(df.select_dtypes('O').columns)
num_col = list(df.select_dtypes(np.number).columns)

### <div class="h2">II Visualization & Insights </div> <a class="anchor" id="edaii"></a>

In [None]:
plt.figure(figsize=[15,7])
for i,j in enumerate(num_col):
    plt.subplot(2,4,i+1)
    sns.boxplot(df_new[j])
    plt.tight_layout()
plt.show()

In [None]:
# Definitely ID won't help us predict Levy is quite, Years, Cylinders and Airbags show very discrete values hence looking forward we can deside whether to use it as category or a numerical value.

In [None]:
num_col.remove('ID')
df_new.drop(columns='ID',inplace=True)

In [None]:
for i in cat_col:
        plt.figure(figsize=[15,7])
        sns.countplot(x=df_new[i])
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show()

In [None]:
fig = px.treemap(data_frame=df_new,path=["Manufacturer","Category","Model"],title='MANUFACTURER WISE DATA DISTRIBUTION')
fig.show()

In [None]:
fig = px.treemap(data_frame=df_new,path=["Manufacturer","Category","Model"],values='Price',title='MANUFACTURER WISE TOTAL PRICE | SALES DISTRIBUTION')
fig.show()

In [None]:
# HYUNDAI AND TOYOTA DEFINITELY DOMINATE THE MARKET OF CARS FOLLOWED BY MERCEDES.

# JEEPS AND SEDANS ARE PREFERED MORE OVER ANY OTHER CATEGOIES OF 4 WHEELERS FOLLOWED BY HATCHBACKS.

# ON A COMFORT LEVEL CARS WITH AUTOMATIC TRANSMISSION , LEFT WHEELED AND LEATHERD INTERIORS ARE PREFFERED THE MOST.

# ON FUELTYPE AND PERFORMANCE PETROL TYPE ARE PREFFERED THE MOST.

# WE CAN SAY ECONOMIC CARS ARE PREFFERED MORE AS WE GET TO SEE 2L ENGINE , FRONT WHEEL DRIVE BASED CARS & Cars WITH NO TURBO HAVE HIGHEST PREFERENCES.

# COLOR PREFRENCES ARE PRETTY MUCH EXPECTED (BLACK WHITE SILVER and GREY).

# IN TERMS OF SALES HYUNDAI DOMINATES THE MARKET WITH TOTAL SALE OF 18,41,93,610 followed by TOYOTA, MERCEDES and OPEL

# BRAND A CATEGORIES DO PLAY A VITAL ROLE IN UNIT SALES AND TOTAL SALES.

In [None]:
sns.heatmap(data=df_new.corr(),cmap='Blues',annot=True)
plt.show()

In [None]:
sns.pairplot(df_new[num_col])
plt.show()

In [None]:
# THERE IS NO REMARKABLE TREND FOR PRICE WITH RESPECT  TO ANY OF THE NUMERICAL COLUMNS SURE DO HAVE EXTREME POINTS WHICH NEED TO BE EXPLORED FURTHER.
# A SLIGHT TREND FOR PRICE WITH RESPECT TO LEVY YEAR and MILEAGE CAN BE SEEN.

In [None]:
for i in cat_col:
    if len(df[i].unique())<25:
        boxxx = px.box(data_frame=df_new,x=i,y='Price')
        boxxx.show()

In [None]:
# A VARIANCE OF DISTRIBUTION FOR PRICE CAN BEEN SEEN THOUGHT OUT EACH CATEGORICAL FEATURES EXCEPT FOR DRIVE WHEELS (TO BE CONSIDERED FOR FURTHER VERIFICATION).

In [None]:
len(df.Model.unique())

## <div class="h4">Preprocessing</div> <a class="anchor" id="pre"></a>

In [None]:
num_col

In [None]:
for i in cat_col:
    print(i,':',len(df[i].unique()))

### <div class="h4">I Train Test Split</div> <a class="anchor" id="pre"></a>

In [None]:
X=df_new.copy()
X.drop(columns='Price',inplace=True)
y=df_new.copy()['Price']

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X,y,random_state=1,test_size=0.3)

### <div class="h4">II Encoding</div> <a class="anchor" id="prei"></a>

In [None]:
yrs_disnt = list(X['Year'].sort_values().unique())
mapdisct = {j:i for i,j in enumerate(yrs_disnt)}

In [None]:
## ENC
## 'Manufacturer','Model','Category','FuelType','Transmission','Drive_wheels','Doors','Color' BINARY ENCODED
## 'Year','Door' ORDINAL
## 'Leather interior','Wheel','Turbo' BINARY
## 'Cylinders',' Airbags' Considered as continous values.

In [None]:
enc = ce.BinaryEncoder(cols=['Manufacturer','Model','Category','FuelType','Transmission','Drive_wheels','Color'],drop_invariant=False,return_df=True)
X_train_enc = enc.fit_transform(X=X_train)
X_test_enc = enc.transform(X=X_test)

In [None]:
X_train_enc['Year'] = X_train_enc['Year'].replace(mapdisct)
X_test_enc['Year'] = X_test_enc['Year'].replace(mapdisct)
X_train_enc['Doors'] = X_train_enc['Year'].replace({'2-3':0,'4-5':1,'>5':2})
X_test_enc['Doors'] = X_test_enc['Year'].replace({'2-3':0,'4-5':1,'>5':2})
X_train_enc['Leather interior'] = X_train_enc['Leather interior'].replace({'Yes':1,'No':0})
X_test_enc['Leather interior'] = X_test_enc['Leather interior'].replace({'Yes':1,'No':0})
X_train_enc['Wheel'] = X_train_enc['Wheel'].replace({'Left wheel':1,'Right-hand drive':0})
X_test_enc['Wheel'] = X_test_enc['Wheel'].replace({'Left wheel':1,'Right-hand drive':0})
X_train_enc['Turbo'] = X_train_enc['Turbo'].replace({'Yes':1,'No':0})
X_test_enc['Turbo'] = X_test_enc['Turbo'].replace({'Yes':1,'No':0})

In [None]:
X_train_enc['Cylinders'],X_train_enc['Airbags'] = X_train_enc['Cylinders'].astype('i'),X_train_enc['Airbags'].astype('i')
X_test_enc['Cylinders'],X_test_enc['Airbags'] = X_test_enc['Cylinders'].astype('i'),X_test_enc['Airbags'].astype('i')

## <div class="h4">ML MODEL BUILDING & EVALUTATION</div> <a class="anchor" id="mb"></a>


### <div class="h4">I Base XGBRegressor</div> <a class="anchor" id="xg"></a>


In [None]:
base_XGB = xg.XGBRegressor(n_estimators=100,booster='gbtree',random_state=10)

In [None]:
base_XGB.fit(X_train_enc,y_train)

In [None]:
y_train_pred_base_XGB = base_XGB.predict(X_train_enc)
y_test_pred_base_XGB = base_XGB.predict(X_test_enc)

In [None]:
Train_rsquare_base_XGB = r2_score(y_train,y_train_pred_base_XGB)
print("Train R-square associated with Base XG Boost Regression is : ", Train_rsquare_base_XGB,'\n')

Test_rsquare_base_XGB = r2_score(y_test , y_test_pred_base_XGB)
print("Test R-square associated with Base XG Boost Regression is : ", Test_rsquare_base_XGB,'\n')


print("Test Root_mean_squared_error associated Base XG Boost Regression is : ", 
      np.sqrt(mean_squared_error(y_test , y_test_pred_base_XGB)),'\n')

print("Test Mean_absolute_error associated with base XG Boost Regression is : ",
      mean_absolute_error(y_test , y_test_pred_base_XGB),'\n')
      
print("Test Mean_absolute_percentage_error associated with base XG Boost Regression is : ", 
      mean_absolute_percentage_error(y_test , y_test_pred_base_XGB),'\n')

In [None]:
crossval_scores_base_XGB = cross_val_score(base_XGB,X_test_enc,y_test,cv=15,scoring='r2')
print('Test Scores:',crossval_scores_base_XGB)

print('Score Mean',crossval_scores_base_XGB.mean()*100, 'Score Standard Deviation', crossval_scores_base_XGB.std()*100)

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x=base_XGB.feature_importances_,y=base_XGB.feature_names_in_)
plt.tight_layout()
plt.show()

In [None]:
x_ax = range(len(y_test))
plt.figure(figsize=(25, 7))
plt.plot(x_ax, y_test, label="original")
plt.plot(x_ax, y_test_pred_base_XGB, label="predicted")
plt.title("Car Price dataset test and predicted data BASE XG BOOST")
plt.xlabel('X')
plt.ylabel('Price')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

### <div class="h4">II Hyper Parameter XGBRegressor</div> <a class="anchor" id="xg1"></a>

In [None]:
learning_rates = [0.1,0.2,0.3,0.5,0.6]

n_estimators = [100,150,200,300]

max_depths = [int(x) for x in range(6,12)]

boosters = ['gbtree']

xg_grid = {
    'n_estimator' : n_estimators,
    'learning_rate':learning_rates,
        'max_depth': max_depths,
    'booster':boosters
          }

base_XGB_HPT = xg.XGBRegressor(random_state = 10,n_jobs=-1)

xgb_hpt = GridSearchCV(estimator = base_XGB_HPT, 
                   param_grid = xg_grid, 
                   cv = 15, n_jobs = -1)

#fit the model
xgb_hpt.fit(X_train_enc, y_train)

In [None]:
print('Best Hyper parameters for XGB Regressor: ', xgb_hpt.best_params_, '\n')

In [None]:
xgb_best_tune = xg.XGBRegressor(booster='gbtree', learning_rate= 0.1, max_depth= 8, n_estimator= 100, reg_lambda = 1)

In [None]:
xgb_best_tune.fit(X_train_enc,y_train)

In [None]:
y_train_pred_base_XGB_hpt = xgb_best_tune.predict(X_train_enc)
y_test_pred_base_XGB_hpt = xgb_best_tune.predict(X_test_enc)

In [None]:
Train_rsquare_base_XGB_hpt = r2_score(y_train,y_train_pred_base_XGB_hpt)
print("Train R-square associated with Base XG Boost Regression is : ", Train_rsquare_base_XGB,'\n')

Test_rsquare_base_XGB_hpt = r2_score(y_test , y_test_pred_base_XGB_hpt)
print("Test R-square associated with Base XG Boost Regression is : ", Test_rsquare_base_XGB,'\n')


print("Test Root_mean_squared_error associated Base XG Boost Regression is : ", 
      np.sqrt(mean_squared_error(y_test , y_test_pred_base_XGB_hpt)),'\n')

print("Test Mean_absolute_error associated with base XG Boost Regression is : ",
      mean_absolute_error(y_test , y_test_pred_base_XGB_hpt),'\n')
      
print("Test Mean_absolute_percentage_error associated with base XG Boost Regression is : ", 
      mean_absolute_percentage_error(y_test , y_test_pred_base_XGB_hpt),'\n')

In [None]:
scores_tuned_XGB_hpt = cross_val_score(xgb_best_tune,X_test_enc,y_test,cv=15,scoring='r2')
print('Test Scores:',scores_tuned_XGB_hpt)

print('Score Mean',scores_tuned_XGB_hpt.mean()*100, 'Score Standard Deviation', scores_tuned_XGB_hpt.std()*100)

In [None]:
### SLIGHTLY BETTER THAN THE BASE MODEL THERE IS MINIMISATION IS ERROR TERM.

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x=xgb_best_tune.feature_importances_,y=xgb_best_tune.feature_names_in_)
plt.tight_layout()
plt.show()

In [None]:
x_ax = range(len(y_test))
plt.figure(figsize=(25, 7))
plt.plot(x_ax, y_test, label="original")
plt.plot(x_ax, y_test_pred_base_XGB_hpt, label="predicted")
plt.title("Car Price dataset test and predicted data TUNED XGBOOST")
plt.xlabel('X')
plt.ylabel('Price')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

### <div class="h4">III Base Random Forest</div> <a class="anchor" id="rf"></a>

In [None]:
base_rf = RandomForestRegressor(n_jobs=-1)

In [None]:
base_rf.fit(X_train_enc,y_train)

In [None]:
y_train_pred_base_rf = base_rf.predict(X_train_enc)
y_test_pred_base_rf = base_rf.predict(X_test_enc)

In [None]:
Train_rsquare_base_rf = r2_score(y_train,y_train_pred_base_rf)
print("Train R-square associated with Base Random Forest Regression is : ", Train_rsquare_base_rf,'\n')

Test_rsquare_base_rf = r2_score(y_test , y_test_pred_base_rf)
print("Test R-square associated with Base Random Forest Regression is : ", Test_rsquare_base_rf,'\n')


print("Test Root_mean_squared_error associated Base Random Forest Regression is : ", 
      np.sqrt(mean_squared_error(y_test , y_test_pred_base_rf)),'\n')

print("Test Mean_absolute_error associated with Base Random Forest Regression is : ",
      mean_absolute_error(y_test , y_test_pred_base_rf),'\n')
      
print("Test Mean_absolute_percentage_error associated with Base Random Forest Regression is : ", 
      mean_absolute_percentage_error(y_test , y_test_pred_base_rf),'\n')

In [None]:
scores_base_rf = cross_val_score(base_rf,X_test_enc,y_test,cv=15,scoring='r2')
print('Test Scores:',scores_base_rf)

print('Score Mean',scores_base_rf.mean()*100, 'Score Standard Deviation', scores_base_rf.std()*100)

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x=base_rf.feature_importances_,y=base_rf.feature_names_in_)
plt.tight_layout()
plt.show()

In [None]:
# visualizing in a plot
x_ax = range(len(y_test))
plt.figure(figsize=(25, 7))
plt.plot(x_ax, y_test, label="original")
plt.plot(x_ax, y_test_pred_base_rf, label="predicted")
plt.title("Car Price dataset test and predicted data BASE RANDOM FOREST")
plt.xlabel('X')
plt.ylabel('Price')
plt.legend(loc='best',fancybox=True, shadow=True)
plt.grid(True)
plt.show()

### <div class="h4">IV Hyper Parameter Random Forest</div> <a class="anchor" id="rf1"></a>

In [None]:
n_estimators = [150,200,250,300]

max_depths = [int(x) for x in range(5,16)]

criterions = ["squared_error", "absolute_error"]

min_samples_leafs = [5,10,15,20]

rf_grid = {
    'n_estimators' : n_estimators ,
    'criterion': criterions,
    'max_depth': max_depths,
    'min_samples_leaf': min_samples_leafs}

random_forest_reg = RandomForestRegressor(random_state = 10,n_jobs=-1)

rfr_hpt = RandomizedSearchCV(estimator = random_forest_reg,param_distributions=rf_grid, cv = 10,verbose=2)
# fit the model
rfr_hpt.fit(X_train_enc, y_train)

In [None]:
rfr_hpt.best_estimator_.get_params('max_features')

In [None]:
rfr_hpt.best_estimator_

In [None]:
rfr_best_tune = RandomForestRegressor(max_depth = 13 , min_samples_leaf= 1
                                      ,max_features=0.9, n_estimators=500, n_jobs=-1 , random_state= 10)

In [None]:
rfr_best_tune.fit(X_train_enc,y_train)
y_train_pred_base_RF_hpt = rfr_best_tune.predict(X_train_enc)
y_test_pred_base_RF_hpt = rfr_best_tune.predict(X_test_enc)

In [None]:
Train_rsquare_base_rf_hpt = r2_score(y_train,y_train_pred_base_RF_hpt)
print("Train R-square associated with Base Random Forest Regression is : ", Train_rsquare_base_rf_hpt,'\n')

Test_rsquare_base_rf_hpt = r2_score(y_test , y_test_pred_base_RF_hpt)
print("Test R-square associated with Base Random Forest Regression is : ", Test_rsquare_base_rf_hpt,'\n')


print("Test Root_mean_squared_error associated Base Random Forest Regression is : ", 
      np.sqrt(mean_squared_error(y_test , y_test_pred_base_RF_hpt)),'\n')

print("Test Mean_absolute_error associated with Base Random Forest Regression is : ",
      mean_absolute_error(y_test , y_test_pred_base_RF_hpt),'\n')
      
print("Test Mean_absolute_percentage_error associated with Base Random Forest Regression is : ", 
      mean_absolute_percentage_error(y_test , y_test_pred_base_RF_hpt),'\n')

In [None]:
scores_tuned_Rf = cross_val_score(rfr_best_tune,X_test_enc,y_test,cv=15,scoring='r2')
print('Test Scores:',scores_tuned_Rf)

print('Score Mean',scores_tuned_Rf.mean()*100, 'Score Standard Deviation', scores_tuned_Rf.std()*100)

### 

## <div class="h4">6. Bechmark & Conclusion</div> <a class="anchor" id="BNK"></a>

In [None]:
bench_mark = pd.DataFrame(columns=['XGBOOST','XGBOOST HPT','Random Forest','Random Forest HPT'],index = ['R2 Train','R2 Test','R2 Crossval','MAE','MAPE','Variance'])

In [None]:
bench_mark.loc['R2 Train',:] = [Train_rsquare_base_XGB,Train_rsquare_base_XGB_hpt,Train_rsquare_base_rf,Train_rsquare_base_rf_hpt]
bench_mark.loc['R2 Test',:] = [Test_rsquare_base_XGB,Test_rsquare_base_XGB_hpt,Test_rsquare_base_rf,Test_rsquare_base_rf_hpt]
bench_mark.loc['R2 Crossval',:] = [np.mean(crossval_scores_base_XGB),np.mean(scores_tuned_XGB_hpt),np.mean(scores_base_rf),np.mean(scores_tuned_Rf)]
bench_mark.loc['MAE',:] = [mean_absolute_error(y_test,y_test_pred_base_XGB),mean_absolute_error(y_test,y_test_pred_base_XGB_hpt),mean_absolute_error(y_test,y_test_pred_base_rf),mean_absolute_error(y_test,y_test_pred_base_RF_hpt)]
bench_mark.loc['MAPE',:] = [mean_absolute_percentage_error(y_test,y_test_pred_base_XGB),mean_absolute_percentage_error(y_test,y_test_pred_base_XGB_hpt),mean_absolute_percentage_error(y_test,y_test_pred_base_rf),mean_absolute_percentage_error(y_test,y_test_pred_base_RF_hpt)]
bench_mark.loc['Variance',:] = [np.mean(crossval_scores_base_XGB)/np.std(crossval_scores_base_XGB),np.mean(scores_tuned_XGB_hpt)/np.std(scores_tuned_XGB_hpt),np.mean(scores_base_rf)/np.std(scores_base_rf),np.mean(scores_tuned_Rf)/np.std(scores_tuned_Rf)]

In [None]:
bench_mark

#### * we can see from the benchmark that we minimised the overfitting for both the models with the help of hyper parameters.
#### * The crossval clearly provides us a picture or mean scores of cross validation 15. Both the model perform very good.
#### * In terms of error minimisation also both the models perform quite good.
#### * "It's not about the scores it's all about consistency" 
        We can see that XGBoost HPT achives a lower variance with the very same bias provided by Random Forest hpt.
        

## Conclusion:

### We achived a bias variance tradoff for both the models where xgboost performs better in terms of variance.

### Few more algorithims and hyper parameters can be explored to get a better score / there is room for more features. We can drop models instead as it only leads to complexity of model.