<a href="https://colab.research.google.com/github/GOLISHYAMP/Colab_Notebooks/blob/main/XGboostingRegressor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Car Price Prediction**

The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

This dataset contains information about used cars.\
This data can be used for a lot of purposes such as Used Car Price Prediction using different Machine Learning Techniques.

Data Description (Feature Information)

car_name: Car's Full name, which includes brand and specific model name.\
brand: Brand Name of the particular car.\
model: Exact model name of the car of a particular brand.\
seller_type: Which Type of seller is selling the used car\
fuel_type: Fuel used in the used car, which was put up on sale.\
transmission_type: Transmission used in the used car, which was put on sale.\
vehicle_age: The count of years since car was bought.\
mileage: It is the number of kilometer the car runs per litre.\
engine: It is the engine capacity in cc(cubic centimeters)\
max_power: Max power it produces in BHP.\
seats: Total number of seats in car.\
selling_price: The sale price which was put up on website.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/Cleaned_Cardekho_dataset.csv')

In [3]:
df.columns

Index(['Unnamed: 0', 'model', 'vehicle_age', 'km_driven', 'seller_type',
       'fuel_type', 'transmission_type', 'mileage', 'engine', 'max_power',
       'seats', 'selling_price'],
      dtype='object')

In [4]:
df.drop('Unnamed: 0', axis = 1, inplace=True)

In [5]:
continuous_features = [feature for feature in df.columns if df.dtypes[feature] != 'O'][:-1]
continuous_features

['vehicle_age', 'km_driven', 'mileage', 'engine', 'max_power', 'seats']

In [6]:
categorical_features = [feature for feature in df.columns if df.dtypes[feature] == 'O']
categorical_features

['model', 'seller_type', 'fuel_type', 'transmission_type']

In [7]:
X = df.drop('selling_price', axis = 1)
y = df['selling_price']

In [8]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
X['model'] = LE.fit_transform(X['model'])

In [9]:
#Train Test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31)

In [10]:
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
SC = StandardScaler()

OHE = OneHotEncoder(drop = 'first')

ct = ColumnTransformer(
    [
        ('StandardScaler', SC, continuous_features),
        ('OneHotEncoder', OHE, categorical_features[1:]),
    ], remainder='passthrough',
)

In [11]:
X_train.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
11464,98,5,61000,Dealer,Diesel,Manual,24.3,1248,88.5,5
10982,3,8,89000,Dealer,Diesel,Automatic,14.49,2993,258.0,5
207,25,3,25000,Individual,Petrol,Manual,17.4,1497,117.3,5
9569,32,8,51235,Dealer,Diesel,Manual,19.01,1461,108.45,5
1061,118,3,50000,Individual,Petrol,Manual,18.6,1197,81.83,5


In [12]:
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

# **Model Training**

In [13]:
pip install xgboost



In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

models = {
    'LR' : LinearRegression(),
    'Lasso' : Lasso(),
    'Ridge' : Ridge(),
    'KNN' : KNeighborsRegressor(),
    'dtr' : DecisionTreeRegressor(),
    'rfr' : RandomForestRegressor(),
    'adr' : AdaBoostRegressor(),
    'gbr' : GradientBoostingRegressor(),
    'xgbr': XGBRegressor()
}

In [15]:
from sklearn.metrics import r2_score, mean_absolute_error

for model_name, model in list(models.items()):
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  print(f"Evaluation of {model_name} model ")
  print('r2_score : ', r2_score(y_test, y_pred))
  print('mean_absolute_error : \n', mean_absolute_error(y_test, y_pred))
  print("\n *********************************************************\n")

Evaluation of LR model 
r2_score :  0.6970780390013382
mean_absolute_error : 
 266897.00825787743

 *********************************************************

Evaluation of Lasso model 
r2_score :  0.6970793416005481
mean_absolute_error : 
 266891.4642698102

 *********************************************************

Evaluation of Ridge model 
r2_score :  0.6970721732801253
mean_absolute_error : 
 266832.5538440468

 *********************************************************

Evaluation of KNN model 
r2_score :  0.892950284108542
mean_absolute_error : 
 110745.44275056763

 *********************************************************

Evaluation of dtr model 
r2_score :  0.8207798177081544
mean_absolute_error : 
 136922.41863985296

 *********************************************************

Evaluation of rfr model 
r2_score :  0.9185742894450157
mean_absolute_error : 
 104430.9961936917

 *********************************************************

Evaluation of adr model 
r2_score :  0.69

In [16]:
rfr_params = {
    'n_estimators' : [50, 100, 150, 200],
    'criterion' : ['squared_error', 'absolute_error', 'friedman_mse', 'poisson'],
    'max_depth' : [2,3,4,5,6,7],
}
params_xgbr = {
    'learning_rate' : [0.1, 0.01],
    'max_depth' : [5, 8, 12, 20, 30],
    'n_estimators' : [100, 200, 300],
    'colsample_bytree' : [0.5, 0.8, 1, 0.3, 0.4]
}

In [17]:
models = [
    ('rfr', RandomForestRegressor(), rfr_params),
    ('xgbr', XGBRegressor(), params_xgbr)
]

In [18]:
from math import e
from sklearn.model_selection import RandomizedSearchCV
for name, model, params in models:
  print(f'{name} is under going HP tunning')
  rscv = RandomizedSearchCV(estimator=model, \
                            param_distributions=params, n_jobs = -1, \
                            refit = True, cv = 5, scoring = 'r2')
  rscv.fit(X_train, y_train)
  print("Eval o/p")
  print(rscv.best_params_)
  print(rscv.best_score_)
  y_pred = rscv.predict(X_test)

  print(f"Evaluation of {name} model ")
  print('r2 score : ', r2_score(y_test, y_pred))
  print('mean_absolute_error : \n', mean_absolute_error(y_test, y_pred))
  print("\n ******************************************************\n")


rfr is under going HP tunning
Eval o/p
{'n_estimators': 100, 'max_depth': 6, 'criterion': 'friedman_mse'}
0.8360033449292938
Evaluation of rfr model 
r2 score :  0.8910185083282498
mean_absolute_error : 
 138523.71405548332

 ******************************************************

xgbr is under going HP tunning
Eval o/p
{'n_estimators': 300, 'max_depth': 8, 'learning_rate': 0.1, 'colsample_bytree': 0.4}
0.8639708755900186
Evaluation of xgbr model 
r2 score :  0.9282568020028568
mean_absolute_error : 
 100895.54170647249

 ******************************************************

