<a href="https://colab.research.google.com/github/GOLISHYAMP/Colab_Notebooks/blob/main/RandomForestRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Car Price Prediction**

The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

This dataset contains information about used cars.\
This data can be used for a lot of purposes such as Used Car Price Prediction using different Machine Learning Techniques.

Data Description (Feature Information)

car_name: Car's Full name, which includes brand and specific model name.\
brand: Brand Name of the particular car.\
model: Exact model name of the car of a particular brand.\
seller_type: Which Type of seller is selling the used car\
fuel_type: Fuel used in the used car, which was put up on sale.\
transmission_type: Transmission used in the used car, which was put on sale.\
vehicle_age: The count of years since car was bought.\
mileage: It is the number of kilometer the car runs per litre.\
engine: It is the engine capacity in cc(cubic centimeters)\
max_power: Max power it produces in BHP.\
seats: Total number of seats in car.\
selling_price: The sale price which was put up on website.

**Dataset:**
https://www.kaggle.com/datasets/manishkr1754/cardekho-used-car-data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/content/cardekho_dataset.csv')

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15411 entries, 0 to 15410
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         15411 non-null  int64  
 1   car_name           15411 non-null  object 
 2   brand              15411 non-null  object 
 3   model              15411 non-null  object 
 4   vehicle_age        15411 non-null  int64  
 5   km_driven          15411 non-null  int64  
 6   seller_type        15411 non-null  object 
 7   fuel_type          15411 non-null  object 
 8   transmission_type  15411 non-null  object 
 9   mileage            15411 non-null  float64
 10  engine             15411 non-null  int64  
 11  max_power          15411 non-null  float64
 12  seats              15411 non-null  int64  
 13  selling_price      15411 non-null  int64  
dtypes: float64(2), int64(6), object(6)
memory usage: 1.6+ MB


In [None]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
car_name,0
brand,0
model,0
vehicle_age,0
km_driven,0
seller_type,0
fuel_type,0
transmission_type,0
mileage,0


So, data is already null free, So we can directly go for Feature engineering

# **Feature Engineering**

In [None]:
df.columns

Index(['Unnamed: 0', 'car_name', 'brand', 'model', 'vehicle_age', 'km_driven',
       'seller_type', 'fuel_type', 'transmission_type', 'mileage', 'engine',
       'max_power', 'seats', 'selling_price'],
      dtype='object')

In [None]:
# We can drop car_name and brand, instead we only use model for prediction
df.drop(['Unnamed: 0','car_name', 'brand'], axis = 1, inplace = True)

In [None]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [None]:
df.model.unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [None]:
df.seller_type.unique()

array(['Individual', 'Dealer', 'Trustmark Dealer'], dtype=object)

In [None]:
df.fuel_type.unique()

array(['Petrol', 'Diesel', 'CNG', 'LPG', 'Electric'], dtype=object)

In [None]:
df.transmission_type.unique()

array(['Manual', 'Automatic'], dtype=object)

In [None]:
continuous_features = [feature for feature in df.columns if df.dtypes[feature] != 'O'][:-1]
continuous_features

['vehicle_age', 'km_driven', 'mileage', 'engine', 'max_power', 'seats']

In [None]:
categorical_features = [feature for feature in df.columns if df.dtypes[feature] == 'O']
categorical_features

['model', 'seller_type', 'fuel_type', 'transmission_type']

In [None]:
df.columns

Index(['model', 'vehicle_age', 'km_driven', 'seller_type', 'fuel_type',
       'transmission_type', 'mileage', 'engine', 'max_power', 'seats',
       'selling_price'],
      dtype='object')

In [None]:
X = df.drop('selling_price', axis = 1)
y = df['selling_price']

In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
X['model'] = LE.fit_transform(X['model'])

In [None]:
#Train Test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31)

So, for seller_type, fuel_type, transmission_type	we can go for OneHotEncoder and for model we have to got for labelEncoder

In [None]:
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
SC = StandardScaler()

OHE = OneHotEncoder(drop = 'first')

ct = ColumnTransformer(
    [
        ('StandardScaler', SC, continuous_features),
        ('OneHotEncoder', OHE, categorical_features[1:]),
    ], remainder='passthrough',
)

In [None]:
X_train.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
11464,98,5,61000,Dealer,Diesel,Manual,24.3,1248,88.5,5
10982,3,8,89000,Dealer,Diesel,Automatic,14.49,2993,258.0,5
207,25,3,25000,Individual,Petrol,Manual,17.4,1497,117.3,5
9569,32,8,51235,Dealer,Diesel,Manual,19.01,1461,108.45,5
1061,118,3,50000,Individual,Petrol,Manual,18.6,1197,81.83,5


In [None]:
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

In [None]:
X_train[0]

array([-0.34994468,  0.12340515,  1.09668358, -0.45220642, -0.27683625,
       -0.39964597,  0.        ,  0.        ,  1.        ,  0.        ,
        0.        ,  0.        ,  1.        , 98.        ])

# **Model Training**

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.neighbors import KNeighborsRegressor

models = {
    'LR' : LinearRegression(),
    'Lasso' : Lasso(),
    'Ridge' : Ridge(),
    'KNN' : KNeighborsRegressor(),
    'dtr' : DecisionTreeRegressor(),
    'rfr' : RandomForestRegressor()
}

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error

for model_name, model in list(models.items()):
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)

  print(f"Evaluation of {model_name} model ")
  print('r2_score : ', r2_score(y_test, y_pred))
  print('mean_absolute_error : \n', mean_absolute_error(y_test, y_pred))
  print("\n *********************************************************\n")

Evaluation of LR model 
r2_score :  0.6970780390013382
mean_absolute_error : 
 266897.00825787743

 *********************************************************

Evaluation of Lasso model 
r2_score :  0.6970793416005481
mean_absolute_error : 
 266891.4642698102

 *********************************************************

Evaluation of Ridge model 
r2_score :  0.6970721732801253
mean_absolute_error : 
 266832.5538440468

 *********************************************************

Evaluation of KNN model 
r2_score :  0.892950284108542
mean_absolute_error : 
 110745.44275056763

 *********************************************************

Evaluation of dtr model 
r2_score :  0.803405257620456
mean_absolute_error : 
 142190.54492377554

 *********************************************************

Evaluation of rfr model 
r2_score :  0.9180316143801208
mean_absolute_error : 
 104930.94203320736

 *********************************************************



Ok KNN and RFR is doing good, ok now lets find the hyperparameter for KNN ,  Random Forest Regressor

In [None]:
rfr_params = {
    'n_estimators' : [50, 100, 150, 200],
    'criterion' : ['squared_error', 'absolute_error', 'friedman_mse', 'poisson'],
    'max_depth' : [2,3,4,5,6,7],
}
knn_params = {
    'n_neighbors':[2,3,4,5,6,7,8],
    'weights' : ['uniform','distance'],
    'algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p' : [1, 2]
}

In [None]:
models = [
    ('rfr', RandomForestRegressor(), rfr_params),
    ('knn', KNeighborsRegressor(), knn_params)
]

In [None]:
from math import e
from sklearn.model_selection import RandomizedSearchCV
for name, model, params in models:
  print(f'{name} is under going HP tunning')
  rscv = RandomizedSearchCV(estimator=model, \
                            param_distributions=params, n_jobs = -1, \
                            refit = True, cv = 5, scoring = 'r2')
  rscv.fit(X_train, y_train)
  print("Eval o/p")
  print(rscv.best_params_)
  print(rscv.best_score_)
  y_pred = rscv.predict(X_test)

  print(f"Evaluation of {name} model ")
  print('r2 score : ', r2_score(y_test, y_pred))
  print('mean_absolute_error : \n', mean_absolute_error(y_test, y_pred))
  print("\n ******************************************************\n")


rfr is under going HP tunning
Eval o/p
{'n_estimators': 50, 'max_depth': 7, 'criterion': 'absolute_error'}
0.85264102070715
Evaluation of rfr model 
r2 score :  0.8953092265713251
mean_absolute_error : 
 127630.15812520272

 ******************************************************

knn is under going HP tunning
Eval o/p
{'weights': 'distance', 'p': 2, 'n_neighbors': 2, 'algorithm': 'ball_tree'}
0.8794259806000998
Evaluation of knn model 
r2 score :  0.9156607865934887
mean_absolute_error : 
 109644.21589398784

 ******************************************************

