# Used Car Price Prediction

## 1) Problem statement.

* This dataset comprises used cars sold on cardehko.com in India as well as important features of these cars.
* If user can predict the price of the car based on input features.
* Prediction results can be used to give new seller the price suggestion based on market condition.

## 2) Data Collection.
* The Dataset is collected from scrapping from cardheko webiste
* The data consists of 13 column and 15411 rows.

In [34]:
# Import Libraries
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split, RandomizedSearchCV

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor
)
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore")

- Load the dataset

In [2]:
# Load the dataset
df = pd.read_csv("../dataset/cardekho_imputated.csv", index_col=[0])

In [3]:
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## Data Cleaning
### Handling Missing values

* Handling Missing values 
* Handling Duplicates
* Check data type
* Understand the dataset

In [4]:
# Check Null values
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

- **Note:** There is no missing values in the dataset

- Remove unnecessary features from the dataset. There are some features that contains same information like car_name, brand and model. So, we can remove car_name and brand from the dataset.

In [5]:
# Remove unnecessary features
df.drop(['car_name', 'brand'], axis=1, inplace=True)

In [6]:
df.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


### Getting all different types of features

In [7]:
# Categorical Features
cat_features = [features for features in df.columns if df[features].dtype == 'O']
print("Number of Categorical Features: ", len(cat_features))

# All Numerical Features
num_features = [features for features in df.columns if df[features].dtype != 'O']
print("Number of Numerical Features: ", len(num_features))

# Discrete Numerical Features
discrete_features = [features for features in num_features if len(df[features].unique()) <= 25]
print("Number of Discrete Numerical Features: ", len(discrete_features))

# Continuous Numerical Features
continuous_features = [features for features in num_features if features not in discrete_features]
print("Number of Continuous Numerical Features: ", len(continuous_features))

Number of Categorical Features:  4
Number of Numerical Features:  7
Number of Discrete Numerical Features:  2
Number of Continuous Numerical Features:  5


In [8]:
# Seperate Dependent and Independent Features
X = df.drop("selling_price", axis=1) # Independent Features
y = df['selling_price'] # Dependent Feature

In [9]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [10]:
y.head()

0    120000
1    550000
2    215000
3    226000
4    570000
Name: selling_price, dtype: int64

## Feature Encoding and Scaling
**One Hot Encoding for Columns which had lesser unique values and not ordinal**
* One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [11]:
X['model'].value_counts()

model
i20             906
Swift Dzire     890
Swift           781
Alto            778
City            757
               ... 
Altroz            1
C                 1
Ghost             1
Quattroporte      1
Gurkha            1
Name: count, Length: 120, dtype: int64

In [12]:
X['model'].unique()

array(['Alto', 'Grand', 'i20', 'Ecosport', 'Wagon R', 'i10', 'Venue',
       'Swift', 'Verna', 'Duster', 'Cooper', 'Ciaz', 'C-Class', 'Innova',
       'Baleno', 'Swift Dzire', 'Vento', 'Creta', 'City', 'Bolero',
       'Fortuner', 'KWID', 'Amaze', 'Santro', 'XUV500', 'KUV100', 'Ignis',
       'RediGO', 'Scorpio', 'Marazzo', 'Aspire', 'Figo', 'Vitara',
       'Tiago', 'Polo', 'Seltos', 'Celerio', 'GO', '5', 'CR-V',
       'Endeavour', 'KUV', 'Jazz', '3', 'A4', 'Tigor', 'Ertiga', 'Safari',
       'Thar', 'Hexa', 'Rover', 'Eeco', 'A6', 'E-Class', 'Q7', 'Z4', '6',
       'XF', 'X5', 'Hector', 'Civic', 'D-Max', 'Cayenne', 'X1', 'Rapid',
       'Freestyle', 'Superb', 'Nexon', 'XUV300', 'Dzire VXI', 'S90',
       'WR-V', 'XL6', 'Triber', 'ES', 'Wrangler', 'Camry', 'Elantra',
       'Yaris', 'GL-Class', '7', 'S-Presso', 'Dzire LXI', 'Aura', 'XC',
       'Ghibli', 'Continental', 'CR', 'Kicks', 'S-Class', 'Tucson',
       'Harrier', 'X3', 'Octavia', 'Compass', 'CLS', 'redi-GO', 'Glanza',
       

In [13]:
len(X['model'].unique())

120

**There are 120 unique values in the 'model' feature and not ordinal. So, we use label encoding.**

In [14]:
le = LabelEncoder()
X['model'] = le.fit_transform(X['model'])

In [15]:
X.head()

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
0,7,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5
1,54,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5
2,118,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5
3,7,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5
4,38,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5


In [16]:
X['seller_type'].value_counts()

seller_type
Dealer              9539
Individual          5699
Trustmark Dealer     173
Name: count, dtype: int64

In [17]:
X['fuel_type'].value_counts()

fuel_type
Petrol      7643
Diesel      7419
CNG          301
LPG           44
Electric       4
Name: count, dtype: int64

In [18]:
X['transmission_type'].value_counts()

transmission_type
Manual       12225
Automatic     3186
Name: count, dtype: int64

**Note: We handle `seller_type`, `fuel_type` and `transmission_type` features using one hot encoding.** 

In [19]:
# Create Column Transformer with 2 types of transformers
X_num_features = X.select_dtypes(exclude='object').columns
ohe_features = ['seller_type', 'fuel_type', 'transmission_type']

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, ohe_features),
        ("StandardScaler", numeric_transformer, X_num_features)
    ], 
    remainder='passthrough'
)

In [20]:
preprocessor

In [21]:
X=preprocessor.fit_transform(X)

In [22]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,1.247335,-0.000276,-1.324259,-1.263352,-0.403022
1,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.225693,-0.343933,-0.690016,-0.192071,-0.554718,-0.432571,-0.403022
2,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.536377,1.647309,0.084924,-0.647583,-0.554718,-0.479113,-0.403022
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,-1.519714,0.983562,-0.360667,0.292211,-0.936610,-0.779312,-0.403022
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-0.666211,-0.012060,-0.496281,0.735736,0.022918,-0.046502,-0.403022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15406,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.508844,0.983562,-0.869744,0.026096,-0.767733,-0.757204,-0.403022
15407,0.0,0.0,0.0,0.0,0.0,1.0,1.0,-0.556082,-1.339555,-0.728763,-0.527711,-0.216964,-0.220803,2.073444
15408,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.407551,-0.012060,0.220539,0.344954,0.022918,0.068225,-0.403022
15409,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.426247,-0.343933,72.541850,-0.887326,1.329794,0.917158,2.073444


### Split the dataset into training and testing data

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [24]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((12328, 14), (3083, 14), (12328,), (3083,))

## **Model Training, Model Selection, and Evaluation**

In [25]:
# Create a function to evaluate model
def evaluate_model(true, predict):
    mae = mean_absolute_error(true, predict)
    mse = mean_squared_error(true, predict)
    rmse = np.sqrt(mse)
    r2 = r2_score(true, predict)
    return mae, rmse, r2

In [26]:
# Beginning Model Training
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree Regressor": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "AdaBoost Regressor": AdaBoostRegressor(),
    "Gradient Boosting Regressor": GradientBoostingRegressor(),
    "XGBoost Regressor": XGBRegressor()
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model
    
    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test Dataset
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print("Model Performance for Training set")
    print(" - Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print(" - Mean Absolute Error: {:.4f}".format(model_train_mae))
    print(" - R2 Score: {:.4f}".format(model_train_r2))
    
    print('----------------------------------')
    
    print("Model Performance for Testing set")
    print(" - Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print(" - Mean Absolute Error: {:.4f}".format(model_test_mae))
    print(" - R2 Score: {:.4f}".format(model_test_r2))
    
    print('='*35)
    print('\n')

Linear Regression
Model Performance for Training set
 - Root Mean Squared Error: 553855.6665
 - Mean Absolute Error: 268101.6071
 - R2 Score: 0.6218
----------------------------------
Model Performance for Testing set
 - Root Mean Squared Error: 502543.5930
 - Mean Absolute Error: 279618.5794
 - R2 Score: 0.6645


Lasso
Model Performance for Training set
 - Root Mean Squared Error: 553855.6710
 - Mean Absolute Error: 268099.2226
 - R2 Score: 0.6218
----------------------------------
Model Performance for Testing set
 - Root Mean Squared Error: 502542.6696
 - Mean Absolute Error: 279614.7461
 - R2 Score: 0.6645


Ridge
Model Performance for Training set
 - Root Mean Squared Error: 553856.3160
 - Mean Absolute Error: 268059.8015
 - R2 Score: 0.6218
----------------------------------
Model Performance for Testing set
 - Root Mean Squared Error: 502533.8230
 - Mean Absolute Error: 279557.2169
 - R2 Score: 0.6645


K-Neighbors Regressor
Model Performance for Training set
 - Root Mean Square

- **Note: The Random Forest Regressor, Gradient Boosting Regressor and XGBoost Regressor works well for this dataset.**

## **Hyperparameter Tuning**

In [30]:
#Initialize few parameter for Hyperparamter tuning
rf_params = {
    "max_depth": [5, 8, 15, None, 10],
    "max_features": [5, 7, "auto", 8],
    "min_samples_split": [2, 8, 15, 20],
    "n_estimators": [100, 200, 500, 1000]
}

xgboost_params = {
    "learning_rate": [0.1, 0.01],
    "max_depth": [5, 8, 12, 20, 30],
    "n_estimators": [100, 200, 300],
    "colsample_bytree": [0.5, 0.8, 1, 0.3, 0.4]
}

In [31]:
# Models list for Hyperparameter tuning
randomcv_models = [
    ("RF", RandomForestRegressor(), rf_params),
    ("XGboost", XGBRegressor(), xgboost_params)
]

In [32]:
# Hyperparameter Tuning - RandomizedSearchCV
model_params = {} 

for name, model, params in randomcv_models:
    randomcv = RandomizedSearchCV(
        estimator=model,
        param_distributions=params,
        n_iter=100,
        cv=3,
        verbose=2,
        n_jobs=-1
    )
    
    randomcv.fit(X_train, y_train)
    model_params[name] = randomcv.best_params_

for model_name in model_params:
    print(f"---------------- Best Params for {model_name} -------------------")
    print(model_params[model_name])


Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END max_depth=15, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=15, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=15, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=10, max_features=7, min_samples_split=20, n_estimators=100; total time=   3.8s
[CV] END max_depth=10, max_features=7, min_samples_split=20, n_estimators=100; total time=   4.6s
[CV] END max_depth=10, max_features=7, min_samples_split=20, n_estimators=100; total time=   3.9s
[CV] END max_depth=5, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_split=15, n_estimators=500; total time=   0.0s
[CV] END max_depth=8, max_features=auto,

---------------- Best Params for RF -------------------

{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 8, 'max_depth': 15}

---------------- Best Params for XGboost -------------------

{'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.4}

In [33]:
# Retrain the models with best parameters
models = {
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, min_samples_split=2, max_features=8, max_depth=15),
    "XGboost Regressor": XGBRegressor(n_estimators=300, max_depth= 5, learning_rate= 0.1, colsample_bytree= 0.4)
}

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model
    
    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test Dataset
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    print(list(models.keys())[i])
    
    print("Model Performance for Training set")
    print(" - Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print(" - Mean Absolute Error: {:.4f}".format(model_train_mae))
    print(" - R2 Score: {:.4f}".format(model_train_r2))
    
    print('----------------------------------')
    
    print("Model Performance for Testing set")
    print(" - Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print(" - Mean Absolute Error: {:.4f}".format(model_test_mae))
    print(" - R2 Score: {:.4f}".format(model_test_r2))
    
    print('='*35)
    print('\n')

Random Forest Regressor
Model Performance for Training set
 - Root Mean Squared Error: 136746.3961
 - Mean Absolute Error: 54296.7846
 - R2 Score: 0.9769
----------------------------------
Model Performance for Testing set
 - Root Mean Squared Error: 223614.0498
 - Mean Absolute Error: 99376.1262
 - R2 Score: 0.9336


XGboost Regressor
Model Performance for Training set
 - Root Mean Squared Error: 123586.9373
 - Mean Absolute Error: 78321.8203
 - R2 Score: 0.9812
----------------------------------
Model Performance for Testing set
 - Root Mean Squared Error: 275826.6057
 - Mean Absolute Error: 103221.3906
 - R2 Score: 0.8989




## **Summary**

After hyperparameter tuning, the Random Forest Regressor demonstrated strong performance on both the training and testing datasets. The model achieved a high R2 score, indicating a good fit to the data, and relatively low error metrics (RMSE and MAE).

The XGboost Regressor also performed well, particularly on the training set, but showed slightly higher error metrics on the testing set compared to the Random Forest Regressor. This suggests that while XGboost is a powerful model, the Random Forest Regressor may be better suited for this particular dataset.

Overall, the Random Forest Regressor appears to be the more effective model for predicting used car prices based on the given features.