# **Used Car Price Prediction**
## **1. Problem Statement**
* This dataset comprises used cars sold on cardekho.com in india as well as important features of these cars.
* If user can predict the price of car based on input features.
* Prediction results can be used to give new seller the price suggestion based on the market condition

## **2. Data Collection**
* The dataset is collected from scrapping from cardekho website.
* The data consists of 13 columns and 15411 samples.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
df = pd.read_csv('cardekho_imputated.csv',index_col=[0])
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


## **3. Data Cleaning**
### **Handling Missing Values**
* Handling missing values
* Handling Duplicates
* Check Datatypes
* Understand the Dataset

In [3]:
# Handling missing values
df.isnull().sum()

car_name             0
brand                0
model                0
vehicle_age          0
km_driven            0
seller_type          0
fuel_type            0
transmission_type    0
mileage              0
engine               0
max_power            0
seats                0
selling_price        0
dtype: int64

In [4]:
# Handling Duplicates
df.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


* brand and model is same as the car_name feature 
* We'll drop car_name and brand because it may be repeated many time and will take only model feature because it is more important


In [5]:
# Drop car_name and model features
df.drop(['car_name','brand'],axis=1,inplace=True)
df[:3]

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000


In [6]:
df.model.value_counts()

model
i20            906
Swift Dzire    890
Swift          781
Alto           778
City           757
              ... 
Ghibli           1
Altroz           1
GTC4Lusso        1
Aura             1
Gurkha           1
Name: count, Length: 120, dtype: int64

In [7]:
len(df.model.unique())

120

In [8]:
num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print(f"Number of numerical feature : {len(num_features)}\n{num_features}")
cat_features = [feature for feature in df.columns if df[feature].dtype=='O']
print(f"Number of categorical feature : {len(cat_features)}\n{cat_features}")
discrete_features = [feature  for feature in num_features if len(df[feature].unique())<=25]
print(f"Number of discrete feature : {len(discrete_features)}\n{discrete_features}")
continuous_features = [feature for feature in num_features if feature not in discrete_features]
print(f"Number of continuous feature : {len(continuous_features)}\n{continuous_features}")

Number of numerical feature : 7
['vehicle_age', 'km_driven', 'mileage', 'engine', 'max_power', 'seats', 'selling_price']
Number of categorical feature : 4
['model', 'seller_type', 'fuel_type', 'transmission_type']
Number of discrete feature : 2
['vehicle_age', 'seats']
Number of continuous feature : 5
['km_driven', 'mileage', 'engine', 'max_power', 'selling_price']


## **4. Split The Data Into Independent And Dependent Features**

In [9]:
# First split the data into 
X = df.drop('selling_price',axis=1)
y = df['selling_price']

## **5. Encoding the model category label**

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['model'] = le.fit_transform(X['model'])


## **6. Train Test Split (Split the data into training and test set)**

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train

Unnamed: 0,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats
14238,108,7,70252,Dealer,Diesel,Automatic,11.20,2400,215.00,5
1731,91,2,10000,Individual,Petrol,Manual,23.84,1199,84.00,5
13218,17,2,6000,Dealer,Diesel,Automatic,19.00,1950,241.30,5
403,25,7,63000,Dealer,Petrol,Manual,17.80,1497,117.30,5
13550,117,10,80292,Dealer,Petrol,Manual,20.36,1197,78.90,5
...,...,...,...,...,...,...,...,...,...,...
6581,42,7,127731,Dealer,Diesel,Manual,20.77,1248,88.80,7
17029,95,11,59000,Dealer,Petrol,Manual,16.09,1598,103.20,5
6839,100,7,20000,Individual,Petrol,Manual,20.51,998,67.04,5
1104,118,2,15000,Dealer,Petrol,Manual,18.60,1197,81.86,5


In [12]:
y_train

14238    1825000
1731      515000
13218    7500000
403       435000
13550     200000
          ...   
6581      665000
17029     249000
6839      250000
1104      620000
9295      960000
Name: selling_price, Length: 12328, dtype: int64

## **7. Feature Encoding and Scaling** 

In [13]:
X_train.model.unique()

array([108,  91,  17,  25, 117,  13,  87,  54,  30,  64,  62,  39,  44,
        89,  32,  38,  23,   0, 114, 111,  98,   7, 103,  97,  56,  90,
       119,  84,  49,  77,  24,  99, 118, 100,  85,  47,  14,  10,  96,
        70,  45,  73,   1,   4,  16,   2,  42,  60,  88,  36,  58,  78,
        11,  61,  80,  59,   3,  20,  41,  95,  65,  71,  46,  79,  83,
        19, 106,  81,  68,  26,  40, 112, 113, 101,  63,  92,  57, 104,
        37,  22,  94,  27,  34,  86,   5,  74, 116, 107,  35,  29,  53,
       102,   6,  93,  72,   9, 109, 115,   8,  48,  12,  33,  67,  55,
        31,  50,  28,  51,  43,  18, 110, 105,  21,  82,  76,  66,  75,
        52])

In [14]:
len(df.model.unique())

120

In [15]:
len(df.seller_type.unique()),len(df.fuel_type.unique()),len(df.transmission_type.unique())


(3, 5, 2)

In [16]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
one_hot_columns = ['seller_type','fuel_type','transmission_type']
num_features = X_train.select_dtypes(exclude='object').columns
preprocessor = ColumnTransformer(
    [
        ('OneHotEncoder', OneHotEncoder(drop='first'), one_hot_columns),
        ('StandardScaler',StandardScaler(),num_features)
    ],
    remainder='passthrough'
)
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [17]:
X_train = pd.DataFrame(X_train)
X_test =pd.DataFrame(X_test)

In [18]:
X_train.head(),X_test.head()

(    0    1    2    3    4    5    6         7         8         9         10  \
 0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.255968  0.323969  0.349100 -2.050819   
 1  1.0  0.0  0.0  0.0  0.0  1.0  1.0  0.789199 -1.337798 -1.069394  0.985661   
 2  0.0  0.0  1.0  0.0  0.0  0.0  0.0 -1.242618 -1.337798 -1.163564 -0.177042   
 3  0.0  0.0  0.0  0.0  0.0  1.0  1.0 -1.022962  0.323969  0.178369 -0.465315   
 4  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.503081  1.321030  0.585469  0.149668   
 
          11        12        13  
 0  1.756765  2.681685 -0.403824  
 1 -0.547081 -0.382744 -0.403824  
 2  0.893542  3.296910 -0.403824  
 3  0.024564  0.396229 -0.403824  
 4 -0.550917 -0.502047 -0.403824  ,
     0    1    2    3    4    5    6         7         8         9         10  \
 0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.503081  1.985737  0.413795  0.149668   
 1  1.0  0.0  1.0  0.0  0.0  0.0  1.0 -1.352446 -0.673091  0.060655  1.838470   
 2  0.0  0.0  1.0  0.0  0.0  0.0  1.0 -0.556193  0.323969 

## **8. Model Training, Model Evaluation And Model Selection**

In [19]:
# Importing models
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Import model evaluation functions
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [20]:
def evaluate_model(y_test,y_pred):
    """
    Evaluate model performance using various metrics.
    
    Parameters:
    - y_true: array-like
        True target values.
    - y_pred: array-like
        Predicted values from the model.

    Returns:
    - r2: float
        R^2 score.
    - mae: float
        Mean Absolute Error.
    - mse: float
        Mean Squared Error.
    - rmse: float
        Root Mean Squared Error.
    """
    r2 = r2_score(y_test,y_pred)
    mae = mean_absolute_error(y_test,y_pred)
    mse = mean_squared_error(y_test,y_pred)
    rmse = np.sqrt(mse)

    return r2,mae,mse,rmse 

In [21]:
def train_models(models : dict, X_train, X_test, y_train, y_test):
    """
    Train multiple models and evaluate their performance.

    Parameters:
    - models: dict
        A dictionary where keys are model names and values are model instances.
    - X_train: array-like or pd.DataFrame
        Feature matrix for training data.
    - y_train: array-like or pd.Series
        Target vector for training data.
    - X_test: array-like or pd.DataFrame
        Feature matrix for test data.
    - y_test: array-like or pd.Series
        Target vector for test data.
    """
    for i in range(len(list(models))):
        model = list(models.values())[i]
        model.fit(X_train,y_train)
    
        # Make Predictions
        y_train_pred = model.predict(X_train)
        y_test_pred =model.predict(X_test)
    
        model_train_r2, model_train_mae, model_train_mse, model_train_rmse = evaluate_model(y_train,y_train_pred)
        model_test_r2, model_test_mae, model_test_mse, model_test_rmse = evaluate_model(y_test,y_test_pred)
        
        print(list(models.keys())[i])
        print("-----------------------")
        print('Model Performance on Training set')
        print("-----------------------------")
        print("- R2 Score : {:.2f}".format(model_train_r2))
        print("- Mean Absolute Error : {:.2f}".format(model_train_mae))
        print("- Mean Squared Error : {:.2f}".format(model_train_mse))
        print("- Root Mean Squared Error : {:.2f}".format(model_train_rmse))
    
        print('Model Performance on Test set')
        print("-----------------------------")
        print("- R2 Score : {:.2f}".format(model_test_r2))
        print("- Mean Absolute Error : {:.2f}".format(model_test_mae))
        print("- Mean Squared Error : {:.2f}".format(model_test_mse))
        print("- Root Mean Squared Error : {:.2f}".format(model_test_rmse))
        print("--------------------------------------------------------------")
        

In [22]:
models = {

    'Random Forest Regressor' : RandomForestRegressor(),
    'Gradient Boosting Regressor' : GradientBoostingRegressor(),

}

In [24]:
train_models(models,X_train, X_test, y_train, y_test)

Random Forest Regressor
-----------------------
Model Performance on Training set
-----------------------------
- R2 Score : 0.98
- Mean Absolute Error : 39806.50
- Mean Squared Error : 17640391740.29
- Root Mean Squared Error : 132817.14
Model Performance on Test set
-----------------------------
- R2 Score : 0.93
- Mean Absolute Error : 101384.30
- Mean Squared Error : 50678605789.59
- Root Mean Squared Error : 225119.09
--------------------------------------------------------------
Gradient Boosting Regressor
-----------------------
Model Performance on Training set
-----------------------------
- R2 Score : 0.95
- Mean Absolute Error : 111709.56
- Mean Squared Error : 42002252335.71
- Root Mean Squared Error : 204944.51
Model Performance on Test set
-----------------------------
- R2 Score : 0.91
- Mean Absolute Error : 126637.51
- Mean Squared Error : 65987544445.64
- Root Mean Squared Error : 256880.41
--------------------------------------------------------------


 **We'll select GradientBoosting model And Random Forest model for HyperParameater Tuning because of performong well on the test data.**

## **9. Hyperparameter Tuning**

In [26]:
# Initialize Hyperparameters for Tuning
rf_params = {
    'max_depth' : ['None', 5, 8, 10, 15],
    'max_features' : ['auto', 5, 7, 8],
    'min_samples_split' : [2, 8, 15, 20],
    'n_estimators' : [100, 200, 500, 1000]
}
gradient_boost_params = {
    'loss' : ['squared_error', 'huber', 'absolute_error'],
    'criterion': ['friedman_mse', 'squared_error', 'mse'],
    'n_estimators': [100,200,500],
    'min_samples_split':[2,8,15,20],
    'max_depth': [5,8,15,None,10],
    'learning_rate' : [0.1, 0.01, 0.001]
}

In [27]:
random_cv_models = [
    ('Random Forest' ,RandomForestRegressor(),rf_params),
    ('Gradient Boosting Regressor' ,GradientBoostingRegressor(),gradient_boost_params)
]

In [28]:
from sklearn.model_selection import RandomizedSearchCV

model_params = {}
for name, model, params in random_cv_models:
    random = RandomizedSearchCV(
        estimator=model,
        param_distributions=params,
        n_iter=100,
        cv=3,
        verbose=2,
        n_jobs=-1,
    )
    random.fit(X_train,y_train)
    model_params[name] = random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [29]:
for model_name in model_params:
    print(f"Best Params for {model_name}:\n{model_params[model_name]}\n")

Best Params for Random Forest:
{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 7, 'max_depth': 15}

Best Params for Gradient Boosting Regressor:
{'n_estimators': 100, 'min_samples_split': 2, 'max_depth': 8, 'loss': 'huber', 'learning_rate': 0.1, 'criterion': 'friedman_mse'}



In [33]:
# Retrain model With best params
models = {
    'Random Forest Regressor' :RandomForestRegressor(n_estimators=100,
                                                     min_samples_split=2,
                                                     max_features=8,
                                                     max_depth=10),
    'Gradient Boosting':GradientBoostingRegressor(n_estimators=200,
                                                  min_samples_split=8,
                                                  max_depth=10,
                                                  loss = 'huber',
                                                  criterion = 'squared_error',
                                                  learning_rate=0.1),
    
}
train_models(models,X_train, X_test, y_train, y_test)

Random Forest Regressor
-----------------------
Model Performance on Training set
-----------------------------
- R2 Score : 0.96
- Mean Absolute Error : 83946.50
- Mean Squared Error : 30027877192.73
- Root Mean Squared Error : 173285.54
Model Performance on Test set
-----------------------------
- R2 Score : 0.93
- Mean Absolute Error : 107505.29
- Mean Squared Error : 54719141969.61
- Root Mean Squared Error : 233921.23
--------------------------------------------------------------
Gradient Boosting
-----------------------
Model Performance on Training set
-----------------------------
- R2 Score : 0.99
- Mean Absolute Error : 38296.09
- Mean Squared Error : 4508298281.92
- Root Mean Squared Error : 67143.86
Model Performance on Test set
-----------------------------
- R2 Score : 0.93
- Mean Absolute Error : 98384.41
- Mean Squared Error : 55532319509.38
- Root Mean Squared Error : 235652.96
--------------------------------------------------------------
