# CAR DEKHO PROJECT

Problem Statement:
The used car market in India is a dynamic and ever-changing landscape. Prices can fluctuate wildly based on a variety of factors including the make and model of the car, its mileage, its condition and the current market conditions. As a result, it can be difficult for sellers to accurately price their cars.

Approach:
We propose to develop a machine learning model that can predict the price of a used car based on its features. The model will be trained on a dataset of used cars that have been sold on Cardekho.com in India. The model will then be able to be used to predict the price of any used car, given its features.

Objective
To build suitable Machine Learning Model for Used Car Price Prediction.

Benefits:
The benefits of this solution include:

Sellers will be able to more accurately price their cars which will help them to sell their cars faster and for a higher price.
Buyers will be able to find cars that are priced more competitively.
The overall used car market in India will become more efficient.

Project Summary: Used Car Price Prediction

In this project, we developed a machine learning model to predict used car prices in India using data from Cardekho.com. The model addresses the challenges of pricing in a volatile market influenced by factors such as mileage, engine specifications, and car condition.

Key Steps:

Data Cleaning & Preprocessing:

Extracted numerical values from columns containing units (e.g., mileage, engine, max_power) using regular expressions.
Removed the redundant 'car_name' column and dropped rows with missing values to ensure data consistency.
Applied a log transformation to the target variable (selling_price) to reduce skewness.
Feature Engineering & Transformation:

Defined numerical features (vehicle_age, km_driven, mileage, engine, max_power, seats) and categorical features (brand, model, seller_type, fuel_type, transmission_type).
Constructed a preprocessing pipeline using a ColumnTransformer to scale numerical features (via StandardScaler) and encode categorical features (using OneHotEncoder), ensuring robust feature representation.
Model Building:

Integrated the preprocessing steps with a RandomForestRegressor into a single pipeline.
Trained the model on 80% of the data while reserving 20% for testing, achieving high performance as evidenced by the evaluation metrics.
Evaluation & Prediction:

Evaluated the model using RMSE and R² score to assess accuracy.
Demonstrated the model’s predictive capability by estimating the price of a sample used car, providing sellers and buyers with a reliable price prediction tool.
Benefits:

Sellers can price their cars more accurately, reducing the time to sale and potentially achieving higher prices.
Buyers gain access to more competitively priced vehicles, improving market efficiency.
This approach not only simplifies the pricing process but also contributes to a more transparent and efficient used car market in India.

In [1]:
# Importing important libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import re

In [2]:
# loading dataset
data = pd.read_csv('Cardekho.csv')

In [3]:
data.head()

Unnamed: 0,car_name,brand,model,vehicle_age,km_driven,seller_type,fuel_type,transmission_type,mileage,engine,max_power,seats,selling_price
0,Maruti Alto,Maruti,Alto,9,120000,Individual,Petrol,Manual,19.7,796,46.3,5,120000
1,Hyundai Grand,Hyundai,Grand,5,20000,Individual,Petrol,Manual,18.9,1197,82.0,5,550000
2,Hyundai i20,Hyundai,i20,11,60000,Individual,Petrol,Manual,17.0,1197,80.0,5,215000
3,Maruti Alto,Maruti,Alto,9,37000,Individual,Petrol,Manual,20.92,998,67.1,5,226000
4,Ford Ecosport,Ford,Ecosport,6,30000,Dealer,Diesel,Manual,22.77,1498,98.59,5,570000


In [4]:
# To check number of rows and columns
data.shape

(15411, 13)

In [5]:
# To get the overview of data quickly
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15411 entries, 0 to 15410
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   car_name           15411 non-null  object 
 1   brand              15411 non-null  object 
 2   model              15411 non-null  object 
 3   vehicle_age        15411 non-null  int64  
 4   km_driven          15411 non-null  int64  
 5   seller_type        15411 non-null  object 
 6   fuel_type          15411 non-null  object 
 7   transmission_type  15411 non-null  object 
 8   mileage            15411 non-null  float64
 9   engine             15411 non-null  int64  
 10  max_power          15411 non-null  float64
 11  seats              15411 non-null  int64  
 12  selling_price      15411 non-null  int64  
dtypes: float64(2), int64(5), object(6)
memory usage: 1.5+ MB


In [6]:
# Summary statistics for data
data.describe()

Unnamed: 0,vehicle_age,km_driven,mileage,engine,max_power,seats,selling_price
count,15411.0,15411.0,15411.0,15411.0,15411.0,15411.0,15411.0
mean,6.036338,55616.48,19.701151,1486.057751,100.588254,5.325482,774971.1
std,3.013291,51618.55,4.171265,521.106696,42.972979,0.807628,894128.4
min,0.0,100.0,4.0,793.0,38.4,0.0,40000.0
25%,4.0,30000.0,17.0,1197.0,74.0,5.0,385000.0
50%,6.0,50000.0,19.67,1248.0,88.5,5.0,556000.0
75%,8.0,70000.0,22.7,1582.0,117.3,5.0,825000.0
max,29.0,3800000.0,33.54,6592.0,626.0,9.0,39500000.0


In [8]:
# To check duplicate values
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
15406    False
15407    False
15408    False
15409    False
15410    False
Length: 15411, dtype: bool

In [9]:
# checking wrong datas available or not in 'car_name' column
data['car_name'].unique()

array(['Maruti Alto', 'Hyundai Grand', 'Hyundai i20', 'Ford Ecosport',
       'Maruti Wagon R', 'Hyundai i10', 'Hyundai Venue', 'Maruti Swift',
       'Hyundai Verna', 'Renault Duster', 'Mini Cooper', 'Maruti Ciaz',
       'Mercedes-Benz C-Class', 'Toyota Innova', 'Maruti Baleno',
       'Maruti Swift Dzire', 'Volkswagen Vento', 'Hyundai Creta',
       'Honda City', 'Mahindra Bolero', 'Toyota Fortuner', 'Renault KWID',
       'Honda Amaze', 'Hyundai Santro', 'Mahindra XUV500',
       'Mahindra KUV100', 'Maruti Ignis', 'Datsun RediGO',
       'Mahindra Scorpio', 'Mahindra Marazzo', 'Ford Aspire', 'Ford Figo',
       'Maruti Vitara', 'Tata Tiago', 'Volkswagen Polo', 'Kia Seltos',
       'Maruti Celerio', 'Datsun GO', 'BMW 5', 'Honda CR-V',
       'Ford Endeavour', 'Mahindra KUV', 'Honda Jazz', 'BMW 3', 'Audi A4',
       'Tata Tigor', 'Maruti Ertiga', 'Tata Safari', 'Mahindra Thar',
       'Tata Hexa', 'Land Rover Rover', 'Maruti Eeco', 'Audi A6',
       'Mercedes-Benz E-Class', 'Audi Q7'

In [10]:
# Checking number of unique values
data['car_name'].nunique()

121

In [11]:
# checking wrong datas available or not in 'fuel_type' column
data['fuel_type'].unique()

array(['Petrol', 'Diesel', 'CNG', 'LPG', 'Electric'], dtype=object)

In [12]:
# checking wrong datas available or not in 'seller_type' column
data['seller_type'].unique()

array(['Individual', 'Dealer', 'Trustmark Dealer'], dtype=object)

In [13]:
# checking wrong datas available or not in 'transmission_type' column
data['transmission_type'].unique()

array(['Manual', 'Automatic'], dtype=object)

In [15]:
# There is no wrong data available at any columns.

In [14]:
# Data Cleaning and Preprocessing
def extract_numeric(value):
    if pd.isna(value):
        return np.nan
    numbers = re.findall(r"\d+\.?\d*", str(value))
    return float(numbers[0]) if numbers else np.nan

In [17]:
# Process numerical columns with units
data['mileage'] = data['mileage'].apply(extract_numeric)
data['engine'] = data['engine'].apply(extract_numeric)
data['max_power'] = data['max_power'].apply(extract_numeric)

In [19]:
# Drop redundant column
data.drop('car_name', axis=1, inplace=True)

In [21]:
# Handle missing values
data = data.dropna()

In [24]:
# Define features and target
X = data.drop('selling_price', axis=1)
y = np.log(data['selling_price'])  # Log transform for skewed target

In [25]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Preprocessing pipeline
categorical_features = ['brand', 'model', 'seller_type', 'fuel_type', 'transmission_type']
numerical_features = ['vehicle_age', 'km_driven', 'mileage', 'engine', 'max_power', 'seats']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])


In [27]:
# Create pipeline with Random Forest Regressor
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])


In [28]:
# Train the model
model.fit(X_train, y_train)

In [29]:
# Evaluate the model
y_pred = model.predict(X_test)

print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}')
print(f'R² Score: {r2_score(y_test, y_pred)}')

RMSE: 0.18011704781362217
R² Score: 0.9350697695938412


In [30]:
# Example prediction
sample_data = {
    'brand': ['Hyundai'],
    'model': ['i20'],
    'vehicle_age': [3],
    'km_driven': [25000],
    'seller_type': ['Individual'],
    'fuel_type': ['Petrol'],
    'transmission_type': ['Manual'],
    'mileage': [18.6],
    'engine': [1197],
    'max_power': [81.83],
    'seats': [5]
}

sample_df = pd.DataFrame(sample_data)
predicted_price = np.exp(model.predict(sample_df))
print(f'\nPredicted Price: ₹{predicted_price[0]:.2f}')


Predicted Price: ₹689982.86
