# PriceTrack: Unlocking Car Market Insights

PriceTrack is a data science project designed to predict the valuation of second-hand cars based on key input parameters. 
Leveraging regression model, it provides data-driven insights to help buyers and sellers make informed decisions.

We will be using some common Python libraries, such as pandas, numpy, seaborn, and matplotlib.

In [414]:
import pandas as pd

## Importing Dataset

We used the Quikr Car data set that comes from the (https://quikr.com) to carry out Exploratory Data Analysis.

The dataset includes 892 cars (rows) and 6 features (columns).

In [415]:
df=pd.read_csv("quikr_car.csv")
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


### Understanding the type of data we're dealing with

Starting with datatypes

In [416]:
df.info()
# categorize into types of variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   company     892 non-null    object
 2   year        892 non-null    object
 3   Price       892 non-null    object
 4   kms_driven  840 non-null    object
 5   fuel_type   837 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB


In [417]:
df.isna().sum()

name           0
company        0
year           0
Price          0
kms_driven    52
fuel_type     55
dtype: int64

## Data Preprocessing
#### Converting columns to respective datatypes

#### a) Year Column

Finding unique year values

In [418]:
print("Unique years: ", df["year"].unique())

Unique years:  ['2007' '2006' '2018' '2014' '2015' '2012' '2013' '2016' '2010' '2017'
 '2008' '2011' '2019' '2009' '2005' '2000' '...' '150k' 'TOUR' '2003'
 'r 15' '2004' 'Zest' '/-Rs' 'sale' '1995' 'ara)' '2002' 'SELL' '2001'
 'tion' 'odel' '2 bs' 'arry' 'Eon' 'o...' 'ture' 'emi' 'car' 'able' 'no.'
 'd...' 'SALE' 'digo' 'sell' 'd Ex' 'n...' 'e...' 'D...' ', Ac' 'go .'
 'k...' 'o c4' 'zire' 'cent' 'Sumo' 'cab' 't xe' 'EV2' 'r...' 'zest']


Converting non-numeric values to numeric or NaN using ```errors="coerce"```, this step ensures there are no invalid values or errors before converting to integers

In [419]:
# Year can be estimated based on kilometers driven
df["year"] = pd.to_numeric(df["year"], errors='coerce').astype("Int64")
unique_years = df["year"].sort_values().unique()
print("Unique years: ", unique_years)
df.head()

Unique years:  <IntegerArray>
[1995, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, <NA>]
Length: 22, dtype: Int64


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


#### b) Km Driven Column

Converting Kilometers to Integer, also making sure its value is more than 100

In [420]:
price_series = pd.to_numeric(df["kms_driven"].str.replace(r"\D", "", regex=True), errors="coerce").astype("Float64")

# Convert all entries to 0 where km is less than 100
price_series[price_series < 100] = pd.NA

df["kms_driven"] = price_series
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,36000.0,Diesel


#### c) Price Column

Converting Price to Integer

In [421]:
price_series = pd.to_numeric(df["Price"].str.replace(r"\D", "", regex=True), errors="coerce").astype("Float64")

df["Price"] = price_series
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel


#### d) Fuel Type Column

Converting Fuel Type to category

In [422]:
df["fuel_type"] = df["fuel_type"].astype("category")
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel


In [423]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   name        892 non-null    object  
 1   company     892 non-null    object  
 2   year        842 non-null    Int64   
 3   Price       857 non-null    Float64 
 4   kms_driven  823 non-null    Float64 
 5   fuel_type   837 non-null    category
dtypes: Float64(2), Int64(1), category(1), object(2)
memory usage: 38.6+ KB


In [424]:
df[["Price", "kms_driven"]].describe()

Unnamed: 0,Price,kms_driven
count,857.0,823.0
mean,404688.534422,46848.72175
std,465536.544629,34213.169031
min,30000.0,100.0
25%,175000.0,27000.0
50%,299999.0,41000.0
75%,485000.0,57461.5
max,8500003.0,400000.0


## Data Cleaning

#### Handling Missing values

In [425]:
df.isna().sum()

name           0
company        0
year          50
Price         35
kms_driven    69
fuel_type     55
dtype: int64

In [426]:
nokms_mask = df["kms_driven"].isna()
noyear_mask = df["year"].isna()
noprice_mask = df["Price"].isna()
nofuel_mask = df["fuel_type"].isna()

tempdf = df[((nokms_mask) & (noyear_mask) & (noprice_mask) & (nofuel_mask))]
print(f"{tempdf.shape[0]} rows are deleted because they don't have any of kms_driven, year, price or fuel_type")
df.drop(axis="index", inplace=True, index=tempdf.index)
print("Remaining dataset:")
df

12 rows are deleted because they don't have any of kms_driven, year, price or fuel_type
Remaining dataset:


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel
...,...,...,...,...,...,...
887,Ta,Tara,,310000.0,,
888,Tata Zest XM Diesel,Tata,2018,260000.0,27000.0,Diesel
889,Mahindra Quanto C8,Mahindra,2013,390000.0,40000.0,Diesel
890,Honda Amaze 1.2 E i VTEC,Honda,2014,180000.0,,


Removing Rows with Invalid names

In [427]:
invalid_name_mask = df["name"].str.len()<5
print("Removed", df[invalid_name_mask].shape[0], "rows with invalid name")

df.drop(axis="index", index=df[invalid_name_mask].index, inplace=True)

df

Removed 3 rows with invalid name


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel
...,...,...,...,...,...,...
886,Toyota Corolla Altis,Toyota,2009,300000.0,132000.0,Petrol
888,Tata Zest XM Diesel,Tata,2018,260000.0,27000.0,Diesel
889,Mahindra Quanto C8,Mahindra,2013,390000.0,40000.0,Diesel
890,Honda Amaze 1.2 E i VTEC,Honda,2014,180000.0,,


Removing LPG and \<NA\> fuel type records

In [428]:
print("Before:")
print(df["fuel_type"].value_counts())
print("Empty records:",df["fuel_type"].isna().sum())

# Drop LPG records
df = df[(df["fuel_type"] != "LPG") & (~df["fuel_type"].isna())].reset_index(drop=True)
df["fuel_type"] = df["fuel_type"].cat.remove_unused_categories()

print("\nAfter:")
print(df["fuel_type"].value_counts())
print("Empty records:",df["fuel_type"].isna().sum())

Before:
fuel_type
Petrol    440
Diesel    395
LPG         2
Name: count, dtype: int64
Empty records: 40

After:
fuel_type
Petrol    440
Diesel    395
Name: count, dtype: int64
Empty records: 0


Finding the average kilometers driven

In [429]:
mean_km = int(df.loc[~nokms_mask, "kms_driven"].mean())
mean_km

46822

Replacing NaN with Mean Kms where **Year** & **Kms_driven** both are missing

In [430]:
df.loc[(noyear_mask) & (nokms_mask), "kms_driven"] = mean_km
df.isna().sum()

name           0
company        0
year           0
Price         21
kms_driven    14
fuel_type      0
dtype: int64

Finding Mean Annual Kilometers driven based on **fuel_type**

In [431]:
base_year = unique_years.max()
df["Age"] = base_year - df["year"] + 1
useful_df = df.loc[(~nokms_mask) & (~noyear_mask)]

df["Annual_Km_Driven"] = useful_df.loc[:,"kms_driven"] / useful_df.loc[:,"Age"]
mean_table = df[~df["Annual_Km_Driven"].isna()].groupby("fuel_type")["Annual_Km_Driven"].mean()
print(mean_table)
mean_table = mean_table.to_dict()

fuel_type
Diesel    9833.688496
Petrol    5134.568259
Name: Annual_Km_Driven, dtype: Float64


  mean_table = df[~df["Annual_Km_Driven"].isna()].groupby("fuel_type")["Annual_Km_Driven"].mean()


Filling the empty Annual_Km_Driven records with mean values based on their fuel type

In [432]:
df["Annual_Km_Driven"] = df['Annual_Km_Driven'].fillna(df['fuel_type'].map(mean_table))
df["Annual_Km_Driven"]

0      3461.538462
1      9833.688496
2          11000.0
3      4666.666667
4           6000.0
          ...     
830    5555.555556
831    2727.272727
832        12000.0
833        13500.0
834    5714.285714
Name: Annual_Km_Driven, Length: 835, dtype: Float64

In [433]:
df["kms_driven"] = df["kms_driven"].fillna(df["Annual_Km_Driven"] * df["Age"])
df

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type,Age,Annual_Km_Driven
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol,13,3461.538462
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,137671.638942,Diesel,14,9833.688496
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol,2,11000.0
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol,6,4666.666667
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel,6,6000.0
...,...,...,...,...,...,...,...,...
830,Maruti Suzuki Ritz VXI ABS,Maruti,2011,270000.0,50000.0,Petrol,9,5555.555556
831,Tata Indica V2 DLE BS III,Tata,2009,110000.0,30000.0,Diesel,11,2727.272727
832,Toyota Corolla Altis,Toyota,2009,300000.0,132000.0,Petrol,11,12000.0
833,Tata Zest XM Diesel,Tata,2018,260000.0,27000.0,Diesel,2,13500.0


In [434]:
df.isna().sum()

name                 0
company              0
year                 0
Price               21
kms_driven           0
fuel_type            0
Age                  0
Annual_Km_Driven     0
dtype: int64

We dropped the records where price was empty

In [435]:
# df[df["name"].str.contains(r"Alto", regex=True)]
print("No. of records with missing price:", df["Price"].isna().sum())
df = df[~df["Price"].isna()].reset_index(drop=True)
df

No. of records with missing price: 21


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type,Age,Annual_Km_Driven
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol,13,3461.538462
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,137671.638942,Diesel,14,9833.688496
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol,6,4666.666667
3,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel,6,6000.0
4,Ford Figo,Ford,2012,175000.0,41000.0,Diesel,8,5125.0
...,...,...,...,...,...,...,...,...
809,Maruti Suzuki Ritz VXI ABS,Maruti,2011,270000.0,50000.0,Petrol,9,5555.555556
810,Tata Indica V2 DLE BS III,Tata,2009,110000.0,30000.0,Diesel,11,2727.272727
811,Toyota Corolla Altis,Toyota,2009,300000.0,132000.0,Petrol,11,12000.0
812,Tata Zest XM Diesel,Tata,2018,260000.0,27000.0,Diesel,2,13500.0


Exploring the data types

In [436]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 814 entries, 0 to 813
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   name              814 non-null    object  
 1   company           814 non-null    object  
 2   year              814 non-null    Int64   
 3   Price             814 non-null    Float64 
 4   kms_driven        814 non-null    Float64 
 5   fuel_type         814 non-null    category
 6   Age               814 non-null    Int64   
 7   Annual_Km_Driven  814 non-null    Float64 
dtypes: Float64(3), Int64(2), category(1), object(2)
memory usage: 49.5+ KB


In [437]:
print("We can see that there are no more missing values in our dataset")
df.isna().sum()

We can see that there are no more missing values in our dataset


name                0
company             0
year                0
Price               0
kms_driven          0
fuel_type           0
Age                 0
Annual_Km_Driven    0
dtype: int64

Storing the cleaned data into a CSV file

In [438]:
df.to_csv("cleaned_data.csv", index=False)
df

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type,Age,Annual_Km_Driven
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol,13,3461.538462
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,137671.638942,Diesel,14,9833.688496
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol,6,4666.666667
3,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel,6,6000.0
4,Ford Figo,Ford,2012,175000.0,41000.0,Diesel,8,5125.0
...,...,...,...,...,...,...,...,...
809,Maruti Suzuki Ritz VXI ABS,Maruti,2011,270000.0,50000.0,Petrol,9,5555.555556
810,Tata Indica V2 DLE BS III,Tata,2009,110000.0,30000.0,Diesel,11,2727.272727
811,Toyota Corolla Altis,Toyota,2009,300000.0,132000.0,Petrol,11,12000.0
812,Tata Zest XM Diesel,Tata,2018,260000.0,27000.0,Diesel,2,13500.0
