# PriceTrack: Unlocking Car Market Insights

PriceTrack is a data science project designed to predict the valuation of second-hand cars based on key input parameters. 
Leveraging regression model, it provides data-driven insights to help buyers and sellers make informed decisions.

We will be using some common Python libraries, such as pandas, numpy, seaborn, and matplotlib.

In [207]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sns

## Importing Dataset

We used the Quikr Car data set that comes from the (https://quikr.com) to carry out Exploratory Data Analysis.

The dataset includes 892 cars (rows) and 6 features (columns).

In [208]:
df=pd.read_csv("quikr_car.csv")
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


### Understanding the type of data we're dealing with

Starting with datatypes

In [209]:
df.info()
# categorize into types of variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   company     892 non-null    object
 2   year        892 non-null    object
 3   Price       892 non-null    object
 4   kms_driven  840 non-null    object
 5   fuel_type   837 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB


## Data Preprocessing
#### Converting columns to respective datatypes

#### a) Year Column

Finding unique year values

In [210]:
print("Unique years: ", df["year"].unique())

Unique years:  ['2007' '2006' '2018' '2014' '2015' '2012' '2013' '2016' '2010' '2017'
 '2008' '2011' '2019' '2009' '2005' '2000' '...' '150k' 'TOUR' '2003'
 'r 15' '2004' 'Zest' '/-Rs' 'sale' '1995' 'ara)' '2002' 'SELL' '2001'
 'tion' 'odel' '2 bs' 'arry' 'Eon' 'o...' 'ture' 'emi' 'car' 'able' 'no.'
 'd...' 'SALE' 'digo' 'sell' 'd Ex' 'n...' 'e...' 'D...' ', Ac' 'go .'
 'k...' 'o c4' 'zire' 'cent' 'Sumo' 'cab' 't xe' 'EV2' 'r...' 'zest']


Converting non-numeric values to numeric or NaN using ```errors="coerce"```, this step ensures there are no invalid values or errors before converting to integers

In [211]:
# Year can be estimated based on kilometers driven
df["year"] = pd.to_numeric(df["year"], errors='coerce').astype("Int64")
unique_years = df["year"].sort_values().unique()
print("Unique years: ", unique_years)
df.head()

Unique years:  <IntegerArray>
[1995, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, <NA>]
Length: 22, dtype: Int64


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel


#### b) Km Driven Column

Converting Kilometers to Integer, also making sure its value is more than 100

In [212]:
price_series = pd.to_numeric(df["kms_driven"].str.replace(r"\D", "", regex=True), errors="coerce").astype("Int64")

# Convert all entries to 0 where km is less than 100
price_series[price_series < 100] = pd.NA

df["kms_driven"] = price_series
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,36000.0,Diesel


#### c) Price Column

Converting Price to Integer

In [213]:
price_series = pd.to_numeric(df["Price"].str.replace(r"\D", "", regex=True), errors="coerce").astype("Int64")

df["Price"] = price_series
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel


#### d) Fuel Type Column

Converting Fuel Type to category

In [214]:
df["fuel_type"] = df["fuel_type"].astype("category")
df.head()

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000.0,45000.0,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000.0,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000.0,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000.0,28000.0,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000.0,36000.0,Diesel


In [215]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   name        892 non-null    object  
 1   company     892 non-null    object  
 2   year        842 non-null    Int64   
 3   Price       857 non-null    Int64   
 4   kms_driven  823 non-null    Int64   
 5   fuel_type   837 non-null    category
dtypes: Int64(3), category(1), object(2)
memory usage: 38.6+ KB


In [216]:
df.describe()

Unnamed: 0,year,Price,kms_driven
count,842.0,857.0,823.0
mean,2012.523753,404688.534422,46848.72175
std,4.024601,465536.544629,34213.169031
min,1995.0,30000.0,100.0
25%,2010.0,175000.0,27000.0
50%,2013.0,299999.0,41000.0
75%,2015.0,485000.0,57461.5
max,2019.0,8500003.0,400000.0


## Data Cleaning

#### Handling Missing values

In [217]:
df.isna().sum()

name           0
company        0
year          50
Price         35
kms_driven    69
fuel_type     55
dtype: int64

In [218]:
nokms_mask = df["kms_driven"].isna()
noyear_mask = df["year"].isna()
noprice_mask = df["Price"].isna()
nofuel_mask = df["fuel_type"].isna()

tempdf = df[((nokms_mask) & (noyear_mask) & (noprice_mask) & (nofuel_mask))]
print(f"{tempdf.shape[0]} rows are deleted because they don't have any of kms_driven, year, price or fuel_type")
df.drop(axis="index", inplace=True, index=tempdf.index)
print("Remaining dataset:")
df

12 rows are deleted because they don't have any of kms_driven, year, price or fuel_type
Remaining dataset:


Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,45000,Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,,22000,Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,28000,Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,36000,Diesel
...,...,...,...,...,...,...
887,Ta,Tara,,310000,,
888,Tata Zest XM Diesel,Tata,2018,260000,27000,Diesel
889,Mahindra Quanto C8,Mahindra,2013,390000,40000,Diesel
890,Honda Amaze 1.2 E i VTEC,Honda,2014,180000,,


In [231]:
invalid_name_mask = df["name"].str.len()<5
invalid_name_mask.sum()

3

Finding the average kilometers driven

In [219]:
df[~nokms_mask]["kms_driven"].mean(lev)

  df[~nokms_mask]["kms_driven"].mean(lev)


NameError: name 'lev' is not defined

In [None]:
base_year = unique_years[-1]
df["Age"] = base_year - df["year"]
df
# df["Annual_Km_Driven"] = df[""]

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type,Age
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol,12
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel,13
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol,1
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol,5
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel,5
...,...,...,...,...,...,...,...
887,Ta,Tara,0,310000,,,2019
888,Tata Zest XM Diesel,Tata,2018,260000,"27,000 kms",Diesel,1
889,Mahindra Quanto C8,Mahindra,2013,390000,"40,000 kms",Diesel,6
890,Honda Amaze 1.2 E i VTEC,Honda,2014,180000,Petrol,,5


- Price Column

In [None]:
invalid_row_count = df["Price"].str.contains(r'[a-zA-Z]').sum()
print("Number of Invalid Price Rows:", invalid_row_count)

Number of Invalid Price Rows: 35


Converting **Price** to numeric datatype (cleaning non-numeric characters)
<br />
If Invalid value is found, we will fill it with 0 for now

In [None]:
df["Price"] = pd.to_numeric(df["Price"].str.replace(r'\D', '', regex=True), errors='coerce').fillna(0).astype(int)
df["Price"].head()

0     80000
1    425000
2         0
3    325000
4    575000
Name: Price, dtype: int64

Here we can see that only **kms_driven** and **fuel_type** columns have null values

In [None]:
df["year"].unique()

array([2007, 2006, 2018, 2014, 2015, 2012, 2013, 2016, 2010, 2017, 2008,
       2011, 2019, 2009, 2005, 2000,    0, 2003, 2004, 1995, 2002, 2001])

In [None]:
# Convert 'name', 'company', 'fuel_type' to string (if not already)
ds1["name"] = ds1["name"].astype(str)
ds1["company"] = ds1["company"].astype(str)
ds1["fuel_type"] = ds1["fuel_type"].astype(str)

# Convert 'Price' and 'kms_driven' to numeric (cleaning non-numeric characters)
ds1["Price"] = pd.to_numeric(ds1["Price"].str.replace(r'\D', '', regex=True), errors='coerce').fillna(0).astype(int)
ds1["kms_driven"] = pd.to_numeric(ds1["kms_driven"].str.replace(r'\D', '', regex=True), errors='coerce').fillna(0).astype(int)

# Save the cleaned dataset
ds1.to_csv("my_cleaned_file.csv", index=False)

print("Data types successfully converted and saved!")


Data types successfully converted and saved!
