# Used Car Advertisement Exploratory Data Analysis

In this experiment, we take a look at a dataset of car sales advertisements. We're interested in factors such as the distribution of price based on the vehicles brand, and when people may start thinking about selling their car by looking at the odometer readings of the vehicles. We are also going to incorporate this data into an interactive web application so that users can make their own comparisons.

In [14]:
import pandas as pd
import plotly.express as px
from matplotlib import pyplot as plt
import streamlit as st


### Data Cleaning

In [15]:
data = pd.read_csv(r'C:\Users\Lance\Desktop\python_projects\car-banana\vehicles_us.csv')
print(data.info(show_counts=True))
print(data.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB
None
   price  model_year           model  condition  cylinders fuel  odometer  \
0   9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1  25500         NaN     

Out of the columns with missing variables, model_year and odometer might be worth taking a look at.

When making plots with price, I noticed that there are price values that are extremely low, such as new vehicles supposedly selling for less than 500 dollars, which is odd. I'll take a look at that.

In [16]:
# filter for vehicles priced more than 500 USD
normal_price = data[data['price'] > 500]

# checking the cases where vehicles are priced less than 500
price_is_low = data[data['price'] < 500]
price_is_low.sample(10)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
31157,1,2018.0,dodge charger,excellent,6.0,gas,7655.0,automatic,sedan,black,,2018-09-23,25
14113,1,2016.0,ford mustang,excellent,,gas,,other,coupe,black,1.0,2018-05-14,62
25593,1,2007.0,chevrolet tahoe,like new,8.0,gas,1.0,automatic,SUV,blue,,2018-07-26,20
24256,1,,dodge charger,excellent,10.0,gas,59772.0,other,sedan,custom,1.0,2018-08-27,64
26941,1,2015.0,jeep grand cherokee,excellent,8.0,gas,42442.0,automatic,SUV,black,1.0,2019-04-17,5
12544,1,2017.0,jeep wrangler,excellent,10.0,gas,38617.0,other,SUV,black,1.0,2018-11-03,52
10171,1,2018.0,toyota camry le,excellent,4.0,gas,38860.0,automatic,sedan,white,,2018-08-13,4
28489,1,1999.0,honda accord,fair,,gas,218000.0,automatic,sedan,grey,,2019-02-04,21
42138,69,2014.0,honda accord,excellent,4.0,gas,82315.0,automatic,sedan,white,,2018-12-06,35
9181,1,2018.0,nissan altima,excellent,10.0,gas,42007.0,other,sedan,custom,1.0,2018-10-26,32


These values are likely placeholders in the case that the price is 1. I'm guessing that the other sub 500 vehicles had their values input incorrectly. For example, zeroes may be missing. Either that or the condition isn't input correctly. Luckily there aren't a lot of them. I'll leave these out and it shouldn't affect analysis.

Here I add a manufacturer column to make it easier to compare different brands.

I want to do some analysis with model years later, so I'll look at the missing values in the model_year column.

In [23]:
# Add manufacturer column
data['manufacturer'] = data['model'].apply(lambda x: x.split()[0])

# filters for NaN values in model_year
no_year = data[data['model_year'].isnull()]

# looking at the number of vehicles with no model year for each manufacturer
no_year['manufacturer'].value_counts()

manufacturer
ford             882
chevrolet        726
toyota           386
honda            241
jeep             240
nissan           239
ram              229
gmc              171
dodge             97
subaru            94
hyundai           74
volkswagen        60
chrysler          58
kia               41
cadillac          27
bmw               21
buick             14
acura             12
mercedes-benz      7
Name: count, dtype: int64

I don't see a way to fill in these missing values without messing with the analysis. I'll leave these out since it's only a small portion of the dataset.

checking for duplicates

In [18]:
print(data.duplicated().value_counts())
print(data.duplicated().sum())

False    51525
Name: count, dtype: int64
0


Doesn't look like there are any duplicate entries.

I'm interested in seeing how the top 3 manufacturers compare to the rest since they account for more than half of the entries in the dataset.

In [None]:
# Get the top 3 manufacturers
top_manufacturers = data['manufacturer'].value_counts().head(3).index.tolist()

# Filter for only top 3 manufacturers
filtered_top_data = data[data['manufacturer'].isin(top_manufacturers)]

# Filter for all other manufacturers
filtered_not_data = data[~data['manufacturer'].isin(top_manufacturers)]

data['manufacturer'].value_counts()

manufacturer
ford             12672
chevrolet        10611
toyota            5445
honda             3485
ram               3316
jeep              3281
nissan            3208
gmc               2378
subaru            1272
dodge             1255
hyundai           1173
volkswagen         869
chrysler           838
kia                585
cadillac           322
buick              271
bmw                267
acura              236
mercedes-benz       41
Name: count, dtype: int64

I'll take a look at the distribution of the vehicle prices between the 3 manufacturers that show up the most often in the dataset compared to the rest of the manufacturers. I'm guessing that the top 3 will have slightly higher prices show up more compared to rest of the dataset.

In [20]:
price_top_3 = px.histogram(
                            filtered_top_data,
                            x='price',
                            nbins=60,
                            title='Price Distribution of Ford, Toyota, and Chevrolet Vehicles',
                            labels={
                                'price': 'Price (USD)',
                                'count': 'Frequency'
                            })

price_top_3.update_xaxes(range=[0, 60000]) # Set x-axis range for clarity
price_top_3.update_layout(yaxis_title='Frequency')
price_top_3


In [21]:
price_other = px.histogram(
                            filtered_not_data,
                            x='price',
                            nbins=100,
                            title='Price Distribution of All Other Manufacturers',
                            labels={
                                'price': 'Price (USD)',
                                'count': 'Frequency'
                            })

price_other.update_xaxes(range=[0, 60000]) # Set x-axis range for clarity
price_other.update_layout(yaxis_title='Frequency')
price_other

Surprisingly, there isn't too much of a difference between the top 3 brands and the rest of the brands. Most notably, there are more frequent values in the 25-30k range in the top 3 brands compared to the rest.

I'm also curious in the relationship between a vehicles odometer reading and it's price. Since there are an overwhelmingly large amount of values, we'll look at this with a heatmap.

In [22]:
# 
price_odometer = px.density_heatmap(
                                    data.query('2 <= price <= 60000 & odometer <= 200000'),  # Filter outliers
                                    x='odometer',
                                    y='price',
                                    nbinsx=25,
                                    nbinsy=25,
                                    color_continuous_scale='Viridis',
                                    labels={
                                        'odometer': 'Odometer (miles)',
                                        'price': 'Price (USD)'},
                                    title='Density of Odometer vs Price'
)

price_odometer

Looks like there's an inverse relationship between how a car is priced and the amount of miles on it's odometer. It looks like people consider selling their car once the odometer hits around 100,000 - 140,000 miles for around $5,000-$10,000.