In [23]:
import pandas as pd
import plotly.express as px
import plotly.express as px

In [24]:
df = pd.read_csv("../vehicles_us.csv")

In [40]:
df.columns

Index(['price', 'model_year', 'model', 'condition', 'cylinders', 'fuel',
       'odometer', 'transmission', 'type', 'paint_color', 'is_4wd',
       'date_posted', 'days_listed'],
      dtype='object')

In [25]:
df.shape, df.head()

((51525, 13),
    price  model_year           model  condition  cylinders fuel  odometer  \
 0   9400      2011.0          bmw x5       good        6.0  gas  145000.0   
 1  25500         NaN      ford f-150       good        6.0  gas   88705.0   
 2   5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
 3   1500      2003.0      ford f-150       fair        8.0  gas       NaN   
 4  14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   
 
   transmission    type paint_color  is_4wd date_posted  days_listed  
 0    automatic     SUV         NaN     1.0  2018-06-23           19  
 1    automatic  pickup       white     1.0  2018-10-19           50  
 2    automatic   sedan         red     NaN  2019-02-07           79  
 3    automatic  pickup         NaN     NaN  2019-03-22            9  
 4    automatic   sedan       black     NaN  2019-04-02           28  )

In [26]:
df.describe

<bound method NDFrame.describe of        price  model_year           model  condition  cylinders fuel  odometer  \
0       9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1      25500         NaN      ford f-150       good        6.0  gas   88705.0   
2       5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3       1500      2003.0      ford f-150       fair        8.0  gas       NaN   
4      14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   
...      ...         ...             ...        ...        ...  ...       ...   
51520   9249      2013.0   nissan maxima   like new        6.0  gas   88136.0   
51521   2700      2002.0     honda civic    salvage        4.0  gas  181500.0   
51522   3950      2009.0  hyundai sonata  excellent        4.0  gas  128000.0   
51523   7455      2013.0  toyota corolla       good        4.0  gas  139573.0   
51524   6300      2014.0   nissan altima       good        4.0  gas       N

This gives us a basic summary of the number-based columns in our dataset. Here’s what stands out:

price:

Prices are all over the place. The average is pretty high, and there's a big jump between the median and max—definitely some pricey outliers.

Some listings even show a price of 0, which doesn’t make sense and will need to be cleaned up.

model_year:

Covers a wide range of car ages—from classic cars to newer models.

There are some missing values here, so we’ll need to figure out how to deal with them later.

odometer:

Mileage varies a lot, with some super high numbers.

The average and median aren’t too far apart though, which means the data isn’t wildly skewed.

days_listed:

This shows how long each car stayed listed for sale.

Some cars were listed for a really long time, possibly abandoned posts.

What to Keep an Eye On:
We’ll want to clean up the weird values like price = 0

Decide what to do with the missing model_year and odometer data

Possibly remove or adjust big outliers for better visuals and insights



In [27]:
df.isnull().sum()


price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

In [28]:
df.duplicated().sum()


np.int64(0)

In [29]:
df.columns.tolist()


['price',
 'model_year',
 'model',
 'condition',
 'cylinders',
 'fuel',
 'odometer',
 'transmission',
 'type',
 'paint_color',
 'is_4wd',
 'date_posted',
 'days_listed']

In [30]:
df['fuel'].value_counts()
df['condition'].value_counts(dropna=False)
df['transmission'].unique()


array(['automatic', 'manual', 'other'], dtype=object)

In [31]:
px.box(df, y='price', title='Boxplot of Price').show()
px.box(df, y='odometer', title='Boxplot of Odometer').show()


 Boxplots for Price and Odometer
We used boxplots to check for outliers and overall spread in the price and odometer columns:

Price Boxplot:

There's a wide range of prices, with quite a few extreme values on the high end.

Some listings are marked at $0 or very low, which could be errors, placeholders, or scams.

High outliers might skew the average price, so cleaning or capping may be needed for clearer analysis.

Odometer Boxplot:

Most odometer readings fall within a normal range, but there are a few vehicles with very high mileage.

Unlike price, the odometer values seem more reasonable overall, but we’ll still keep an eye out for values that don’t make sense.

These boxplots help highlight where the data might be misleading or messy—and point out values we may want to filter out or adjust before digging deeper.

In [32]:
px.histogram(df, x='price', nbins=100, title='Vehicle Price Distribution').show()


Vehicle Price Distribution:
The histogram above shows the distribution of vehicle prices across the dataset:

Most listings fall under $20,000, with a large concentration between $5,000 and $15,000.

There are a noticeable number of vehicles listed at or near $0, which could indicate:

Data entry errors

Placeholder values

Listings with missing or removed pricing

The long tail on the right side suggests a few high-end or luxury vehicles with significantly higher prices, which may skew the average.

To get clearer insights in later steps, we may want to:

Filter out or flag unrealistic prices (like 0)

Consider using log-scaled plots or boxplots to handle the skewed distribution

In [33]:
px.scatter(df, x='odometer', y='price', color='type', title='Price vs. Odometer by Type').show()


Price vs. Odometer by Vehicle Type:
This scatter plot shows how vehicle price changes in relation to odometer readings, with points colored by vehicle type:

There’s a clear downward trend: vehicles with higher mileage tend to be cheaper, which is expected.

A few cars with very high odometer values are still listed at high prices—these may be data issues or rare cases (e.g., vintage or restored vehicles).

The color grouping helps visualize how different vehicle types behave:

Trucks and SUVs seem to retain value better at higher mileage than other types.

Sedans and coupes cluster more densely in the lower price ranges.

This plot helps confirm the importance of odometer as a predictive feature for price and highlights which vehicle types may hold value longer.

In [34]:
px.box(df, x='condition', y='price', title='Price Distribution by Condition').show()


Price Distribution by Vehicle Condition:
This boxplot shows how vehicle price varies across different condition categories:

As expected, better condition generally corresponds with a higher median price.

Vehicles marked as "like new" or "excellent" tend to have higher price ranges, though there’s still overlap with lower conditions.

Outliers are present in every condition category—including very high prices for even "fair" or "salvage" listings—which may indicate:

Overpriced listings

Data entry issues

Niche or collector vehicles

Interestingly, the pricing difference between some categories (e.g., "good" vs. "excellent") is not very wide, which may reflect subjective labeling by sellers.

This visualization helps us see how vehicle condition impacts pricing, though it also reminds us that condition alone doesn’t fully explain price variance.



In [35]:
numeric_cols = df.select_dtypes(include=['float64', 'int64'])

# Compute the correlation matrix
correlation_matrix = numeric_cols.corr()

# Create a heatmap
fig = px.imshow(
    correlation_matrix,
    text_auto=True,
    color_continuous_scale='RdBu_r',
    title="🔍 Correlation Heatmap of Numeric Features")

fig.show()

Correlation Heatmap:
The heatmap above shows how different numeric features in the dataset are correlated with each other:

price is weakly correlated with most features, but may show small trends with odometer and model_year.

odometer is negatively correlated with model_year, which makes sense—newer cars tend to have less mileage.

Most variables show only mild correlations, which suggests that the data may require feature engineering or deeper categorical analysis for stronger predictive power.

This gives us a good starting point for identifying which features could be useful in modeling or deeper visual exploration.

In [41]:
# Group by fuel type and calculate average price
avg_price_by_fuel = df.groupby('fuel')['price'].mean().reset_index()

# Sort for cleaner look
avg_price_by_fuel = avg_price_by_fuel.sort_values(by='price', ascending=False)

# Plot
fig = px.bar(
    avg_price_by_fuel,
    x='fuel',
    y='price',
    title='⛽ Average Vehicle Price by Fuel Type',
    color='price',
    color_continuous_scale='Tealrose'
)
fig.show()


Average Price by Fuel Type:
This bar chart displays the average price of vehicles based on their fuel type:

Electric and hybrid vehicles tend to have higher average prices, reflecting their newer technology and market demand.

Gasoline-powered vehicles make up the bulk of listings and show a wide range of prices.

Diesel may appear higher due to commercial/truck listings.

Lower-priced categories could include older or less efficient vehicle types.


In [42]:
# Group and calculate average price for the hierarchy
sunburst_data = df.groupby(['type', 'condition', 'fuel'])['price'].mean().reset_index()

# Build the sunburst chart
fig = px.sunburst(
    sunburst_data,
    path=['type', 'condition', 'fuel'],
    values='price',
    color='price',
    color_continuous_scale='RdBu',
    title='🌞 Average Price by Type → Condition → Fuel')

fig.show()

 Average Price Breakdown: Type → Condition → Fuel
This sunburst chart gives a layered view of average vehicle prices across three categories:

Inner layer (type) shows the overall vehicle types (e.g., SUV, truck, sedan)

Middle layer (condition) breaks down the condition of each type

Outer layer (fuel) shows the type of fuel used

It’s a great way to see how price trends shift based on vehicle category and quality. For example:

Some "like new" hybrids may show very high average prices.

Older or "salvage" gas cars will likely cluster at the lower end.

The ability to explore layers makes this chart perfect for interactive insights.

This chart adds both depth and visual polish to the EDA, wrapping up the visual section with impact.