## Exploratory Data Analysis: Used Car Advertisements

This notebook performs exploratory data analysis (EDA) on a dataset of used car listings in the United States. The goal is to understand the structure of the dataset, uncover patterns, handle missing values thoughtfully, and generate insights that can be visualized in a Streamlit web app.

## 📌 Objectives:
- Load and inspect the dataset
- Clean and preprocess the data, including handling missing values
- Explore key variables (e.g., price, odometer, condition)
- Visualize relationships between features using Plotly Express

In [57]:
import pandas as pd
import plotly.express as px

df = pd.read_csv(r'C:\Users\Sbeki\OneDrive\Desktop\Codes\software-engineering-tasks\vehicles_us.csv')  


df.info()
df.describe()
df.head()

#filling missing values in the dataset
df['is_4wd'] = df['is_4wd'].fillna(0).astype('bool')
df['paint_color'] = df['paint_color'].fillna('unknown')
df['cylinders'] = df[['cylinders', 'type']].groupby('type').transform(lambda x: x.fillna(x.median()))
df['model_year'] = df[['model_year', 'model']].groupby('model').transform(lambda x: x.fillna(x.median()))
df['odometer'] = df[['odometer', 'model_year']].groupby('model_year').transform(lambda x: x.fillna(x.median()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


## Histograms of `price`, `model_year`, and `odometer

In [58]:
px.histogram(df, x='price', nbins=50, title='Distribution of Car Prices')

px.histogram(df, x='model_year', title='Distribution of Model Years')


## Scatterplots: `price` vs. `odometer`, `price` vs. `model_year`

In [59]:
px.scatter(df, x='odometer', y='price', color='condition',title='Price vs. Odometer by Condition')

px.scatter(df, x='model_year', y='price', title='Price vs. Model Year')

## We'll now explore the distributions of some key features such as price, odometer reading, and model year.


In [60]:

fig = px.histogram(df, x='price', nbins=50, title='Distribution of Car Prices')
fig.show()

In [61]:
fig = px.histogram(df, x='odometer', nbins=50, title='Distribution of Vehicle Odometer Readings')
fig.show()


In [62]:
fig = px.histogram(df, x='model_year', nbins=30, title='Distribution of Model Years')
fig.show()

## Let’s explore how price varies with respect to odometer and model year.

In [63]:
fig = px.scatter(df, x='odometer', y='price', color='condition',
                 title='Price vs. Odometer by Vehicle Condition')
fig.show()

In [64]:
fig = px.scatter(df, x='model_year', y='price', color='type',
                 title='Price vs. Model Year by Vehicle Type')
fig.show()

In [65]:
avg_price_color = df.groupby('paint_color')['price'].mean().reset_index()

fig = px.bar(avg_price_color, x='paint_color', y='price',
             title='Average Car Price by Paint Color')
fig.show()

## Overall rundown

- The dataset contains missing values in several key columns. We addressed them using context-specific imputation strategies (e.g., grouped medians).
- Most used cars cluster around lower price ranges, with a few high-end outliers.
- There is a visible drop in price as odometer readings increase, confirming expected depreciation.
- Newer model years generally fetch higher prices, although type and condition affect this.
- The data and visualizations prepared here can be translated into an interactive Streamlit dashboard for user exploration.
