In [26]:
%pip install pandas streamlit plotly

import pandas as pd
import streamlit as st
import plotly.express as px


Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


This notebook explores and preprocesses the `vehicles_us.csv` dataset, focusing on identifying patterns and trends in vehicle advertisements. The goal is to clean and prepare the data for building a web application.

In [27]:
df = pd.read_csv("vehicles_us.csv")

In [28]:
# Title for the Streamlit app
st.title("Exploratory Data Analysis of Used Vehicles")
st.header("Interactive Visualizations with Streamlit")

2025-01-28 16:52:11.538 
  command:

    streamlit run /Users/chriscoleman/Library/Python/3.9/lib/python/site-packages/ipykernel_launcher.py [ARGUMENTS]


DeltaGenerator()

In [29]:
st.subheader("Dataset Overview")
st.write(df.head())  # Show the first 5 rows of the dataset
st.write("Shape of the dataset:", df.shape)



In [21]:
# Step 1: Check for missing values
st.subheader("Missing Values Before Filling")
st.write(df.isnull().sum())

No duplicates found.


In [30]:
# Step 2: Handle missing values
# Fill 'model_year' with the median grouped by 'model'
df['model_year'] = df['model_year'].fillna(df.groupby('model')['model_year'].transform('median'))

# Fill 'odometer' with the median grouped by 'model' and 'model_year'
df['odometer'] = df['odometer'].fillna(df.groupby(['model', 'model_year'])['odometer'].transform('median'))

# Fill 'is_4wd' with 0 (assuming missing means not 4WD)
df['is_4wd'] = df['is_4wd'].fillna(0)

# Fill 'paint_color' with 'Unknown'
df['paint_color'] = df['paint_color'].fillna('Unknown')

# Fill 'cylinders' with the mode grouped by 'fuel'
df['cylinders'] = df['cylinders'].fillna(df.groupby('fuel')['cylinders'].transform(lambda x: x.mode()[0] if not x.mode().empty else None))

# Step 3: Verify missing values handled
st.subheader("Missing Values After Filling")
st.write(df.isnull().sum())



In [31]:

# Step 4: Histogram
st.subheader("Histogram: Distribution of Vehicle Prices")
fig1 = px.histogram(df, x="price", nbins=50, title="Distribution of Vehicle Prices")
st.plotly_chart(fig1)






DeltaGenerator()

In [32]:
st.subheader("Scatter Plot: Odometer vs. Price")
fig2 = px.scatter(df, x="odometer", y="price", title="Odometer vs. Price", color="condition", hover_data=["model", "model_year"])
st.plotly_chart(fig2)



DeltaGenerator()

In [35]:

if st.checkbox("Show Dataset Summary"):
    st.subheader("Dataset Summary")
    st.write(df.describe())








In [34]:
st.subheader("Conclusion")
st.write("""
- Most vehicles are priced under $25,000, with a few outliers.
- Odometer readings tend to negatively correlate with price (higher mileage, lower price).
- Condition impacts price significantly, as vehicles in 'new' or 'like new' condition are valued higher.
""")




- Newer vehicles and those in better condition are priced higher.
- Missing data has been restored meaningfully using domain-specific assumptions.
- The dataset is now clean and ready for use in the Streamlit app.