# 🚗 Final Combined EDA Notebook

## 🔹 Introduction and Auto-Cleaned EDA

This notebook performs exploratory data analysis (EDA) on a dataset of vehicle listings in the US. We'll clean the data, check for duplicates and missing values, create visualizations, and draw some conclusions that help support a Streamlit dashboard application.

In [None]:
import pandas as pd
import plotly.express as px

# Load dataset
df = pd.read_csv('../vehicles_us.csv')
df.head()

## 🔍 Duplicate Records

In [None]:
duplicates = df.duplicated().sum()
print(f'Duplicates found: {duplicates}')

## 🔍 Missing Values Overview

In [None]:
df.isna().sum()

## 🔧 Filling Missing `cylinders` Using GroupBy Median by Model and Model Year

In [None]:
df['cylinders'] = df['cylinders'].fillna(df.groupby(['model', 'model_year'])['cylinders'].transform('median'))
df['cylinders'].isna().sum()

In [None]:
df['vehicle_age'] = 2025 - df['model_year']

## 🔹 User's EDA Notebook Content

In [6]:
import pandas as pd
import plotly.express as px

df = pd.read_csv("C:/Users/leona/project-5/vehicles_us.csv")

df['vehicle_age'] = 2025 - df['model_year']


In [7]:
df['vehicle_age'] = 2025 - df['model_year']
px.histogram(df, x='vehicle_age', nbins=30,
             title='Distribution of Vehicle Age',
             labels={'vehicle_age': 'Vehicle Age (Years)'})


In [8]:
px.histogram(df, x='days_listed', nbins=30,
             title='Distribution of Days Listed on Website',
             labels={'days_listed': 'Days Listed'})


In [9]:
px.scatter(df, x='model_year', y='price',
           title='Price vs Model Year',
           labels={'model_year': 'Model Year', 'price': 'Price ($)'})



In [10]:
px.scatter(df, x='odometer', y='price', color='condition',
           title='Price vs Odometer Colored by Condition',
           labels={'odometer': 'Mileage', 'price': 'Price ($)'})


## 🔹 Auto-Generated Visualizations & Final Summary

## 📊 Distribution of Vehicle Age

In [None]:
fig1 = px.histogram(df, x='vehicle_age', nbins=30, title='Distribution of Vehicle Age')
fig1.show()

**Conclusion**: Most listed vehicles are between 5 and 15 years old, indicating a used market dominated by mid-age cars.

## 📈 Price vs. Odometer by Condition

In [None]:
fig2 = px.scatter(df, x='odometer', y='price', color='condition', title='Price vs Odometer by Condition')
fig2.show()

**Conclusion**: Newer or well-maintained vehicles (excellent/good condition) tend to be priced higher even with high mileage.

## 📊 Distribution of Days Listed

In [None]:
fig3 = px.histogram(df, x='days_listed', nbins=30, title='Distribution of Days Listed on Website')
fig3.show()

**Conclusion**: Most cars are listed for under 50 days, indicating fast turnover in listings.

## ✅ Final Conclusion

We examined vehicle listings and cleaned the data by handling duplicates and filling in missing values for `cylinders`. We visualized key relationships such as price vs. mileage and age distribution. These insights help us build a more interactive and informative dashboard.