# Car Advertisement Data Analysis

## Project Description
This project explores a dataset of car advertisements. We will analyze various features such as car prices, model years, engine cylinders, and odometer readings. The goal is to clean the data, handle missing values, and create interactive visualizations using the Plotly Express library to gain insights into the data.


In [40]:
pip install streamlit

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [41]:
import pandas as pd
import streamlit as st
import plotly.express as px


 Load the dataset and
 Display the first few rows of the dataset

In [42]:
file_path = 'C:/Users/User/Downloads/car.veh.csv'
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


 summary of DataFrame's structure

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


 Descriptive statistics for numerical columns

In [44]:
df.describe(include='all')

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
count,51525.0,47906.0,51525,51525,46265.0,51525,43633.0,51525,51525,42258,25572.0,51525,51525.0
unique,,,100,6,,5,,3,13,12,,354,
top,,,ford f-150,excellent,,gas,,automatic,SUV,white,,2019-03-17,
freq,,,2796,24773,,47288,,46902,12405,10029,,186,
mean,12132.46492,2009.75047,,,6.125235,,115553.461738,,,,1.0,,39.55476
std,10040.803015,6.282065,,,1.66036,,65094.611341,,,,0.0,,28.20427
min,1.0,1908.0,,,3.0,,0.0,,,,1.0,,0.0
25%,5000.0,2006.0,,,4.0,,70000.0,,,,1.0,,19.0
50%,9000.0,2011.0,,,6.0,,113000.0,,,,1.0,,33.0
75%,16839.0,2014.0,,,8.0,,155000.0,,,,1.0,,53.0


In [45]:
df['price'].describe()

count     51525.000000
mean      12132.464920
std       10040.803015
min           1.000000
25%        5000.000000
50%        9000.000000
75%       16839.000000
max      375000.000000
Name: price, dtype: float64

In [46]:
df.isnull().sum()

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

 Handling missing values

In [47]:
import numpy as np
# Fill missing values in 'is_4wd' with 0 and convert it to boolean
df['is_4wd'] = df['is_4wd'].fillna(0).astype(bool)

# Fill missing values in 'paint_color' with 'unknown'
df['paint_color'] = df['paint_color'].fillna('unknown')

# Fill missing values in 'cylinders' based on the median value per 'type'
df['cylinders'] = df.groupby('type')['cylinders'].transform(lambda x: np.nanmedian(x) if len(x.dropna()) > 0 else np.nan)

# Fill missing values in 'model_year' based on the median value per 'model'
df['model_year'] = df.groupby('model')['model_year'].transform(lambda x: np.nanmedian(x) if len(x.dropna()) > 0 else np.nan)

# Fill missing values in 'odometer' based on the median value per 'model_year' and 'model'
df['odometer'] = df.groupby(['model_year', 'model'])['odometer'].transform(lambda x: np.nanmedian(x) if len(x.dropna()) > 0 else np.nan)

# Fill remaining missing values in 'odometer' with a default value (e.g., median of the entire column or a custom value)
df['odometer'] = df['odometer'].fillna(df['odometer'].median())

#  Check if there are any missing values left
df.isnull().sum()

price           0
model_year      0
model           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

 Histogram of car prices.

 Histogram of odometer readings.

In [48]:
fig = px.histogram(df, x='price', title='Distribution of Car Prices')
fig.show()

fig = px.histogram(df, x='odometer', title='Distribution of Odometer Readings')
fig.show()

 Scatter plot of price vs model year.
 
 Scatter plot of price vs odometer.

In [49]:
fig = px.scatter(df, x='model_year', y='price', color='condition', title='Price vs Model Year')
fig.show()


fig = px.scatter(df, x='odometer', y='price', color='condition', title='Price vs Odometer')
fig.show()

histogram that visualizes how car prices are distributed across different conditions.

In [50]:
fig = px.histogram(df, x='price', color='condition', title='Price Distribution by Car Condition', barmode='overlay')
fig.show()

This scatter plot visualizes the relationship between model year (x-axis) and price (y-axis), allowing you to see how car prices change over time.

In [51]:
fig = px.scatter(df, x='model_year', y='price', color='fuel', title='Price vs Model Year by Fuel Type')
fig.show()

## Conclusions

- The distribution of car prices shows that most cars fall within a certain price range, with a few high-priced outliers.
- There is a clear relationship between car prices and model year: newer cars tend to be priced higher.
- Odometer readings show a wide range of values, and lower mileage cars generally have higher prices.