In [1]:
import pandas as pd 
from matplotlib import pyplot as plt
import plotly.express as px


In [2]:
project_data = pd.read_csv("vehicles_us.csv")
project_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [3]:
project_data.sample()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
38056,18985,2010.0,toyota tacoma,excellent,6.0,gas,132925.0,automatic,pickup,white,1.0,2018-07-24,11


In [4]:
project_data = project_data.drop_duplicates()

Missing values in model year, cylinders, odometer, paint color and 4wd. date posted should be in datetime. Changing 4wd to 1 = True 0 = False. 

In [5]:
# Convert to datetime
project_data['date_posted'] = pd.to_datetime(project_data['date_posted'])

# Fill missing values
project_data['model_year'] = project_data['model_year'].fillna(project_data['model_year'].median())
project_data['cylinders'] = project_data.groupby('model')['cylinders'].fillna(project_data['cylinders'].median())
project_data['odometer'] = project_data.groupby('model')['odometer'].fillna(project_data['odometer'].median())
project_data['paint_color'].fillna('unknown', inplace=True)
project_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   price         51525 non-null  int64         
 1   model_year    51525 non-null  float64       
 2   model         51525 non-null  object        
 3   condition     51525 non-null  object        
 4   cylinders     51525 non-null  float64       
 5   fuel          51525 non-null  object        
 6   odometer      51525 non-null  float64       
 7   transmission  51525 non-null  object        
 8   type          51525 non-null  object        
 9   paint_color   51525 non-null  object        
 10  is_4wd        25572 non-null  float64       
 11  date_posted   51525 non-null  datetime64[ns]
 12  days_listed   51525 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(2), object(6)
memory usage: 5.1+ MB


  project_data['cylinders'] = project_data.groupby('model')['cylinders'].fillna(project_data['cylinders'].median())
  project_data['odometer'] = project_data.groupby('model')['odometer'].fillna(project_data['odometer'].median())
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  project_data['paint_color'].fillna('unknown', inplace=True)


In [6]:
project_data.sample()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
45038,6790,2008.0,honda civic,good,4.0,gas,135399.0,manual,sedan,blue,,2019-03-29,144


In [7]:
project_data['type'].value_counts()


type
SUV            12405
truck          12353
sedan          12154
pickup          6988
coupe           2303
wagon           1541
mini-van        1161
hatchback       1047
van              633
convertible      446
other            256
offroad          214
bus               24
Name: count, dtype: int64

In [8]:
# Change object types to lower
project_data['type'] = project_data['type'].str.lower()
project_data['model'] = project_data['model'].str.lower()
project_data['fuel'] = project_data['fuel'].str.lower()
project_data['condition'] = project_data['condition'].str.lower()
project_data['transmission'] = project_data['transmission'].str.lower()
project_data['paint_color'] = project_data['paint_color'].str.lower()
project_data['type'].value_counts()

type
suv            12405
truck          12353
sedan          12154
pickup          6988
coupe           2303
wagon           1541
mini-van        1161
hatchback       1047
van              633
convertible      446
other            256
offroad          214
bus               24
Name: count, dtype: int64

In [9]:
# Distribution of Vehicle Prices
filtered_data = project_data[project_data['price'] > 0]
prices = px.histogram(
    filtered_data,
    x='price',
    nbins=50, 
    title='Distribution of Vehicle Prices',
    labels=('Number of Vehicles', 'Price')
)
prices.update_layout(xaxis_title='Price', yaxis_title='Number of Vehicles')
prices.show()

### Summary:
Price data reveals a wide range of vehicle prices, from $1 to $375,000, with a mean of $12,160 and a median of $9,000, indicating a right-skewed distribution. The interquartile range (IQR) spans from $5,000 to $16,900, showing most vehicles are priced moderately. The high standard deviation of $10,082 highlights significant variability in pricing. These findings suggest that while some high-priced vehicles skew the average, the majority are more affordable, presenting opportunities for targeted pricing strategies and customer segmentation.

In [10]:
# Odometer Reading Distribution
odometer = px.histogram(
    project_data,
    x='odometer',
    nbins=50,
    title='Distribution of Odometer Readings'
)
odometer.update_layout(xaxis_title='Odometer Reading', yaxis_title='Number of Vehicles')
odometer.show()

### Summary:
The odometer data, with 51,525 entries, shows a wide variation in vehicle mileage, ranging from 0 to 990,000 miles. The mean odometer reading is approximately 97,854 miles, with a median of 99,114 miles, suggesting that the majority of vehicles fall within the mid-range of mileage. The interquartile range (IQR) is between 35,896 and 146,541 miles, indicating that most vehicles have odometer readings clustered within this range. The high standard deviation of 72,940 miles highlights a significant variation, with some vehicles having considerably higher mileages than others. This data can provide insights into vehicle wear and tear, which can influence pricing, especially for used cars.

In [11]:
# Scatter plot Price vs Odometer
filtered_data = project_data[project_data['price'] > 0]
price_odometer = px.scatter(
    filtered_data,
    x='odometer',
    y='price',
    title='Price vs Odometer'
)
price_odometer.update_layout(xaxis_title='Odometer Reading', yaxis_title='Price')
price_odometer.show()

### Summary:
The scatter plot of price versus odometer reading shows a general trend where vehicles with higher odometer readings tend to have lower prices, which aligns with the expectation that higher mileage often correlates with decreased vehicle value. However, there are outliers, with some high-mileage vehicles still priced significantly higher than the majority, suggesting other factors influencing the price. This pattern can inform pricing strategies, where higher-mileage vehicles might need adjustments or incentives to remain competitive in the market. Additionally, the presence of outliers indicates that while mileage is important, other vehicle attributes, such as condition or brand, may also be playing a substantial role in determining price.

In [12]:
# Scatter plot Model Year vs Price
filtered_data = project_data[(project_data['price'] > 0) & (project_data['model_year'] > 0)]
modelyear_price = px.scatter(
    filtered_data,
    x='model_year',
    y='price',
    title='Model Year vs Price'
)
modelyear_price.update_layout(xaxis_title='Model Year', yaxis_title='Price')
modelyear_price.show()

### Summary:
The scatter plot of "Model Year vs Price" highlights the relationship between the age of a vehicle and its price. The data reveals a clear trend where newer vehicles tend to have higher prices, with a significant concentration of vehicles priced higher as the model year approaches 2019. This indicates that buyers are willing to pay more for newer cars, which is consistent with general market trends. Additionally, the spread of prices for older model years shows a wider range, possibly due to variations in condition, mileage, and other factors. Understanding this relationship is critical for pricing strategies, as it suggests opportunities for more competitive pricing in older models and premium pricing for newer vehicles.