## Vehicles



### Vehicle Statistics — A High Level Overview

In [None]:
import streamlit as st
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Load DF
vehicles_df = pd.read_csv('/Users/dianuselvenbough/Desktop/vehicles_us.csv')

#print head of data
print('Head of Data')
print(vehicles_df.head())
print()




ParserError: Error tokenizing data. C error: Expected 1 fields in line 42, saw 37


We can see from calling head that the columns of the dataset entail price, model year, model, condition, cylinders, fuel, odomoeter/transmission, type, paint color, four wheel drive status, date datat was posted, and amoutn of days listed before the vehicle was bought.

In [None]:
#print info
print('Vehicles Info')
print(vehicles_df.info())
print()

Vehicles Info


NameError: name 'vehicles_df' is not defined

There are 51,525 columns overall. Out of these columns, price, model, condition, fuel, transmission, type, date_posted, and days_listed are all filled out. model_year only has 47,906 values of 51,525 filled out, meaning that there are 3,619 null values in the column. Cylinders has 46,265 values filled out, meaning there are 5,260 blank values. Odometer has 43,633 values, meaning there are 7,892 null values. Paint color only has 42,258 filled out, meaning there are 9,267 null values. is_4wd only has 25,572, meaning there are 25,953 null values.

In [None]:
#print describe
print('Vehicles Describe')
print(vehicles_df.describe())
print()

We can see from the mean and the standard deviation that the data is widely dispersed, especially with the minimum value for price being 1 and the max being 375,000. Most of the cars were produced in 2009. Just under 70% of the cars are between four and eight cylinders. The odometer readings on the cars are highly spread out as well. Assuming null values mean no four wheel drive, then about 49% of the cars have four wheel drive. Just under 70% of the cars were listed between 11 and 70 days, with large spread of the data, indicating that there is high variablity in what what characteristics indicate a car will sell easily.

## Cleaning Data

Model year, paint color, cylinders, and odometer readings are very important for analyzing car sells data. Therefore, the rows with NaN values for that data must be jettisoned. 

In [None]:
# Get rid of rows with missing data
vehicles_df = vehicles_df.dropna(subset=['model_year', 'paint_color', 'cylinders', 'odometer'])

# Replace NaN values in is_4wd with 0
vehicles_df['is_4wd'] = vehicles_df['is_4wd'].fillna(0)


print(vehicles_df.head())

Now that we have a set of workable data, it's wise to make sure all data types in the dataframe make sense.

In [None]:
# Reprint info for reference
print('Vehicles Info')
print(vehicles_df.info())
print()

model_year, cylinders, and odometer should all be stored as int64 to reduce space, and because there are no necessary decimals in the numerical values of those columns. 

model, condition, fuel, transmission, type, and paint_color should all be stored as strings.

is_4wd should be stored as a boolean.

date_posted should be stored as datetime.

In [None]:
# Converting model_year to int
vehicles_df['model_year'] = vehicles_df['model_year'].astype(int)

# Converting odometer to int
vehicles_df['odometer'] = vehicles_df['odometer'].astype(int)

# Converting cylinders to int
vehicles_df['cylinders'] = vehicles_df['cylinders'].astype(int)

# Converting model to string
vehicles_df['model'] = vehicles_df['model'].astype('string')

# Converting condition to string
vehicles_df['condition'] = vehicles_df['condition'].astype('string')

#Converting fuel to string
vehicles_df['fuel'] = vehicles_df['fuel'].astype('string')

# Converting transmission to string
vehicles_df['transmission'] = vehicles_df['transmission'].astype('string')

# Converting type to string
vehicles_df['type'] = vehicles_df['type'].astype('string')

# Converting paint_color to string
vehicles_df['paint_color'] = vehicles_df['paint_color'].astype('string')

# Converting is_4wd to bool
vehicles_df['is_4wd'] = vehicles_df['is_4wd'].astype(bool)

# Converting date_posted to datetime
vehicles_df['date_posted'] = pd.to_datetime(vehicles_df['date_posted'])

# Reprint info to check changes
print('Vehicles Info')
print(vehicles_df.info())


Thankfully, we now have 29,916 values that allow us to analyze the dataset reliably. The next step is to go through model, condition, fuel, transmission, type, and paint_color to make sure that all the values for the specific data are easy to put into conversation together.

In [None]:
# Unique values in each string column
print('Unique Values')
print()
print("Number of Models:", (vehicles_df['model'].nunique()))
print("Model:")
for model in sorted(vehicles_df['model'].str.strip().str.lower().unique()):
    print("-", model)
print()

print("Number of Conditions:", (vehicles_df['condition'].nunique()))
print("Condition:")
for condition in sorted(vehicles_df['condition'].str.strip().str.lower().unique()):
    print("-", condition)
print()

print("Number of Fuel Types:", (vehicles_df['fuel'].nunique()))
print("Fuel Types:")
for fuel in sorted(vehicles_df['fuel'].str.strip().str.lower().unique()):
    print("-", fuel)
print(f"Count of Other Kinds of Fuel Types: {vehicles_df['fuel'].str.strip().str.lower().value_counts().loc['other']} of 29,916 values")
print()

print("Number of Transmission Types:", (vehicles_df['transmission'].nunique()))
print("Transmission Types:")
for transmission in sorted(vehicles_df['transmission'].str.strip().str.lower().unique()):
    print("-", transmission)
print(f"Count of Other Kinds of Transmissions: {vehicles_df['transmission'].str.strip().str.lower().value_counts().loc['other']} of 29,916 values")
print()

print("Number of Vehicle Types:", (vehicles_df['type'].nunique()))
print("Vehicle Types:")
for vehicle_type in sorted(vehicles_df['type'].str.strip().str.lower().unique()):
    print("-", vehicle_type)
print()

print("Number of Paint Colors:", (vehicles_df['paint_color'].nunique()))
print("Paint Colors:")
for paint_color in sorted(vehicles_df['paint_color'].str.strip().str.lower().unique()):
    print("-", paint_color)
print()

The transmission and fuel columns have a negligible amount of 'other' values, so we will jettison those to aid in specificity.

In [None]:
# Drop rows where "other" is the value for fuel or transmission
vehicles_df = vehicles_df[(vehicles_df['fuel'] != 'other') & (vehicles_df['transmission'] != 'other')]

# Check to see if 'other' appears in fuel or transmission
print('Fuel and Transmission Check')
print()
print('Other Fuel:', vehicles_df['fuel'].str.strip().str.lower().value_counts().get('other', 0))
print('Other Transmission:', vehicles_df['transmission'].str.strip().str.lower().value_counts().get('other', 0))
print()

## Data Analysis

In order to assist our client in making the most sales, we want to find out what some factors are that can impact how quickly a car sales, as well as for what price. Therefore, we want to consider price in relationship to all the factors that are available. Let's start by figuring out which decade of mode_year had the highest prices associated.

In [None]:
# Define decade bins
bins = np.arange(1900, 2030, 10)  # Decade bins from 1900 to 2020
labels = [f"{int(start)}s" for start in bins[:-1]]  # Create labels for each decade

# Assign each model_year to a decade bin
vehicles_df['decade'] = pd.cut(vehicles_df['model_year'], bins=bins, labels=labels, right=False)

# Create a histogram using Plotly Express
fig = px.histogram(vehicles_df, x="price", color="decade",
                   nbins=400, barmode="stack",
                   title="Histogram of Price Distribution by Decade",
                   labels={"price": "Price ($)", "decade": "Decade"},
                   opacity=0.7)

# Update the x axis to only show prices 0 to 50,000, since the majority of prices are below 50,000
fig.update_xaxes(range=[0, 60000])

# Show the plot
fig.show()

In [None]:
# Define decade bins
bins = np.arange(1900, 2030, 10)  # Decade bins from 1900 to 2020
labels = [f"{int(start)}s" for start in bins[:-1]]  # Create labels for each decade

# Assign each model_year to a decade bin
vehicles_df['decade'] = pd.cut(vehicles_df['model_year'], bins=bins, labels=labels, right=False)

# Ensure models are sorted alphabetically before plotting
vehicles_df = vehicles_df.sort_values(by='model')

# Create a histogram using Plotly Express, using model as color
fig_1 = px.histogram(vehicles_df, x="price", color="model",
                   nbins=400, barmode="stack",
                   title="Histogram of Price Distribution by Model (Alphabetized)",
                   labels={"price": "Price ($)", "model": "Vehicle Model"},
                   opacity=0.7,
                   category_orders={"model": sorted(vehicles_df['model'].unique())})  # Ensures models are in alphabetical order

# Update the x-axis to only show prices from 0 to 60,000
fig_1.update_xaxes(range=[0, 60000])

# Show the plot
fig_1.show()

The 2010s have the highest number of vehicles in the dataset. In addition, 

In [None]:
import plotly.express as px

# Create a histogram using Plotly Express
fig_2 = px.histogram(vehicles_df, x="price", color="paint_color",
                   title="Price Distribution by Paint Color",
                   labels={"price": "Price ($)", "paint_color": "Paint Color"},
                   opacity=0.7,
                   barmode="stack")  

# Make x-axis 0-60k
fig_2.update_xaxes(range=[0, 60000])

# Show the plot
fig_2.show()

## Scatter Plot of Price and Model Year

In [None]:
fig_3 = px.scatter(vehicles_df, 
                 x='model_year', 
                 y='price', 
                 title='Vehicle Price vs. Model Year',
                 labels={'model_year': 'Model Year', 'price': 'Price'},
                 trendline="ols")

# Change trendline to red
fig_3.update_traces(selector=dict(name="trendline"), line=dict(color='red'))

#display plot
fig_3.show()
