Project Scope:

The purpose of this particular project is to showcase different kinds of graphs with the help of our plotly.express library. We will first conduct some exploratory data analysis to see if there are any missing data. The numpy library would come in handy just in case. This notebook is part of a bigger project, in that the main scope is to further apply it into a sample web app for later.

In [1]:
#Getting the necessary libraries possible
import pandas as pd
import numpy as np
import streamlit as st
import plotly.express as px

In [2]:
#loading our dataset
cars_us = pd.read_csv("C:\\Users\\Anthony\\Desktop\\python_folder\\software_dev_project_1\\vehicles_us.csv")
#creating a new column titled 'manufacturer', which gets the first word of the 'model' column
cars_us['manufacturer'] = cars_us['model'].apply(lambda x:
                                                 x.split()[0])

In [3]:
#A sample of what we're dealing with...
cars_us.sample(15)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,manufacturer
16338,10965,2007.0,ford f-150,excellent,8.0,gas,81000.0,automatic,pickup,,1.0,2018-12-12,91,ford
19920,19500,2017.0,honda cr-v,like new,4.0,gas,12000.0,automatic,SUV,black,1.0,2018-12-21,118,honda
8156,2600,2004.0,hyundai santa fe,fair,6.0,gas,,automatic,SUV,,1.0,2018-05-06,1,hyundai
36321,2499,2012.0,ford f-150,excellent,6.0,gas,120179.0,automatic,pickup,,1.0,2018-11-23,35,ford
6195,28500,2006.0,ford f-250,excellent,8.0,diesel,74300.0,automatic,pickup,white,1.0,2018-08-30,56,ford
45147,15000,2017.0,honda civic lx,excellent,4.0,gas,41000.0,automatic,sedan,,,2019-03-14,13,honda
23454,1500,1999.0,jeep grand cherokee,fair,8.0,gas,21900.0,automatic,SUV,blue,1.0,2018-12-07,11,jeep
2970,5000,2001.0,chevrolet silverado 2500hd,good,8.0,diesel,241000.0,automatic,pickup,red,,2018-11-24,22,chevrolet
222,2750,2006.0,honda civic lx,good,4.0,gas,200.0,manual,sedan,black,,2018-07-06,41,honda
22242,13991,2015.0,chevrolet equinox,good,4.0,gas,76642.0,automatic,SUV,custom,1.0,2018-08-24,35,chevrolet


In [4]:
#What's under the hood?
cars_us.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
 13  manufacturer  51525 non-null  object 
dtypes: float64(4), int64(2), object(8)
memory usage: 5.5+ MB


Thus far, the columns with missing values are as follows:
- model_year (We could replace those missing values with the mode)
- cylinders (We could replace those missing values with the median value within that column)
- odometer (We could replace those missing values with the median value as well)
- paint_color (We can replace the missing value with the phrase 'Unknown')
- is_4wd (assuming the missing values are not 4 wheel drive)

In [5]:
#How do we know if there are columns with missing values in 'model_year' and 'odometer'?
cars_us[(cars_us['model_year'].isna()) & (cars_us['odometer'].isna())]

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,manufacturer
159,23300,,nissan frontier crew cab sv,good,,gas,,other,pickup,grey,1.0,2018-07-24,73,nissan
260,14975,,toyota 4runner,good,6.0,gas,,automatic,SUV,silver,,2018-05-13,57,toyota
370,4700,,kia soul,good,,gas,,manual,sedan,white,,2019-01-14,50,kia
586,26000,,toyota rav4,like new,4.0,gas,,automatic,SUV,,,2018-08-09,29,toyota
659,8400,,volkswagen jetta,good,4.0,diesel,,manual,wagon,,,2018-10-22,37,volkswagen
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51195,21999,,ram 2500,good,6.0,diesel,,automatic,truck,white,1.0,2018-05-10,35,ram
51222,1000,,acura tl,good,6.0,gas,,automatic,sedan,grey,,2018-12-09,23,acura
51257,6500,,toyota corolla,good,4.0,gas,,automatic,sedan,white,,2018-10-16,75,toyota
51295,3850,,hyundai elantra,excellent,4.0,gas,,automatic,sedan,silver,,2019-03-16,83,hyundai


549 entries where we have neither mileage nor model year. Not a big bother for a dataframe of 51,525 entries though...

Assuming the missing values in the 'is_4wd' are cars without 4-wheel drive, we can go ahead a replace said missing value with 0.

In [6]:
#Assuming the missing values in the 'is_4wd' are cars without 4-wheel drive, we can go ahead a replace said missing value with 0
cars_us['is_4wd'] = cars_us['is_4wd'].fillna(0)

In [7]:
#Replacing missing paint colors with the term 'unknown' 
cars_us['paint_color'] = cars_us['paint_color'].fillna('Unknown')

In [8]:
#Replacing missing values of the model year with the mode
cars_us['model_year'] = cars_us['model_year'].fillna(cars_us['model_year'].mode())

In [9]:
#Replacing missing values of cylinders with the median
cars_us['cylinders'] = cars_us['cylinders'].fillna(cars_us['cylinders'].median())

In [10]:
#Replacing missing values within the odometer with the median
cars_us['odometer'] = cars_us['odometer'].fillna(cars_us['odometer'].median())

That should take care of any data discrepancies.

In [11]:
#How many brands are on that list?
cars_us['manufacturer'].value_counts()

ford             12672
chevrolet        10611
toyota            5445
honda             3485
ram               3316
jeep              3281
nissan            3208
gmc               2378
subaru            1272
dodge             1255
hyundai           1173
volkswagen         869
chrysler           838
kia                585
cadillac           322
buick              271
bmw                267
acura              236
mercedes-benz       41
Name: manufacturer, dtype: int64

Ford and Chevrolet are the two most common brands on this list. But how do their prices compare against each other?

In [12]:
#What are the kinds of fuel options available?
cars_us['fuel'].value_counts()

gas         47288
diesel       3714
hybrid        409
other         108
electric        6
Name: fuel, dtype: int64

In [13]:
#And the available qualities of the vehicle?
cars_us['condition'].value_counts()

excellent    24773
good         20145
like new      4742
fair          1607
new            143
salvage        115
Name: condition, dtype: int64

In [14]:
st.header('Vehicle types by manufacturer')
#Our histogram
veh_type = px.histogram(cars_us,x='manufacturer', 
                   color='type')
#Showing the histogram with Streamlit
st.write(veh_type)

2023-05-07 22:25:40.473 
  command:

    streamlit run C:\Users\Anthony\AppData\Roaming\Python\Python311\site-packages\ipykernel_launcher.py [ARGUMENTS]


In [15]:
st.header('Histrogram of condition vs model_year')
#And here's the histogram:
con_vs_yr = px.histogram(cars_us, x='model_year', 
                         color='type')
#Let's show that histogram
st.write(con_vs_yr)

In [16]:
st.header('Scatterplot Example')
example = px.scatter(cars_us,x='price',y='odometer',color='type')
st.write(example)

In [17]:
example
#While this does raise an error, we can see the scatterplot graph.

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Graph assessment:

From what we can gather, the median of the odometer appears as a small line of dots. These dots mark as 113k miles as our median for our odometer. On another note, the greatest concentration within this scatterplot appears to peak from around 415,000 miles on the y-axis and at about roughly 60,000 dollars on the x-axis.

In [None]:
st.header('Price comparision between two different manufacturers')      
manufac_list = sorted(cars_us['manufacturer'].unique())                 #Our list of car manufacturers
manufacturer_1 = st.selectbox(                                          #Gets users inputs from dropdown menu
                            label= 'Choose the first manufacturer',     #Title of the select box
                            options=manufac_list,                       #Options listed in select box
                            index=manufac_list.index('ford'))           #Default pre-selected option

manufacturer_2 = st.selectbox(                                          #Repeat practice for second dropdown
                            label= 'Choose the second manufacturer',     
                            options=manufac_list,                       
                            index=manufac_list.index('chevrolet'))
mask_filter = (cars_us['manufacturer'] == manufacturer_1) | (cars_us['manufacturer'] == manufacturer_2)
cars_us_filtered = cars_us[mask_filter]
normalize = st.checkbox('Normalize histogram', value=True)
if normalize:
    histnorm = 'percent'
else:
    histnorm = None
st.write(px.histogram(cars_us_filtered,
                    x='price',
                    nbins=30,
                    color='manufacturer',
                    histnorm=histnorm,
                    barmode='overlay'))