## Exploratory Data Analysis

### Introduction

In this project, I will be developing and deploying a web application to a cloud service to be viewed by the public.

I have a used cars inventory dataset and will be performing exploratory data analysis, cleaning the data, then develop the web application with Streamlit and deploy it to Render.

I hope to practice my software development skills, my project structure skills, and my data analytic skills with this task.


In [78]:
import pandas as pd
import plotly.express as px

In [79]:
#read data into dataframe
file_path = r'..\vehicles_us.csv'
df_raw = pd.read_csv(file_path)

In [80]:
#check the data for things we would want to fix
display(df_raw.sample(10))
df_raw.info()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
44138,32500,2019.0,ford f150 supercrew cab xlt,excellent,6.0,gas,11381.0,automatic,pickup,white,1.0,2019-03-21,45
6362,3995,2012.0,chrysler 200,excellent,4.0,gas,198000.0,automatic,sedan,black,,2019-02-08,5
45137,2000,2003.0,volkswagen jetta,good,4.0,gas,207000.0,automatic,sedan,red,,2018-12-27,60
34122,8000,2013.0,toyota camry,good,4.0,gas,166000.0,automatic,sedan,silver,,2018-06-06,40
33106,2995,2005.0,ford taurus,good,6.0,gas,100702.0,automatic,sedan,,,2018-06-29,120
12918,6500,2010.0,toyota camry,good,6.0,gas,111153.0,automatic,sedan,grey,,2018-10-21,25
24016,12950,2008.0,gmc sierra 2500hd,good,8.0,gas,129029.0,automatic,truck,,1.0,2019-02-25,25
15556,9900,2011.0,chevrolet suburban,excellent,,gas,,automatic,SUV,grey,1.0,2018-11-15,77
11367,12999,2015.0,chevrolet equinox,excellent,6.0,gas,86853.0,automatic,SUV,,,2018-10-28,31
39970,11900,2007.0,gmc yukon,excellent,8.0,gas,165000.0,automatic,SUV,black,1.0,2019-04-13,43


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [81]:
#Check the data for duplicates
dups = df_raw.duplicated(subset=['model','price','model_year','cylinders','fuel','transmission','type','is_4wd','paint_color'])

dup_rows = df_raw[dups]

dup_value_counts = dup_rows.groupby(['model','price','model_year','cylinders','fuel','transmission','type','is_4wd','paint_color']).size()

print(dup_value_counts)

model              price  model_year  cylinders  fuel  transmission  type    is_4wd  paint_color
acura tl           8499   2011.0      6.0        gas   automatic     sedan   1.0     grey           5
bmw x5             4500   2008.0      8.0        gas   automatic     SUV     1.0     blue           1
                   6000   2004.0      6.0        gas   automatic     SUV     1.0     black          1
                   6500   2007.0      6.0        gas   manual        SUV     1.0     black          1
                   8888   2008.0      6.0        gas   automatic     SUV     1.0     silver         4
                                                                                                   ..
toyota tundra      28995  2015.0      8.0        gas   automatic     pickup  1.0     grey           2
                   30487  2015.0      8.0        gas   automatic     truck   1.0     silver         1
                   30898  2016.0      8.0        gas   automatic     truck   1.0     gr

In [82]:
#check for duplicates with odometer added
df_raw.duplicated(subset=['model','price','model_year','cylinders','fuel','transmission','type','is_4wd','paint_color']).sum()

9272

In [83]:
#further research of duplicate acuras
print(df_raw[(df_raw['model'] == 'acura tl') & (df_raw['model_year'] == 2011) & (df_raw['price'] == 8499) & (df_raw['cylinders'] == 6)])

       price  model_year     model  condition  cylinders fuel  odometer  \
11095   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
11145   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
11250   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
11267   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
11413   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
20439   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
20534   8499      2011.0  acura tl  excellent        6.0  gas  189000.0   
20721   8499      2011.0  acura tl  excellent        6.0  gas       NaN   

      transmission   type paint_color  is_4wd date_posted  days_listed  
11095    automatic  sedan        grey     1.0  2018-12-11           67  
11145    automatic  sedan        grey     1.0  2019-04-03           92  
11250    automatic  sedan         NaN     1.0  2018-12-21           17  
11267    automatic  sedan       

In [84]:
#further research of duplicate volkswagens
print(df_raw[(df_raw['model'] == 'volkswagen passat') & (df_raw['model_year'] == 2013) & (df_raw['price'] == 15995)])

       price  model_year              model  condition  cylinders    fuel  \
1377   15995      2013.0  volkswagen passat  excellent        4.0  diesel   
1413   15995      2013.0  volkswagen passat  excellent        4.0  diesel   
18874  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
19989  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
22510  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
27036  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
37619  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
40137  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
40184  15995      2013.0  volkswagen passat  excellent        4.0  diesel   
44600  15995      2013.0  volkswagen passat  excellent        NaN  diesel   
44601  15995      2013.0  volkswagen passat  excellent        4.0  diesel   

       odometer transmission   type paint_color  is_4wd date_posted  \
1377

### I believe I need to get rid of rows that have dupilcate model, year, price and odometer reading and prioritize keeping those with less NaN values

In [85]:
#sort the records by the records with less NaN values to be on top
df_sorted = df_raw.iloc[df_raw.isnull().sum(axis=1).argsort()]

In [86]:
#drop duplicates based on the columns I determined and keep the first records based on the results from above
df = df_sorted.drop_duplicates(subset=['model','price','model_year','cylinders','fuel','transmission','type','is_4wd','paint_color'],keep='first')

In [87]:
#new info
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 42253 entries, 25762 to 20969
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         42253 non-null  int64  
 1   model_year    38772 non-null  float64
 2   model         42253 non-null  object 
 3   condition     42253 non-null  object 
 4   cylinders     37213 non-null  float64
 5   fuel          42253 non-null  object 
 6   odometer      36494 non-null  float64
 7   transmission  42253 non-null  object 
 8   type          42253 non-null  object 
 9   paint_color   33855 non-null  object 
 10  is_4wd        20348 non-null  float64
 11  date_posted   42253 non-null  object 
 12  days_listed   42253 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 4.5+ MB


In [88]:
#The model field should be split up into two different columns so we can create visualizations or drop downs based on the brand
#split the column up into 5 columns by spaces
df[['make','model','filler','filler1','filler2']] = df['model'].str.split(' ',expand=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['make','model','filler','filler1','filler2']] = df['model'].str.split(' ',expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['make','model','filler','filler1','filler2']] = df['model'].str.split(' ',expand=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['make','model','fil

In [89]:
#combine the last 4 columns back into the model column
df['model'] = df['model'] + ' ' + df['filler'].fillna('') + ' ' + df['filler1'].fillna('') + ' ' + df['filler2'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['model'] = df['model'] + ' ' + df['filler'].fillna('') + ' ' + df['filler1'].fillna('') + ' ' + df['filler2'].fillna('')


In [90]:
#drop the filler columns
df.drop(['filler','filler1','filler2'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(['filler','filler1','filler2'],axis=1,inplace=True)


In [91]:
#add an age column
df['age'] = 2024 - df['model_year']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age'] = 2024 - df['model_year']


In [92]:
#preview new dataframe
df.sample(5)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make,age
47379,4800,2005.0,corolla,excellent,,gas,104600.0,automatic,sedan,custom,,2018-09-05,70,toyota,19.0
14014,12000,2011.0,x5,excellent,6.0,gas,78000.0,automatic,SUV,,1.0,2019-02-10,20,bmw,13.0
46458,3950,2006.0,grand caravan,good,6.0,gas,168928.0,automatic,van,white,,2019-01-04,6,dodge,18.0
49005,27995,2014.0,silverado 2500hd,excellent,8.0,diesel,132864.0,automatic,truck,white,1.0,2019-02-10,23,chevrolet,10.0
16223,4895,,fusion,excellent,4.0,gas,,automatic,sedan,silver,,2019-04-12,26,ford,


##### Missing Values
###### - There are a good amount of NaN model years, I don't love the idea of filling them with 0, but I would like to convert the column to int and I can always exclude 0's from visualizations later on.
###### - There are a lot of NaN cylinder values as well. We could replace this with the most common cylinder based on the type of car. I will remove this if needed.
###### - There are lots of missing odometer values. i will replace these with 0.
###### - There are missing paint_colors. I can fill these with 'black', but I think it's fine to leave them as NaN
###### - There are 10s of thousands of missing 4wd values. If we check below the unique values of this field, it's either 1 or NaN. I will fill these missing values with 0

In [93]:
#check unique values of is_4wd
print(df['is_4wd'].unique())

[ 1. nan]


In [94]:
#replace those missing values with 0
df['is_4wd'] = df['is_4wd'].fillna(0).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['is_4wd'] = df['is_4wd'].fillna(0).astype(int)


In [95]:
#replace those missing values in odometer with 0
df['odometer'] = df['odometer'].fillna(0).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['odometer'] = df['odometer'].fillna(0).astype(int)


In [96]:
#Fill missing model_year values and convert to int
df['model_year'] = df['model_year'].fillna(0).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['model_year'] = df['model_year'].fillna(0).astype(int)


#### Edit

Filling the NaN's with the median cylinder by model/year instead of filling them by type of car.

In [97]:
#alternative method for filling the NaN cylinders. I will grab the median cylinder from the dataframe grouped by model and model_year and use that value
def cylinder_estimation(df):
    # Calculate median cylinders grouped by car model and model year
    median_cylinders = df.groupby(['model', 'model_year'])['cylinders'].transform('median')

    # Fill NaN values with the calculated median
    df['cylinders'] = df['cylinders'].fillna(median_cylinders)
    
    return df

df = cylinder_estimation(df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cylinders'] = df['cylinders'].fillna(median_cylinders)


In [98]:
#checking if the NaN's are gone
df[df['cylinders'].isna()].count()

price           26
model_year      26
model           26
condition       26
cylinders        0
fuel            26
odometer        26
transmission    26
type            26
paint_color     23
is_4wd          26
date_posted     26
days_listed     26
make            26
age             26
dtype: int64

After replacing the NaN cylinders with the median for each model and year, there are still 26 cars with NaN cylinders. This is probably because there are few models of those cars and none of them have a cylinder value. I plug the rest with my original function cyl using the averages by car type

In [99]:
#check the unique values of cylinder and type to see what we need to assign default values to.
print(df['cylinders'].unique())
print(df['type'].unique())

[ 6.   8.  10.   4.   5.   3.  12.   7.   nan  9.   4.5]
['pickup' 'SUV' 'truck' 'other' 'coupe' 'wagon' 'sedan' 'convertible'
 'offroad' 'van' 'hatchback' 'mini-van' 'bus']


In [100]:
#see the average cylinders for each type of car.
df_cyl_avg = df[['type','cylinders']]
df_cyl_avg.groupby(['type']).mean().round()

#I will round convertibles down to 6,pickups + trucks to 8 since no 7 cylinder engines exist in the data, and buses to 10 since 9 doesn't exist.

Unnamed: 0_level_0,cylinders
type,Unnamed: 1_level_1
SUV,6.0
bus,8.0
convertible,7.0
coupe,6.0
hatchback,4.0
mini-van,6.0
offroad,6.0
other,6.0
pickup,7.0
sedan,5.0


In [101]:
#create a function for assigning default values of cylinders for each car type.
def cyl(x):
    if x in ['SUV','convertible','coupe','mini-van','offroad','van']:
        return 6
    elif x == 'bus':
        return 10
    elif x == 'hatchback':
        return 4
    elif x in ['pickup','truck']:
        return 8
    elif x == ['sedan','wagon']:
        return 5
    else:
        return 6 #this is for 'other' type

In [102]:
df['cylinders'] = df['cylinders'].fillna(df['type'].apply(cyl)).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cylinders'] = df['cylinders'].fillna(df['type'].apply(cyl)).astype(int)


##### Data Types

In [103]:
df['price'] = df['price'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].astype(float)


In [104]:
#checking the dataframe to make sure data types and missing values look good and look at a sample
display(df.sample(10))
df.info()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make,age
50017,12791.0,2007,sierra 1500,good,8,gas,124496,automatic,truck,white,1,2018-05-08,89,gmc,17.0
51459,7490.0,2010,edge,excellent,6,gas,124066,automatic,SUV,grey,1,2018-10-15,94,ford,14.0
16337,28650.0,2016,f-150,excellent,8,gas,49107,automatic,truck,silver,1,2018-06-10,42,ford,8.0
1744,2995.0,2001,altima,excellent,4,gas,125000,automatic,sedan,,0,2019-03-15,34,nissan,23.0
34721,14250.0,2007,tundra,excellent,8,gas,96966,automatic,pickup,,0,2019-03-21,22,toyota,17.0
35307,14999.0,2007,wrangler,excellent,6,gas,136777,manual,SUV,black,1,2019-02-17,28,jeep,17.0
34979,20000.0,2017,explorer,good,6,gas,24842,automatic,SUV,blue,1,2019-02-22,11,ford,7.0
42820,10995.0,2006,silverado 1500,good,8,gas,0,automatic,truck,white,1,2018-11-15,17,chevrolet,18.0
6183,3793.0,2008,impala,excellent,6,gas,165831,automatic,sedan,silver,0,2018-05-04,48,chevrolet,16.0
27973,3995.0,2005,1500,good,8,gas,220917,automatic,truck,black,0,2018-11-11,23,ram,19.0


<class 'pandas.core.frame.DataFrame'>
Index: 42253 entries, 25762 to 20969
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         42253 non-null  float64
 1   model_year    42253 non-null  int32  
 2   model         42253 non-null  object 
 3   condition     42253 non-null  object 
 4   cylinders     42253 non-null  int32  
 5   fuel          42253 non-null  object 
 6   odometer      42253 non-null  int32  
 7   transmission  42253 non-null  object 
 8   type          42253 non-null  object 
 9   paint_color   33855 non-null  object 
 10  is_4wd        42253 non-null  int32  
 11  date_posted   42253 non-null  object 
 12  days_listed   42253 non-null  int64  
 13  make          42253 non-null  object 
 14  age           38772 non-null  float64
dtypes: float64(2), int32(4), int64(1), object(8)
memory usage: 4.5+ MB


## Visualizations

#### Histograms

In [105]:
#Price distribution
#There's not many cars above $50k, so I excluded them from the distribution.
fig1 = px.histogram(df,
                    x='price',
                    color='type',
                    range_x=[0,50000],
                    nbins=500,
                    opacity=.6,
                    title='<b> Price Distribution by Type of Car <b>',
                    template='plotly_dark')

fig1.update_layout(yaxis_title='Amount of Cars',xaxis_title='Price (USD)',height=800)
fig1.show()

##### We can tell from the above histogram that most of the cars are between the $2k to $10k range. The higher prices are mostly occupied by trucks and SUVs, but this data doesn't contain many high priced vehicles. Sedans don't appear to have many high priced cars in the data.

In [106]:
#Brand distribution of modern luxury cars

#create a filtered dataframe of luxury models in between 2015 and 2020
filtered_df_year = df[(df['model_year'] >= 2010) & (df['model_year'] <= 2020) & (df['make'].isin(['bmw','acura','mercedes-benz','cadillac','buick','lexus','audi','lincoln']))]

#plot histogram of brands with a color filter of years
fig2 = px.histogram(filtered_df_year,
                    x='make',
                    color='model_year',
                    title='<b> Modern Luxury Car Model Year Distribution <b>',
                    template='plotly_dark')

fig2.update_layout(yaxis_title='Amount in Inventory',xaxis_title='Brand',height=800)

fig2.show()

#### Based on the above histogram, buick has the biggest stock of modern luxury cars and mercedes-benz only has 34 from one year. It appears that most of the luxury vehicles in the inventory come from 2011 and 2012.

In [107]:
#Create year ranges as eras and plot the distribution of all cars in these ranges

#create year_range function to pass the dataframe to
def year_range(x):
    if 1920 <= x <= 1940:
        return '1920-1940'
    elif 1940 < x <= 1960:
        return '1941-1960'
    elif 1960 < x <= 1980:
        return '1961-1980'
    elif 1980 < x <= 2000:
        return '1981-2000'
    elif 2000 < x <= 2020:
        return '2001-2020'
    else:
        return 'unknown'

In [108]:
#create year_range column in dataframe. I will add this to my functions and to the dataframe for the app
df['year_range'] = df['model_year'].apply(year_range)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [109]:
#create condition column based on odometer readings. These are opinion based
def condition(x):
    if 0 <= x <= 200:
        return 'new'
    elif 200 < x <= 5000:
        return 'like new'
    elif 5000 < x <= 20000:
        return 'good'
    elif 20000 < x <= 50000:
        return 'used'
    elif 50000 < x <= 100000:
        return 'very used'
    else:
        return 'heavily used'

In [110]:
#add condition column
df['condition'] = df['odometer'].apply(condition)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [111]:
#plot histogram of 
fig3 = px.histogram(df,
                    x='make',
                    color='year_range',
                    title='<b> Distribution of Brands by Year Range <b>',
                    template='plotly_dark')

fig3.update_layout(yaxis_title='Amount in Inventory',xaxis_title='Brand',height=800)

fig3.show()

##### Based on the above histogram, the huge majority of the inventory is from the last 20 years. There are a decent amount of unknown model years, but there are not many cars from before 2000 to choose from.

#### Scatterplots

In [112]:
df[df['age'].isna()]

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make,age,year_range
30394,13995.0,0,cr-v,very used,4,gas,78633,automatic,wagon,grey,1,2019-03-22,23,honda,,unknown
33922,17500.0,0,traverse,very used,6,gas,58300,automatic,SUV,silver,1,2018-11-22,66,chevrolet,,unknown
33231,19995.0,0,acadia,very used,6,gas,54850,automatic,SUV,silver,1,2018-07-06,31,gmc,,unknown
29655,14990.0,0,silverado 1500,heavily used,8,gas,147615,automatic,truck,grey,1,2018-05-22,36,chevrolet,,unknown
29831,9000.0,0,impreza,used,4,gas,34000,automatic,hatchback,grey,1,2019-01-11,33,subaru,,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17967,4950.0,0,soul,new,4,gas,0,other,wagon,,0,2019-04-11,31,kia,,unknown
23376,3500.0,0,4runner,new,6,gas,0,automatic,SUV,,0,2019-03-28,57,toyota,,unknown
43039,4200.0,0,fusion,new,4,gas,0,automatic,sedan,,0,2019-01-10,11,ford,,unknown
19877,34900.0,0,benze sprinter 2500,new,6,diesel,0,automatic,van,,0,2018-10-16,44,mercedes-benz,,unknown


In [113]:
#making a scatter plot to see if the mileage of the car has an effect on the price
fig4 = px.scatter(df[df['age'].notna()],
                  x='odometer',
                  y='price',
                  color='age',
                  title='<b> Mileage vs. Price <b>',
                  template='plotly_dark')

fig4.update_xaxes(range=[0, 200000])
fig4.update_yaxes(range=[0, 75000])
fig4.update_layout(yaxis_title='Price (USD)',xaxis_title='Mileage',height=900)
fig4.show()

##### You can see from the scatter plot above that the higher the mileage of the car, the slightly lower price the car will be. There's not too many cars for under $10k that have less than 40k miles. As you get towards the higher mileages in the scatter plot, the cars are typically very old, but there are a fair amount of old cars with less than 100k miles.

In [114]:
#making a scatter plot to see if days listed on the inventory has any effect on the price
fig5 = px.scatter(df,
                  x='days_listed',
                  y='price',
                  color='condition',
                  labels={'days_listed':'Days Listed','price':'Price'},
                  hover_data=['price','days_listed','condition','odometer'],
                  title='<b> Days Listed vs. Price <b>',
                  template='plotly_dark')

fig5.update_xaxes(range=[0, 200])
fig5.update_yaxes(range=[0, 60000])
fig5.update_layout(height=900)
fig5.show()

##### You can see above that most of the cars in the inventory haven't been listed for more than 100 days. The price ranges between cars 100 days and under do not vary that much.

##### But below, if you take out the majority of the inventory and limit to 100-200 days on the site, you can see that those cars that have been sitting unbought for a while are typically much cheaper. There must be problems with them

In [115]:
fig6 = px.scatter(df,
                  x='days_listed',
                  y='price',
                  color='age',
                  title='<b> Days Listed between 100-200 Days vs. Price <b>',
                  template='plotly_dark')

fig6.update_xaxes(range=[100, 200])
fig6.update_yaxes(range=[0, 75000])
fig6.update_layout(yaxis_title='Price (USD)',xaxis_title='Days Listed',height=900)
fig6.show()

### Conclusion

I reviewed the vehicles_us.csv data, created functions to clean the data and created some visualizaitons to get a feel for the data I was working with

I see that there is a large variety of cars ranging from new to old, used to like new, cheap to expensive, and all types of different brands.

In my application, I allowed the user to filter the inventory by the make and the condition they would like to shop for as well as letting them filter by the year of the model and select new listings only.

I also gave the users some visualizaitons like the types of cars available by manufacturer and allowed them to compare between two different brands.

I provided a scatter plot of Mileage vs. Price which aimed to show how the more miles on a car, the cheaper it is likely to be in the market.

