<h1>Introduction</h1>
This analysis looks at different angles of the fuel data for the vehicle spreadsheet provided for the project. It looks at fuel in the scope of transmission and vehicle types and looks at the average price for different products.

These are the different libraries that are necessary to complete the analysis.

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats as st
from math import factorial as ft 
import plotly.express as px

This reads the data into a DataFrame and displays the information and the first five rows of the data.

In [3]:
car_data = pd.read_csv('/Users/leahdeyoung/Desktop/GitHub/car-data-project-practicum/vehicles_us.csv', encoding = "utf-8")

display(car_data.head())
car_data.info()


Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


This is the beginning of cleaning the data. Model year is missing data, so I filled the missing data with zeroes and converted the datatype to int.

In [4]:
print(car_data['model_year'].isna().sum())
car_data['model_year'] = car_data['model_year'].fillna(0)
car_data['model_year'] = car_data['model_year'].astype('int')
print(car_data['model_year'].isna().sum())

3619
0


If paint color was not listed, I added the string "Unknown".

In [5]:
print(car_data['paint_color'].isna().sum())
car_data['paint_color'] = car_data['paint_color'].fillna('Unknown')
print(car_data['paint_color'].isna().sum())

9267
0


If the number of cylinders was blank, I filled it in with zero. That way, if I decided to use this data for analysis, I knew to disregard anything that had a zero value. I also converted the datatype to int.

In [6]:
print(car_data['cylinders'].isna().sum())
car_data['cylinders'] = car_data['cylinders'].fillna(0)
car_data['cylinders'] = car_data['cylinders'].astype('int')
print(car_data['cylinders'].isna().sum())

5260
0


I wanted to fill in the blank fields for the odometer; however, I did not want to use zero as the placeholder value because a zero-value would indicate that there were no miles on the car. As a result, I chose 111111111 as a placeholder value as it was not used in the data. I then converted the datatype to int.

In [7]:
print(car_data['odometer'].isna().sum())
car_data['odometer'] = car_data['odometer'].fillna(111111111)
car_data['odometer'] = car_data['odometer'].astype('int')
print(car_data['odometer'].isna().sum())

7892
0


This value is a 0 or 1 for true or false. I set the missing values to 0 to assume that if the value was missing, the car did not have four-wheel drive. Ultimately, I decided not to use this column in my analysis because of the uncertainty.

In [8]:
print(car_data['is_4wd'].isna().sum())
car_data['is_4wd'] = car_data['is_4wd'].fillna(0)
car_data['is_4wd'] = car_data['is_4wd'].astype('int')
print(car_data['is_4wd'].isna().sum())

25953
0


Checking to see that the DataFrame looks okay and that the information looks correct after filling in missing values and converting datatypes.

In [9]:
display(car_data.sample(10))
car_data.info()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
33528,24495,2013,chevrolet silverado 2500hd,excellent,0,diesel,213132,automatic,truck,red,1,2018-06-30,48
2801,5800,2006,chevrolet tahoe,excellent,8,gas,133000,automatic,SUV,Unknown,0,2018-05-09,71
16644,1,2018,ram 3500,excellent,10,gas,8530,other,truck,white,1,2018-11-08,17
40386,27999,2014,chevrolet silverado 1500,like new,8,gas,68000,automatic,truck,green,1,2019-01-10,19
24156,22500,2019,nissan frontier crew cab sv,good,6,gas,26344,other,pickup,Unknown,1,2019-03-15,118
49526,34000,2016,ford f250,excellent,8,gas,55500,automatic,pickup,brown,1,2018-11-29,34
22433,6900,2012,chevrolet malibu,good,4,gas,42000,automatic,sedan,silver,0,2019-04-02,63
34805,7500,2013,nissan altima,excellent,4,gas,111000,automatic,sedan,Unknown,0,2019-02-21,15
28974,22988,2014,ram 1500,like new,8,gas,75000,automatic,truck,black,1,2018-08-02,41
17871,8500,2011,chevrolet silverado 1500,excellent,8,gas,105125,automatic,pickup,black,1,2018-09-09,3


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   price         51525 non-null  int64 
 1   model_year    51525 non-null  int64 
 2   model         51525 non-null  object
 3   condition     51525 non-null  object
 4   cylinders     51525 non-null  int64 
 5   fuel          51525 non-null  object
 6   odometer      51525 non-null  int64 
 7   transmission  51525 non-null  object
 8   type          51525 non-null  object
 9   paint_color   51525 non-null  object
 10  is_4wd        51525 non-null  int64 
 11  date_posted   51525 non-null  object
 12  days_listed   51525 non-null  int64 
dtypes: int64(6), object(7)
memory usage: 5.1+ MB


Checking different values for fuel, transmission, and vehicle type to determine if there are duplicates and if analysis on these columns is a viable option.

In [10]:
print(car_data['fuel'].unique())
print(car_data['transmission'].unique())
print(car_data['type'].unique())

['gas' 'diesel' 'other' 'hybrid' 'electric']
['automatic' 'manual' 'other']
['SUV' 'pickup' 'sedan' 'truck' 'coupe' 'van' 'convertible' 'hatchback'
 'wagon' 'mini-van' 'other' 'offroad' 'bus']


Grouped data by transmission and fuel and then counted it. Created histograms for both. This sets up a scatter plot later.

In [11]:

car_fuel = car_data.groupby('fuel')['fuel'].count()
car_transmission = car_data.groupby('transmission')['transmission'].count()
car_fuel_hist = px.histogram(car_fuel, 
                             x='fuel', 
                             nbins=5, 
                             title='Fuel and Transmission Type Frequency')
car_trans_hist = px.histogram(car_transmission, 
                             x='transmission', 
                             nbins=10,
                             title='Transmission Type Frequency' 
                             )
car_fuel_hist.show()
car_trans_hist.show()

Grouped vehicle data by vehicle type and then counted it. This sets up the second scatter plot later.

In [12]:
car_type_frequency = car_data.groupby('type')['type'].count()
car_type_hist = px.histogram(car_type_frequency, 
                             y='type', 
                             nbins=8, 
                             title='Car Type Frequency')
car_type_hist.show()

This groups data by fuel and transmission, then calculates the average price. Then I created a new dataframe out of the series, and created a scatter plot.

In [17]:
grp = car_data.groupby(['fuel', 'transmission'])
car_fuel_transmission = grp['price'].mean()
car_fuel_transmission = car_fuel_transmission.reset_index().rename(columns={0: 'price'})
#price_fuel = car_data.groupby('fuel')['price'].mean()
car_fuel_scatter = px.scatter(car_fuel_transmission,
                              x='fuel',
                              y='price',
                              color='transmission',
                              labels={
                                'price': 'Average Price',
                                'fuel': 'Fuel Type',
                                'transmission': 'Transmission Type'},                              
                              title='Average Price per Type of Transmission and Fuel')
car_fuel_scatter.show()

This groups the data by fuel and vehicle type, then calculates average price. Then the series is put into a dataframe, and a scatter plot is created. 

In [18]:
grp2 = car_data.groupby(['fuel', 'type'])
car_type = grp2['price'].mean()
car_type = car_type.reset_index().rename(columns={0: 'price'})
car_type_scatter = px.scatter(car_type,
                              x='type',
                              y='price',
                              color='fuel',
                              labels={
                                  'price': 'Average Price',
                                  'type': 'Vehicle Type',
                                  'fuel': 'Fuel Type'},
                              title='Average Price for Vehicle and Fuel Type')
car_type_scatter.show()

<h1>Conclusion</h1>
Disregarding the "other" data, diesel seems to be the more expensive options when it comes to fuel, and larger vehicles seem to be more expensive than smaller vehicles. Fuel type seems to have more impact on price than transmission type. Overall, I would not recommend basing a purchase on this data alone, but I think it is good to take it into account.