# Car EDA

In this project, we will:
- Import the **vehicles_us.csv** with **pandas**
- Perform a simple EDA with **plotly-express**

## Import csv
First, we'll open up this csv and see what we're working with.

In [1]:
import pandas as pd

df=pd.read_csv(r'C:\\Users\Michael\project1\vehicles_us.csv')

print(df.head())
print(df.isnull().sum())
print(df.info())

   price  model_year           model  condition  cylinders fuel  odometer  \
0   9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1  25500         NaN      ford f-150       good        6.0  gas   88705.0   
2   5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3   1500      2003.0      ford f-150       fair        8.0  gas       NaN   
4  14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   

  transmission    type paint_color  is_4wd date_posted  days_listed  
0    automatic     SUV         NaN     1.0  2018-06-23           19  
1    automatic  pickup       white     1.0  2018-10-19           50  
2    automatic   sedan         red     NaN  2019-02-07           79  
3    automatic  pickup         NaN     NaN  2019-03-22            9  
4    automatic   sedan       black     NaN  2019-04-02           28  
price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel   

Ok, so the data seems to be unclean. I'm concerned about the thousands of missing values in **model_year**, **cylinders**, **odometer**, **paint_color**, and **is_4wd**. Out of 51,525 entries, a large fraction of entries are missing in those columns. Let's try eliminating all Nulls, and hopefully we'll have enough data left to work with.

In [2]:
clean_df=df.dropna().reset_index(drop=True)

print(clean_df.head())
print(clean_df.isnull().sum())
print(clean_df.info())

   price  model_year                     model  condition  cylinders fuel  \
0  14990      2014.0              chrysler 300  excellent        6.0  gas   
1  15990      2013.0               honda pilot  excellent        6.0  gas   
2  19500      2011.0  chevrolet silverado 1500  excellent        8.0  gas   
3  12990      2009.0                 gmc yukon  excellent        8.0  gas   
4  14990      2010.0                  ram 1500  excellent        8.0  gas   

   odometer transmission    type paint_color  is_4wd date_posted  days_listed  
0   57954.0    automatic   sedan       black     1.0  2018-06-20           15  
1  109473.0    automatic     SUV       black     1.0  2019-01-07           68  
2  128413.0    automatic  pickup       black     1.0  2018-09-17           38  
3  132285.0    automatic     SUV       black     1.0  2019-01-31           24  
4  130725.0    automatic  pickup         red     1.0  2018-12-30           13  
price           0
model_year      0
model           0
con

Perfect. We've cleaned all of the Nulls, and even though we've eliminated over 2/3's of our data, I believe 14,852 entries is still plenty of data for a simple EDA. We'll just have to bear in mind that our Statistc population is now much smaller. Now let's visualize our DataFrame.

## EDA
Next, we'll make a couple of Histograms and Scatterplots

In [5]:
import plotly.express as px

histogram1 = px.histogram(clean_df, x='paint_color', title='Color Distribution')
histogram.show()

It looks like people like white cars the most. Black comes in second, followed by silver and blue.

In [8]:
histogram2 = px.histogram(clean_df, x='price', title='Price Distribution')
histogram2.show()

The price distribution is heavily skewed to the left, suggesting great outliers in expensive cars.

In [13]:
scatterplot=px.scatter(clean_df, x='price', y='model_year', title='Price correlation')
scatterplot.show()

Here we've compared the cars price against the model year. Newer cars populate the majority of the data points. There is a positive correlation between new cars and higher prices. We can also point out the extreme outliers that threw off the average disrtibution in the previous Histogram. Some overpriced cars from the late 90's to early 2000's (perhaps vintage collectables) are skewing the entire visual to the far outskirts of 100,000 and greater price range, when in fact the majority of the data is in the 80,000 and lower range. This is were the differences between averages and medians come into play. Averages are suscepltible to be skewed by outliers. **Median statistics are more robust**. infact let's throw in an additional visualization to demonstrate.

In [18]:
boxplot=px.box(clean_df, x='price', title='Price IQR')
boxplot.show()

A better Statistical model would be the IQR (Inter Quartile Range), which is the difference between the 1st and 3rd quartile. A quartile is a marker for a quarter of the data points. The Median is the middle $data point$ in our data set, meanwhile IQR marks the $range$ for the middle 50% of the data(By the way. In the Box plot above, the IQR is the $greyed-out$ area in the box).

- Q1 marks the first 25% of the data points.
- Q2 AKA the Median marks the first 50% of the data points.
- Q3 marks the first 75% of the data points.
- IQR = Q3-Q1

Our Q3 is 22,000. Our Q1 is 7,000. 22,000 - 7,000 = 15,000 is our IQR. Which means the middle 50% of the cars in our data set are $actually$ around the 15,000 dollar range, with a Median of 13,500. This statistical analysis is a great way to factor out extreme outliers.