# Sprint 4 - Software Development Tools Project
# Exploratory Data Analysis

# 1. Introduction
## 1.1 Background
Using the `vehicles_us` dataset, this project aims to provide additional practice on common software engineering tasks learned in this sprint. Develop and deploy a web application to a cloud service, Render, in this case.

## 1.2 Stages
There is only one dataset in this project, `../vehicles_us.csv`, for this project.

To analyze this dataset, the following steps will be followed:
1. Data Overview
2. Data Preprocessing
3. Data Analysis

# 2 Data Overview
Import the libraries to be used on this project.

In [1]:
# Loading all the libraries
import pandas as pd
from scipy import stats
import streamlit as st


Open the dataset in a dataframe, `vehicles_df`.

In [2]:
# Load the data
vehicles_df = pd.read_csv('../vehicles_us.csv')

# Show dataframe
vehicles_df

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013.0,nissan maxima,like new,6.0,gas,88136.0,automatic,sedan,black,,2018-10-03,37
51521,2700,2002.0,honda civic,salvage,4.0,gas,181500.0,automatic,sedan,white,,2018-11-14,22
51522,3950,2009.0,hyundai sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,,2018-11-15,32
51523,7455,2013.0,toyota corolla,good,4.0,gas,139573.0,automatic,sedan,black,,2018-07-02,71


# 3 Data Preprocessing
Explore the dataset to get the initial understanding of the data. Do necessary corrections.

In [3]:
# Get basic info
vehicles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


In [4]:
# Add a brand column to the dataframe
vehicles_df['brand'] = vehicles_df['model'].apply(lambda x: x.split()[0])

# Insert new column at index 3 so it shows after model in dataframe
vehicles_df.insert(3, 'brand', vehicles_df.pop('brand'))

# Show dataframe
vehicles_df

Unnamed: 0,price,model_year,model,brand,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,bmw,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,ford,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,hyundai,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,ford,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,chrysler,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013.0,nissan maxima,nissan,like new,6.0,gas,88136.0,automatic,sedan,black,,2018-10-03,37
51521,2700,2002.0,honda civic,honda,salvage,4.0,gas,181500.0,automatic,sedan,white,,2018-11-14,22
51522,3950,2009.0,hyundai sonata,hyundai,excellent,4.0,gas,128000.0,automatic,sedan,blue,,2018-11-15,32
51523,7455,2013.0,toyota corolla,toyota,good,4.0,gas,139573.0,automatic,sedan,black,,2018-07-02,71


I added a `'brand'`column to capture the manufacturer of each model for further analysis down the line. There are some missing values that will be addressed in the next section.

## 3.1 Missing Values
Check for any missing values and fill them appropriately.

In [5]:
# Check for missing values
vehicles_df.isna().sum()

price               0
model_year       3619
model               0
brand               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64

Let's fill the missing values in the `'is_4wd'` column first.

In [6]:
# Let's ensure that all the missing values in the 'is_4wd' column are supposed to be 0
# Check the unique values in this column
vehicles_df['is_4wd'].unique()

array([ 1., nan])

In [7]:
# Replace the missing values in 'is_4wd' with 0 and change type to boolean
vehicles_df['is_4wd'] = vehicles_df['is_4wd'].fillna(0).astype(bool)

# Check for missing values
vehicles_df.isna().sum()

price              0
model_year      3619
model              0
brand              0
condition          0
cylinders       5260
fuel               0
odometer        7892
transmission       0
type               0
paint_color     9267
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

There were missing values in the `'is_4wd'` column. Using the `unique` method on this column, I was able to determine that the missing values were probably indicating the value 0 so I filled the missing values with 0.

Let's replace the missing values in the `'paint_color'`, `'cylinders'`, and `'model_year'` columns with the mode of each by model. The median wouldn't make sense for these columns as they are categorical in nature despite the `'cylinders'`, and `'model_year'` columns being numerical.

In [8]:
# Replace missing paint colors with the most frequent value for each model
vehicles_df['paint_color'] = vehicles_df.groupby('model')['paint_color'].transform(lambda x: x.fillna(x.mode()[0]))

# Check for missing values
vehicles_df.isna().sum()

price              0
model_year      3619
model              0
brand              0
condition          0
cylinders       5260
fuel               0
odometer        7892
transmission       0
type               0
paint_color        0
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

In [9]:
# Replace missing cylinders with the most frequent value for each model
vehicles_df['cylinders'] = vehicles_df.groupby('model')['cylinders'].transform(lambda x: x.fillna(x.mode()[0]))

# Check for missing values
vehicles_df.isna().sum()

price              0
model_year      3619
model              0
brand              0
condition          0
cylinders          0
fuel               0
odometer        7892
transmission       0
type               0
paint_color        0
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

In [10]:
# Replace missing model years with the most frequent value for each model
vehicles_df['model_year'] = vehicles_df.groupby('model')['model_year'].transform(lambda x: x.fillna(x.mode()[0]))

# Check for missing values
vehicles_df.isna().sum()

price              0
model_year         0
model              0
brand              0
condition          0
cylinders          0
fuel               0
odometer        7892
transmission       0
type               0
paint_color        0
is_4wd             0
date_posted        0
days_listed        0
dtype: int64

Finally, let's fill the `'odometer'` column with the median value of each by model.

In [11]:
# 
vehicles_df.groupby('model')['odometer'].median()

model
acura tl             141000.0
bmw x5               108500.0
buick enclave        113728.0
cadillac escalade    129176.0
chevrolet camaro      62655.5
                       ...   
toyota sienna        140715.0
toyota tacoma        125000.0
toyota tundra        120500.0
volkswagen jetta     107000.0
volkswagen passat     84905.0
Name: odometer, Length: 100, dtype: float64

In [12]:
# Replace missing odometer values with the median odometer value for each model
vehicles_df['odometer'] = vehicles_df.groupby('model')['odometer'].transform(lambda x: x.fillna(x.median()))

# Check for missing values
vehicles_df.isna().sum()

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


price            0
model_year       0
model            0
brand            0
condition        0
cylinders        0
fuel             0
odometer        41
transmission     0
type             0
paint_color      0
is_4wd           0
date_posted      0
days_listed      0
dtype: int64

Oops, it looks like this approach did not eliminate all the missing values in the `'odometer'` column so let's take a look at the remaining missing values.

In [13]:
# Check which models have no odometer value
vehicles_df[vehicles_df['odometer'].isna()]['model'].value_counts()

model
mercedes-benz benze sprinter 2500    41
Name: count, dtype: int64

In [14]:
# View all the models in the brand mercedes-benz
vehicles_df[vehicles_df['brand'] == 'mercedes-benz']

Unnamed: 0,price,model_year,model,brand,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
42,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2019-01-15,16
1642,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-12-04,36
2232,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-08-23,70
2731,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2019-04-12,31
4149,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-10-12,28
4681,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-10-02,32
5681,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-12-11,34
8975,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-09-24,45
10600,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-09-16,47
11541,34900,2013.0,mercedes-benz benze sprinter 2500,mercedes-benz,excellent,6.0,diesel,,automatic,van,black,False,2018-05-28,24


In [15]:
# Confirm that only the benze sprinter model is in the mercedes-benz to see if we could fill with median value from brand
vehicles_df[vehicles_df['brand'] == 'mercedes-benz']['model'].value_counts()

model
mercedes-benz benze sprinter 2500    41
Name: count, dtype: int64

The mercedes-benz benze sprinter 2500 does not have any odometer value entered so let's fill the missing values with the median odometer value for 6.0 cylinders cars.

In [16]:
# Replace missing odometer values with the median odometer value for each cylinder type
vehicles_df['odometer'] = vehicles_df.groupby('cylinders')['odometer'].transform(lambda x: x.fillna(x.median()))

# Check for missing values
vehicles_df.isna().sum()

price           0
model_year      0
model           0
brand           0
condition       0
cylinders       0
fuel            0
odometer        0
transmission    0
type            0
paint_color     0
is_4wd          0
date_posted     0
days_listed     0
dtype: int64

All missing values have now been filled so let's take a look at any duplicate values implicit or explicit that we can find in our dataset.

## 3.2 Duplicate Values
Check for fully duplicate rows or duplicates in the `'model'` and `'model_year'` columns.

In [17]:
# Check for fully duplicate rows
vehicles_df.duplicated().sum()

np.int64(0)

There are no fully duplicate rows that can be found at the moment. However, let's further categorize our dataset by grouping by the `'brand'` column.

Histogram of fuel vs model_year
Scatterplot of price vs manufacturer