# Exploratory Data Analysis

## INTRODUCTION

At the beginning of my data science journey, when I learned and practiced a data science project, I usually didn’t concern about exploratory data analysis. It was when I did my first Kaggle competition I began to pay attention to it. Especially after seeing some code from different participants, I noticed some of them emphasize EDA in their method. At the time I realized that I practiced some points of EDA without me even noticed it. However, that was the time I started to pay attention to EDA seriously.

The main benefit that I feel when I started to incorporate EDA as a definite step into my data science project is that I become more organized and it becomes easier to understand what to do next. So, I am thinking to create a framework that will provide a clear guide on EDA (at least for me).

There is a lot of sources on EDA out there but the ones that I find helpful for me are Hands-On Exploratory Data Analysis with Python, a book by Suresh Kumar Mukhiya and Usman Ahmed. A video on YouTube also helps (https://www.youtube.com/watch?v=YEBRkLo568Q&list=PLqW0gOGkUieY3Ljs9S0TuFP36xqefLDcg&index=3&t=1420s).

## What is Exploratory Data Analysis (EDA)?

According to IBM, Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps to determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

For me, EDA is all about understanding your data, check the connection of entities of the data, and find patterns. It is quite remarkable the insight we can get about our data by performing EDA. Create an EDA framework is like create a path, a clear pathway will get us to our destination faster and safer.


## My EDA framework: IDENTIFICATION -> TRANSFORMATION -> VISUALIZATION AND EXPLORATION



    
## IDENTIFICATION

When we are doing a data science project, we usually deal with a set of data, or what we call a data set. Data by Merriam-webster definition is a piece of factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation. A data set usually consists of columns and rows. A column represents a variable, an entity that we want to investigate. Meanwhile, a row is the value of a certain variable. The feature is just another name for the variable.

A complete explanation of types of data can be seen here: https://www.upgrad.com/blog/types-of-data/

### Why Does Identification Matter?

Identification of type of data is extremely important. It will be easier for us to do further analysis if we know what kind of data we are dealing with. Statistical methods are designed to work with a certain type of data. Methods that are used in numerical data sometimes cannot be used for categorical types of data. Knowing our data type will limit the possibility of us using the wrong method of analysis.

Nominal data allow us to calculate frequencies, proportions, and percentages. They can be illustrated using a bar chart or pie chart. Meanwhile, for ordinal data, summarize calculation; frequencies, percentile, mode, median, and interquartile range are allowed to be evaluated. They also work well with bar and pie charts.

In continuous data, we have the most choice for summarization. Calculation of percentile, interquartile range, central tendencies (mean, median, and mode), and the standard deviation are possible. For visualization, continuous data are better represented by histogram or box plot.


## TRANSFORMATION

Data transformation is a method to change the structure, format, and value of the data so that our data become more accessible for analysis. There is a lot of techniques for data transformation and there is no exact way to do so. First, we need to set what the objectives of our project are. Second, understand the procedures or steps to achieve those goals. Only by that, we will be able to identify the appropriate data transformation to apply in our data.


## VISUALIZATION AND EXPLORATION

When talking about visualization, it usually is about charts, variable description, and relationships between variables. It is important to be close to kinds of charts and understand when to use each of them. The book I mentioned above also gives a great explanation about types of charts, when to use them, and examples of how to use them using Python.

In the end, identification, transformation, and visualization and exploration are all about gaining insight into our data. They are not a separate process. You might be doing transformation and visualization at once, or doing it one by one. There is no absolute way to do EDA, everyone has their method. The result, however, can be surprising. Sometimes, we will find information we never think of before.

As always, first is to import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Load the dataset that will be used

In [None]:
vehicles = pd.read_csv('../input/craigslist-carstrucks-data/vehicles.csv')
vehicles.head()

In [None]:
# print shape and column names
print(vehicles.columns)
print(vehicles.shape)

## IDENTIFICATION

In [None]:
vehicles.describe()

In [None]:
# check type of each variable
vehicles.info()

Since it now has 26 columns, it is hard to know whether we have desirable the type of data for each column or not. So we need to slice the data for checking purpose. 

In [None]:
vehicles.iloc[:, 0:11]

In [None]:
vehicles.iloc[:, 11:22]

In [None]:
vehicles.iloc[:, 22:27]

After check the value and type of each column, there are some columns that we need to change their datatype.

## TRANSFORMATION

All kind of transformation in this project is for visualization purpose only, no machine learning model will be used.

### Handling null/missing values

Before we convert datatype, let check for null value

In [None]:
# total null value for each column
vehicles.isnull().sum()

In [None]:
# sorting columns based on their total null
null_val = pd.DataFrame(vehicles.isnull().sum(), columns = ['Nan_sum'])
null_val = null_val[null_val['Nan_sum']>0]
null_val['Percentage'] = (null_val['Nan_sum']/len(vehicles))*100
null_val = null_val.sort_values(by=['Nan_sum'], ascending=False)
null_val

We can treat those columns whose null values differently depends on the project's goal. For the sake of this project, all rows with entirely null and variables with null values > 30% will be dropped, and the rest will be filled by mode.

In [None]:
vehicles.dropna(how='all')
vehicles.shape

It seems that no row filled with all null. Let drop some columns and fill the others with mode value

In [None]:
vehicles.drop(['size', 'condition', 'VIN', 'cylinders', 'paint_color'], inplace=True, axis=1)
vehicles.dropna(subset=['drive', 'type'], how='any', inplace=True) # drop any row from drive and type columns that contains null
vehicles.shape

In [None]:
filled_mode = ['odometer', 'manufacturer', 'lat', 'long', 'model', 'fuel',
       'title_status', 'transmission', 'year', 'description', 'image_url',
       'posting_date'] 
for x in filled_mode:
    vehicles[x] = vehicles[x].fillna(vehicles[x].mode()[0])

In [None]:
# re-check null value
vehicles.isnull().sum()

There is a ton of way to handle null/missing value, above are only some of them. Keep in mind that treating missing value for numerical and categorical variables might be different. **Hands-On Exploratory Data Analysis with Python**, a book by Suresh Kumar Mukhiya and Usman Ahmed is a really good source if we want to know different ways to handle missing/null values.

### Converting datatype

Now it's time to convert some columns' data type

In [None]:
# converting 'year' from float to int
vehicles['year'] = vehicles['year'].astype(int)

In [None]:
# converting 'posting_date' from object to datetime
vehicles['posting_date'] = pd.to_datetime(vehicles['posting_date'], utc=True)

In [None]:
# check type of each variable
vehicles.info()

### Filtering row based on specific value

For simplicity, I will reduce year from 2000 onward.

In [None]:
vehicles = vehicles.loc[vehicles.year>2000, :] # only year after 2000 are selected
vehicles.shape

In [None]:
# Select model whose total appearance is more than 500
vehicle_model = pd.DataFrame(vehicles.model.value_counts(), columns = ['model'])
vehicle_model = vehicle_model.sort_values(by=['model'], ascending=False)
vehicle_model = vehicle_model.loc[vehicle_model.model>500, :]
vehicle_model

In [None]:
vehicles = vehicles.loc[vehicles['model'].isin(vehicle_model.index), :]
vehicles.shape

In [None]:
# Select manufacturer whose total appearance is more than 500
manufacturers = pd.DataFrame(vehicles.manufacturer.value_counts(), columns = ['manufacturer'])
manufacturers = manufacturers.sort_values(by=['manufacturer'], ascending=False)
manufacturers = manufacturers.loc[manufacturers.manufacturer>500, :]
manufacturers

In [None]:
vehicles = vehicles.loc[vehicles['manufacturer'].isin(manufacturers.index), :]
vehicles.shape

### Replace row value

In [None]:
vehicles.model.unique()

Let consider that f-150 and f150 are the same model, so we replace one of them with another one.

In [None]:
vehicles.model.value_counts() #before replacement

In [None]:
vehicles['model'].replace({'f150': 'f-150'}, inplace = True)

In [None]:
# check total f-150 model after replacement
vehicles.model[vehicles.model == 'f-150'].value_counts()

## VISUALIZATION & EXPLORATION

For this project, I will mainly focus on categorical variables.

### Distribution for category variables

In [None]:
# DISTRIBUTION OF MANUFACTURER, FUEL, DRIVE, AND TITLE STATUS BASED ON THEIR TOTAL NUMBER 

cat_type = ['manufacturer', 'fuel', 'drive', 'title_status']
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(18, 10)) 

for i, var in enumerate(cat_type):
    row = i//2
    pos = i % 2    
    plot = sns.countplot(x=var, data=vehicles, order = vehicles[var].value_counts().index, ax=axs[row][pos])
    var = plot.set_xticklabels(plot.get_xticklabels(), rotation=90)
fig.tight_layout(pad=2.0)

Some conclusion we can see are:
* 3 of top5 car manufacturers are from USA, the other two are from Japan
* Gas dominates fuel for car
* 4wd and fwd shared almost same number of drive type of car
* Almost all used cars are in clean condition

In [None]:
# DISTRIBUTION OF STATE WITH MOST CARS, TYPE, AND TRANSMISSION 

cat_type2 = ['state', 'type', 'transmission']
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(18, 10)) 

for i, var in enumerate(cat_type2):
    row = i//2
    pos = i % 2    
    plot = sns.countplot(x=var, data=vehicles, order = vehicles[var].value_counts().index, ax=axs[row][pos])
    var = plot.set_xticklabels(plot.get_xticklabels(), rotation=90)
fig.tight_layout(pad=2.0)

* California (ca) by far is the state with largest number of car followed by Florida (fl)
* US citizens love sedan and SUV
* They also like automatic type of car engine

In [None]:
# YEAR
plt.figure(figsize=(9,5))
sns.set_theme(style="darkgrid")
stat = sns.countplot(x="year", data=vehicles)
var = stat.set_xticklabels(stat.get_xticklabels(), rotation=90)

We can see from chart above that 2017 is the year with highest number of used car posted. That number decreased massively in 2020. The price of new car probably affect the decision to keep the old car.

In [None]:
# PRICE MEAN FOR EACH MANUFACTURER WITH ALL TYPE COMBINED

price_mean = vehicles[['price','manufacturer']].groupby('manufacturer').mean()

plt.figure(figsize=(9,5))
pr = sns.barplot(x=price_mean.index, y="price", data=price_mean)
var = pr.set_xticklabels(pr.get_xticklabels(), rotation=90)

If there is possibility you want to re-sell your car in the future, then buy Chevrolet. They have the lowest depreciation value. Mean the price of used car is far higher than most of other manufacturer.

In [None]:
# MANUFACTURER WITH MOST USED CARS FOR SALE EACH YEAR

manf_ser = vehicles.groupby('year').manufacturer.value_counts()
manf_ser_df = pd.DataFrame(manf_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(manf_ser_df, cmap='Blues', linecolor='white', linewidth=1)

Used cars from manufacturers like Ford, Chevrolet, and Toyota dominated for sale, especially in 2012 to 2018. Means car from those manufacturers are favourite among US people.

In [None]:
cols = ['ford', 'chevrolet', 'toyota', 'honda','jeep']
manf_ser_df2 = manf_ser_df[cols].copy()
manf_ser_df2

In [None]:
# TIME SERIES FOR FOR-SALE USED CARS EACH YEAR FROM TOP 5 MANUFACTURERS

fig, ax = plt.subplots(figsize=(18, 10))
sns.lineplot(data=manf_ser_df2)
ax.set_xlim(2001,2021)
ax.set_xticks(range(2001,2021))
plt.show()

They had similar fluctuation of for-sale used cars with Ford were always dominating almost every year.

In [None]:
plt.subplots(figsize=(9, 5))
sns.heatmap(manf_ser_df2, cmap='Blues', linecolor='white', linewidth=1)

In [None]:
# DISTRIBUTION OF USED CARS RESOLD IN EVERY STATE FOR TOP 5 MANUFACTURERS

state_ser = vehicles.groupby('manufacturer').state.value_counts()
state_ser_df = pd.DataFrame(state_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(state_ser_df, cmap='Blues', linecolor='white', linewidth=1)

The pattern was similar for every state. It always used cars from Ford, Chevrolet, and Toyota the owners wanted to sale

In [None]:
# WHAT TYPE OF USED CARS WERE FOR SALE?

type_ser = vehicles.groupby('manufacturer').type.value_counts()
type_ser_df = pd.DataFrame(type_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(type_ser_df, cmap='Blues', linecolor='white', linewidth=1)

Ford and Chevrolet had SUV, Pickup, Sedan, and Truck. Made-by-Japan used Sedan cars for Toyota and Honda. Many SUV-Jeep owners suprisingly wanted to sell their cars. 

In [None]:
# TRANSMISSION TYPE

trans_ser = vehicles.groupby('manufacturer').transmission.value_counts()
trans_ser_df = pd.DataFrame(trans_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(trans_ser_df, cmap='Blues', linecolor='white', linewidth=1)

It was no surprise that automatic used cars dominated the used cars market.

In [None]:
# FUEL

fuel_ser = vehicles.groupby('manufacturer').fuel.value_counts()
fuel_ser_df = pd.DataFrame(fuel_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(fuel_ser_df, cmap='Blues', linecolor='white', linewidth=1)

Same thing with fuel. Gas ruled.

In [None]:
# DRIVE TYPE

drive_ser = vehicles.groupby('manufacturer').drive.value_counts()
drive_ser_df = pd.DataFrame(drive_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(drive_ser_df, cmap='Blues', linecolor='white', linewidth=1)

### Price Distribution for each category

In [None]:
manf7000 = ['ford', 'chevrolet', 'toyota', 'honda', 'jeep']
to_drop = ['Unnamed: 0', 'id', 'url', 'region_url', 'image_url', 'model', 'description', 'lat', 'long']
vehicles_ca_2017 = vehicles.loc[(vehicles['manufacturer'].isin(manf7000))&(vehicles['state']=='ca')&((vehicles['year']==2017)), :].copy()
vehicles_ca_2017.drop(to_drop,inplace=True, axis=1)
vehicles_ca_2017.head(3)

In [None]:
man_type = ['type', 'drive', 'fuel', 'transmission']
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 14)) 

for i, var in enumerate(man_type):
    row = i//2
    pos = i % 2   
    plot = sns.boxplot(x=var, y='price', data=vehicles_ca_2017, hue='manufacturer', ax=axs[row][pos])
fig.tight_layout(pad=2.0)                                              

Ford had the widest price range in almost all categories. Other manufacturers had wide price range in some area but absent on the other area.

In [None]:
# Just want to know little about reagion

reg_ser = vehicles_ca_2017.groupby('manufacturer').region.value_counts()
reg_ser_df = pd.DataFrame(reg_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(reg_ser_df, cmap='Blues', linecolor='white', linewidth=1)

In [None]:
# USED CARS DISTRIBUTION IN CALIFORNIA IN 2017

state_ser = vehicles_ca_2017.groupby('manufacturer').state.value_counts()
state_ser_df = pd.DataFrame(state_ser.unstack())

plt.subplots(figsize=(12, 7))
sns.heatmap(state_ser_df, cmap='Blues', linecolor='white', linewidth=1)

It turned out that many people in California is the owner of Ford followed by Toyota. Japanese cars are loved by the Americans.

In [None]:
# TOP 5 MANUFACTURER IN CA STATE IN 2017
plt.figure(figsize=(9,5))
sns.set_theme(style="darkgrid")
stat = sns.countplot(x="manufacturer", data=vehicles_ca_2017)
var = stat.set_xticklabels(stat.get_xticklabels(), rotation=90)

#Above is the exact number of for-sale used cars in CA in 2017. 

In [None]:
# PRICE DISTRIBUTION AND TENDENCY OF USED CARS IN CA IN 2017

man_type = ['type', 'drive', 'fuel', 'transmission']
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 14)) 

for i, var in enumerate(man_type):
    row = i//2
    pos = i % 2   
    sns.violinplot(x=var, y='price', data=vehicles_ca_2017, hue='manufacturer', ax=axs[row][pos])
fig.tight_layout(pad=2.0)    

It is always good for customers to know at what price the used car they want to buy. Charts above displays price tendency for top 5 manufacturers in California in 2017. 

In [None]:
# 2017, WHAT TYPE OF CARS PEOPLE IN CALIFORNIA PEOPLE WANTED TO SELL 

plt.subplots(figsize=(18, 12)) 
sns.countplot(y="type", hue="manufacturer", data=vehicles_ca_2017)

If you wanted to buy SUV or Pickup, there were many choice coming from Frod. If you wanted Toyota, you would see mostly Sedan. 

In [None]:
g = sns.catplot(y="type", hue="manufacturer", col="drive", data=vehicles_ca_2017, kind="count", height=8, aspect=.8)

In [None]:
# REGRESSION PLOT PRICE VS ODOMETER FOR FORD WHEN NO 0 PRICE
vehiclesnon0_ca_2017 = vehicles_ca_2017.loc[(vehicles_ca_2017['price']!=0), :].copy()
sns.set_context('paper', font_scale=1.4)
sns.lmplot(x='odometer', y='price', hue='manufacturer', data=vehiclesnon0_ca_2017,
          scatter_kws={'s': 100, 'linewidth': 0.5, 'edgecolor': 'w'}, height=8, aspect=4)

An odometer or odograph is an instrument used for measuring the distance traveled by a vehicle, such as a bicycle or car (Wikipedia). 0 price probably beacuse the owner didn't state their desired price, so I eliminated them. From the regression chart above helped to predict the general price of used cars from top 5 manufacturers in CA in 2017.

In [None]:
# CA, 2017: PRICE RANGE FOR EVERY TYPE FROM TOP 5 MANUFACTURERS

g = sns.catplot(x="type", y="price", col="manufacturer",
                data=vehiclesnon0_ca_2017, saturation=.5,
                kind="bar", ci=None, aspect=.6)
(g.set_axis_labels("", "Price")
  .set_xticklabels(['truck', 'mini-van', 'sedan', 'pickup', 'SUV', 'hatchback', 'wagon', 'convertible', 'other', 'coupe', 'van'])
  .set_titles("{col_name} {col_var}")
  .set_xticklabels(rotation=90)
  .despine(left=True)) 

If you wanted to buy a certain type of used car, chart above could help you to decide what manufacturer to go for.

So, those are some information we can obtain by doing EDA. There are much more insight from that one dataset if we want to dig deeper. Some code mistakes might be there, but you get the idea. Thanks. 