# **Lab 6**

## Exploratory Data Analysis (EDA)

In this lab, you will learn how to explore the data further by statistical and correlation analysis. Many functions are available in Python (and other packages) to help you achieve this at ease. You will also learn how  how to explore data from ground-up. Of course, exploratory analysis also entails utilizing graphical representations to provide meaningful "pictorial" description of the data.

> **Credit note:** A portion of this lab was adapted from [sanithps98's repo](https://github.com/sanithps98/Automobile-Dataset-Analysis) on data analysis.

In [1]:
import numpy as np
import pandas as pd

from IPython.display import display

To visualize data, let's load the relevant packages. We have two packages that we can use:

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

## Describing the Data

We will be using the cleaned car data that we did in Lab 4.

In [3]:
df = pd.read_csv('car_data_CLEANED.csv', index_col=0)   # first column is used as index
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,horsepower-binned
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000.0,21,27,13495.0,Low
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,mpfi,3.47,2.68,9.0,111,5000.0,21,27,16500.0,Low
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,mpfi,2.68,3.47,9.0,154,5000.0,19,26,16500.0,Medium
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,mpfi,3.19,3.4,10.0,102,5500.0,24,30,13950.0,Low
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,mpfi,3.19,3.4,8.0,115,5500.0,18,22,17450.0,Low


The most basic function that you can use to describe the data with summary statistics is to use `describe()`.

In [4]:
df.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0
mean,0.840796,122.0,98.797015,0.837102,0.915126,0.899108,2555.666667,-3.053798,3.330692,3.256874,10.164279,103.402985,5117.665368,25.179104,30.686567,13207.129353
std,1.254802,31.99625,6.066366,0.059213,0.029187,0.040933,517.296727,0.024129,0.268072,0.316048,4.004965,37.36565,478.113805,6.42322,6.81515,7947.066342
min,-2.0,65.0,86.6,0.678039,0.8375,0.799331,1488.0,-3.092056,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,101.0,94.5,0.801538,0.890278,0.869565,2169.0,-3.070568,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,122.0,97.0,0.832292,0.909722,0.904682,2414.0,-3.057791,3.31,3.29,9.0,95.0,5125.369458,24.0,30.0,10295.0
75%,2.0,137.0,102.4,0.881788,0.925,0.928094,2926.0,-3.045594,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,1.0,1.0,1.0,4066.0,-2.938151,3.94,4.17,23.0,262.0,6600.0,49.0,54.0,45400.0


In [5]:
df.dtypes

symboling              int64
normalized-losses      int64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size          float64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower             int64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
horsepower-binned     object
dtype: object

Notice that all the attributes that are of `object` type (mainly strings, or other non-numerical types) are ignored when we use `describe()`.

Interestingly, you can also "describe" the non-numerical data. This is how: Fix the "include" option to 'object' (the data type), and it now gives a set of relevant measures: **count** (how many values), **unique** (how many unique values), **top** (the one with the most values), **freq** (the count of the most frequent value). 

In [6]:
df.describe(include=['object'])

Unnamed: 0,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system,horsepower-binned
count,201,201,201,201,201,201,201,201,201,201,201
unique,22,2,2,2,5,3,2,6,7,8,3
top,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi,Low
freq,32,181,165,115,94,118,198,145,157,92,153


We can now see that most cars are of "Low" horsepower category (163 of them), and most cars are sedan type (96 of them).

**Descriptive Question 1**: What are the average prices for each category of car body styles?

To obtain the unique types of car body styles, use `unique()` on the column `body-style`:

In [7]:
df['body-style'].unique()

array(['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop'],
      dtype=object)

**AE1**: Find the average prices for each of these categories.

*Recall: You could use* `groupby` *function here to group the data by the body styles, and have each group calculate its mean.*

In [None]:
# write your code here
average = 


### Crosstabs / Pivot Tables

Crosstabs and pivot tables are kind of similar at first glance. They allow us to analyze data that can be aggregated by more than a single attribute. Both retain data in tabular format but in a more summarized form. Generally, they are only different in terms of their functionalities in certain softwares (read [here](https://www.mtab.com/difference-crosstabs-pivot-tables/) for more information) where pivot tables are normally regarded as more dynamic (users can drag and drop and rearrange data on the spot) than crosstabs.

If you are planning to perform grouping of data with multiple variables (say, grouping by drive type and body style), you might do something like this:

In [None]:
df_gp1 = df[['drive-wheels','body-style','price']]
gp1 = df_gp1.groupby(['drive-wheels','body-style'],as_index=False).mean()
gp1

Basically, all possible combinations of the drive type and body style, have been aggregated by mean. It may be much easier to visualize if it is made into a pivot table instead:

In [None]:
gp_pivot = gp1.pivot(index='drive-wheels',columns='body-style')
gp_pivot

Often, we do not have data for some of the pivot cells (because that combination just didn't exist). We can fill these missing cells with the value 0, or you can be also satisfied with leaving it as NaN if no further processing is going to happen.

### Dispersion

**Exploratory Question 2**: What is the dispersion of the price of cars in this data? 

A boxplot is a good graphical representation to show dispersion, particularly for continuous numerical values such as prices.

In [None]:
df_price = df[["price"]]
df_price

There are some rows with NaN. We should drop them.

In [None]:
df_price.dropna(inplace=True)
df_price

We use Seaborn package's [`boxplot`](https://seaborn.pydata.org/generated/seaborn.boxplot.html). Check out the documentation to see other options for customization.

In [None]:
sns.boxplot(y="price", width=0.15, data=df_price)    

Multiple boxplots can be shown together if we define more than one "dimensions" to it. Seaborn makes it really simple: you just need to define what is along the 'x' and 'y'.

In [None]:
sns.boxplot(x="body-style", y="price", width=0.75, data=df)

We see that the distributions of price between the different body-style categories have a significant overlap (especially hatchback, sedan and wagon styles), so we could probably "guess" that there is not that much correlation between these three attributes, hence the body-style attribute may not be a good predictor of price as well (if we intend to train a model using this attribute).

But this is not a good way to look at correlation between attributes. The better way is to calculate correlation itself.

### Correlation

**Exploratory Question 3**: What are a few attributes that correlate the most with the car price?

Pandas is really convenient. You can immediately compute the correlation between attributes by using the `corr()` function, without needing to worry about the NaN values interfering

> Technical note: If you have NaN values you cannot calculate the correlation score. If you use Numpy's `corrcoef` function, you have to manually handle them).

In [16]:
df_esp = df[["engine-size", "price"]]

In [None]:
df_esp.corr()

This is a matrix containing the correlation coefficients ([Pearson's](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) by default) between pairs of attributes. It is also symmetrical because order of attribute is not important when computing correlation (correlation between x and y is the same as correlation between y and x). 

Why do you think the diagonal values are 1?

Do you think engine size correlates positively with price?

You can pass this matrix into Seaborn's heatmap function, which provides some colors and shading to give a graphical representation to correlation. 

In [18]:
?sns.heatmap

In [None]:
sns.heatmap(df_esp.corr(), cmap='cividis')   #cividis is a colormap setting. You can change it to other options, viridis, YlGnBu

Let's try for the entire dataframe (numerical data only):

In [None]:
sns.heatmap(df.corr(), cmap='cividis')

Which of the following numerical attributes correlate strongly with car price?

<br><br>

These attributes are the ones that make good candidates as features for subsequent data mining or machine learning tasks. They are able to tell apart a car that has a high or low price.

### Scatter Plot

The correlation between two attributes can be better visualized in a scatter plot of data points. Here is Seaborn's [`scatterplot`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

In [None]:
sns.scatterplot(x="engine-size", y="price", data=df)

Seaborn has another function [`regplot`](https://seaborn.pydata.org/generated/seaborn.regplot.html) which plots the scatter plot plus the fitted regression line. 

This line as one that characterises the distribution of the data in a linear way. In correlation terms, it gives us an idea whether the correlation is positive or negative, strong or weak, or there is no correlation between the attributes.

In [None]:
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

As the size of engine goes up, the price seems to goes up as well: this indicates a positive direct correlation between these two variables. The regression line is almost a perfect diagonal line, which means that the correlation between these two attributes is pretty strong. In predictive modeling, we can also think of this as, if we have information of a car's engine size, we are likely to be able to predict (regress) the price quite well. 
<br><br>

**AE2**: Visualize the scatter plots of several other variables listed below, versus the price.
* highway-mpg
* peak-rpm
* stroke

Analyse the scatter plots and also determine their correlation scores to see if they match your analysis: 

In [None]:
# write your code here
sns.scatterplot(x="highway-mpg", y="price", data=df)
df_sc = df[["highway-mpg", "price"]]

df_sc.corr()



[Matplotlib](https://matplotlib.org/) package will be explored in more detail when we come to Data Visualization topic later. For now, Seaborn seems to be quite straightforward and easy to use but you may not be able to have fine control over how the plot comes out. (Note: Actually you can, but you need some knowledge on how to manipulate plots with matplotlib. Seaborn has matplotlib running beneath.)