# Details on the dataset and EDA can be found in
https://www.kaggle.com/code/devanshbesain/exploration-and-analysis-auto-mpg

# Exploration and analysis of the auto-mpg data set.
Welcome to this notebook created for exploration and analysis of the Auto-MPG data-set from UCI Machine Learning Library.
The data-set is fairly standard on kaggle but can be accessed separately from the UCI Machine Learning Repository along with many other interesting data-sets. Check http://archive.ics.uci.edu/ml/index.php for more.

This notebook aims primarily to demonstrate use of pandas and seaborn for exploration and visualization of the data-set.

## So what is the auto-mpg data set?
 The following description can be found on the UCI Repository page for the data-set (http://archive.ics.uci.edu/ml/datasets/Auto+MPG)
 -  This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)

Let's first import the libraries.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


We have imported all the packages and libraries we will be using for the initial exploration of data. This notebook will be focusing on the Exploration and Visualization using pandas and seaborn packages.

Let us load the data to explore for hidden treasures.

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Colab-Notebooks/auto-mpg.csv',index_col='car name')
data

Let's have a look at data

In [None]:
display(data.head())
print(data.index)
print(data.columns)

We can see that the dataset has the following columns (with their type):

 - **mpg**: continuous
 - **cylinders**: multi-valued discrete
 - **displacement**: continuous
 - **horsepower**: continuous
 - **weight**: continuous
 - **acceleration**: continuous
 - **model year**: multi-valued discrete
 - **origin**: multi-valued discrete
 - **car name**: string (unique for each instance)

In [None]:
data.shape

In [None]:
data.isnull().any()

Nothing seems to be missing

In [None]:
#data.dtypes
data.info(verbose=True) # Verbose is to print all the available information

Interestingly, the horsepower is an object and not a float. The values we saw above were clearly numbers. So let's try converting the column using astype().

    Let's look at the unique elements of horsepower to look for discrepancies

In [None]:
# For discrepancies, unique() function can be very useful for summarising the data since it is hard to see the discrepancies in large datasets.
data.horsepower.unique()

When we print out all the unique values in horsepower, we find that there is '?' which was used as a placeholder for missing values. Lest remove these entries.

In [None]:
data = data[data.horsepower != '?']

In [None]:
print('?' in data.horsepower)

In [None]:
data.shape

In [None]:
data.dtypes

So we see all entries with '?' as place holder for data are removed. However, we the horsepower data is still an object type and not float. That is because initially pandas obtained the entire column as object when we imported the data set due to '?', so lets change that data column to float.

In [None]:
data.horsepower = data.horsepower.astype('float')
data.dtypes

Now everything looks in order so lets continue, let's describe the dataset

In [None]:
data.describe()

- The first quartile, 17 MPG, is the value for which 25% of the entire MPG observations are smaller and 75% are larger.
- Q2, 22.75 MPG, is the same as the median (50% of MPG observations are smaller than Q2, 50% are larger)
- Only 25% of the observations are greater than the third quartile, 29 MPG.

## Step 1: Let's look at mpg

In [None]:
data.mpg.describe()

So the minimum value is 9 and maximum is 46, but on average it is 23.44 with a standard deviation of 7.8

In [None]:
sns.distplot(data['mpg'])


In [None]:
print("Skewness: %f" % data['mpg'].skew()) # A measure of distortion of symmetric distribution
print("Kurtosis: %f" % data['mpg'].kurt())
# Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
# That is, data sets with high kurtosis tend to have heavy tails, or outliers

Using our seaborn tool we can look at mpg:

 - Slight of 0.45
 - Kurtosis of -0.51

### Lets visualise some relationships between these data points, but before we do, we need to scale them to the same range of [0,1]
In order to do so, lets define a function scale()

In [None]:
def scale(a):
    b = (a-a.min())/(a.max()-a.min())
    return b

In [None]:
data_scale = data.copy()
data_corr = data.copy() # to be used for correlation matrix later

In [None]:
data_scale ['displacement'] = scale(data_scale['displacement'])
#data_scale['horsepower'] = scale(data_scale['horsepower'])
data_scale ['acceleration'] = scale(data_scale['acceleration'])
data_scale ['weight'] = scale(data_scale['weight'])
data_scale['mpg'] = scale(data_scale['mpg'])

In [None]:
data_scale['mpg'].head()
#data_scale.to_csv("data_scale.csv")

All above data is now scaled to the same range of [0,1]. This will help us visualize data better. We used a copy of the original dataset for these operations.

In [None]:
# Lets assign the known origins to the country_code, easier to observe and plot in categorical manner
data['Country_code'] = data.origin.replace([1,2,3],['USA','Europe','Japan'])
data_scale['Country_code'] = data.origin.replace([1,2,3],['USA','Europe','Japan'])

In [None]:
data_scale.head()

Lets look at MPG's relation to categories

In [None]:
var = 'Country_code'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
palette = ['plum', 'g', 'orange']
fig = sns.boxplot(x=var, y="mpg", data=data_plt, palette=palette)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)


The red line marks the  average of the set. From the above plot we can observe:

 - Majority of the cars from USA (almost 75%) have MPG below global average.
 - Majority of the cars from Japan and Europe have MPG above global average.

### Let's look at the year-wise distribution of MPG

In [None]:
var = 'model year'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
palette = ['plum', 'g', 'orange', 'b', 'r']
fig = sns.boxplot(x=var, y="mpg", data=data_plt, palette=palette)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)


- Higher model performs better MPG (not always) but exhibits the correlation.

### And MPG distribution for cylinders

In [None]:
var = 'cylinders'
data_plt = pd.concat([data_scale['mpg'], data_scale[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
palette = ['plum', 'g', 'orange', 'b', 'r']
fig = sns.boxplot(x=var, y="mpg", data=data_plt, palette=palette)
fig.axis(ymin=0, ymax=1)
plt.axhline(data_scale.mpg.mean(),color='r',linestyle='dashed',linewidth=2)


## Now let us look at the correlation heatmap

In [None]:
corrmat = data_corr.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, square=True, vmin=-1, vmax=1, annot=True, cmap='BrBG');

In [None]:
factors = ['cylinders','displacement','horsepower','acceleration','weight','mpg']
corrmat = data[factors].corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmin=-1, vmax=1, annot=True, cmap='BrBG');

In [None]:
#scatterplot
sns.set()
sns.pairplot(data_scale, size = 2.0, kind='scatter', hue ='Country_code',  markers=["o", "s", "D"]), #diag_kind="hist"
plt.show()

## How are we doing so far?
We have seen the data to get a feel for it. For example, we saw the spread of the desired variable MPG along the various discrete variables, namely, Origin, Year of Manufacturing or Model and Cylinders.  

## Lets get back to data distribution

In [None]:
var='mpg'
data[data[var]== data[var].min()] # the minimum mpg value and its other columns, as a whole row-data.

In [None]:
data[data[var]== data[var].max()]

In [None]:
var='displacement'
data[data[var]== data[var].min()]

In [None]:
data[data[var]== data[var].max()]

In [None]:
var='horsepower'
data[data[var]== data[var].min()]

In [None]:
data[data[var]== data[var].max()]

In [None]:
var='weight'
data[data[var]== data[var].min()]

In [None]:
data[data[var]== data[var].max()]

In [None]:
var='acceleration'
data[data[var]== data[var].min()]

In [None]:
data[data[var]== data[var].max()]

Now that we have looked at the distribution of the data along discrete variables and we saw some scatter-plots using the seaborn pairplot. Now let's try to find some logical causation for variations in mpg. We will use the lmplot() function of seaborn with scatter set as true. This will help us in understanding the trends in these relations. We can later verify what we see with the correlation heat map to find if the conclusions drawn are correct. We prefer lmplot() over regplot() for its ability to plot categorical data better.

In [None]:
#plot = sns.lmplot('horsepower','mpg',data=data,hue='Country_code') # Seaborn version below 12?
a=sns.lmplot(data=data, x='horsepower', y='mpg', hue='Country_code') # version 13+
a.set(ylim = (0,50)) # change a to plot if seaborn version below 12.


In [None]:
plot = sns.lmplot(data=data, x='displacement',y='mpg',hue='Country_code')
plot.set(ylim = (0,50))

In [None]:
plot = sns.lmplot(data=data,x='weight',y='mpg',hue='Country_code')
plot.set(ylim = (0,50))

In [None]:
plot = sns.lmplot(data=data,x='acceleration',y='mpg',hue='Country_code')
plot.set(ylim = (0,50))

In [None]:
data['Power_to_weight'] = ((data.horsepower*0.7457)/data.weight) # Power in Watt to the Weight/ # Power required to move per kg.
#When it comes to cars being sold everywhere around the world, automakers have to be very mindful of their fuel economy figures.
#This is where the power to weight ratios come into play. It doesn’t matter if you have the most powerful engine in the world
#when it is attached to the heaviest chassis in the world. If the engine doesn’t have to work as hard to get the vehicle up to speed,
#then the acceleration rate can be increased. https://www.autodeal.com.ph/articles/car-features/what-power-weight-ratio-and-why-it-important

In [None]:
data.sort_values(by='Power_to_weight',ascending=False ).head()


## Overview
So far, we had a look at our data using various pandas methods and visualized it using seaborn package. We looked at
### MPG's relation with discrete variables
 - MPG distribution over given years if manufacturing
- MPG distribution by country of origin
- MPG distribution by number of cylinders

### MPGs relation to other continuous variables:

 - Pair wise scatter plot of all variables in data.
### Correlation
 - We looked at the correlation heat map of all columns in our data
