# Tutorial 7: Exploratory Data Analysis (EDA)

## Objectives

After this tutorial you will be able to:

*   Understand the importance of EDA
*   Apply EDA techniques to different data types
*   Assess relationships between variables
*   Coummunicate findings effectively
*   Apply EDA to real-world datasets

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#import-1">Import dataset</a>
    </li>
    <br>
    <li>
        <a href="#desc">Descriptive Analysis</a>
    </li>
    <br>
    <li>
        <a href="#corr">Correlation Statistics</a>
    </li>
    <br>
</ol>


<hr id="import">

<h2>1. Import the dataset</h2>

Import the `Pandas` library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Read the data from `csv` into a `Pandas DataFrame`

In [None]:
df = pd.read_csv('CO2_Emissions_Canada.csv')
df.head()

In [None]:
df.tail()

Get information about the columns of the `DataFrame`

In [None]:
df.info()

<hr id="desc">

<h2>2. Descriptive Analysis</h2>

Summarizing numerical data: measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation)

In [None]:
print('Mean engine size:', df['Engine Size [L]'].mean())
print('Median engine size:', df['Engine Size [L]'].median())
print('Mode engine size:', df['Engine Size [L]'].mode())
print('Range of engine size:', df['Engine Size [L]'].max() - df['Engine Size [L]'].min())
print('Variance of engine size:', df['Engine Size [L]'].var())
print('Standard deviation of engine size:', df['Engine Size [L]'].std())

There is an easier way to describe the different parameters using `Pandas` method `DataFrame.describe()`

In [None]:
df.describe()

In [None]:
# create a boxplot for engine size
df.plot(kind='box', y='Engine Size [L]', title='Boxplot of Engine Size [L]', ylabel='Engine Size [L]')

Summarizing categorical data: frequency tables and mode.  
We can use the `describe()` method for string parameters as follows:

In [None]:
df.describe(include='object')

We can also create **frequency tables** for each parameter using the `value_counts()` method

In [None]:
df['Fuel Type'].value_counts()

In [None]:
# create a countplot for fuel type
sns.countplot(x='Fuel Type', data=df)
plt.title('Fuel Type Counts')

<hr id="corr">

<h2>3. Correlation Statistics</h2>

<h3> Pearson Correlation Coefficient (Pearson's r)</h3>


The Pearson correlation coefficient, also known as Pearson's r, is a statistical measure of the **linear correlation** between two variables. It is a number between -1 and 1.

**Pearson's r**
- A value close to 1 indicates positive correlation
- A value close to -1 indicates negative correlation
- A value close to 0 indicates no correlation

**P-value**  
The P-value is the probability value that the correlation between these two variables is statistically significant.   
It is typical to choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.  
  
| P-value   | Correlation Significance  |
| ---       | ---                       |
| `p < 0.001:`  | strong evidence       |
| `p < 0.05:`   | moderate evidence     |
| `p < 0.1:`    | weak evidence         |
| `p > 0.1:`    | no evidence           |

We can calculate the Pearson correlation in `Pandas` using the `DataFrame.corr()` method between numerical values

In [None]:
df[['Engine Size [L]', 'CO2 Emissions [g/km]']].corr()

We can also calculate the Pearson correlation using the `scipy.stats` module

In [None]:
# calculate correlation between two columns
stats.pearsonr(df['Engine Size [L]'], df['CO2 Emissions [g/km]'])

We can also calculate a correlation matrix between **ALL** numerical variables in a dataframe as follows:

In [None]:
# calculate the correlation matrix
df.corr(numeric_only=True)

An appropriate way to visualize a correlation matrix is through a heatmap

In [None]:
# plot the correlation matrix
df_corr = df.corr(numeric_only=True)
sns.heatmap(df_corr, annot=True, cmap='coolwarm')

An appropriate way to visualize the correlation between 2 parameters is through a scatter plot

In [None]:
# plot the scatter plot between Years of Experience and Accident Level Num
df.plot(kind='scatter', x='Engine Size [L]', y='CO2 Emissions [g/km]', figsize=(10, 6))
plt.title('Engine size vs CO2 emissions')

In [None]:
# create a scatter matrix
pd.plotting.scatter_matrix(df, figsize=(20, 20))

<h3>Chi-Square (<i>x</i><sup>2</sup>) Test</h3>

The chi-square test is a statistical hypothesis test that is used to determine whether there is a significant association between two **categorical variables**.  
The chi-square test is based on the comparison of **observed** and **expected** frequencies in a contingency table.  

A high chi-square value indicates that there is a strong association between the two categorical variables being tested.

In [None]:
# create the pivot/crosstab contingenncy table
pivot = pd.crosstab(df['Vehicle Class'], df['Fuel Type'])
pivot

In [None]:
# perform the chi-square test
stats.chi2_contingency(pivot)

In [None]:
# create a heatmap
sns.heatmap(pivot, annot=True, fmt='d', cmap='coolwarm')

# title
plt.title('Vehicle Class vs Fuel Type')

<h3>ANOVA: Analysis of Variance</h3>

The Analysis of Variance  (ANOVA) is a statistical method used to determine whether there are significant differences between the means of two or more groups. It is a powerful tool for analyzing data from experiments and observational studies.  

A high `F-score` suggests that the independent variable has a significant effect on the dependent variable (outcome).  
And the `P-value` determines whether the `F-score` is statistically significant or not.

In [None]:
# calculate ANOVA for Industry Sector and Accident Level Num
# group the data by fuel type and corresponding CO2 emissions
df_anova = df[['Fuel Type', 'CO2 Emissions [g/km]']].groupby(['Fuel Type'])
df_anova.head()

In [None]:
# perform the ANOVA test
anova_results = stats.f_oneway(
    df_anova.get_group('X')['CO2 Emissions [g/km]'],
    df_anova.get_group('Z')['CO2 Emissions [g/km]'],
    df_anova.get_group('D')['CO2 Emissions [g/km]'],
    df_anova.get_group('E')['CO2 Emissions [g/km]'],
    df_anova.get_group('N')['CO2 Emissions [g/km]']
)
anova_results

In [None]:
# box plot
sns.boxplot(y=df['CO2 Emissions [g/km]'], hue=df['Fuel Type'])

# show grid lines
plt.grid(axis='y')

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>
<br>
<a href="https://www.kaggle.com/datasets/mrmorj/car-fuel-emissions">CO2 emissions dataset (kaggle.com)</a>