# Tutorial 6: Data Visualization

## Objectives

After this tutorial you will be able to:

*   Understand basic concepts of data visualization
*   Use Pandas, Matplotlib, and Seaborn to create different types of plots
*   Customize plots to make them more informative and visually appealing
*   Apply data visualization techniques to visualize and explore datasets

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#import-1">Import dataset 1 (Categorical)</a>
    </li>
    <br>
    <li>
        <a href="#heatmap">Cross Tabulation and Heatmap Plot</a>
    </li>
    <br>
    <li>
        <a href="#line">Line Plot</a>
    </li>
    <br>
    <li>
        <a href="#bar">Bar Plot</a>
    </li>
    <br>
    <li>
        <a href="#pie">Pie Plot</a>
    </li>
    <br>
    <li>
        <a href="#import-2">Import dataset (Numerical)</a>
    </li>
    <br>
    <li>
        <a href="#histogram">Histograms</a>
    </li>
    <br>
    <li>
        <a href="#box">Box Plot</a>
    </li>
    <br>
    <li>
        <a href="#scatter">Scatter Plot</a>
    </li>
    <br>
</ol>


<hr id="import">

<h2>1. Import dataset 1</h2>

Import the `Pandas` library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Read the data from `csv` into a `Pandas DataFrame`

In [None]:
df = pd.read_csv('IHMStefanini_industrial_safety_and_health_database.csv')
df.head()

Get information about the columns of the `DataFrame`

In [None]:
df.info()

Standardize data

In [None]:
# convert column to datetime
df['Date'] = pd.to_datetime(df['Date'])
df.info()

In [None]:
# create a new column for unique months (YYYY-MM) 2020-10
df['Month'] = df['Date'].dt.year.astype(str) + '-' + df['Date'].dt.month.astype(str).str.zfill(2)
df.head()

In [None]:
df.tail()

<hr id="heatmap">

<h2>2. Cross Tabulation and Heatmap Plot</h2>

Heatmaps show the magnitude of a variable across two dimensions with colors.

In [None]:
# group by month and count the number of accidents
df_country_level = df.groupby(['Country', 'Accident Level'], as_index=False)['Date'].count()
df_country_level


A more convenient and compact way to see the distribution of a parameter accross 2 different groups or categories of data is a `pivot table` or a `heatmap plot`

In [None]:
# create a pivot table showing the number of accidents by country and by accident level
pivot = pd.crosstab(df['Country'], df['Accident Level'])
pivot

To create any plot using the `matplotlib` library, we simply use the `pyplot` module (assigned an alias of `plt` in this file) and use the corresponding plot type method.

Common plot types:

| PLOT TYPE         |  pyplot method        |
| ---               |  ---                  |
| line (default)    |  `plot(x, y)`         |
| scatter           |  `scatter(x, y)`      |
| bar               |  `bar(x, height)`     |
| pie               |  `pie(x, labels)`     |
| histogram         |  `hist(x)`            |
| box               |  `box(x)`             |
| heatmap           |  `pcolor(matrix)`     |

In [None]:
# plot heatmap
plt.pcolor(pivot)

That's how easy it is to create a plot with `matplotlib`!  
  
But we can add a few adjustments to make it much clearer (e.g. title, axes labels, etc.)  
There are a lot of available methods to add different elements to the active plot.  
Once a plot is created, all we need to do is call the required method from the `pyplot (plt)` module, and that method will be applied to the active plot.

In [None]:
# create a blank figure with a size of 8x6 inches
plt.figure(figsize=(8, 6))

# create heatmap plot
plt.pcolor(pivot, cmap='RdBu')

# add a colorbar
plt.colorbar()

# # add a plot title
plt.title('Heatmap of number of accidents by country and accident level', y=1.1)

# # add Y-axis title
plt.ylabel('Country')

# # updated Y-axis ticks/labels
plt.yticks(ticks=np.arange(len(pivot)) + 0.5, labels=pivot.index)

# # add X-axis title
plt.xlabel('Accident Level')

# # update X-axis ticks/labels
plt.xticks(np.arange(len(pivot.columns)) + 0.5, pivot.columns)

# # show plot (this is only required in Python script files, but it is called by default in Jupyter Notebooks)
plt.show()

Now this looks much better and clearer!  
We can also use `seaborn` to plot a similar heat map with just one command!

In [None]:
sns.heatmap(pivot, cmap='RdBu', annot=True, fmt='d')
plt.title('Heatmap of number of accidents by country and accident level', y=1.1)

<hr id="line">

<h2>3. Line Plot</h2>

Line plots show trends and changes over time.

In [None]:
# groupd by month
df_months = df.groupby('Month', as_index=False)['Accident Level'].count()
df_months

We can plot the `line` chart using the `pyplot (plt)` module

In [None]:
plt.plot(df_months['Month'], df_months['Accident Level'])

Or, a very big advantage of the `Pandas` library is that it integrates with `matplotlib` and has plotting capabilities built into the `DataFrame` objects.  
To plot data from a `DataFrame`, all we need to do is call the `plot()` method on the `DataFrame` and specify the `kind` parameter as the required plot type.  

For example, here are the equivalent `Pandas` methods to the `plt` methods mentioned above:

| PLOT TYPE         |  pyplot method        | DataFrame method                  |
| ---               |  ---                  | ---                               |
| line (default)    |  `plot(x, y)`         | `plot(kind='line', x, y)`         |
| scatter           |  `scatter(x, y)`      | `plot(kind='scatter', x, y)`      |
| bar               |  `bar(x, height)`     | `plot(kind='bar', x, y)`          |
| pie               |  `pie(x, labels)`     | `plot(kind='pie', x, y)`          |
| histogram         |  `hist(x)`            | `plot(kind='hist', x)`            |
| box               |  `box(x)`             | `plot(kind='box', x)`             |

In [None]:
# line plot with adjustments

# we can also spepcify the figure size in the DataFrame plot method
df_months.plot(
    kind='line', 
    figsize=(15, 5), 
    x='Month', 
    y='Accident Level', 
    title='Number of accidents per month', 
    ylabel='Number of accidents', 
    legend=False,
    color='red',                # line color
    linestyle='dotted',         # line style
    marker='o',                 # marker type
    markersize=5,               # marker size
    markerfacecolor='blue',     # marker fill color
    markeredgecolor='blue',     # marker border color
)

# X-axis ticks/labels
plt.xticks(ticks=range(len(df_months)), labels=df_months['Month'], rotation=45)

# set X margins to 0 (spacing between X-axis and the first/last datapoint)
plt.margins(x=0)

# # show grid lines
plt.grid()

<hr id="bar">

<h2>4. Bar Plot</h2>

Bar plots compare the values between different categories or groups.

In [None]:
# get accident level counts
df_accident_level = df.groupby('Accident Level')['Accident Level'].count()
df_accident_level

In [None]:
# bar plot

# since the DataFrame only contains one column, we can use the plot method without specifying x and y
df_accident_level.plot(kind='bar', ylabel='Number of accidents', title='Number of accidents per level')

# X-axis ticks/labels
plt.xticks(rotation=0)

# show grid lines
plt.grid(axis='y')

We can use `seaborn.countplot()` to produce the same plot as above without having to group by accident level.

In [None]:
# bar plot with counts of unique values
order_list = ['I', 'II', 'III', 'IV', 'V']
sns.countplot(data=df, x='Accident Level', order=order_list)

# show grid lines
plt.grid(axis='y')

<hr id="pie">

<h2>5. Pie Plot</h2>

Pie plots show proportion distribution of categories or groups.

In [None]:
# pie plot
df_industry_sector = df.groupby('Industry Sector')['Industry Sector'].count()
df_industry_sector

In [None]:
# pie plot
df_industry_sector.plot(kind='pie')

In [None]:
# adjust plot
df_industry_sector.plot(
    kind='pie', 
    figsize=(10, 6), 
    title='Number of accidents per industry sector',
    ylabel='',              # remove Y-axis label
    labels=None,            # disable automatic labels (we will add them in the legend in the next step)
    autopct='%1.1f%%',      # format of the percentage values
    startangle=90,          # start angle of the first pie slice
    explode=[0, 0, 0.1]     # a list of values specifying the fraction of the radius with which to offset each pie slice (here we exploe the "Others" slice)
)

# add legend
plt.legend(labels=df_industry_sector.index, loc='lower right')

<hr id="import-2">

<h2>6. Import dataset 2</h2>

Read the data from `csv` into a `Pandas DataFrame`

<h4>Understanding the Data</h4>

**Model**	
- 4WD/4X4 = Four-wheel drive
- AWD = All-wheel drive
- FFV = Flexible-fuel vehicle
- SWB = Short wheelbase
- LWB = Long wheelbase
- EWB = Extended wheelbase  

**Transmission**	
- A = automatic
- AM = automated manual
- AS = automatic with select shift
- AV = continuously variable
- M = manual
- 3 - 10 = Number of gears  

**Fuel type**	
- X = regular gasoline
- Z = premium gasoline
- D = diesel
- E = ethanol (E85)
- N = natural gas  

**Fuel consumption**	
- City and highway fuel consumption ratings are shown in litres per 100 kilometres (L/100 km) 
- the combined rating (55% city, 45% hwy) is shown in L/100 km and in miles per imperial gallon (mpg)  

**CO2 emissions**	
the tailpipe emissions of carbon dioxide (in grams per kilometre) for combined city and highway driving

In [None]:
df = pd.read_csv('CO2_Emissions_Canada.csv')
df.head()

In [None]:
df.info()

<hr id="histogram">

<h2>7. Histograms</h2>

Histograms show the data distribution similar to a distribution curve but in discrete intervals called `bins`.

In [None]:
# histogram
df['Engine Size [L]'].plot(kind='hist')

As we see in the default plot, the X-ticks do not align with the bins.  
We can specify the number of bins and the ticks for the histogram with the help of the `numpy` function `histogram(list_of_values, number_of_bins)`.  
And the function returns a list with the frequencies, and another list with the bin edges (ticks).

In [None]:
# use numpy to get the histogram data
freq, bin_edges = np.histogram(df['Engine Size [L]'], 10)
print(freq)
print(bin_edges)

In [None]:
# plot histogram
df['Engine Size [L]'].plot(kind='hist', bins=10, title='Histogram of engine size', xlabel='Engine size [L]')

# X-axis ticks/labels
plt.xticks(bin_edges, rotation=90)

# show grid lines
plt.grid()

<hr id="box">

<h2>8. Box Plot</h2>

A box (also called box-and-whisker) plot also shows the distribution of a single variable and identifies the outliers.

It also shows the following stats:
- minimum (Q0, excluding outliers)
- first quartile (Q1)
- median (Q2)
- third quartile (Q3)
- maximum (Q4, excluding outliers)
- interquartile range (IQR = Q3 - Q1)
- max outliers (> Q3 + 1.5*IQR)
- min outliers (< Q1 - 1.5*IQR)

In [None]:
# box plot
df['CO2 Emissions [g/km]'].plot(kind='box', title='Box plot of CO2 emissions', ylabel='CO2 Emissions [g/km]')

# show grid lines
plt.grid()

We can see how the values are distributed in different groups in another categorical column by using `seaborn.boxplot()`, and using the `hue` parameter

In [None]:
# box plot
sns.boxplot(y=df['CO2 Emissions [g/km]'], hue=df['Fuel Type'])

# show grid lines
plt.grid(axis='y')

We can also plot multiple box plots on the same graph as follows:

In [None]:
# multiple box plots
df[['Fuel Consumption City [L/100 km]', 'Fuel Consumption Hwy [L/100 km]', 'Fuel Consumption Comb [L/100 km]']].plot(
    kind='box',
    title='Box plot of fuel consumption for different driving conditions',
    ylabel='Fuel Consumption [L/100 km]',
    xlabel='Fuel Consumption'
)

# X-axis ticks/labels
plt.xticks(ticks=[1, 2, 3], labels=['City', 'Highway', 'Combined'])

# show grid lines
plt.grid()

<hr id="scatter">

<h2>9. Scatter Plot</h2>

Scatter plots show the relationships between variables

In [None]:
# scatter plot
df.plot(kind='scatter', x='Engine Size [L]', y='CO2 Emissions [g/km]', title='Scatter plot of engine size vs. CO2 emissions', figsize=(15, 5))

# show grid lines
plt.grid()

Calcualute the trend line using `numpy.polyfit()`

In [None]:
# trend line
fit = np.polyfit(df['Engine Size [L]'], df['CO2 Emissions [g/km]'], 1)
fit

<h4>Regression Plot</h4>

It's basically a scatter plot with a linear regression trendline.

In [None]:
# scatter plot
df.plot(kind='scatter', x='Engine Size [L]', y='CO2 Emissions [g/km]', title='Scatter plot of engine size vs. CO2 emissions', figsize=(15, 5))

# plot trend line
y = fit[0] * df['Engine Size [L]'] + fit[1]
plt.plot(df['Engine Size [L]'], y, color='red')

# anotate trend line equation
plt.annotate('y = {0:.2f}x + {1:.2f}'.format(fit[0], fit[1]), xy=(5, 200), color='red')

# show grid lines
plt.grid()

We can also use `seaborn.regplot()` to create the above regression plot with trendline without having to perform the fitting step.

In [None]:
plt.figure(figsize=(15, 5))
sns.regplot(x='Engine Size [L]', y='CO2 Emissions [g/km]', data=df)
plt.title('Scatter plot of engine size vs. CO2 emissions')


<h4>Bubble Plot</h4>

It's basically a scatter plot with an additional dimension (i.e. size of points)

In [None]:
# scatter plot
plt.figure(figsize=(15, 5))
sns.scatterplot(data=df, x='Engine Size [L]', y='CO2 Emissions [g/km]', size='Cylinders', alpha=0.5, sizes=(20, 400), color='green')

# title
plt.title('Scatter plot of engine size vs. CO2 emissions')

# show grid lines
plt.grid()

<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>
<br>
<a href="https://www.kaggle.com/datasets/ihmstefanini/industrial-safety-and-health-analytics-database">industrial safety dataset (kaggle.com)</a>
<br>
<a href="https://www.kaggle.com/datasets/mrmorj/car-fuel-emissions">CO2 emissions dataset (kaggle.com)</a>