# Chapter 4: Data Visualization

In [None]:
!pip install matplotlib

In [None]:
%reset
low_memory=False
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 4.1 Introduction & Problem Setting

In this lesson we are going to go over the basics of plotting in python. For this we will use the matplotlib library. There are other libraries (seaborn, plotnine, ...) that offer different types of functionality and other programming languages more suited to data visualisation (e.g. ggplot2 in R), but let's take the path of least resistence for now.

For those of you who are interested in the specifics of including local/web images in your jupyter notebook: consult the StackOverflow oracle at https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o

Those interested in markdown and the link to html, visit https://www.xaprb.com/blog/how-to-style-images-with-markdown/

## 4.2 Basic data visualization

Before we get started with visualizing our first plots, remember that the internet is a beautiful place! There are plenty of tutorials out there whihc offer a unique view on the subject. Some of those examples are [this tutorial](https://www.skytowner.com/explore/getting_started_with_matplotlib) or the official [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html).

Noz, let's get started by plotting a simple data point!

In [None]:
plt.scatter(1, 1, color = 'r')

We can easily add a second point to our visualization by passing it to the function.

In [None]:
plt.scatter([1,3],[1,3], color = 'r')

We can also play around with the colour!

There are multiple options available. You can use a lot of abbreviations (such as 'r' for red), RGBA codes, HEX codes or even common css named colours.

In [None]:
plt.scatter([1,2],[1,2], color = 'gold')

In [None]:
plt.scatter([1,2],[1,2], color = (0.1, 0.2, 0.5, 0.3))

Right now we are working in a **scatterplot**. This works great for visualizing seperate data points. But what do we do if we want to show an evolution of data and we want to connect the dots?

Instead of remaining in our plot.scatter world, we use plt.plot! This allows us to draw lines with ease.

In [None]:
plt.plot([1,2],[1,2], color = 'r')

Everything we've seen so far is good for basic visualizations, but when it comes to proper data plotting we are going to need a bit more structure. Luckily, there is already a defined structure for us to use in seaborn!

A visualization is what we usually call a **Figure**. Each figure can exist of multiple plots or **axes**. Remember, these are not to be confused with the x and y **axis** of a plot! Please study the graph below to get a feeling of the structure.

![](https://storage.googleapis.com/skytowner_public/images/uG4nzrYfQ9KDk2M87YDI/matplotlib_figure_axes_axis%20(4).png)

Let's try this out! Just like always we are going to store the output of our plot in a variable so we can use it later. We will be making one figure with a single axes containing a simple plot with two data points.

Normally, when we run our code it would display the output instantly. To stop this, simply add 'plt.close()' *after* your plot code.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter([1,2],[1,2], color = 'b')
plt.close()

In [None]:
fig

#### Question 1: Create a new plot which has two axes: the one from the plot above and the one where we drew the line. Can you put them next to each other? What about below each other?

### 4.2.1 Data reading
Lets start and have a look at some actual data. There are 3 files with weather station data (extracted from [ncdc.noaa.gov](https://www.ncdc.noaa.gov/cdo-web/search)) on Canvas called '72356013968.csv', '72546014933.csv' and '72450003928.csv'. Download them and install them in the same directory as this notebook.

#### Question two: Recap! Load all three csv files in memory. Look at the amount of columns and records. Can you see an issue with this? Merge the top 100 records of each dataframe into a single new dataframe.

## 4.3  plotting the data


Just like how we have structured our figure in seperate axes, we can bring some structure to our individual plots. Study the image below to get a feeling of what we will be able to control!

![](https://storage.googleapis.com/skytowner_public/images/uG4nzrYfQ9KDk2M87YDI/Anatomy_1.png)

We can now easily plot this data using the methods we've seen above.

In [None]:
plt.plot(station_data['DATE'], station_data['TMP'])

Hangon, this looks like absolute garbage, what's going on? The data points are not readable at all, and we are not able to see any difference between our seperate stations. We can fix this by pulling the data from the stations seperate once more and layering them!

### 4.3.1 Plotting layer by layer


First, we can check how many different stations or dataset contains and how many records each station has. This should be 3 x 100 as we just composed our dataset, but it never hurts to check!

In [None]:
station_data.groupby(['STATION', 'NAME']).size()

Beautiful, we can even see the name of the stations! Now let's create a few seperate series, two for each station. We are creating two per station because we want to display something on our x axis (date) and something on our y axis (temperature).

In [None]:
TU_date = station_data[station_data['STATION'] == 72356013968]['DATE']
TU_temp = station_data[station_data['STATION'] == 72356013968]['TMP']

WI_date = station_data[station_data['STATION'] == 72450003928]['DATE']
WI_temp = station_data[station_data['STATION'] == 72450003928]['TMP']

DM_date = station_data[station_data['STATION'] == 72546014933]['DATE']
DM_temp = station_data[station_data['STATION'] == 72546014933]['TMP']

And now we can plot this. Remember to give each station its own colour!

In [None]:
plt.plot(TU_date, TU_temp, color = 'r')
plt.plot(WI_date, WI_temp, color = 'g')
plt.plot(DM_date, DM_temp, color = 'b')
plt.show()

### 4.3.2 Issues

Currently we have a plot, but it is not very appealing. There are still several issues we can find here.

#### Question 3: What can we improve on our plot above?

Before we continue, did you notice both the temperature and the dates where not recognised as their respective values and thus did not display correctly? No? Then there you already have one thing to improve on the plot :D

Anyways, before we continue we must convert these columns. While we're at it, let's also convert the temperature from Farenheit into Celcius.

In [None]:
station_data['TMP'].dtype

In [None]:
station_data['TMP'].head()

In [None]:
station_data['TMP'] = station_data['TMP'].str.replace(',', '.').astype(float)

In [None]:
TU_date = pd.to_datetime(station_data[station_data['STATION'] == 72356013968]['DATE'], format='%d/%m/%Y %H:%M')
TU_temp = (5/9)*(station_data[station_data['STATION'] == 72356013968]['TMP'] - 32)

WI_date = pd.to_datetime(station_data[station_data['STATION'] == 72450003928]['DATE'], format='%d/%m/%Y %H:%M')
WI_temp = (5/9)*(station_data[station_data['STATION'] == 72450003928]['TMP']  -32)

DM_date = pd.to_datetime(station_data[station_data['STATION'] == 72546014933]['DATE'], format='%d/%m/%Y %H:%M')
DM_temp = (5/9)*(station_data[station_data['STATION'] == 72546014933]['TMP'] -32)

Now it is time to actually improve the plot! We will do so by adding a title, adding labels to our axis, adding a legend and also adding a small grid to improve readability of the actual data. You'll notice that python will automatically start grouping all our data on the axis now that we have converted it into the correct data formats.

In [None]:
better_plot = plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')
plt.legend()
plt.xlabel('Date')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature')
plt.grid(linewidth=0.2)
plt.close()

In [None]:
better_plot

### 4.3.3 Dots or lines?

What's a better visualization? Dots or lines? It all depends on the situation.

In theory, lines 'make up' data that does not exist when connecting data points. That is why overall, a **scatter plot** is preferred. Especially when your dataset is large enough, it gets the message across well enough.

When your dataset becomes a lot smaller, for example when zooming in on a small part, a scatterplot starts to become less meaningful. In such a case, it might be worth looking into a **line plot**.

Of course the line between them is a gray area. In the end, the most important thing is the **story you wish to tell**. Your graph has to support this is good as possible, while still portraying the truth.

In [None]:
fig = plt.figure(figsize=(15,8))

ax1 = fig.add_subplot(2,2,1)
ax1.scatter(TU_date,TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
ax1.scatter(WI_date,WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
ax1.scatter(DM_date,DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')
ax1.legend()
ax1.set_xlabel('Date')
ax1.set_ylabel('Average Temp (C)')
ax1.set_title('Daily average temperature')
ax1.grid(linewidth=0.2)

ax2 = fig.add_subplot(2,2,2)
ax2.scatter(TU_date[0:10],TU_temp[0:10], color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
ax2.legend()
ax2.set_xlabel('Date')
ax2.set_ylabel('Average Temp (C)')
ax2.set_title('Zoom Tulsa International Airport')
ax2.grid(linewidth=0.2)

ax3 = fig.add_subplot(2,2,3)
ax3.plot(TU_date,TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
ax3.plot(WI_date,WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
ax3.plot(DM_date,DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')
ax3.legend()
ax3.set_xlabel('Date')
ax3.set_ylabel('Average Temp (C)')
ax3.set_title('Daily average temperature')
ax3.grid(linewidth=0.2)

ax4 = fig.add_subplot(2,2,4)
ax4.plot(TU_date[0:10],TU_temp[0:10], color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
ax4.scatter(TU_date[0:10],TU_temp[0:10], color = 'tab:green', label='TULSA INTERNATIONAL AIRPORT')
ax4.legend()
ax4.set_xlabel('Date')
ax4.set_ylabel('Average Temp (C)')
ax4.set_title('Zoom Tulsa International Airport')
ax4.grid(linewidth=0.2)

## 4.4 Adding some extra fluff

Our plot is starting to look a lot better! Now let's add some more floof. What about a fancier title and a line that displays the average temperature per station?

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')
plt.legend()
plt.xlabel('Date')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.grid(linewidth=0.2)
plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted")  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted")  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted")  #horizontal line

Amazing! Let's say you want to put the focus on the data from Wichita Dwight D. Eisenhower national airport, we can do so by playing with the opacity of the plots.

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT', alpha = 0.3)
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT', alpha = 0.3)
plt.legend()
plt.xlabel('Date')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.grid(linewidth=0.2)
plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted")  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted", alpha = 0.3)  #horizontal line

A temperature always changes a bit on a given day. Since we have no information about this in our dataset, let's assume it changes about 20 degrees farenheit in each direction on any given day. Can we visualise this in our plot?

In [None]:
TU_sd_min = (5/9)*(station_data[station_data['STATION'] == 72356013968]['TMP'] - 20 - 32)
TU_sd_max = (5/9)*(station_data[station_data['STATION'] == 72356013968]['TMP'] + 20 - 32)

WI_sd_min = (5/9)*(station_data[station_data['STATION'] == 72450003928]['TMP'] - 20 - 32)
WI_sd_max = (5/9)*(station_data[station_data['STATION'] == 72450003928]['TMP'] + 20 - 32)

DM_sd_min = (5/9)*(station_data[station_data['STATION'] == 72546014933]['TMP'] - 20 - 32)
DM_sd_max = (5/9)*(station_data[station_data['STATION'] == 72546014933]['TMP'] + 20 - 32)

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')

plt.fill_between(TU_date, TU_sd_min, TU_sd_max, color = 'tab:blue', alpha = 0.2)
plt.fill_between(WI_date, WI_sd_min, WI_sd_max, color = 'tab:orange', alpha = 0.2)
plt.fill_between(DM_date, DM_sd_min, DM_sd_max, color = 'tab:purple', alpha = 0.2)

plt.legend()
plt.xlabel('Date')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.grid(linewidth=0.5)
plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted", alpha = 0.3)  #horizontal line

Now, let's do something about those ugly dates. We can create a set of dates using pandas and use seaborn to plot it to month names!

In [None]:
pd.date_range(start='2024-01-01', end='2024-10-31', freq='1ME')

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')

plt.fill_between(TU_date, TU_sd_min, TU_sd_max, color = 'tab:blue', alpha = 0.2)
plt.fill_between(WI_date, WI_sd_min, WI_sd_max, color = 'tab:orange', alpha = 0.2)
plt.fill_between(DM_date, DM_sd_min, DM_sd_max, color = 'tab:purple', alpha = 0.2)

plt.legend()
plt.xlabel('Month (end)')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.xlim([TU_date.min(), TU_date.max()])
plt.xticks(ticks = pd.to_datetime(pd.date_range(start='2024-01-01', end='2024-10-31', freq='1ME')), 
           labels = ['January', 'Februari', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October'], rotation=45, ha = 'right')
plt.grid(linewidth=0.2)
plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted", alpha = 0.3)  #horizontal line

### 4.4.1 Additional annotations

You can go absolutely crazy with what you can show on a plot! For example, you can draw a vertical line similar to the one we have for the average temperature. You can even create a shaded 'zone' to indicate a certain period!

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')

plt.fill_between(TU_date, TU_sd_min, TU_sd_max, color = 'tab:blue', alpha = 0.2)
plt.fill_between(WI_date, WI_sd_min, WI_sd_max, color = 'tab:orange', alpha = 0.2)
plt.fill_between(DM_date, DM_sd_min, DM_sd_max, color = 'tab:purple', alpha = 0.2)

plt.legend()
plt.xlabel('Month (end)')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.xlim([TU_date.min(), TU_date.max()])
plt.xticks(ticks = pd.to_datetime(pd.date_range(start='2024-01-01', end='2024-10-31', freq='1ME')), 
           labels = ['January', 'Februari', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October'], rotation=45)
plt.grid(linewidth=0.2)

plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted", alpha = 0.3)  #horizontal line

plt.axvline(pd.to_datetime('2024-07-04'), zorder = 0,color = 'tab:red', linestyle = "dotted")  #vertical line
plt.axvspan(pd.to_datetime('2024-06-04'), pd.to_datetime('2024-07-04'), zorder = 0, color = 'tab:red', alpha = 0.1)  #vertical shading
plt.annotate("4th of July preparations", (pd.to_datetime('2024-06-25'), 0), rotation='vertical', ha = 'left', va = 'center')

### 4.4.2 Has anyone thought about the colourblind yet?

When creating a plot, some things might be obvious for you but not for others. One important thing to take into consideration is that quite a big portion of the world has some form of colour blindness. We need to adapt to this!

First, let's save our figure using plt.savefig().

In [None]:
plt.figure()
plt.plot(TU_date, TU_temp, color = 'tab:blue', label='TULSA INTERNATIONAL AIRPORT')
plt.plot(WI_date, WI_temp, color = 'tab:orange', label='WICHITA DWIGHT D. EISENHOWER NATIONAL AIRPORT')
plt.plot(DM_date, DM_temp, color = 'tab:purple', label='DES MOINES INTERNATIONAL AIRPORT')

plt.fill_between(TU_date, TU_sd_min, TU_sd_max, color = 'tab:blue', alpha = 0.2)
plt.fill_between(WI_date, WI_sd_min, WI_sd_max, color = 'tab:orange', alpha = 0.2)
plt.fill_between(DM_date, DM_sd_min, DM_sd_max, color = 'tab:purple', alpha = 0.2)

plt.legend()
plt.xlabel('Month (end)')
plt.ylabel('Average Temp (C)')
plt.title('Daily average temperature', fontsize = 14, fontweight ='bold')
plt.xlim([TU_date.min(), TU_date.max()])
plt.xticks(ticks = pd.to_datetime(pd.date_range(start='2024-01-01', end='2024-10-31', freq='1ME')), 
           labels = ['January', 'Februari', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October'], rotation=45)
plt.grid(linewidth=0.2)

plt.axhline(TU_temp.mean(), color = 'tab:blue', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(WI_temp.mean(), color = 'tab:orange', linestyle = "dotted", alpha = 0.3)  #horizontal line
plt.axhline(DM_temp.mean(), color = 'tab:purple', linestyle = "dotted", alpha = 0.3)  #horizontal line

plt.axvline(pd.to_datetime('2024-07-04'), zorder = 0,color = 'tab:red', linestyle = "dotted")  #vertical line
plt.axvspan(pd.to_datetime('2024-06-04'), pd.to_datetime('2024-07-04'), zorder = 0, color = 'tab:red', alpha = 0.1)  #vertical shading
plt.annotate("4th of July preparations", (pd.to_datetime('2024-06-25'), 0), rotation='vertical', ha = 'left', va = 'center')

plt.savefig('station_fig.png', dpi = 500)

#### Question four: Put your saved file through a [COBLIS](https://www.color-blindness.com/coblis-color-blindness-simulator) (**Co**lour **Bli**ndness **S**imulator) and compare the different kinds of colour blindes. Does your plot still stand? Could you improve it?