# Python beginners course - Level 3 - Matplotlib
In the previous notebook, we have seen how we can represent data in Dataframes and some basic functionality for exploring the data. This notebook will build on this exploration by introducing visualisation techniques provided by the ```Matplotlib``` package, which is the back-bone of visualisation in Python.

This tutorial consists of the following parts:
- What is Matplotlib
- Line plots
- Histograms
- Box plots
- Scatter plots

## 1. What is Matplotlib
When working with any kind of data, it is often hard to get a good understanding of the data just by looking at the numbers. Especially when the data is becoming large, it is more convenient to create visualisations which help in understanding the data. This is where `Matplotlib` comes in.

`Matplotlib` is the most popular visualisation library in Python. It is a Python 2D plotting library which produces publication quality figures in a variety of formats, and is even able to create interactive figures. Matplotlib tries to make easy things easy and hard things possible. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, and many more with just a few lines of code. This notebook will explore the core functionality ```Matplotlib``` has to offer.

In [None]:
# Before we start we have to import the packages that we will be using
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2. Line plot

Line plots are one of the most frequently used charts in the world.

A line plot is a type of chart which displays information as a series of data points connected by straight line segments. It is a basic type of chart common in many fields. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. Line plots are often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically. 

A typical line plot is shown below.

![line plot](../assets/linechart.png)

In the previous section we have seen how we can use the `Numpy` library to create arrays. We will now use this knowledge to make some nice looking line plots.

To start, we will attempt to plot a Sine wave which is defined by the function $y = sin(x)$ for values $x$ between $0$ and $6$.

In [None]:
# Create an array of 5 evenly spaced values between 0 and 6
x = np.linspace(0, 6, 5)

# For each of these values calculate the Sine
y = np.sin(x)

# Plot the Sine wave using matplotlib
plt.plot(x, y, marker='*')
plt.show()

See how easy it is? With just one call to the ```plt.plot``` function we created a decently looking chart! 

## Exercise 1: Create and plot a Sine wave with 100 markers.
The line plot above consists of 5 markers (stars) which are connected by straight lines. This is because we only have 5 data points. If we want to visualize a more smooth Sine wave, we'll need more data points. To solve this, you will now create the same Sine wave, only now with more markers. 

Replace the ```___``` in the following cells to complete the exercise.

In [None]:
# Create a Sine wave with 100 markers
x_new = np.linspace(0, 6, ___)
y_new = np.sin(___)

# Plot the new Sine wave
plt.plot(x, y, marker='*')
plt.plot(___, ___)
plt.show()

We can see that the line plot with 100 markers looks much nicer, and actually looks like a Sine wave now.

In `Matplotlib` there are many options to customize your plots. In the figure above, we already see two of those (do you know which ones?).

## Exercise 2: Create and plot other functions
Apply the skills we learned to plot a Sine wave to plot the following functions

1. $y = x$
2. $y = 2x$
3. $y = \sqrt x$
4. $y = x^2$ 

in the same figure.

Replace the ```___``` in the following cell to complete the exercise.

In [None]:
# Create the horizontal axis values using NumPy
x = np.linspace(0, 5, 20)

# Create the vertical axis values of the 4 plots
y1 = ___
y2 = ___
y3 = ___
y4 = ___

# Create figures for all 4 plots
plt.plot(x, ___)
plt.plot(x, ___)
plt.plot(x, ___)
plt.plot(x, ___)

# Display the figures
plt.show()

As mentioned, `Matplotlib` allows for extensive customization of your plots. Some settings that can be customized are:
- Line color
- Line style (dashed, dotted, ...)
- Line width
- Marker style (star, cross, ...)
- Marker color
- Marker size
- Axis labels
- Title
- Legend
- ...

An example of a plot which contains some of these settings is shown below.

In [None]:
# Create the horizontal axis values using NumPy
x = np.linspace(0, 2*np.pi, 10)

# Calculate the Sine of all x-values
y = np.sin(x)

# Plot the values of y against x using Matplotlib
plt.plot(x,
         y, 
         color='blue',       # line color: blue
         linestyle='dotted', # line style: dotted
         linewidth=5,        # line width: 5
         marker='o',         # marker: circles
         markersize=12,      # marker size: 12
         markerfacecolor='r' # marker color: red   
        )

# Show the figure
plt.show()

## Exercise 3: Play around with the figure settings above such that:
- The line color is green
- There are no markers
- The line style is dashed
- The line width is thinner

A good figure is one where the reader is presented with sufficient information such that it is clear what is being shown. A figure without any labels or a title is meaningless for the unknowing reader, as it is often impossible to decipher what the values represent. 

Therefore, we will show how to add axis lables, a title and a legend to a `Matplotlib` figure. We will use the previous figure as illustration.

In [None]:
# Create the horizontal axis values using NumPy
x = np.linspace(0, 2*np.pi, 10)

# Calculate the Sine of all x-values
y = np.sin(x)

# Plot the values of y against x using Matplotlib
plt.plot(x,
         y, 
         color='blue',       # line color: blue
         linestyle='dotted', # line style: dotted
         linewidth=5,        # line width: 5
         marker='o',         # marker: circles
         markersize=12,      # marker size: 12
         markerfacecolor='r' # marker color: red   
        )

# Add the axis labels
plt.xlabel('x values')        # horizontal axis label: 'x values'
plt.ylabel('y values')        # vertical axis label: 'y values'

# Add a title
plt.title('Illustration of Matplotlib settings')        # plot title: 'Sine wave'

# Add a legend
plt.legend(['Sine wave'])     # add a legend to the figure

# Show the figure
plt.show()

## 3. Histograms
A histogram is a plot that lets you discover the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal distribution), outliers, skewness, etc.

Lets explore histograms using the boston dataset from the previous notebook.
Before we dive it, we'll read the data again and give a quick recap of the dataset. 

In [None]:
#Let's read the data in again
data = pd.read_csv('../data/boston_dataset.csv')

data.head(3)

Remember that the columns have the following meaning (**you don't have to remember this**):

| Column        | Represents                                      |
| :------------- |------------------------------------------------ | 
| CRIM | per capita crime rate by town |
| ZN | proportion of residential land zoned for lots over 25,000 sq.ft.|
| INDUS | proportion of non-retail business acres per town.|
| CHAS | Charles River dummy variable (1 if tract bounds river; 0 otherwise)|
| NOX | nitric oxides concentration (parts per 10 million)|
| RM | average number of rooms per dwelling|
| AGE | proportion of owner-occupied units built prior to 1940|
| DIS | weighted distances to five Boston employment centres|
| RAD | index of accessibility to radial highways|
| TAX | full-value property-tax rate per 10,000 dollar|
| PTRATIO | pupil-teacher ratio by town|
| B | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town|
| LSTAT | percentage lower status of the population|
| MEDV | Median value of owner-occupied homes (1 = 1000 dollar)|

An example of a histogram of the `LSTAT` feature can be shown using the following code:

In [None]:
# Extract the LSTAT column as a Pandas Series
LSTAT_col = data['LSTAT']

# Plot the histogram using the Series, giving it a label 'LSTAT' and defining how many bars (bins) we want to see
plt.hist(LSTAT_col, label='LSTAT', bins=70)
plt.xlabel('Value')        # Set the horizontal axis label to 'Value'
plt.ylabel('Frequency')    # Set the vertical axis label to 'Frequency'
plt.legend()               # Make the legend visible
plt.show()                 # Show the histogram

## Exercise 4: Create histograms for the RM and MEDV columns of our dataset.
Replace the ```___``` in the following cell to complete the exercise.

In [None]:
# Histogram for the feature RM
RM_col = ___

plt.hist(RM_col, label=___, bins=70)
plt.xlabel('Value')        # Set the horizontal axis label to 'Value'
plt.ylabel('Frequency')    # Set the vertical axis label to 'Frequency'
plt.legend()               # Make the legend visible
plt.show()                 # Show the histogram

# Histogram for the target MEDV
MEDV_col = ___

plt.hist(MEDV_col, label=___, bins=70)
plt.xlabel('Value')        # Set the horizontal axis label to 'Value'
plt.ylabel('Frequency')    # Set the vertical axis label to 'Frequency'
plt.legend()               # Make the legend visible
plt.show()                 # Show the histogram

As we see in the figure of the exercise above, plotting multiple figures using `Matplotlib` will automatically put them underneath each other. However, sometimes we might want to arrange figures side-by-side. This is useful if we have many figures that we want to put into a report. Luckily, Matplotlib makes this possible. 

Consider a scenario where we want to combine multiple figures in a grid of 2 rows of 3 figures. To achieve this using ```Matplotlib``` we need to create a so-called ```subplot```, which is basically a grid where each field can be filled with a figure of choice. The grid can have any size we want.

To illustrate how to create such a subplot grid, consider the following code:

In [None]:
# Step 1: Set the figure size to 8 by 8 inches and create a grid of 2 rows and 3 columns
plt.subplots(figsize=[8,8], nrows=2, ncols=3) 

# Step 2: Set the position of this plot in the grid. 
# Below we indicate that our figure should placed in position 1 (counting left to right and top to bottom)
plt.subplot(2, 3, 1)

# Step 3: Plot a histogram that we want to show
plt.hist(data['LSTAT'], label='LSTAT', bins=70)         

#Step 4: Show the figure
plt.show()

Above we created a figure of 8 by 8 inches that contains a grid of 2 rows and 3 columns. Then, we put a histogram in position 1 (counting from left to right and top to bottom).

We can add more plots to free positions to fill up our grid. To get a more intuitive idea of what the grid looks like that is generated by ```subplot```, such grid is visualized in the figure below.
![](../assets/subplots.png)

As can be seen, the grid positions are numbered as expected, starting with 1 in the top-left corner, and increasing horizontally.
With the code above, you _only_ fill the figure at location (2,3,1), so you still need to fill the other locations using `plt.subplot(2,3,2)`, ..., `plt.subplot(2,3,6)`. Let's now try to do this ourselves.

## Exercise 5: Create a subplot figure of the histograms of `LSTAT`, `RM` and `MEDV`. The subplot should have 3 rows and 1 columns.
Replace the ```___``` in the following cell to complete the exercise.

In [None]:
plt.subplots(figsize=[8,8], nrows=___, ncols=___) # Set the figure size

plt.subplot(___) # Fill the 1st spot in our subplot of 3 rows and 1 column
plt.hist(___, label=___, bins=70)
plt.legend()

plt.subplot(___) # Fill the 2nd spot in our subplot of 3 rows and 1 column
plt.hist(___, label=___, bins=70)
plt.legend()

plt.subplot(___) # Fill the 3rd spot in our subplot of 3 rows and 1 column
plt.hist(___, label=___, bins=70)
plt.legend()

plt.show()

## Exercise 6: Adjust the code above such that:
1. The top histogram is blue
2. The middle histogram is red
3. The bottom histogram is green

**Hint:** blue: 'b', red: 'r', green: 'g'

In [None]:
# Copy the code from exercise 5 here and edit it

## Exercise 7: Adjust the code above such that:
1. The histograms are now side-by-side each other instead of above each other.
2. Add a title to each of the histograms.
3. Add labels to the axes.

In [None]:
# Copy the code from exercise 6 here and edit it

## (Optional) Challenge 1: Make sure the histograms:
- share the y-axis if the figures are side-by-side in the subplot
- share the x-asis if the figures are above each other in the subplot

**Hint:** Google and StackOverflow are your friends.

In [None]:
# Copy the code from exercise 7 here and edit it

## 4. Boxplots

Do you remember the `.describe()` function from earlier notebooks? It was used to show us statistics of Dataframe columns such as mean, standard deviation and median. Let's recall what the output of ```describe()```  looks like:

In [None]:
# Calculate the basic statistics of the LSTAT, RM and MEDV columns
data[['LSTAT']].describe()

When looking at these statistics it is often hard to get a feeling for the actual distribution of a dataset. Therefore, there also exists a graphical representation of the information calculated with `.describe()`, which is called a **boxplot**. 

A boxplot is a standardized way of displaying the distribution of data based on a five number summary:

1. maximum (excluding outliers)
2. upper quartile (Q3)
3. median
4. lower quartile (Q1)
5. minimum (excluding outliers)

It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed. An illustrative example of a boxplot is shown in the figure below.

![](../assets/box-plot-explained.gif)

Using ```Matplotlib``` again, it is relatively easy to make your data. Before making the boxplot of `LSTAT`, let's first remind ourselves what the statistics of this column were:

The corresponding boxplot to this data column can be made using the following code:

In [None]:
# Extract the LSTAT column as a Pandas Series
LSTAT_col = data['LSTAT']

# Create the boxplot using our Series and adding a label
plt.boxplot(LSTAT_col, labels=['LSTAT'])

# Show the boxplot
plt.show()

As you can see from the boxplot, the minimum, lower quartile (25%), median (50%), upper quartile (75%) and maximum are visualized, as were calculated using `.describe()`. 

Looking at the boxplot, it is easy to see that the data is not symmetrically distributed which is less obvious when looking at the results of `.describe()`.

## 5. Scatter plots
A scatter plot is a type of plot to display values for typically two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

A scatter plot can be used either when you want to investigate if there exists a relation between two variables, or to see that there is no relation between the variables.

Consider a scenario where we suspect that there is a relationship between two variables is in our dataset. The column `RM` represents the average number of rooms per dwelling. It is to be expected that this variable has some positive relationship with the price of the dwelling, `MEDV`. To confirm this assumption, we will plot a scatter plot of these two variables.

In [None]:
# Extract the data as Pandas Series
RM_col = data['RM']
MEDV_col = data['MEDV']

# Create a scatter plot using the Series
plt.scatter(x=RM_col, y=MEDV_col)

# Set the axes labels
plt.xlabel('RM')
plt.ylabel('MEDV')

# Set the title
plt.title('Scatter plot of RM and MEDV')

# Show the scatter plot
plt.show()

In the scatter plot above, every point represents a row in our original dataset. Its location on the horizontal axis is determined by the value in the `RM` column, and its vertical position by the value in the `MEDV` column.

Looking at the scatter plot, there seems to be a strong pattern between the two columns. As the value of `RM` increases, generally the value of `MEDV` also increases implying there is a strong positive correlation between the two. This result is as we anticipated, since it is quite obvious that there should be a positive relation between house price and average number of rooms.

In [None]:
RM_col = data['RM']
LSTAT_col = data['LSTAT']
MEDV_col = data['MEDV']

plt.scatter(RM_col, LSTAT_col, c=MEDV_col)
plt.xlabel('RM')
plt.ylabel('LSTAT')
plt.colorbar()
plt.show()

## Exercise 8
To conclude this notebook, let's combine all the plots we learned about into 1 grid. 

Below, create a subplot and place the following figures side-by-side:

- Position 1: a green histogram of column ```RM```.
- Position 2: a boxplot of column ```LSTAT```.
- Position 3: a red scatterplot of columns ```RM``` and ```LSTAT```.

We give you an empty cell below, but feel free to copy snippets from the previous exercises.