<img src="images/Picture0.png" width=200x />

# Notebook 11 -  Visualization with Matplotlib

In this notebook we will learn how to visualize data using the Matplotlib module. 

You’ll learn how to present your data visually using the following graphs:

- Box plots
- Histograms
- Pie charts
- Bar charts

You'll also learn how to create figures with multiple plots.

**Note:** This section focuses on representing data and keeps stylistic settings to a minimum. [Here is a link](https://matplotlib.org/stable/api/pyplot_summary.html#basic) to the official documentation for used routines from `matplotlib.pyplot`, so you can explore the options that you won’t see here.

### Credits



In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Figures
What if we wanted multiple plots? This is where figures help us. A figure is a diagram or a shape that can be formed by a collection of plots in different dimensions. You can think of the figure object as a canvas that holds all the subplots and other plot elements inside it. A figure can have one or more subplots inside it called axes, arranged in rows and columns. Every figure has at least one axes.

First let's look at a simple figure.

In [None]:
# Create Figure and Subplots
fig, ax1 = plt.subplots() 

# Plot
ax1.scatter([1,2,3,4,5], [1,2,3,4,10], color= 'purple', marker= '*') #scatter plot 

# Title, x and y labels, x and y limits
ax1.set_title('Scatterplot Purple Stars')
ax1.set_xlabel('x')  # x label
ax1.set_ylabel('y') # y label
ax1.set_xlim(0, 6)   # x axis limits
ax1.set_ylim(0, 12)  # y axis limits

plt.show()

We use `plt.subplots()` to create a figure that contains a plot. This creates and returns two objects:
- the figure 
- the axes (subplots) inside the figure

I called `plt.scatter()` to draw the points. Since there was only one axes by default, it drew the points on that axes itself.

Note how now in order to add title and labels we have different methods that are applied to the specific axes: 
- `.set_title()`
- `.set_ylabel()`
- `.set_xlabel()`

Suppose, I want to draw our two sets of points (purple starts and orange circles) in two separate plots side-by-side instead of the same plot. How would you do that?

You can do that by creating two separate subplots, aka, axes using `plt.subplots(1, 2)`. 

### Exercise
- Create a figure with two plots
- In the first plot, recreate the plot from above.
- In the second graph, plot a line with orange circles.
- Remember to label everything! 



In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
# Add code here!

Mmmmm... That's a very small graph... We can use `figsize=(10,4)` as one of the `plt.subplots()` arguments to change the size. 

### Exercise
- Add `figsize=(10,4)` to the arguments of `plt.subplots()` to change the figure size. 
- Play around with it! 

### Exercise

What if you wanted 2 rows and 1 column? What would you have to change? 

## Saving a Figure

You can easily save a figure to, for example, a .png file by making use of `plt.savefig()`. The file will be saved wherever you are working on your computer. The only argument you need to pass to this function is the file name, just like in this example:



In [None]:
# Create Figure and Subplots
fig, ax1 = plt.subplots() 

# Plot
ax1.scatter([1,2,3,4,5], [1,2,3,4,10], color= 'purple', marker= '*') #scatter plot 

# Title, x and y labels, x and y limits
ax1.set_title('Scatterplot Purple Stars')
ax1.set_xlabel('x')  # x label
ax1.set_ylabel('y') # y label
ax1.set_xlim(0, 6)   # x axis limits
ax1.set_ylim(0, 12)  # y axis limits

plt.savefig("myfig.png")

plt.show()

## Visualizing Data

Let's generate some arrays:

In [None]:
x = np.random.randn(1000)
y = np.random.randn(100)
z = np.random.randn(10)

### Boxplot

In [None]:
fig, ax = plt.subplots()
ax.boxplot((x, y, z), vert=False, showmeans=True, meanline=True,
           labels=('x', 'y', 'z'), patch_artist=True,
           medianprops={'linewidth': 2, 'color': 'purple'},
           meanprops={'linewidth': 2, 'color': 'red'})
plt.show()

The parameters of `.boxplot()` define the following:

- **x:** is your data.
- **vert:** sets the plot orientation to horizontal when False. The default orientation is vertical.
- **showmeans**: shows the mean of your data when True.
- **meanline:** represents the mean as a line when True. The default representation is a point.
- **labels:** the labels of your data.
- **patch_artist:** determines how to draw the graph.
- **medianprops:** denotes the properties of the line representing the median.
- **meanprops:** indicates the properties of the line or dot representing the mean.

There are other parameters, but their analysis is beyond the scope of this tutorial.

You can see three box plots. Each of them corresponds to a single dataset (x, y, or z) and show the following:

- The **mean** is the red dashed line.
- The **median** is the purple line.
- The **first quartile** is the left edge of the blue rectangle.
- The **third quartile** is the right edge of the blue rectangle.
- The **interquartile range** is the length of the blue rectangle.
- The **range** contains everything from left to right.
- The **outliers** are the dots to the left and right.

A box plot can show so much information in a single figure!

### Histograms

Histograms are particularly useful when there are a large number of unique values in a dataset. The histogram divides the values from a sorted dataset into intervals, also called **bins**. Often, all bins are of equal width, though this doesn’t have to be the case. The values of the lower and upper bounds of a bin are called the **bin edges**.


The **frequency** is a single value that corresponds to each bin. It’s the number of elements of the dataset with the values between the edges of the bin. By convention, all bins but the rightmost one are half-open. They include the values equal to the lower bounds, but exclude the values equal to the upper bounds. The rightmost bin is closed because it includes both bounds. If you divide a dataset with the bin edges 0, 5, 10, and 15, then there are three bins:

- The first and leftmost bin contains the values greater than or equal to 0 and less than 5.
- The second bin contains the values greater than or equal to 5 and less than 10.
- The third and rightmost bin contains the values greater than or equal to 10 and less than or equal to 15.

The function `np.histogram()` is a convenient way to get data for histograms:

In [None]:
hist, bin_edges = np.histogram(x, bins=10)

In [None]:
hist

In [None]:
bin_edges

It takes the array with your data and the number (or edges) of bins and returns two NumPy arrays:

- hist contains the frequency or the number of items corresponding to each bin
- bin_edges contains the edges or bounds of the bin

Let's make a histogram using `.hist()`

In [None]:
fig, ax = plt.subplots()
ax.hist(x, bin_edges, cumulative=False)
ax.set_title('Histogram')
ax.set_xlabel('x')
ax.set_ylabel('Frequency')
plt.show()

### Pie Charts
Pie charts represent data with a small number of labels and given relative frequencies. They work well even with the labels that can’t be ordered (like nominal data).

Let’s define data associated to three labels:

In [None]:
x, y, z = 128, 256, 1024

Let's create a pie chart with `.pie()`

In [None]:
fig, ax = plt.subplots()
ax.pie((x, y, z), labels=('x', 'y', 'z'), autopct='%1.1f%%')
ax.set_title('Pie Chart')
plt.show()

**Note:**`autopct` defines the format of the relative frequencies shown on the figure. 


### Bar Charts

Bar charts also illustrate data that correspond to given labels or discrete numeric values. They can show the pairs of data from two datasets. Items of one set are the labels, while the corresponding items of the other are their frequencies. Optionally, they can show the errors related to the frequencies, as well.

The bar chart shows parallel rectangles called bars. Each bar corresponds to a single label and has a height proportional to the frequency or relative frequency of its label. 

In [None]:
x = np.arange(21)
y = np.random.randint(21, size=21)

We can create a bar chart using the function `.bar()`

In [None]:
fig, ax = plt.subplots()
ax.bar(x, y)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()

What if we have categorical data? We can visualize it using bar charts as well! 

In [None]:
animals = ['Lions', 'Elephants', 'Birds', 'Sloths', 'Monkeys']
amounts = [23,17,35,29,12]

fig, ax= plt.subplots()
ax.bar(animals, amounts)
ax.set_title('Amount of Animals in the Zoo')
plt.show()

What if you wanted different colors for each bar? 


### Exercise
- Create an array with the colors you want each animal to have. Note that it should go in the same order as your animal array! 
- Where do you include this array in your bar chart script? 

<hr>
<font face="verdana" style="font-size:30px" color="blue">---------- Optional Advanced Material ----------</font>

We can plot multiple bar charts by playing with the thickness and the positions of the bars. The data variable contains three series of four values. The following script will show three bar charts of four bars. The bars will have a thickness of 0.25 units. Each bar chart will be shifted 0.25 units from the previous one. The data object is a nested list containing number of candy sold each year.

Note how I now have added a new function called `plt.xticks()`. This function will change the x-values ticks, I have told the function that I want 4 ticks, and then put the names of each tick in order.

In [None]:
data = [[30, 25, 50, 20],
[40, 23, 51, 17],
[35, 22, 45, 19]]
X = np.arange(4)
plt.bar(X + 0.00, data[0], width = 0.25, label= 'Lollipops' , color="pink")
plt.bar(X + 0.25, data[1], width = 0.25, label= 'Chocolate', color="brown")
plt.bar(X + 0.50, data[2], width = 0.25, label= 'Gummies', color='yellow')
plt.title('Candy sold from 2017-2020')
plt.xticks(X+0.25, ('2017','2018', '2019', '2020')) 
plt.legend()
plt.show()

### Exercise

Using Numpy create an array that has 5 rows and 3 columns with random numbers in the interval  [30,60]. Each row represents a color and each column represents a flower.

- Rows:
  - Row 1: Orange
  - Row 2: Purple
  - Row 3: Green
- Columns:
  - Column 1: Orchids
  - Column 2: Tulips
  - Column 3: Daisies
  - Column 4: Roses
  - Column 5: Poppies

1. Make a pie chart that represents the percentage of each color of flowers. Each slice should have the actual color of the flowers, i.e. the slice of percentage of orange flowers should be orange. Make sure to include title and labels.

2. Make a bar chart to visualize the amount of flowers of each color. You should have 5 ticks in your x-axis, each with the name of the flower. Each tick should have 3 bars, each bar representing the amount of flowers with a specific color with. The color of each bar should match the color of the flower. Make sure to include title, labels, correct limits.