
## Matplotlib

Matplotlib is a Python graphical library that can be used to produce a variety of different graph types. 

pandas contains very tight integration with matplotlib. pandas includes function that automatically call matplotlib functions to produce graphs. 

Although we are using Matplotlib in this lesson, pandas can mke use of several other graphical libraries available from within Python such as ggplot2 and seaborn.

## Importing matplotlib

The matplotlib library can be imported just like any other library. Like pandas it is almost invariably given an alias. In this case 'plt'. Almost any example code using matplotlib will use 'plt' as the alias.

In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph we want it to be displayed in a cell in the notebook just like any other results. To do this we use the '%matplotlib inline' directive.  

If you forget to to this your graphs will not appear.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline


## Numpy 

Numpy is another Python library. It is used for multi-dimensional array processing. In our case we just want to use it for its useful random number generation functions which we will use to create some fake data to demonstrate some of the graphing functions of matplotlib.

numpy is usually given the alias of 'np', a convention we will follow.

## Example 1 - bar charts

~~~
np.random.rand(20)
~~~

will generate 20 random numbers between 0 and 1.

We are using these to create a pandas Series of values. 

A bar chart only needs a single set of values. Each 'bar' represents the value from the Series of values.
A pandas Series (and a DataFrame) have a method called 'plot'. We only need to to tell plot what kind of graph we want.

The 'x' axis represents the index values of the Series

In [None]:
import numpy as np
import pandas as pd

np.random.seed(12345)            # set a seed value to ensure reproducibility of the plots 
s = pd.Series(np.random.rand(20) )
#s
# plot the bar chart
s.plot(kind='bar')

Internally the pandas 'plot' method has called the 'bar' method of matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.

We can use matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the pandas Series holding the values.

We also have to explicitly call the 'show' function to produce the graph



In [None]:

plt.bar(range(len(s)), s) 
plt.show ()

## Exercise

Compare the two graphs we have just drawn. How do they differ? Are the differences significant?

## Solution

Most importantly the data in the graphs is the same. There are cosmetic differentces in the scale points in the x and y axis and in the width of the bars.

The width of the bars can be changed with a parameter in the 'bar' function

~~~
plt.bar(range ( len ( s )), s, width = 0.5)   # the default width is 0.8

~~~

We can plot histograms in a similar ways, directly from pandas and also from Matplotlib

In [None]:
s = pd.Series(np.random.rand(20))
# plot the bar chart
s.plot(kind='hist')

In [None]:
plt.hist(s) 
plt.show()

For the Histogram, each data point is allocated to one of 10 (by default) equal 'bins' of equal size (range of numbers) which are indicated along the x axis and the number of points (frequency) is shown on the y axis.

In this case the graphs are almost identical. The only difference being in the first graph the y axis has a label 'Frequency' associated with it.

We can fix this with a call to the 'ylabel' function


In [None]:

plt.ylabel('Frequency')
plt.hist(s)
plt.show()

In general most graphs can be broken down into a series of elelments which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.

The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.

In matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the 'show' function.

Using this same approach we can plot two sets of data on the same graph

We will use a scatter plot to demonstrate some of the available features.

For a scatter plot we need two sets of data points one for the x values and the other for the y values.

In [None]:
# Generate some date for 2 sets of points.
x1 = pd.Series(np.random.rand(20) - 0.5 )
y1 = pd.Series(np.random.rand(20) - 0.5 )

x2 = pd.Series(np.random.rand(20) + 0.5 )
y2 = pd.Series(np.random.rand(20) + 0.5 )


# Add some features
plt.title('Scatter Plot')
plt.ylabel('Range of y values')
plt.xlabel('Range of x values')

# plot the points in a scatter plot
plt.scatter(x1,y1, c='red', label='Red Range' )  # 'c' parameter is the colour and 'label' is the text forthe legend
plt.scatter(x2,y2, c='blue', label='Blue Range')

plt.legend( loc=4 )  # the locations 1,2,3 and 4 are top-right, top-left, bottom-left and bottom-right
# Show the graph with the two sets of points
plt.show()

## Boxplot

A boxplot provides a simple representaion of a variety of statistical qualities of a single set of data values.



In [None]:
x = pd.Series(np.random.standard_normal(256))

# Show a boxplot of the data 
plt.boxplot(x)
plt.show()

A common use of the boxplot is to compare the statistical variations across a set of variables.

The variables can be independent series or they could be within a dataframe

You can plot individual columns from the dataframe

In [None]:
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
plt.boxplot(df.A, labels = 'A')
plt.show()

## Exercise 

Can you change the code above so that columns 'A' , 'C' and 'D' are all displayed on the same graph?

## Solution

In [None]:
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
plt.boxplot([df.A, df.C, df.D], labels = ['A', 'C', 'D'])
plt.show()

What you cannot do is pass a complete dataframe to the boxplot function. The code

~~~
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
plt.boxplot(df)
plt.show()
~~~

will fail.

However we can use the pandas plot method

In [None]:
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
df.plot(kind = 'box', return_type='axes') # the return_type='axes' is only needed for forward compatibility

We can add a title to the above by adding the 'title' parameter. However there are no parameters for adding the axis labels.
To add labels we can use matplotlib directly.

In [None]:

df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
df.plot(kind = 'box', return_type='axes')

plt.title('Box Plot')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.show()

If you wish to save a your graph as an image you can do so using the 'savefig' function. The image can be saved as a pdf, jpg or png file by changing the file extension.

In [None]:
df = pd.DataFrame(np.random.normal(size=(100,5)), columns=list('ABCDE'))
df.plot(kind = 'box', return_type='axes')

plt.title('Box Plot')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
#plt.show()
plt.savefig('boxplot_from_df.pdf')

In [None]:
# Generate some data for 2 sets of points.
# and additional data for the sizes - suitably scaled
x1 = pd.Series(np.random.rand(20) - 0.5 )
y1 = pd.Series(np.random.rand(20) - 0.5 )
z1 = pd.Series(np.random.rand(20)*200 )

x2 = pd.Series(np.random.rand(20) + 0.5 )
y2 = pd.Series(np.random.rand(20) + 0.5 )
z2 = pd.Series(np.random.rand(20)*200 )

# Add some features
plt.title('Scatter Plot')
plt.ylabel('Range of y values')
plt.xlabel('Range of x values')

# plot the points in a scatter plot
plt.scatter(x1,y1, c='red', label='Red Range', s=z1, alpha=0.5 )  # 's' parameter is the dot size 
plt.scatter(x2,y2, c='blue', label='Blue Range', s=z2, alpha=0.5) # 'alpha' is the opacity

plt.legend( loc=4 ) 
plt.show()