## Table of Contents
<a href="#Background"><font size="+1">Background</font></a>
* Learning Objectives
* Matplotlib
* Getting Started - an important piece of magic!
* Objects in matplotlib

<a href="#Basic-Plotting"><font size="+1">Basic Plotting</font></a>
* Using `plt.subplots()`
* Bar Charts
* Scatter Plots
* Pie Charts
* <a href="#Exercise-1">Exercise 1</a>
* Box Plots
* Multiple Axes
* Histograms
* <a href="#Exercise-2">Exercise 2</a>
* Saving/Exporting Plots

<a href="#Consolidation"><font size="+1">Consolidation</font></a>
* Seaborn
* Bokeh

# Background

### Learning Objectives

The goal of this session is to:

* Become familiar with plotting using matplotlib
* Experiment with a range of different plot types:

### 1. Matplotlib

'Matplotlib' is python's core plotting library. If you want to produce graphical summaries of data it is a good place to start, as simple graphs can be created relatively quickly. Matplotlib is also a really useful library to understand basic plotting functionality in general, understanding some matplotlib will really help you out even if you ultimately end up using one of:

* seaborn - makes basic matplotlib plotting look nice, and adds some more advanced plot types, like violin plots, strip plots, kde plots etc. Open-source.
* bokeh - interactive visualisation for web. Open source.
* plotnine - a port of R's established ggplot2 library. Open-source.

In general, when using matplotlib, you go through the following process:
1. Create a figure with one or more axes, each axis is effectively a blank canvas.
2. Plot some data onto the axes using one or more graph types.
3. Decorate and adjust the visual aspects of the plot (e.g. labels, ticks, titles etc.)

This will become clearer as you learn how to make plots!

### Getting Started

The cells below show how you can import matplotlib.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In the import above, you have imported pandas, which should be familiar, as well as matplotlib:

```python
import matplotlib.pyplot as plt
```

A specific module of matplotlib called pyplot is commonly used for plotting, this allows plots to be made using the same approach that is implemented by MATLAB. I'm not entirely sure why the decision to align with MATLAB was made, however it is now the standard way of producing plots in python!

As you can see, the pyplot module of matplotlib is imported with the nickname, or alias, `plt`.

In addition, I have included a magic command, `%matplotlib inline`, this is a **really** useful magic to include as it means that any plot we produce will be automatically embedded below the code cell once the code to make it has been executed.

Try the code cell below for a quick example.

In [None]:
# make some data - the cube function

# list of numbers -10 to +10
x = list(range(-10,11))

# cubes of x
y = []
for num in x:
    y.append(num**3)
# NB: a oneliner for this is: [i**3 for i in x] this is a 'list comprehension'
    
# Make an axis and plot the data! Hopefully it will appear below the cell!
fig, ax = plt.subplots()
ax.plot(x,y);

## Objects in matplotlib

Working with matplotlib requires at least a vague understand of the main objects that matplotlib uses to make plots: figure and axes.

### The `Figure` object

The `Figure` object is the outermost container for a plot. A simple `Figure` contains one `Axes` object, but can contain multiple `Axes` objects. In addition, the `Figure` also contains and manages other objects associated with plot (e.g. titles, legends etc.)

### The `Axes` object

The `Axes` object is effectively the plot itself and is the main thing you work with when creating a plot. An `Axes` is composed of many objects which can be used to define what is actually represented in the plot. There are a large number of `Axes` methods that allow you to create particular types of plot, like line graphs, bar charts, scatter plots and so on. In addition, many of the objects that comprise an `Axes` can be set from the `Axes` object, the axis labels, ticks, ranges and so on.

### Other matplotlib objects

You might also see references to `Axis` objects which sit inside `Axes` and set graph limits and ticks, `Locator` objects which set tick locations, `Formatter` objects which format labels, as well as the more generic `Artist` object from which all kinds of useful objects like `Line2D`, `Patch` etc. derive. Although you can interact with all these matplotlib objects to tailor an output, most of the time you can do everything you need through the `Axes` object.

# 2. Basic Plotting

Let's load the titanic dataset and start to explore plotting.

In [None]:
titanic = pd.read_csv('../input/titanic/titanic.csv')
titanic.head()

When we make plots, we're not going to call `plt` directly, although this is possible. Instead we will create an axis object that we can then manipulate. This will approach ultimately prove *really* useful, particularly when you come to style your plot or make figures with multiple axes.

## Using `plt.subplots()`

In order to use matplotlib objects effectively for plotting we always start with the same call:
```python
fig, ax = plt.subplots()
```
This code creates an instance of a figure (`fig`) with a single axis (`ax`). This is because the default call is effectively:
```python
fig, ax = plt.subplots(nrows = 1, ncols = 1)
```
Setting nrows and ncols to values > 1 creates multiple axes in a grid formation. These axes can be further customised using the [gridspec](https://matplotlib.org/tutorials/intermediate/gridspec.html) module, but that is a more advanced undertaking.

Often, the most useful parameter is figsize, which allows you to set the width and height of the figure in inches.
```python
fig, ax = plt.subplots(figsize = (8,8))
```
All of our plots will start with this syntax.

## 3. Bar Charts

Have a look at the basic plot below. It uses the `.barh()` method of `ax` (the `Axes` object we created and called `ax`). This tells matplotlib to create a horizontal bar chart. We could also use `.bar()` to create a vertical bar chart.

In [None]:
# Firstly, as the titanic data are disaggregate we need to aggregate!
survivors = titanic['survived'].value_counts(normalize=True)

In [None]:
# Now lets make a simple bar chart of survivors.

# create a figure and an axis instance- the default is nrow=1 and ncol=1
fig, ax = plt.subplots(figsize=(8,4))

# Use the axis to plot a horizontal bar chart.
ax.barh(survivors.index,survivors, tick_label = survivors.index)

# Now we need to decorate the plot
ax.set_xlabel('Proportion of Passengers')
ax.set_ylabel('Survived')
ax.set_title('Titanic Passenger Survival');

The above code has several steps:
1. Data preparation - the `titanic` DataFrame is aggregated by 'survived' and normalized (refer to [aggregation](Aggregation.ipynb) section for more info).
2. Create Figure and Axes instance - a figure of size 8x4 inches is created with a single axes.
3. Make `barh()` plot - the horizontal barchart here takes 3 parameters:
    * `survivors.index` positions the bars on the y-axis by index position.
    * `survivors` provides the values to plot.
    * `tick_label = survivors.index` labels the bars according to their index value, 0 or 1.
4. Decorate the Axes - an xlabel, ylabel and title are added using Axes object set methods.

The basic bar chart is then displayed below the code cell.

Pandas also implements a simple interface to matplotlib, which is a bit friendlier and you may prefer to use:

In [None]:
# create a figure and an axis instance
fig, ax = plt.subplots(figsize=(8,4))

# pandas plot method uses the axis instance created above
survivors.plot(kind='barh', ax=ax)

# Decorate
ax.set_title('Titanic Passenger Survival')
ax.set_xlabel('Proportion of Passengers')
ax.set_ylabel('Survived');

In the cell above, the pandas `.plot()` method creates a figure and axes for a particular `kind` of plot, for a specfic axes instance `ax`. It implements a range of parameters to set particular properties, and additional decoration can be conducted on the axes itself.

Pandas is really useful for making more sophisticated plots in a simple way. For instance, if we wanted a survivor bar chart additionally grouped by passenger class, we could do the following:

In [None]:
# create a figure and an axis instance
fig, ax = plt.subplots(figsize=(8,4))

# Get survivor proportions by passenger classes
titanic.groupby('pclass')['survived'].mean().plot(kind='barh', ax=ax)

# Decorate
ax.set_title('Titanic Passenger Survival by Class')
ax.set_xlabel('Proportion of Passengers Survived')
ax.set_ylabel('Passenger Class');

Breaking this down, we can see that the important work is being done by pandas. In a single line, we:
* group the titanic data by passenger class ('pclass') effectively splitting the data into 3 dataframes.
* Take just the 'survived' variable from the grouped dataframe.
* Take the mean of survived (here working out to be the proportion) for each 'pclass'
* plot the resultant dataframe as a horizontal bar chart.

We could take this further using nested groups, for instance a bar chart grouped by passenger sex and passenger class:

In [None]:
# create a figure and an axis instance
fig, ax = plt.subplots(figsize=(8,4))

# Get survivor proportions by passenger classes
titanic.groupby(['pclass','sex'])['survived'].mean().plot(kind='barh', color = ['dodgerblue','indianred'] * 3, ax=ax)

# make bespoke legend
from matplotlib.patches import Patch
ax.legend(handles = [Patch(facecolor = 'indianred',label='Male'), Patch(facecolor = 'dodgerblue',label='Female')])

# Decorate
ax.set_title('Titanic Passenger Survival by Passenger Sex and Class')
ax.set_xlabel('Proportion Survived')
ax.set_ylabel('Passenger Class')
ax.set_yticklabels([1,1,2,2,3,3])
ax.xaxis.grid();

The code for the above bar chart looks more complicated. This is because I've adjusted the styling so that the Male and Female groups are picked out by the bar colour, rather than being an access label, which would be the default.

This means specifying the `color` to use for 'male' and 'female': `['dodgerblue','indianred']` which are `* 3` to cover each male/female pair for each passenger class (1, 2, 3). This produces the list: `['dodgerblue','indianred','dodgerblue','indianred','dodgerblue','indianred']` which gives a colour for each bar in the bar chart.

I also had to make a bespoke legend to identify the bar colours. I used `Patch` objects for this, Patches have a 'facecolor' and corresponding 'label'. I create a list of the appropriate number of patches, here 2, and pass that list to the 'handles' parameter of the `axes` object's `.legend()` method. In long form this would look like:
```python
# import Patch artist object
from matplotlib.patches import Patch

# make 2 patches
male_patch = Patch(facecolor = 'indianred',label='Male')
female_patch = Patch(facecolor = 'dodgerblue',label='Female')

# make list of patches
patches = [male_patch, female_patch]

# pass to axes legend object.
ax.legend(handles = patches)
```
However, this is done above with two lines:
```python
# import Patch artist object
from matplotlib.patches import Patch
# make bespoke legend
ax.legend(handles = [Patch(facecolor = 'indianred',label='Male'), Patch(facecolor = 'dodgerblue',label='Female')])
```
The legend is automatically positioned in what matplotlib computes to be a 'good' place, however you can change this if you wish. Use the 'loc' parameter with either the string or the number code:

|Location String|	Location Code|
|--|--|
|'best'	|0|
|'upper right'	|1|
|'upper left'	|2|
|'lower left'	|3|
|'lower right'	|4|
|'right'	|5|
|'center left'	|6|
|'center right'	|7|
|'lower center'	|8|
|'upper center'	|9|
|'center'	|10|

NB the default 'loc' is 'best'.

## 4. Scatter Plots

Scatter plots are useful for visually exploring the association between two variables.

The titanic dataset doesn't have much continuous data in it, but lets explore the relationship between fare and age.

In [None]:
# Make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))

# Make a scatter plot with 'x' markers and a mid-grey colour.
ax.scatter(titanic['fare'],titanic['age'], marker = 'x', color = '0.5')

# decorate the plot
ax.set_xlabel('Fare')
ax.set_ylabel('Age')

In the above plot we specify the x and y axis values as well as some styling parameters ('color' and 'marker').

The color parameter can take a range of values:
* matplotlib built-in colours: 'b' (blue); 'g' (green); 'r' (red); 'c' (cyan); 'm' (magenta); 'y' (yellow); 'k' (black); 'w' (white).
* Grayscale - string values between 0 and 1. e.g '0.5' for mid-grey, '0.0' for black, '1.0' for white.
* Legal names - X11 colour name scan be used (e.g. dodgerblue or indianred) see: https://en.wikipedia.org/wiki/X11_color_names#Color_name_chart
* RGB tuples - you can express a colour as an RGB tuple e.g. (0.54, 0.17, 0.89) gives 'blueviolet'.
* Web colours - hexidecimal colours also work e.g. '#DC143C' for Crimson.

There are a wide range of marker styles, see here for details: https://matplotlib.org/api/markers_api.html

Let's take the plot above and add a best-fit line to it. We'll use statsmodels to do this, for more information on using statsmodels have a look at the other vignette. We have courses and notes on introduction to Machine Learning and practical machine learning with R and Python which are covering further the theory and application of statistical models.

In [None]:
# Firstly, get the OLS model from the statsmodels library
# OLS means ordinary least square - a linear regression estimator.
from statsmodels.api import OLS

In [None]:
# Now we can regress fare against age.
model = OLS.from_formula('age ~ fare', data = titanic).fit()

# predict some points to make a line
pred = pd.DataFrame(data = {'fare':[0,titanic['fare'].max()]})
pred['y_hat'] = model.predict(pred)

In [None]:
# Now draw the plot
# Make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))

# Make a scatter plot with 'x' markers and a mid-grey colour.
ax.scatter(titanic['fare'],titanic['age'], marker = 'x', color = '0.5', label = 'Passengers')

# add the best fit line
ax.plot(pred['fare'],pred['y_hat'], color='r', label = 'Best Fit Line')

# Add a legend
ax.legend()

# decorate the plot
ax.set_xlabel('Fare')
ax.set_ylabel('Age')

In the above example I am plotting two graphs (a scatter and a line) on the same axis. Note how adding a label to the plot command (either `.scatter()` or `.plot()` in this case) means that label is pulled in by the legend.

Although we've not expressly looked at it, `.plot()` is how you would plot a line graph.

Below is a more complex example, in particular note how the `.map()` function allows us to specify marker colours.

In [None]:
# Now draw the plot
# Make a figure and an axis
f, ax = plt.subplots(figsize=(7,7))

# Now we can regress fare against age.
model = OLS.from_formula('age ~ fare + sex', data = titanic).fit()

# predict some points to make a line
pred = pd.DataFrame(data = {'fare':[0,titanic['fare'].max()] * 2, 'sex':['male']*2 +['female']*2})
pred['y_hat'] = model.predict(pred)

# Make list of colours
colours = titanic['sex'].map({'female': 'indianred', 'male': 'dodgerblue'})
# Make a scatter plot with 'x' markers and colour by sex.
ax.scatter(titanic['fare'],titanic['age'], marker = 'x', color = colours, label = 'Passengers')

cols = {'female': 'darkred', 'male': 'darkblue'}
# add the best fit lines, using pandas plotting
for label, df in pred.groupby('sex'):
    # NB pandas adds the legend automatically
    df.plot(x = 'fare', y = 'y_hat', color = cols[label], label = label, ax=ax)

# decorate the plot
ax.set_xlabel('Fare')
ax.set_ylabel('Age');

Although the code for the above plot is getting a little complicated, hopefully it demonstrates the flexibility of matplotlib. The underpinning philosophy of matplotlib is that it should make simple plots easy, and complicated plots possible! Ideally, you will be able to create a simple plot with just a few commands.

## 5. Pie Charts

Let's move onto looking at pie charts. As a rule of thumb, a pie chart shouldn't have more than 6 wedges, otherwise the chart becomes very hard to read.

The pie chart function, `.pie()` is quite straightforward, at its simplest it requires only a single parameter, the data counts to be divided into pie slices.

## Exercise 1

See if you can work out how to make a pie chart. We'll try and do things in stages.

<u>Stage 1</u>

Let's represent the passenger class variable ('pclass') as a pie chart.
* First, get counts of the 'pclass' column values (Hint: try using `.value_counts()`), store as an object called 'pclass_count'.
* Now, make a figure and an axis instance, set figsize to (8,8).
* Call `ax.pie()` and give it the pclass_count object.
* Label the pie slices using the `labels` parameter by passing the index of  pclass_count.

<u>Stage 2</u>

* Make the wedge labels bigger. First make a dictionary with the key:value pair 'fontsize':12, and pass this to the `textprops` parameter of `ax.pie()`.
* Improve the look of the pie chart by passing a list of colours to the `colors` parameter of `ax.pie()`, for instance:
    * `colors = ['r','g','b']` for a really basic red-green-blue
    * `colors = ['lightseagreen','lightcoral','lightsteelblue']` for some nice named colours.
    * `colors = ['#a6cee3','#1f78b4','#b2df8a']` which are colourblind safe.
    * `colors = ['0.2','0.5','0.8']` which are shades of grey.
* Assign the value %.1f to the parameter `autopct`, this labels the wedges with their percentages to 1 decimal place.
* Add a title to the axis that reads "Titanic Passengers by Passenger Class"

<u>Stage 3</u>

* If you are happy, save the figure and see what it looks like!  

In [None]:
## Type code in here - add more cells if you need to.

In [None]:
#Exercise 1 Solutions

## sort_index sort the data alphabetically based on index
#Pass_Class = titanic['pclass'].value_counts().sort_index()
#fig,ax = plt.subplots(figsize=(7,7))
#ax.pie(Pass_Class, labels = ('C1','C2','C3'))

## 6. Box Plots

A box plot is a useful way of describing the distribution of a continuous variable, and lends itself to being grouped. A standard box plot has two components:
* The box - representing the quartiles of the data (the 25th, 50th (i.e. median) and 75th percentiles)
* The whiskers - representing the distance to the first data point ± 1.5 times the interquartile range of the data.

Taken together, a box plot gives an indication of the dispersion (spread), skew (symmetry about the median), and outliers of a variable.

In [None]:
# simple boxplot
f, ax = plt.subplots(figsize = (4,6))

# remove na values from 'fare'
fare = titanic[titanic['fare'].notnull()]['fare']

# make a boxplot
ax.boxplot(fare)

# label the y axis
ax.set_ylabel('Fare');

In the plot above it is clear that the 'fare' variable is positively skewed, there are a long tail of outliers and the median is towards the bottom of the box.

We should be able to see a difference by sex, so we will group the boxplot.

In [None]:
# simple boxplot
f, ax = plt.subplots(figsize = (6,6))

# remove na values
fare = titanic[titanic['fare'].notnull()]

# make a boxplot
grps = fare.groupby('sex')['fare']

ax.boxplot([grps.get_group('male'),grps.get_group('female')], labels=['male','female'])

# label the y axis
ax.set_ylabel('Fare');

The grouped box plots are slightly awkward - you have to pass a list in which each list item is a set of data to be plotted. Here I make a `.groupby()` object, and then assign the groups to a list using the `.get_group()` method.

## 7. Histograms

Possibly the most commonly used data visualisation approach, histograms are another way of inspecting the distribution of a  variably (usually continuous, but possible discrete variables like counts). They work by putting data into 'bins' according to their value, counting the number of data points in each bin, and then plotting these frequencies.

The trickiest part is selecting the correct bin size, too many bins and the histogram is too discretising (bumpy), too few and it is too general (smooth).

In [None]:
# histogram of the distribution of fares
f, ax = plt.subplots(figsize = (8,5))

ax.hist(titanic['fare'], bins = 20)

ax.set_xlabel("Fare")
ax.set_ylabel("Frequency");

Looking at the histogram above, it's clear that the fare variable is not normally distributed. If we were going to use it to report an average to represent a typical value, or as a dependent variable in a linear model, we may wish to tranform it.

## 8. Multiple Axes

Using the `plt.subplots()` method defaults we've been creating figures with a single axis onto which we've been plotting. It is straightforward then to create additional axes within the figure and make multiple plots. The axes will be composed as a grid of `nrows` and `ncols`, and will produce a single figure as normal and an array of axes instances to plot on. As we have done previously, it can be useful to 'unpack' the axes instances. Compare these two snippets of code:

```python
f, axes = plt.subplots(ncols = 2)
```
and
```python
f, (ax1, ax2) = plt.subplots(ncols = 2)
```
In the first code snippet, the axes are obtained by specifying `axes[0]` and `axes[1]`, whereas in the second the axes are separately references as `ax1` or `ax2`.

When making multiple axes instances, the `sharex` and `sharey` parameter (`False` by default) are useful if you are looking at the same data on a particular axis.

In [None]:
# sharey as both axis will be survived proportion.
f, (ax1, ax2) = plt.subplots(1,2,figsize=(12,6), sharey = True)

# Plot ax1
titanic.groupby('embarked')['survived'].mean().plot(kind='bar', ax=ax1)
# decorate ax1
ax1.set_xticklabels(['Cherbourg','Queenstown','Southampton'], rotation = 0)
ax1.set_xlabel('Port of Embarkation')
ax1.set_ylabel('Proportion of Passengers who Survived')

# Plot ax2
titanic.groupby('sex')['survived'].mean().plot(kind='bar', color = ['cadetblue','chocolate'], ax=ax2)
# decorate ax2
ax2.set_xticklabels(['Female','Male'], rotation = 0)
ax2.set_xlabel('Passenger Sex')

# Add a figure title
f.suptitle("Titanic Survival by Passenger Port of Embarkation and by Sex");

## Exercise 2

In this section, we've looked at the plotting functions `.bar()`/`.barh()`, `.pie()`, `.scatter()`, `.boxplot()`, `.hist()` and `.plot()`. We've made a number of simple and more complicated plots to demonstrate the power of matplotlib.

Let's have a go at making some of the plots that we've encountered. Try to add appropriate decoration. If you don't know how to do something trying googling for the answer! (There is masses of matplotlib help on stackoverflow etc. as it is such a widely used package).

1. Create a bar chart showing the proportion of people who embarked in Southampton, Cherbourg, and Queenstown.
2. Create a new variable called 'child' which is 1 if the passenger is < 18 and 0 otherwise. Create a plot of survival proportion by 'child'.
3. Create a histogram of 'age', what does the distribution look like?
4. Create a grouped boxplot of fare by place of embarkation.
5. Create a figure with two axes.
    * On axis 1 put the plot from question 2 (child against survived proportion).
    * On axis 2 put a bar plot of lone passengers against survived proportion. Assume that lone passengers have `sibsp` and `parch` equal to 0.

In [None]:
# Start coding here, you'll probably need to add more cells.


In [None]:
# Exercise 2.1 solutions

#f, ax = plt.subplots(figsize = (6,8))

# aggregate the data
#embark = titanic['embarked'].value_counts(normalize = True)

# labels
#cities = {'S':'Southampton', 'Q':'Queenstown', 'C':'Cherbourg'}

# plot the data
#ax.bar(x = embark.index, 
#       height = embark, 
#       tick_label = embark.index.map(cities), 
#       color = ['#66c2a5','#fc8d62','#8da0cb'])

# decorate the plot
#ax.set_ylabel('Proportion of Passengers')
#ax.set_title('Proportion of Titanic Passengers by Embarkation City')

In [None]:
# Exercise 2.2 solutions

# create child variable
#titanic['child'] = (titanic['age'] < 18).astype(int)

# create a figure and an axis instance
#fig, ax = plt.subplots(figsize=(8,4))

# Get survivor proportions by passenger classes
#titanic.groupby('child')['survived'].mean().plot(kind='bar', color = ['hotpink','goldenrod'], ax=ax)

# Decorate
#ax.set_title('Titanic Passenger Survival by Age Group')
#ax.set_ylabel('Proportion of Passengers Survived')
#ax.set_xlabel('Age Group')
#ax.set_xticklabels(['Adult', 'Child'], rotation = 0);

In [None]:
# Exercise 2.3 solutions

#f, ax = plt.subplots(figsize=(10,6))

# Make histogram, 25 bins looks reasonable.
#ax.hist(titanic['age'], bins = 25, color = 'coral')

# decorate
#ax.set_ylabel('Frequency')
#ax.set_xlabel('Passenger Age')
#ax.set_title('Age Distribution of Titanic Passengers')


In [None]:
# Exercise 2.4 solutions

# simple boxplot
#f, ax = plt.subplots(figsize = (6,6))

# remove na values
#fare = titanic[titanic['fare'].notnull()]

# make a boxplot
#grps = fare.groupby('embarked')['fare']

#ax.boxplot([grps.get_group('S'),grps.get_group('Q'),grps.get_group('C')],
#           labels=['Southampton', 'Queenstown', 'Cherbourg'])

# label the y axis
#ax.set_ylabel('Fare');

In [None]:
# Exercise 2.5 solutions

# sharey as both axis will be survived proportion.
#f, (ax1, ax2) = plt.subplots(1,2,figsize=(12,6), sharey = True)

# Plot ax1
#titanic.groupby('child')['survived'].mean().plot(kind='bar', ax=ax1)
# decorate ax1
#ax1.set_xticklabels(['Adult','Child'], rotation = 0)
#ax1.set_xlabel('Age Group')
#ax1.set_ylabel('Proportion of Passengers who Survived')

# Plot ax2
#titanic['lone_passenger'] = ((titanic['sibsp'] == 0) & (titanic['parch'] == 0)).astype(int)
#titanic.groupby('lone_passenger')['survived'].mean().plot(kind='bar', color = ['darkcyan','deeppink'], ax=ax2)
# decorate ax2
#ax2.set_xticklabels(['Passenger in Group','Lone Passenger'], rotation = 0)
#ax2.set_xlabel('Passenger Status')

# Add a figure title
#f.suptitle("Titanic Survival by Passenger Age and by Group Status");

## 9. Saving/Exporting Plots

Saving and Exporting plots is really simple. Just make a plot, and then call `plt.savefig()` at the end with the appropriate parameters.

In [None]:
# create a figure and an axis instance
fig, ax = plt.subplots(figsize=(8,4))

# Get survivor proportions by passenger classes
titanic.groupby('pclass')['survived'].mean().plot(kind='barh', ax=ax)

# Decorate
ax.set_title('Titanic Passenger Survival by Class')
ax.set_xlabel('Proportion of Passengers Survived')
ax.set_ylabel('Passenger Class')

# Now save this figure
plt.savefig('../example.png', dpi=200, bbox_inches = 'tight')

In the `savefig()` function I have set the format (the 'png' image format) by using the appropriate file extension (.png) as part of the filepath for saving the figure. Because 'png' is an image file type, I can also set the resolution that I want the image to be with the dpi (dots per inch). Publications often ask for a minimum of 300 dpi.

Matplotlib can save figure in a range of different filetypes as standard, including:
* Image (Raster) Formats
    * .png - 'portable network graphics'
    * .jpg /.jpeg - 'joint photographic experts group'
    * .tif /.tiff - 'tagged image file format'
* Vector Formats
    * .ps /.eps - Postscript/ Encapsulated Postscript
    * .pdf - 'portable document format'
    * .svg - 'scalable vector graphics'
    
Image file formats are basically regular grids of cells each with a value representing how that cell should be shaded. These kind of files are easily manipulated in software like Adobe Photoshop or GIMP (open source photoshop) if you want to do additional touch-ups or graphic design.

Vector file formats encode all the elements of the plot as paths, this means they don't have a resolution, the quality of the output doesn't degrade the more you zoom in, unlike a raster image. The .pdf is a widely used and highly shareable file format which can be opened by modern web browsers. The postscript formats are primarily used by publishers for printing, you may find that official publications or academic journals request figures as .ps or .eps. The .svg format is an open vector file format that can be used with software like Adobe Illustrator or Inkscape (open source illustrator) to make additional adjustments to the figure.

In the cell below try saving the figure as a .pdf type, then have a look at it by opening the pdf file. Compare the file to the png that you've already saved. If you zoom in the text should get 'fuzzy' in the png, but not in the pdf. That is the key difference between vector and raster, one has an implicit resolution and hence an ideal scale at which it should be viewed to avoid distortion.

Note that I have to recreate the plot as plots are specific to code cells in the notebook.

In [None]:
# create a figure and an axis instance
fig, ax = plt.subplots(figsize=(8,4))

# Get survivor proportions by passenger classes
titanic.groupby('pclass')['survived'].mean().plot(kind='barh', ax=ax)

# Decorate
ax.set_title('Titanic Passenger Survival by Class')
ax.set_xlabel('Proportion of Passengers Survived')
ax.set_ylabel('Passenger Class')

# Now save the figure again as pdf
plt.savefig('../example.pdf', dpi=300, bbox_inches = 'tight')

# Consolidation

Matplotlib is a very large package that forms the core of python's plotting capabilities. In this vignette you have barely scratched the surface of it's capabilities! Hopefully though you should have a sense of how to make a range of simple visualisations. One of Matplotlib's huge advantages is that you can plot almost anything using it, however sometimes actually achieving this can be tricky. Nonetheless, it is an important library to have at least a passing familiarity as other libraries make use of matplotlib as a backend for specific purposes.

A great example of this is [seaborn](https://seaborn.pydata.org), which you may find much more accessible for making data visualisation. In fact, seaborn actually implements a number of interesting and novel plotting methods that aren't directly available through matplotlib. In particular it makes plotting grouped data much easier (see example below).

If interactive, web-style graphics are more your thing, then [bokeh](https://bokeh.pydata.org/) (pronounced as either: bo-kay or bo-kuh) might be worth your time, the simple scatter plot example is reproduced below. Try interacting with the plot.

Finally, if you are an R user who has familiarity using ggplot2, or are otherwise interested in the 'grammar of graphics' (gg) approach, then 'plotnine' may interest you. However, it is not currently part fo the anaconda distribution, so you need to install it (use: 'pip install plotnine' from the command line), unfortunately this can be tricky. As a result plotnine is not demonstrated below.

## 10. Seaborn

In [None]:
# import seaborn
import seaborn as sns

In [None]:
# make a swarmplot - can take a moment to produce.
f, ax = plt.subplots(figsize=(6,6))
sns.swarmplot(x = 'sex', y = 'fare', data = titanic, ax=ax);

## 11. Bokeh

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

In [None]:
output_notebook()

In [None]:
# create a new plot with default tools, using figure
p = figure(plot_width=600, plot_height=600, title = 'Titanic Passengers Age against Fare Paid.')

# add a circle renderer with x and y coordinates, size, color, and alpha
p.circle(titanic['fare'], titanic['age'], size=8, line_color="navy", fill_color="orange", fill_alpha=0.5)

# Add some decoration
p.xaxis.axis_label = 'Fare'
p.yaxis.axis_label = 'Age'

show(p) # show the results