#### Objectives
* Create a time series plot showing a single data set.

* Create a scatter plot showing relationship between two data sets.


#### Matplotlib
* Commonly use a sub-library called matplotlib.pyplot.
* The Jupyter Notebook will render plots inline if we ask it to using a “magic” command.
* good [explainer for Matplotlib](http://pbpython.com/effective-matplotlib.html)


In [None]:
#inline plotting and importing matplotlib as plt
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image

### Matplotlib is best run as in object oriented way
You will have a figure which is the canvas with the frame. The 'axes' object is the picutre you draw onto the canvas. You can draw multiple pictures aka 'axes' on the same canvas.

In [None]:
#from https://stackoverflow.com/questions/5575451/difference-between-axes-and-axis-in-matplotlib
Image(filename='../fig/Fig_axes_in_matplotlib.png')

In [None]:
#explain fig and ax on the board


### Matplotlib can interpret intergers and floats as you would expect
If you have a number encoded as a string convert it to a int or float type so it interprets numbers as expected.

In [None]:
#import the resistant interaction data this time
res_df = pd.read_excel()

#rename the columns and plot



### Matplotlib is REALLY finetunable 
This is its strenght and weakness.

In [None]:





ax.set_ylabel('Log fold change', fontdict={'size': 10})
ax.tick_params(axis='both', which='major', labelsize=10)

### Matplotlib lets you plot different plot types easily
* .plot = lineplot
* .bar = bar graph
* .scatter = scatter plot
* .boxplot = box plots
* .violinplot = violin plot
* many many more. See gallery [here](https://matplotlib.org/gallery/index.html). 

In [None]:
fig, ax = plt.subplots(figsize=(12,12))

gene1 = 'IWGSC_CSS_4AS_scaff_6010640.mRNA.Traes_4AS_8A64DBE8E.1'
gene2 = 'IWGSC_CSS_5AS_scaff_1506888.mRNA.Traes_5AS_FECFF9702.1'

ax.scatter()
ax.scatter()


ax.set_ylabel('log fold expression change', fontdict={'size': 20})
ax.tick_params(axis='both', which='major', labelsize=20)


### Matplotlib has different plotting styles


In [None]:
plt.style.use('ggplot')

fig, ax = plt.subplots(figsize=(12,12))

bar_with = 0.1

gene1 = 'IWGSC_CSS_6BS_scaff_2971039.mRNA.Traes_6BS_0BDACE205.1'
gene2 = 'IWGSC_CSS_5AS_scaff_1506888.mRNA.Traes_5AS_FECFF9702.1'




#### Exercise
Copy over the raw text into a new code cell. Fill in the blanks below to plot the the expression of the five genes that are most highly upregulated during a virulent interaction. Infection of a susceptible wheat cultivar is found in Table S9. Try to explain what the other bits and pieces do.

In [None]:
sus_df = pd.read_excel('../data/12864_2016_2684_MOESM1_ESM.xlsx',\
                      sheet_name='Table S9', skiprows=2 ,index_col='gene_id')

In [None]:
sus_df.head()

#### Excercise
Modify the example in the notes to create a scatter plot showing the relationship between the minimum and maximum GDP per capita among the countries in Asia for each year in the data set. What relationship do you see (if any)?

Start thinking about the following:
* Read in the gdp from asia using from_csv
* make the the fig, ax using plt.subplots()
* calculate min and max for asia over the time using the describe method within the dataframe object. 
* transpose your dataframe
* plot ax.scatter with min and max

In [None]:
data_asia = pd.read_csv('data/gapminder_gdp_asia.csv', index_col='country')

fig, ax = plt.subplots()

ax.scatter(data_asia.describe().T['min'], data_asia.describe().T['max'])

### Matplotlib lets you save files in different formats
The figure can be saved with its intrinsic function .savefig. It will save the current figure to file. The file format will automatically be deduced from the file name extension (formats are pdf, ps, eps, png, and svg). You can also adjust the resolution with dpi = .

In [None]:
fig.savefig('test.png')

## Ploting with Seaborn a modern plotting library
[Seaborn](https://seaborn.pydata.org/) is a simpler statistical plotting library. 

This part is based on https://github.com/aspp-apac/2019-data-tidying-and-visualisation


Seaborn builds on Matplotlib. Some nice features are:

* works directly with Pandas dataframes, concise syntax
* lots of plot types, including some more advanced options
* statistical plotting: many plots do summary statistics for you
* good default aesthetics and easy control of aesthetics
* using Matplotlib gives benefits of Matplotlib - many backends, lots of control
* underlying Matplotlib objects are easy to tweak directly

#### Setup

In [None]:
import pandas as pd
import numpy as np

Be aware that Seaborn automatically changes Matplotlib's defaults on import. Not only your Seaborn plots, but also your Matplotlib plots, will look different once Seaborn is imported. If you don't want this behaviour, you can call sns.reset_orig() after import.

In [None]:
import seaborn as sns

In [None]:
sales = pd.read_csv("../data/toy_data/housing-data-10000.csv", 
                    usecols=['id','date','price','zipcode','lat','long', 'bedrooms',
                             'waterfront','view','grade','sqft_living','sqft_lot'],
                    parse_dates=['date'], 
                    dtype={'zipcode': 'category',
                           'waterfront': 'bool'})

In [None]:
import datetime as dt

In [None]:
dt.datetime.utcnow().strftime("%d_%b_%Y_%Ih_%Mmin_%Ssec")

In [None]:
str(dt.datetime.utcnow().strftime("%d_%b_%Y_%Ih_%Mmin"))

In [None]:
sales.describe().loc[:, 'price']

In [None]:
sales.head()

In [None]:
sales.dtypes


Note that as well as specifying that the date field should be parsed as a date, we specified that certain variables are categorical (as opposed to integers). Some plotting commands understand pandas DataFrames and will treat categorical variables differently to numerical variables.

### Toy data

In [None]:
from io import StringIO

data_string = """name	number	engine_type	colour	wheels	top_speed_mph	weight_tons
Thomas	1	Tank	Blue	6	40	52
Edward	2	Tender	Blue	14	70	41
Henry	3	Tender	Green	18	90	72.2
Gordon	4	Tender	Blue	18	100	91.35
James	5	Tender	Red	14	70	46
Percy	6	Tank	Green	4	40	22.85
Toby	7	Tank	Brown	6	20	27
Emily	12	Tender	Green	8	85	45
Rosie	37	Tank	Purple	6	65	37
Hiro	51	Tender	Black	20	55	76.8"""

trains = pd.read_table(StringIO(data_string))
trains['size'] = pd.cut(trains['weight_tons'], 3, labels=['Small','Medium','Big'])

trains

In [None]:
df = pd.DataFrame({
    'Time': [1,2,3,4,5],
    'Projected': [2,5,10,17,26],
    'Actual': [1,4,9,11,9]
})

fig, ax = plt.subplots()
sns.scatterplot(data=df, x='Time', y='Actual', color='red', ax=ax)
sns.lineplot(data=df, x='Time', y='Projected', color='blue', ax=ax)

ax.set_ylabel('Huge profits')

ax.annotate("where it all went wrong", 
                                 xy=(3,10), xytext=(1,12),
                                 arrowprops={'width':2})

In [None]:
fig, ax = plt.subplots()
sns.barplot(data=trains, x='engine_type', y='top_speed_mph', ax=ax)


Here Seaborn has interpreted the x and y arguments as field names in the supplied DataFrame. Notice also that Seaborn has performed the summary statistics for us - in this case, using the default estimator, which is mean().

Notice also what happens if we simply swap the x and y parameters. Seaborn will automatically deduce that the categorical or string-like variable must be the bar labels, and the numeric variable must be the numeric axis:

#### Exercise
Create a (vertical) bar plot using the **sales** data, showing how house prices vary with the value of the property grade.

In [None]:
sns.barplot(data=sales, x='grade', y='price')

Bar plots are often deplored as a way of showing statistical estimates, as only the top of the bar is really important, and the bar itself is a visual distraction. A point plot is an alternative, and plots like box plots can show more information. Several other plot types also show distributional information within categories.



#### Exercise: 

reproduce the plot you just made, using instead each of the Seaborn functions:

* pointplot()
* boxplot()
* violinplot() (try the scale parameter)
* boxenplot()
* stripplot() [SEE WARNING] (try the jitter parameter)
* swarmplot() [SEE WARNING]
Note what sort of information about the distribution is shown by each.

WARNING: stripplot() and swarmplot() will plot individual data points. There are too many house sales to easily display in this way - you may want to subsample the dataframe with e.g.  data=sales.sample(100).

In [None]:
sns.swarmplot(data=sales.sample(100), x='grade', y='price')

In [None]:
sns.barplot(data=sales, x='price', y='grade', orient='h')

In [None]:
sales_fixed = sales.copy()
sales_fixed['grade'] = sales['grade'].astype('category')

sns.barplot(data=sales_fixed, x='price', y='grade')

### Hue
Many Seaborn plotting functions take a hue parameter. This colours the plot elements by some categorical variable, but more than this, summary statistics are calculated for each level of the hue variable.

In [None]:
# It appears that my hypothesis that more wheels make you faster is flawed
sns.lmplot(data=trains, x='wheels', y='top_speed_mph', 
           hue='engine_type', palette=['red', 'blue'])

#### Excerises
* Create an lmplot of house price against living area. Do this without a hue parameter, then add in waterfront as the hue parameter. What information is the hue giving in this case?
* Try adding the hue parameter to one of your previous plots of some other type - for instance, a box plot.

In [None]:
g = sns.lmplot(data=sales, x='sqft_living', y='price', hue='waterfront')
g.savefig('limplot.png')

In [None]:
sns.barplot(data=sales, x='grade', y='price', hue='waterfront')

#### Compound plots

Seaborn has some plotting functions which create more complex figures made of multiple subplots. These include pairplot(), catplot(), jointplot(), lmplot() and clustermap(). Let's see a few examples:

In [None]:
# jointplot shows a scatter or density plot, with marginal distributions
sns.jointplot(data=sales, x='sqft_living', y='price') #, kind='reg')

In [None]:

# pairplot shows pairwise relationships between variables
# Note that a variable like engine_type would be ignored as it is not numeric
sns.pairplot(data=trains[['wheels', 'top_speed_mph', 'weight_tons']])

# TODO: maybe demo with sales

In [None]:
# catplot conditions different subplots on different variable values
# we map variables to row and column of a grid of plots (as well as to hue)
# in this example, we just use columns, and so have only one row
sns.catplot(data=trains, kind='bar',
               x='size', y='top_speed_mph', 
               col='engine_type')

#### Exercise: 
design a plot using sns.catplot, to show the relationship between house price and (at least): grade, waterfront, and view. Available channels of information are:

* x and y coordinates
* hue
* row and column of subplot (row and col)

You do not have to use all of these channels - in fact your plot may be difficult to take in if you do.

You can set the kind parameter to the kind of plot you want to make: point, bar, count, box, violin, and strip.

You can control the size of the overall figure with size and aspect.

In [None]:
# One option
sns.catplot(data=sales, y='price', x='grade', row='view', hue='waterfront', kind='violin',
               kwargs={'scale':'width'}, size=2, aspect=3)