<img src='../../img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Bokeh Charts

* Charts provide a high-level interface to basic statistical plotting. 
* Charts can simplify the generation of figures.
* Charts provide more powerful data manipulation automatically through the use of Pandas DataFrames.

# Table of Contents
* [Bokeh Charts](#Bokeh-Charts)
* [Learning Objectives:](#Learning-Objectives:)
* [Chart types](#Chart-types)
	* [Scatter](#Scatter)
	* [BoxPlot](#BoxPlot)
	* [Bar](#Bar)
	* [Histogram](#Histogram)
	* [Time Series](#Time-Series)
* [Exercise](#Exercise)


## Learning Objectives:

After completion of this module, learners should be able to:

* generate statistical plots using the high-level Charts interface
* explain the mapping between Pandas DataFrames and Chart options
* plot TimeSeries data

## Set-Up

In [1]:
import pandas as pd
from bokeh.io import output_notebook, show
output_notebook()

# Chart Types

There are serveral Charts available with Bokeh version 0.11. Here, we'll cover
* `Scatter`
* `Bar`
* `Histogram`
* `Box`
* `Timeseries`

See the [reference documentation](http://bokeh.pydata.org/en/latest/docs/reference/charts.html) for the complete list of Charts available in version 0.11.

## Scatter

The Scatter Chart is a convenient way to plot `x` and `y` data.

Start by loading data into a Pandas DataFrame:

In [2]:
flowers = pd.read_csv('data/iris.csv')
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Import the `Scatter` constructor from `bokeh.charts` and use it to create and render the plot.

In [3]:
from bokeh.charts import Scatter

plot = Scatter(flowers, x='petal_length', y='petal_width')

show(plot)

Notice the groupings or "clusters" of point on the previous plot. They might represent different categories of data.

<div class='alert alert-info'>
<img src='img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<br><big><big>
Scatter can perform a groupby based on a column using <tt>color=</tt> or <tt>marker=</tt>.
</big></big>
<br><br>
</div>

In [4]:
# Note the species column
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
flowers.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

Bokeh Scatter chart allows you to visualize differences based on a categorial column.

In [9]:
from bokeh.charts import Scatter

plot = Scatter(flowers, x='petal_length', y='petal_width',
               color='species',
               legend='top_left', 
               title='Flower Morphology')
show(plot)

The glyph in a Scatter chart can be changed with the `marker` keyword. 

In [10]:
dir(plot)

['__cached_all__overridden_defaults__',
 '__cached_all__properties__',
 '__cached_all__properties_with_refs__',
 '__class__',
 '__container_props__',
 '__dataspecs__',
 '__delattr__',
 '__deprecated_attributes__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__overridden_defaults__',
 '__properties__',
 '__properties_with_refs__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__repr_html__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__subtype__',
 '__view_model__',
 '__weakref__',
 '_active_drag',
 '_active_scroll',
 '_active_tap',
 '_attach_document',
 '_axis',
 '_builders',
 '_built',
 '_callbacks',
 '_check_colon_in_category_label',
 '_check_missing_renderers',
 '_check_no_data_renderers',
 '_check_required_range',
 '_check_snapped_toolbar_and_axis',
 '_clone',
 '_defaults',
 '_detach_document',
 '_document',
 

In [11]:
plot = Scatter(flowers, x='petal_length', y='petal_width', 
               color='species', 
               marker='triangle',
               legend='top_left', 
               title='Flower Morphology')
show(plot)

*Note: The full list of available markers for Scatter plots is available through the `bokeh.models.markers` module.*

## BoxPlot (aka "Whisker Diagram")

Box plots standardized way of displaying the distribution of data based on a five number summary:
* minimum
* first quartile
* median
* third quartile
* maximum.

Visual symbols:

* The first through third quartile (25th-75th percentiles, aka "interquartile range or IQR") are in the box
* The middle line of the box is the median (50th percentile)
* The "whiskers" represent either the min and max values of data, or, if outliers are present, the 5th and 95th percentiles
* Red dots represent outliers are present.

The `BoxPlot` chart does the work of determining the quantiles, mean, and outliers.

### Outliers

[John Tukey](https://en.wikipedia.org/wiki/John_Tukey) defined two types of outliers:
* Outliers are 3 IQR above (below) the third (first) quartile.
* Suspected outliers are 1.5 IQR above (below) the third (first) quartile.
* checking the [boxplot.py source code on GitHub](https://github.com/bokeh/bokeh/blob/master/examples/plotting/file/boxplot.py) reveals that Bokeh uses the "Suspected Outliers" definition 

```python
upper = q3 + 1.5*iqr
lower = q1 - 1.5*iqr
```

*Note: Outliers can be turned off in Bokeh using the optional input parameter `outliers=False`*

<div class='alert alert-info'>
<img src='img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<br><big><big>
Box Plots:
<br><tt>label=</tt> on the x-axis
<br><tt>values=</tt> on the y-axis.
</big></big>
<br><br>
</div>

In [12]:
from bokeh.charts import BoxPlot
p = BoxPlot(
    flowers, label='species', values='petal_width',
    xlabel='',
    ylabel='petal width, mm',
    title='Distribution of petal widths',
    color='aqua',
)
show(p)

### BoxPlots and Hierarchical Indexing

Multiple column names input as the `label` will perform hierarchical indexing. 

In [13]:
auto = pd.read_csv('data/auto-mpg.csv')

When we plot the data, notice the `x-axis` labels repeat the cylinder numbers as the origin changes.

In [16]:
from bokeh.charts import BoxPlot
p = BoxPlot(
      auto, label=['cyl','origin'], values='mpg', color='origin', legend=False
)

show(p)

## Bar

Bar charts aggregate data across columns in a DataFrame.

In the call to `Bar()` below `agg` is the aggregation algorithm. The possible algorightms are

* `sum`  (default)
* `mean`
* `count`
* `nunique`
* `median`
* `min`
* `max`

In [17]:
from bokeh.charts import Bar
p = Bar( auto, label='yr', values='mpg', 
         agg='median',
         title="Median MPG by YR", 
         legend='top_left'
)
show(p)

<div class='alert alert-info'>
<img src='img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<br><big><big>
Bar Plots:
<br><tt>label=</tt> on the x-axis
<br><tt>values=</tt> on the y-axis.
</big></big>
<br><br>
</div>

Bar charts support groupby aggregations on categorical columns in the input DataFrame using the `group` input parameter:

In [18]:
from bokeh.charts import Bar
p = Bar( auto, label='yr', values='mpg', 
         agg='median', 
         group='origin',
         title="Median MPG by YR, grouped by ORIGIN", 
         legend='top_left'
)
show(p)

<div class='alert alert-info'>
<img src='img/topics/Essential-Concept.png' align='left' style='padding:10px'>
<br><big><big>
Bar Plots:
<br><tt>group=</tt> column to be grouped
<br><br>Performs a groupby aggregation and then plots each group separately.
</big></big>
<br><br>
</div>

This graph is a little easier to read as a `stacked` Bar graph.

In [22]:
from bokeh.charts import Bar
p = Bar( auto, label='yr', values='mpg',
         agg='mean', 
         stack='origin', # Use the stack feature
         title="Mean MPG by YR, stacked by ORIGIN",
         legend='top_left'
)
show(p)

## Histogram

> "A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson." -- https://en.wikipedia.org/wiki/Histogram

In [23]:
from bokeh.charts import Histogram

plot = Histogram( auto, values='hp',
                  title="HP Distribution",
                  legend='top_right')
show(plot)

Notice this looks like multiple distributions have been mixed together. 
* Let's separate the groups with a groupby aggregation
* For the `Hist()` chart, this can be done with the `color` input parameter.
* Data will be binned separately for each unique entry in the chosen `color` column.

In [24]:
from bokeh.charts import Histogram

plot = Histogram(auto, values='hp', color='cyl',
              title="HP Distribution (color grouped by CYL)",
              legend='top_right')

show(plot)

## Time Series

Time series plots require that a DataFrame has at least one column (or the Index) be of `dtype` = `datetime64`.

In [25]:
from pandas_datareader import data
aapl = data.DataReader('AAPL', 'yahoo', '2010-1-1')
aapl.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1672 entries, 2010-01-04 to 2016-08-23
Data columns (total 6 columns):
Open         1672 non-null float64
High         1672 non-null float64
Low          1672 non-null float64
Close        1672 non-null float64
Volume       1672 non-null int64
Adj Close    1672 non-null float64
dtypes: float64(5), int64(1)
memory usage: 91.4 KB


Here a TimeSeries plot will be generated from stock data:
* Apple
* Microsoft
* IBM

In [26]:
aapl = data.DataReader('AAPL', 'yahoo', '2010-1-1')
msft = data.DataReader('MSFT', 'yahoo', '2010-1-1')
ibm = data.DataReader('IBM', 'yahoo', '2010-1-1')

Let's pull out the `Adj Close` column from each data set, and use the DataTime index from the AAPL data for the combined data:

In [27]:
import pandas as pd

stocks = pd.DataFrame( {'AAPL':aapl['Adj Close'],
                        'MSFT':msft['Adj Close'],
                        'IBM':ibm['Adj Close'],
                        'Date':aapl.index})

stocks.head()

Unnamed: 0_level_0,AAPL,Date,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,2010-01-04,113.304536,25.884104
2010-01-05,28.038618,2010-01-05,111.935822,25.892466
2010-01-06,27.592626,2010-01-06,111.208683,25.733566
2010-01-07,27.541619,2010-01-07,110.823732,25.465944
2010-01-08,27.724725,2010-01-08,111.935822,25.641571


Now use the `TimeSeries()` chart from `bokeh.charts` to plot the data:
* the DataFrame `stocks` is the first input
* the columns to plot on the y-axis are specified as a list `y=[]`
* the x-labels are discovered by Bokeh when it checks the `Index` of the DataFrame and finds that it has `dtype=DateTime`

In [28]:
from bokeh.charts import TimeSeries

plot = TimeSeries( stocks,
                   y=['AAPL','IBM','MSFT'],
                   legend=True,
                   title='Stocks',
                   ylabel='Close Price')

show(plot)

# Exercise

<img src='img/topics/Exercise.png' align='left' style='padding:10px'>

<a href='./Bokeh_ex_charts.ipynb' class='btn btn-primary btn-lg'>Making Charts</a>

----
<a href='./Bokeh_plotting.ipynb' class='btn btn-primary'>Plotting Interface</a>

----
*Copyright Continuum 2012-2016 All Rights Reserved.*