# Bar Plots with Bokeh

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 8)

In [2]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
output_notebook();

## Student Academics Performance Dataset

Reading an `arff` file.

In [3]:
from io import StringIO
import urllib.request
from scipy.io.arff import loadarff

In [4]:
stAcademic_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00467/Sapfile1.arff"
resp = urllib.request.urlopen(stAcademic_url)

In [5]:
data, meta = loadarff(StringIO(resp.read().decode('utf-8')))

`data` contains the data and `meta` contains the metadata

In [6]:
meta

Dataset: Sapfile1
	ge's type is nominal, range is ('M', 'F')
	cst's type is nominal, range is ('G', 'ST', 'SC', 'OBC', 'MOBC')
	tnp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	twp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	iap's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	esp's type is nominal, range is ('Best', 'Vg', 'Good', 'Pass', 'Fail')
	arr's type is nominal, range is ('Y', 'N')
	ms's type is nominal, range is ('Married', 'Unmarried')
	ls's type is nominal, range is ('T', 'V')
	as's type is nominal, range is ('Free', 'Paid')
	fmi's type is nominal, range is ('Vh', 'High', 'Am', 'Medium', 'Low')
	fs's type is nominal, range is ('Large', 'Average', 'Small')
	fq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	mq's type is nominal, range is ('Il', 'Um', '10', '12', 'Degree', 'Pg')
	fo's type is nominal, range is ('Service', 'Business', 'Retired', 'Farmer', 'Others')
	mo's type is nominal, ran

In [7]:
columns_name = list(meta._attributes.keys())
df = pd.DataFrame(data, columns=columns_name)
df.head(3)

Unnamed: 0,ge,cst,tnp,twp,...,ss,me,tt,atd
0,b'F',b'G',b'Good',b'Good',...,b'Govt',b'Asm',b'Small',b'Good'
1,b'M',b'OBC',b'Vg',b'Vg',...,b'Govt',b'Asm',b'Average',b'Average'
2,b'F',b'OBC',b'Good',b'Good',...,b'Govt',b'Asm',b'Large',b'Good'


In some cases, the integer columns are read as objects; for instance, instead of 2, we have b'2'. We go over the object columns and decode them again to solve this problem.

Decoding the object columns

In [8]:
# decoding the object columns
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].str.decode('UTF-8')
df.head()

Unnamed: 0,ge,cst,tnp,twp,...,ss,me,tt,atd
0,F,G,Good,Good,...,Govt,Asm,Small,Good
1,M,OBC,Vg,Vg,...,Govt,Asm,Average,Average
2,F,OBC,Good,Good,...,Govt,Asm,Large,Good
3,M,MOBC,Pass,Good,...,Govt,Asm,Average,Average
4,M,G,Good,Good,...,Private,Asm,Small,Good


## Bar Charts

Let's work with the variable: 
- `fo`: Father Occupation (Service, Business, Retired, Farmer, Others)  

In [9]:
fo = df.fo.value_counts()
fo

Service     38
Business    34
Others      29
Farmer      27
Retired      3
Name: fo, dtype: int64

### Vertical Bar Chart

Vertival bar charts can be drawn using the `vbar()` method. We will use the `ColumnDataSource` object for feeding the data to the graph.

The `ColumnDataSource` object provides the data to the glyphs of your graph. It offers advanced capabilities, such as sharing data between plots, filtering, etc.

In [10]:
from bokeh.models   import ColumnDataSource
from bokeh.palettes import Greens, Spectral

Defining sourcef with the father data

In [11]:
# Defining sourcef with the father data
sourcef = ColumnDataSource(data=dict(values=list(fo.index), counts=fo, color=Greens[len(fo)]))

Set the `x_range` to the list of categories

In [12]:
# Set the x_range to the list of categories
pf_v = figure(x_range=list(fo.index), height=350, title="Father Occupation")

In [13]:
pf_v.vbar(x='values', top='counts', width=0.8, source=sourcef)
show(pf_v)

Adding `color` and the legend 

In [14]:
pf_v.vbar(x='values', top='counts', width=0.8, color='color', legend_field='values', source=sourcef)
show(pf_v)

Printing `sourcef.data`:

In [15]:
print(sourcef.data)

{'values': ['Service', 'Business', 'Others', 'Farmer', 'Retired'], 'counts': Service     38
Business    34
Others      29
Farmer      27
Retired      3
Name: fo, dtype: int64, 'color': ('#006d2c', '#31a354', '#74c476', '#bae4b3', '#edf8e9')}


`Greens` palette is a dictionary that contains a collection (tuples) of colors:

In [16]:
Greens.keys()

dict_keys([3, 4, 5, 6, 7, 8, 9, 256])

In [17]:
Greens[3]

('#31a354', '#a1d99b', '#e5f5e0')

In [18]:
Greens[8]

('#005a32',
 '#238b45',
 '#41ab5d',
 '#74c476',
 '#a1d99b',
 '#c7e9c0',
 '#e5f5e0',
 '#f7fcf5')

### Horizontal Bar Chart

`hbar()`method is used for plotting horizontal bar charts.

Let's work with the variables: 
- `mo`: Mother Occupation (Service, Business, Retired, Housewife, Others)  

In [19]:
mo = df.mo.value_counts()
mo

Housewife    115
Service       12
Others         2
Business       1
Retired        1
Name: mo, dtype: int64

Defining sourcef with the mother data

In [20]:
# Defining sourcef with the mother data
sourcem = ColumnDataSource(data=dict(values=list(mo.index), counts=mo, color=Spectral[len(mo)]))

Set the `y_range` to the list of categories

In [21]:
# Set the y_range to the list of categories
pm_h = figure(y_range=list(mo.index), height=400, title="Mother Occupation")

In [22]:
pm_h.hbar(y='values', right='counts', color='color', height=0.8, legend_field='values', source=sourcem)
show(pm_h)

In [23]:
# Spectral palette colors
Spectral.keys()

dict_keys([3, 4, 5, 6, 7, 8, 9, 10, 11])

## Stacked Bars

A stacked bar chart is a variation of a bar chart, including an additional variable. Suppose you want to plot together the father and mother's occupation information.

As you can see, the variables `fo` (father occupation) and `mo` (mother occupation) have almost the same values. The only exception is that `fo` has 'farmer,' and `mo` has 'housewife' instead. We will change 'housewife' for 'farmer' in `mo` for both variables have exactly the same labels.

In [24]:
df.mo.value_counts()

Housewife    115
Service       12
Others         2
Business       1
Retired        1
Name: mo, dtype: int64

Changing `Housewife` by `Farmer` in mo column (mother occupation)

In [25]:
# Changing 'Housewife' by 'Farmer' in mo column (mother occupation)
df.loc[df.mo == 'Housewife', 'mo'] = 'Farmer'
df.mo.value_counts()

Farmer      115
Service      12
Others        2
Business      1
Retired       1
Name: mo, dtype: int64

Getting the occupations. Notice that we sorted them by alphabetical order.

In [26]:
# Getting the occupations. Notice that we sorted them by alphabetical order. 
occupations = df.mo.value_counts().sort_index().index.tolist()
occupations

['Business', 'Farmer', 'Others', 'Retired', 'Service']

In [27]:
# Getting the father occupations sorted by alphabetical order
df.fo.value_counts().sort_index()

Business    34
Farmer      27
Others      29
Retired      3
Service     38
Name: fo, dtype: int64

Getting the sorted values of the occupation of the father and mother

In [28]:
fo_val = list(df.fo.value_counts().sort_index())
mo_val = list(df.mo.value_counts().sort_index())

### Vertical Stacked Bars

Defining data. Take into account that we have to use the same order of the labels

In [29]:
# Defining data. Take into account that we have to use the same order of the labels
datap = {'occupations': occupations, 
        'Father': fo_val,
        'Mother': mo_val}

In [30]:
#Defining parent
parent = ['Father', 'Mother']

Defining an empty figure

In [31]:
p_v = figure(x_range=occupations, height=350, title="Parents' Occupations",
           toolbar_location=None, tools="hover", tooltips="($name) @occupations: @$name")

In [32]:
p_v.vbar_stack(parent, x='occupations', width=0.8, color=['skyblue','salmon'], 
                source=datap, legend_label=parent)
p_v.y_range.start = 0
p_v.x_range.range_padding = 0.1
p_v.legend.location = "top_right"
p_v.legend.orientation = "vertical"

show(p_v)

In [33]:
# The legend orientation can be horizontal
p_v.vbar_stack(parent, x='occupations', width=0.8, color=['steelblue','coral'], 
                source=datap, legend_label=parent)
p_v.y_range.start = 0
p_v.x_range.range_padding = 0.1
p_v.legend.location = "top_right"
p_v.legend.orientation = "horizontal"

show(p_v)

### Horizontal Stacked Bars

In [34]:
p_h = figure(y_range=occupations, width=600, height=400 , title="Parents' Occupations",
           toolbar_location=None, tools="hover", tooltips="($name) @occupations: @$name")

In [35]:
p_h.hbar_stack(parent, y='occupations', height=0.9, color=['skyblue','salmon' ], 
        source=datap, legend_label=parent)
p_h.x_range.start = 0
p_h.y_range.range_padding = 0.1
p_h.legend.location = "top_right"
p_h.legend.orientation = "vertical"

show(p_h)

In [36]:
# The legend orientation can be horizontal. Let's locate it at the center of the graph.
p_h.hbar_stack(parent, y='occupations', height=0.9, color=['steelblue','coral' ], 
        source=datap, legend_label=parent)
p_h.x_range.start = 0
p_h.y_range.range_padding = 0.1
p_h.legend.location = "center"
p_h.legend.orientation = "horizontal"

show(p_h)

### Grouping

In [37]:
from bokeh.models import FactorRange
from bokeh.transform import factor_cmap

In [38]:
# this creates [ ("Business", "Father"), ("Business", "Mother"), ("Farmer", "Father"), ...]
X_val = [ (occ, par) for occ in occupations for par in parent ]
X_val

[('Business', 'Father'),
 ('Business', 'Mother'),
 ('Farmer', 'Father'),
 ('Farmer', 'Mother'),
 ('Others', 'Father'),
 ('Others', 'Mother'),
 ('Retired', 'Father'),
 ('Retired', 'Mother'),
 ('Service', 'Father'),
 ('Service', 'Mother')]

In [39]:
counts = sum(zip(datap['Father'], datap['Mother']), ()) 
counts

(34, 1, 27, 115, 29, 2, 3, 1, 38, 12)

In [40]:
sourceg = ColumnDataSource(data=dict(x=X_val, counts=counts))

In [41]:
pg = figure(x_range=FactorRange(*X_val), width=700, height=350, title="Parents' Occupations",
            toolbar_location=None, tools="hover", tooltips="@counts")

In [42]:
pg.vbar(x='x', top='counts', width=0.9, source=sourceg, line_color="white",
       fill_color=factor_cmap('x', palette=['silver','pink'], 
       factors=parent, start=1, end=2))
pg.y_range.start = 0
pg.x_range.range_padding = 0.1
pg.xaxis.major_label_orientation = 1
pg.xgrid.grid_line_color = None

show(pg)

## References

- Hussain S, Dahan N.A, Ba-Alwi F.M, Ribata N. Educational Data Mining and Analysis of Studentsâ€™ Academic Performance Using WEKA. Indonesian Journal of Electrical Engineering and Computer Science. 2018; Vol. 9, No. 2. February. pp. 447~459
- https://docs.bokeh.org/en/latest/