# **TEVA Post-Processing**
*Notebook by Ryan van der Heijden  |  rvanderh@uvm.edu*

The goal of this notebook is to facilitate interpretation of the TEVA model output files. The notebook walks the user through the development of an interactive dashboard using the *Bokeh* package which allows them to investigate the CCs and DNFs more easily than dealing with the raw output files. *Bokeh* creates a html dashboard that can contain multiple interactive plots, tables, etc. that can be saved to return to later without re-processing.

**This notebook is divided into four main sections:**
1.   Import Packages and TEVA Output Files
2.   Process TEVA Output Files
3.   Build *Bokeh* Plot Structures
4.   Create *Bokeh* Dashboard

**The resulting dashboard contains four main components:**
1.   Plot of positive predictive value vs. coverage with fitness contours
2.   Table showing the features in each CC
3.   Plot showing feature usage ("popularity") across all CCs
4.   Plot showing CC usage across all DNFs

# (1) Import Packages and TEVA Output Files
*   Import base packages and Bokeh modules that will be used to visualizing the TEVA output.
*   Set plot dimensions.
*   Import the CC and DNF output files from TEVA.

Note that the TEVA output files are Excel spreadsheets, so we need to specify the sheet number to import. The sheet number coresponds to the output class in the algorithm.

In-line plotting will be used for this demonstration, but this can be commented-out to open the figure in it's own window.

In [None]:
# Base Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import colors
import matplotlib.tri as tri

# Bokeh Imports
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, CDSView, GroupFilter, NumeralTickFormatter, TableColumn, DataTable, Div, TabPanel, Tabs
from bokeh.transform import linear_cmap
from bokeh.models import HoverTool
from bokeh.palettes import varying_alpha_palette, Category20

# Set Bokeh to use inline plotting
output_notebook()

# set up plot dimensions (of main plot)
h = 800     # height in pixels
w = 900     # width in pixels

In [None]:
# Import TEVA output files
ccs = pd.read_excel('ccs_2DOC_CAMELS.xlsx', sheet_name='CCEA_Low')
dnfs = pd.read_excel('dnfs_2DOC_CAMELS.xlsx', sheet_name='DNFEA_Low')

# (2) Process TEVA Output Files
*   Create lists of the features associated with each CC and the CCs associated with each DNF.
*   Creat DataFrame containing feature counts sorted by CC order.
*   Create DataFrame containing CC counts sorted by DNF order.
*   Create custom color maps for plotting CCs, DNFs, and fitness contours.
*   Create fitness contours.



In [None]:
# Build a dataframe containing features and ranges for each dnf
'''
Each DNF is composed of several CCs, depending on the order. Each CC contains multiple features.
Build a dataframe that will hold all the CCs and the features they are composed of for each DNF.
'''
# Grab feature names from the CC output
feature_names = list(ccs.columns[12:])

# For each DNF, find the associated CCs and then their variables and ranges
all_ccs = []
for j in range(0, len(dnfs['mask'])):
    item = dnfs.iloc[j].iloc[12:]
    item_ccs = item[item==1].index.values.tolist()
    item_ccs = list(map(lambda i: i[3:], item_ccs))
    all_ccs.append(item_ccs)

# For each cc, build a dictionary containing the features as keys and feature ranges as values.
cc_features = []
for j in range(0, len(ccs)):
    cc_values = ccs.iloc[j].iloc[12:]
    cc_values.fillna(value = 0, method=None, inplace=True)
    cc_values = dict(cc_values[cc_values != 0])
    # cc_features.append(cc_values)
    cc_features.append(list(cc_values.keys()))

In [None]:
# Count how many times each CC shows up in a DNF
# Function to flatten list of lists
def flatten(xss):
    return [x for xs in xss for x in xs]

all_ccs_flat = np.array(flatten(all_ccs))
unique_ccs = list(np.unique(all_ccs_flat))

# Loop through cc_features and get count
cc_counts = []
for i in range(len(unique_ccs)):
    cc_counts.append(np.count_nonzero(all_ccs_flat==unique_ccs[i]))

In [None]:
# Count how many times each feature shows up in a CC
# Flatten to get all features used by CCs
cc_features_flat = np.array(flatten(cc_features))
unique_features = list(np.unique(cc_features_flat))

# Loop through cc_features and get count
feature_counts = []
for i in range(len(unique_features)):
    feature_counts.append(np.count_nonzero(cc_features_flat==unique_features[i]))

In [None]:
#%% Construct stacked feature data
# CC order
cc_len = np.arange(1, max(ccs['order']) + 1 , 1)
cc_col_names = ['Feature']

for j in range(len(cc_len)):
    cc_col_names.append('Order ' + str(j+1))

cc_col_names.append('Total')
feature_order = pd.DataFrame(columns=cc_col_names)
feature_order['Feature'] = unique_features
feature_order['Total'] = feature_counts

# List subset by length
for k in range(len(cc_len)):
    # subset cc_features by order
    subset = [sublist for sublist in cc_features if len(sublist) == cc_len[k]]
    subset = np.array(flatten(subset))

    for i in range(len(unique_features)):
        feature_order.loc[i, 'Order ' + str(k+1)] = np.count_nonzero(subset==unique_features[i])

feature_order.sort_values(by=['Total'], ascending=False, inplace=True)
feature_order.drop(columns=['Total'], inplace=True)
stack_plot_feature = dict(feature_order)
stack_names_feature = cc_col_names[1:-1]

In [None]:
#%% Construct stacked CC data
# DNF order
dnf_len = np.arange(1, max(dnfs['order']) + 1 , 1)
dnf_col_names = ['CC']

for j in range(len(dnf_len)):
    dnf_col_names.append('Order ' + str(j+1))

dnf_col_names.append('Total')
cc_order = pd.DataFrame(columns=dnf_col_names)
cc_order['CC'] = unique_ccs
cc_order['Total'] = cc_counts

# List subset by length
for k in range(len(dnf_len)):
    # subset by order
    subset = [sublist for sublist in all_ccs if len(sublist) == dnf_len[k]]
    subset = np.array(flatten(subset))

    for i in range(len(unique_ccs)):
        cc_order.loc[i, 'Order ' + str(k+1)] = np.count_nonzero(subset==unique_ccs[i])

cc_order.sort_values(by=['Total'], ascending=False, inplace=True)
cc_order.drop(columns=['Total'], inplace=True)
stack_plot_cc = dict(cc_order)
stack_names_cc = dnf_col_names[1:-1]

In [None]:
# Custom color maps
'''
Create custom colormaps, one for CCs, one from DNFs, and one for fitness contours.
Colormaps can range from 0 to 256, but it is best to trim the lightest and darkest portions out.
'''
cc_colors = []
dnf_colors = []

# CCs and DNFs
for i in range(20,220):
    cc_colors.append(colors.rgb2hex(plt.get_cmap('Blues_r')(i)))
    dnf_colors.append(colors.rgb2hex(plt.get_cmap('Oranges_r')(i)))

# Fitness contours
contour_colors = varying_alpha_palette(color='black', start_alpha=150, end_alpha=10)

In [None]:
# Fitness Contour Interpolation
'''
Since the fitness values are irregularly spaced, need to set up a gride and interpolate
fitness values within the plot domain in order to plot fitness contours. Use matplotlib
linear triangular interpolator.
'''
n_grid = min(h, w)  # set grid size to min plot size
x_i = np.linspace(0, 1, n_grid)
y_i = np.linspace(0, 1, n_grid)
xplot, yplot = np.meshgrid(x_i, y_i)
triangles = tri.Triangulation(pd.concat([dnfs['cov'],ccs['cov']]),
                              pd.concat([dnfs['ppv'], ccs['ppv']]))
fitness = pd.concat([dnfs['fitness'], ccs['fitness']])
interpolator = tri.LinearTriInterpolator(triangles, fitness)
z_i = interpolator(xplot, yplot)

# (3) Build *Bokeh* Plot Structures
Create ColumnDataSource structures for CCs, DNFs, and fitness contours.

In [None]:
# Bokeh Data Sources
'''
Bokeh uses a data structure called a "Column Data Source" (CDS) for plotting.
The easiest way to create them with your data is by passing your data as a disctionary.
'''

# column data source for CCs
cc_plot_data = {'x_values': ccs['cov'],
                'y_values': ccs['ppv'],
                'CC': ccs['Unnamed: 0'],
                'Order': ccs['order'],
                'Features': cc_features}
cc_plot_source = ColumnDataSource(data=cc_plot_data)

# column data source for DNFs
dnf_plot_data = {'x_values': dnfs['cov'],
                 'y_values': dnfs['ppv'],
                 'Order': dnfs['order'],
                 'DNF': dnfs['Unnamed: 0'],
                 'CCs': all_ccs}
dnf_plot_source = ColumnDataSource(data=dnf_plot_data)

# column data source for fitness contours
dnf_cont_data = {'x_values': x_i,
                 'y_values': y_i,
                 'z_values': z_i}
dnf_cont_source = ColumnDataSource(dnf_cont_data)

# column data source for CC feature table
table_data = pd.DataFrame(data=cc_features, columns=cc_col_names[1:-1])
table_source = ColumnDataSource(table_data)
columns = []
for j in range(len(cc_len)):
    columns.append(TableColumn(field=table_data.columns[j], title=table_data.columns[j]))

# (4) Create *Bokeh* Dashboard
Create and show the interactive dashboard.

In [None]:
# Bokeh Figure Setup
# tooltips to display when you hover over a data point
dnf_TOOLS = [
    ('DNF #', '@DNF'),
    ('Order', '@Order'),
    ('PPV', '@y_values'),
    ('COV', '@x_values'),
    ('CCs', '@CCs')]

cc_TOOLS = [
    ('CC #', '@CC'),
    ('Order', '@Order'),
    ('PPV', '@y_values'),
    ('COV', '@x_values'),
    ('Features', '@Features')]

#### Figure 1
p1 = figure(width = w, height = h,
           y_range=(0,1.05),
           x_range=(0,1.05),
           x_axis_label='Observation Coverage',
           y_axis_label='Positive Predictive Value',
           hidpi=True,
           tools='crosshair, pan, tap, wheel_zoom, zoom_in, zoom_out, box_zoom, undo, redo, reset, save, lasso_select, help')

cont_levels = np.linspace(min(fitness), max(fitness), 10)
contour_renderer = p1.contour(x_i, y_i, z_i,
                             levels=cont_levels,
                             line_color='gray',
                             fill_color=contour_colors,
                             line_dash='dashed')
# Plot CCs, colored by order
for i in range(0, len(cc_len)):
    cc_plots = p1.scatter('x_values', 'y_values', source=cc_plot_source,
              view=CDSView(filter=GroupFilter(column_name='Order', group=len(cc_len) - i)),
              size=12,
              marker='square',
              line_color='white',
              fill_color=linear_cmap('Order', cc_colors, low=min(ccs['order']), high=max(ccs['order'])),
              hover_color='black',
              legend_label='CC Order {}'.format(len(cc_len) - i),
              fill_alpha=1)

# Add hover tool for CCs
p1.add_tools(HoverTool(tooltips=cc_TOOLS,
                      mode='mouse',
                      point_policy='follow_mouse'))

# Plot DNFs, colored by order
all_dnf_plots = []
for i in range(0, len(dnf_len)):
    dnf_plot = p1.scatter('x_values', 'y_values', source=dnf_plot_source,
              view=CDSView(filter=GroupFilter(column_name='Order', group=len(dnf_len) - i)),
              size=13,
              marker='circle',
              line_color='white',
              fill_color=linear_cmap('Order', dnf_colors, low=min(dnfs['order']), high=max(dnfs['order'])),
              hover_color='black',
              legend_label='DNF Order {}'.format(len(dnf_len) - i),
              fill_alpha=1)
    all_dnf_plots.append(dnf_plot)

# Add seperate hover tool for DNFs
p1.add_tools(HoverTool(renderers = all_dnf_plots,
                      tooltips=dnf_TOOLS,
                      mode='mouse',
                      point_policy='follow_mouse'))

# Add color bar for fitness contours
colorbar = contour_renderer.construct_color_bar(height=int(h/2),
                                                location=(0,int(h/4)),
                                                formatter = NumeralTickFormatter(format='0 a'),
                                                bar_line_color='black',
                                                major_tick_line_color='black')

# General formatting
p1.legend.click_policy='hide'
p1.legend.location='bottom_left'
p1.add_layout(colorbar, 'right')
nonselection_fill_alpha=0.2

#### FIGURE 2
p2 = figure(width = w, height = int(h * 0.6),
            x_range=stack_plot_feature['Feature'],
            x_axis_label='Feature',
            y_axis_label='Count',
            hidpi=True,
            tools='crosshair, reset, save, help')

p2.vbar_stack(stack_names_feature,
              x='Feature',
              width=0.6,
              color=Category20[len(cc_len)],
              source=stack_plot_feature,
              legend_label=stack_names_feature)

# General formatting
p2.xaxis.major_label_orientation = 1
p2.y_range.start = 0
p2.legend.location = "top_right"
p2.legend.orientation = 'vertical'

tab1 = TabPanel(child=p2, title='CC Feature Usage')

#### FIGURE 3
p3 = figure(width = w, height = int(h * 0.6),
            x_range=stack_plot_cc['CC'],
            x_axis_label='CC',
            y_axis_label='Count',
            hidpi=True,
            tools='crosshair, reset, save, help')

p3.vbar_stack(stack_names_cc,
              x='CC',
              width=0.6,
              color=Category20[len(dnf_len)],
              source=stack_plot_cc,
              legend_label=stack_names_cc)

# General formatting
p3.xaxis.major_label_orientation = 1
p3.y_range.start = 0
p3.legend.location = "top_right"
p3.legend.orientation = 'vertical'

tab2 = TabPanel(child=p3, title='DNF CC Usage')

#### Data Table
data_table = DataTable(source = table_source, columns=columns, width=w, height=int(h * 0.4))

#### Show Dashboard
# Text
p1_title = Div(text='PPV vs. COV', width=w, height=20, styles={'font-size': '150%', 'color': 'blue'})
p2_title = Div(text='Feature and CC Usage', width=w, height=20, styles={'font-size': '150%', 'color': 'blue'})
table_title = Div(text='CC Features', width=w, height=20, styles={'font-size': '150%', 'color': 'blue'})

# Construct layout
show(row(column(p1_title, p1), column(table_title, data_table, p2_title, Tabs(tabs=[tab1, tab2]))))