In [2]:

import altair as alt
import pandas as pd

yeast_dataframe = pd.read_csv('YeastLocationData.csv')

input_dropdown = alt.binding_select(options=['CYT','ERL','EXC','ME1','ME2','ME3','MIT','NUC','POX','VAC'])
dropdown_selection = alt.selection_single(fields=['location'], bind=input_dropdown, name='Actual location')
color = alt.condition(dropdown_selection,
                    alt.Color('Origin:N', legend=None),
                    alt.value('lightgray'))



selectable_location_points_plot = alt.Chart(yeast_dataframe).mark_point().encode(
    #x='McG Signal Sequence:Q',
    y='von Heijni Signal Recognition:Q',
    color='location:N'
).properties(
    width=250,
    height=250
).add_selection(
    dropdown_selection
).transform_filter(
    dropdown_selection)

triple_selectable_location_points_plot = selectable_location_points_plot.encode(x='McG Signal Sequence:Q') | selectable_location_points_plot.encode(x='vacuolar and extracellar amino acid composition:Q') | selectable_location_points_plot.encode(x='nuclear/non nuclear discriminant:Q')


user_selection = alt.selection_interval()

scatter_plot_basis = alt.Chart(yeast_dataframe).mark_point().encode(
  y='von Heijni Signal Recognition:Q',
    color=alt.condition(user_selection, 'location:N', alt.value('lightgray'))
).properties(
    width=250,
    height=250
).add_selection(
    user_selection
)

triple_scatter_plot = scatter_plot_basis.encode(x='McG Signal Sequence:Q') | scatter_plot_basis.encode(x='vacuolar and extracellar amino acid composition:Q') | scatter_plot_basis.encode(x='nuclear/non nuclear discriminant:Q')

bar_plot = alt.Chart(yeast_dataframe).mark_bar().encode(
    y='location:N',
    color='location:N', 
    x='count(location):Q'
).transform_filter(user_selection)

bar_plot = alt.Chart(yeast_dataframe).mark_bar().encode(
    y='location:N',
    color='location:N', 
    x='count(location):Q'
).transform_filter(user_selection)

finalplot = triple_scatter_plot & bar_plot & triple_selectable_location_points_plot
finalplot.save('AltairDashboard_YeastProteinLocalisationDataset.html')
finalplot



This data set consists of differnt types of protein found in common bakers' yeast, latin name Saccharomyces cerevisiaem with points coloured by the locatoin of the points within the cell. The upper three plots allow selection of points by clicking on the plot with the mouse pointer and pulling it out to create a highlighted region, with the bar chart showing the relative abundance of different proteins within that region of the chart. The lower plot allows selection of points according to their cellular location. 

The set was downloaded from the University of California Irvine machine learning data set repository and consists of standard code names for the different protein types together with the results of machine learning analysis of the sequence of amino acids that constitute each protein. The aim is clearly to predict where in the yeast cell the protein may be located and thus to infer something about the possible function of the protein. Some of the entries in the data table have the same value for every protein and I have ignored these, assuming that they are erroneous. A similar error is seen in data for discriminating a protein as localised in the nucleus (or not), demonstrate by the figure on the far right of each of the triple plots, but only for a small proportion of the proteins looked at.  The two figures on the left of the triple plots show how very good McGeoch's (McG) and von Heijne's methods for detecting the signal sequence on the proteins is, since this is the sequence that targets proteins to the endoplasmic reticulum (ERL) and from there for there exports them to the cellular membrane (ME3, ME2, ME1) and into the extracellar environment (EXC). Using the mouse pointer to select the proteins in the top left panel of the upper triplet of figures demonstrates the high abundance of proteins from these locations and the relatively low adundance of cytosolic and nuclear proteins, the later actually making up two thirds of the proteins in the data set, as shown in the table below, but being only a handful of proteins that actually score highly with McGeochs and von Heijne's methods. Thus demonstrating that McG and von Heijne methods are very effective.

The rational for the plots show here was to allow the reader to be able to see how well the metrics on the axes separated the proteins according to their true cellular location. The upper plots allow the user to select part of a plot, using the mouse pointer, and see the relative abundance of proteins from different locations in the bar chart. The lower plots can be adjusted to show all the proteins from one particular location in isolation, via use of the drop down menu, which provides a complementary view of the data to that in the upper panels. 

I would have liked to embed the table below into the Dashboard, but haven't   yet taken the time to do this. I would also liked to have added a little more annotation, but again have yet to take the time to find a satisfactory method for doing this using Altair. The only way I can see to do this at the moment is to edit the .html directly, to add a table. 

Relative abundance of the proteins from different locations. 
  CYT (cytosolic or cytoskeletal)                    463
  NUC (nuclear)                                      429
  MIT (mitochondrial)                                244
  ME3 (membrane protein, no N-terminal signal)       163
  ME2 (membrane protein, uncleaved signal)            51
  ME1 (membrane protein, cleaved signal)              44
  EXC (extracellular)                                 37
  VAC (vacuolar)                                      30
  POX (peroxisomal)                                   20
  ERL (endoplasmic reticulum lumen)                    5