# Making Waterfall Plots with Altair
#### By Alyson Freeman

For this assignment, I'm going to show you how to make and use waterfall plots using the python library Altair. This is a great way of comparing experimental data across samples; for example, treating a panel of cell lines in the same way and comparing the effect.

The data that I'm going to use is from a screen performed at the Broad Institute. One of the main purposes of the screen was to look across cancer cell lines and identify any new drug targets that could have clinical impact. To do this, they used shRNA directed at a gene to decrease the amount of that gene's corresponding protein in the cell. If a cell line's growth decreased due to having less of a certain protein, that protein could be a good target for a drug. Researchers then took hundreds of cancer cell lines and used shRNA to knockdown hundreds of genes in each cell line. In this way, they could compare the effect of losing expression of each gene across all cell lines. Ideally, a protein would make a good target if it is important not just in one cell line but in a significant patient population.

When working with large datasets in the biological sciences, visualizations are an indespensible way to analyze and communicate results of experiments. The goal of the following demonstration is to show what waterfall plots are, how they can be used, and how to make them.

### Visualization Technique

Waterfall plots are a type of bar chart with one categorical axis and one numerical axis. The data can be sorted in decending order so that it's easy to see trends of the variable being plotted. An alternative to consider would be a heatmap, where the effect is plotted by color. However, waterfall plots use length to encode the data which is easier to understand than differences in color. A heat map may also be use to plot multiple treatments at the same time, but because a waterfall plot is one treatment sorted in decending order, it will be easier to see the overall effect of a treatment. Waterfall plots are not appropriate for when more than one numeric variable needs to be plotted in which case scatter plot might be better suited.

In the dataset that I'm using in the example, I'm going to plot the cell lines on the x-axis and the effect of losing expression of a certain gene on the y-axis. The stronger the effect, the lower the y-value will be. For this dataset, a significant effect is set at -1.

Example waterfall plot:

![](waterfall_example.jpg)

### Visualization Library

Altair is a declarative visualization library for python based on the Vega and Vega-Lite grammar of graphics. Altair is simple to use yet can make interactive, compelling visualizations. This makes it ideal for data exploration. Altair is open-source and was developed by Jake Vanderplas and Brian Granger with the UW Interactive Data Lab.

It can be installed with the example datasets in vega_datasets:

Or using the conda package:

Keep in mind that the maximum amount of rows that can be plotted in Altair is 5000 so if your dataset is larger than that, another package may be preferable. If time permitted, I would have loved to do a side-by-side comparison of Altair and Plotly using the same data to make similiar visualizations and learn more about both packages. But for this assignment, the focus will stay on Altair due to it's ease of use.

### Demonstration

First, I'm going to import all of the tools that we need for importing, cleaning, manipulating, and visualizing the data.

In [1]:
import pandas as pd
import altair as alt
from urllib.request import urlretrieve
import os

The data that I'm going to use comes from the DepMap project at the Broad Institute (https://depmap.org/portal/download/). The "avana_gene_effect" data contains the data to be plotted and the "sample_info" data includes the cell line names and lineages. I'll merge them together so that we have all of the data that we need in one dataframe.

These files are large and it is not necessary to download it if it is already available locally. Therefore, I'll use os.path.exists and only download the data using urlretreive if the file is not already there. Note: if you are running this in the Coursera Jupyter notebook, it might not work. It did work in my local environment.

In [3]:
# Define the urls for dataset downloads from DepMap
sample_info_url = "https://ndownloader.figshare.com/files/26261569"
avana_gene_effect_url = "https://ndownloader.figshare.com/files/26261293"

# Download data if necessary
if not os.path.exists('sample_info.csv'):
    urlretrieve(sample_info_url, 'sample_info.csv')
if not os.path.exists('avana_gene_effect.csv'):
    urlretrieve(avana_gene_effect_url, 'avana_gene_effect.csv')

We can use pandas to read in the CSV files into dataframes and then merge the two dataframes. I'll also clean up the column names so that it is just the gene name which will make it easier to call the visualizations below. In the visualizations, I'm going to color by lineage. The default color scheme in Altair has 10 different colors so I'm going to check how many lineages I have in the dataset so that I can later define a color scheme. 

In [4]:
samples = pd.read_csv('sample_info.csv')
avana = pd.read_csv('avana_gene_effect.csv')

# Merge Avana data with the CCLE sample cell line data
avana_samples = avana.merge(samples, how='outer', on='DepMap_ID')
avana_samples.columns = avana_samples.columns.str.replace("\(.*\)","").str.strip()

lineages = len(pd.unique(avana_samples['lineage']))
lineages

39

OK, now it's time for some visualizations! 

In the next block, I'm going to define a function that will show a waterfall plot for a gene of interest and I'll explain it in the order of the code. First, I'll make a new dataframe just with the columns that we need.

I want to color the plot by lineage. The default color scheme in Altair only has 10 colors but I have 39 lineages, so I'm going to change the theme to "category20" so that the colors are not repeated so often. You can find additional color schemes here: https://vega.github.io/vega/docs/schemes/.

Next, I'm going to make a brush so that I can select a section of the waterfall plot to zoom in on and make the chart interactive. Then comes the actual waterfall plot that I'm calling "gene_chart". The x-axis will be the cell line names and the y-axis will be the gene effect under the gene_name sorted in decending order. I will turn off the ticks and labels for the x-axis: there's just too much there to be able to read it. Each axis gets a title using alt.Axis and the chart is colored using alt.Condition so that it is colored by lineage and a gray box for the brush highlight. Finally, I'll add the brush selection.

As I mentioned above the cutoff for a significant effect is -1. So I'll make a line to go across the chart at -1 for ease of viewing the sensitive cell lines.

Finally, I'll make one more chart that will show the section of the first chart that is highlighted by the brush.

What's returned is a waterfall plot of all cell lines for a certain gene with a line at -1 and a second waterfall plot of anything highlighted in the first. Here, we can see a zoomed in view that includes the cell line names. I'll call the function with an example gene, BRAF.

In [5]:
# Define a function for a waterfall plot of a specific gene
def waterfall_gene(gene_name):
    # Make a new dataframe with just the columns of interest.
    gene_df = avana_samples[['DepMap_ID', 'stripped_cell_line_name', gene_name, 'lineage']].dropna()
    
    # Define a custom theme for colors.
    def my_theme():
        return {
            'config': {
                'range': {'category': {'scheme': 'category20'}}
            }
        }

    # Register and enable the theme.
    alt.themes.register('my_theme', my_theme)
    alt.themes.enable('my_theme')
    
    # Make a brush to be used for selecting data on the chart.
    brush = alt.selection(type='interval')
    
    # Make the waterfall chart.
    gene_chart = alt.Chart(gene_df).mark_bar().encode(
    x=alt.X('stripped_cell_line_name',
            sort='-y',
            axis = alt.Axis(title='Cell Lines', labels=False, ticks=False)),
    y=alt.Y(gene_name,
           axis = alt.Axis(title='Dropout')),
    color=alt.condition(brush, 'lineage', alt.value('lightgray'))
).properties(
    width=600,
    height=400
).add_selection(
    brush
)
    
    # Make a line to delineate the sensitivity cutoff.
    line = alt.Chart(pd.DataFrame({'y': [-1]})).mark_rule(color='black',strokeDash=[3,3]).encode(
        y='y')
    
    # Make a second waterfall chart for the highlighted portion of the first chart.
    zoom = alt.Chart(gene_df).mark_bar().encode(
    x=alt.X('stripped_cell_line_name',
            sort='-y',
            axis = alt.Axis(title='Cell Lines')),
    y=alt.Y(gene_name,
           axis = alt.Axis(title='Dropout')),
    color='lineage'
).transform_filter(
    brush
)
    
    # Return the two waterfall plots with the -1 cutoff line on the graphs.
    return (gene_chart + line) & (zoom + line)

# Example chart
waterfall_gene('BRAF')

From this, we can see what all of the sensitive cell lines are and the size of the effect for each. We can also see that the majority of the sensitive cell lines are either of breast or skin lineage. The colors are the same so it's not possible to tell the difference. One way to address this in the future is to make a new waterfall plot just of the sensitive cell lines by filtering the data instead of using the interactivity.

I also wanted to show another way of using waterfall plots. We can trellis the plot based on lineage so that we can easily see the differences between them.

This function is very similar to our last one but instead of having a brush to highlight certain data points, we're going to use "facet" to make individual waterfall plots for each lineage. I put them four columns across for readability.

In [6]:
# Define a function for a chart of waterfall plots trellised by lineage
def waterfall_lineage(gene_name):
    
    # Make a new dataframe with just the columns of interest.
    gene_df = avana_samples[['DepMap_ID', 'stripped_cell_line_name', gene_name, 'lineage']].dropna()
    
    # Make the waterfall chart.
    gene_chart = alt.Chart(gene_df).mark_bar(color='black').encode(
    x=alt.X('stripped_cell_line_name',
            sort='-y',
            axis = alt.Axis(title='Cell Lines', labels=False, ticks=False)),
    y=alt.Y(gene_name,
           axis = alt.Axis(title='Dropout'))
).properties(
    width=200,
    height=100
)
    
    # Make a line to delineate the sensitivity cutoff.
    line = alt.Chart().mark_rule(color='black',strokeDash=[3,3]).encode(
        y='a:Q'
)
    # Trellis by lineage using "facet"
    chart = alt.layer(
        gene_chart, line,
        data=gene_df
).transform_calculate(
    a='-1'
).facet(
    facet='lineage:N',
    columns=4
)
    
    # Return the trellis chart.
    return chart

# Example
waterfall_lineage('BRAF')


Now each lineage is plotted on its own subplot. Scrolling through, you can see that the skin plot has the majority of lines on the right hand side of the graph under the -1 cutoff. There are no other lineages that show this strong of an effect although colorectal lines show a similar but weaker trend. Therefore, you can say that the majority of skin cancer cell lines are sensitive to the loss of BRAF.

### Summary

In the end, I hope I've convinced you that using waterfall plots in Altair is an easy and effective way of visualizing certain types of experimental data. This technique and library can be used to plot anything with one categorical and one quantitative variable that can be ranked or sorted like measured patients' responses to a treatment or how far a person can run or bike in a certain time frame.