I've been building or tinkering with data visualization ideas in Python for a while now, both in my free time and in my (still nascent) professional work. With three different libraries `py_d3`, `geoplot`, and `missingno`) under my belt, so to speak, I've had time to form an opinion on the mildly specific topic of data visualization API design, and I wanted to share my thoughts here. `[flesh this out more]`

To start things off, let's discuss some general data visualization philosophy.

The objective of a data visualization library is some combination of just two things: user comprehension, and audience comprehension.

User comprehension is the ability of the individual building the visualization to understand some interesting feature of the dataset. What constitutes interesting obviously varies, but the broad theme is: tell me something about this dataset that I couldn't learn by with a table alone. User comprehension is performed by a data analyst.

Audience comprehension, by contrast, is the ability of users _besides_ the individual doing the visualization to understand interesting features. In this case, the objective to maximize is how well the visualization communicates the desired information content to some target audience. Audience comprehension is performed by a data presenter.

These two tasks are fundamentally different.

Data analysts are generally spending valuable time resources to visualize. They are willing to spend more time interpreting a more complicated visualization result, so long as the visualizations they build are relatively easy-to-make. This leads to a strong preference for expedient solutions and reusable templates.

Data presenters by contrast are generally presenting their work to an external audience, one less (possibly much less) experienced that the analyst building the graphic at visual interpretation. Good data presenters are more willing to accept harder-to-build visualizations, if that means more easily interpretable results.

A good example of a highly data analysis -specific task would be optimizing a machine learning model (perhaps whilst participating in a Kaggle competition). In this setting it is very important to have a deep understanding of every aspect of the dataset, as this understanding is fundamental to the approach you take and the performance you ultimately get from the result. An ML practitioner invariably builds heaps of short-lived, messy-looking charts, none of which survive long enough to be seen by anyone else (save collaborators).

Libraries like [`yellowbrick`](http://www.scikit-yb.org/en/latest/index.html) are design to suit this use-case. Here is a `yellowbrick` "discrimination threshold" visualization:

![](http://www.scikit-yb.org/en/latest/_images/spam_discrimination_threshold.png)

Interpreting this graph takes heaps of domain knowledge, experience with this plot type, and knowledge about the dataset being visualized; all things highly specific to the person creating the chart.

A good example of a highly data presentation -specific task, by contrast, is data journalism. In data journalism you are attempting to communicate some story of viewpoint to a public audience. The public audience is visually uninformed, and has a very low threshold for interpretive complexity, so good data journalists must be skilled in using sophisticated tools to build highly custom "data experiences". Here is one example of just such a visualization, from a New York Times article showing the web of collaboration amongst Oscar winners in Hollywood:

![](https://static01.nyt.com/images/2013/02/21/movies/awardsseason/21oscar-network-sf-image/21oscar-network-sf-image-superJumbo.png)

While algorithmically complex (a few words on that [here](https://bost.ocks.org/mike/example/)), this visualization is beautifully simple to interpret.

These examples, meant to prove a point, are extreme. Most of the time, data visualizations are not strictly in one camp or the other, but instead lie somewhere on the spectrum in between. Tools that are great for data presentation tend to be very complicated and low-level, but also as a consequence very feature-rich and customizable; tools that are great for data analysis are easy-to-use, but can be hard to customize, and often suffer from [leaky abstractions](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/).

For a data visualization module designer, the trick is to strike an appealing balance: between easy-to-use and easy-to-customize, between simple and complex, between limiting and expressive.

With that in mind, let's look at some of the design choices in the Python data visualization ecosystem, and what they bring to the table.

## Abstraction stacks

In computer science an **abstraction** is a self-contained system that hides complexity from its user. Abstractions are everywhere. The Python `requests` library, reputed for having a particularly beautiful API, is an abstraction that builds on top of `httplib` and friends that makes network requests easier-to-use and more powerful. CPython is an abstraction that keeps you from having to write machine code. SQL is an abstraction that seemlessly hides database details like table indexes and data locality.

Abstractions are everywhere, and they help enable us to work faster and do more sophisticated things. But sometimes we want to leave the abstraction we are currently in and dig deeper. This usually occurs when there is something we need or would like to do, which the current level of abstraction we are using does not intrinsically support.

This need to downshift is especially common in data visualization. It's impossible to come up with an API that covers every possible usecase, and well-designed data visualization libraries don't even try. Instead, mature Python data visualization tools provide easy access to their abstraction stack.

A good example of this principle in action is `plotly`. ...more words...chart layer, glyph layer.

## State machines

The `matplotlib` thing.

## Widget-based interactivity

The `bqplot` value prop. Some background on `mpld3`.

## Mixed notaton

`holoviews`

## User-chosen basal layers

Talk about how e.g. `holoviews` allows you to choose a different basal layer for visualization, either `bokeh` or `matplotlib`.

## Grammers of graphics

`plotnine` and the alternative thinker's approach. As well as `altair`.