# Seeing is believing on planet data

Sections
* Tufte's principles of graphical excellence
* Haslett's visual vocabulary
* Other data visualization considerations

This lecture draws from Tufte, Edward. "Chapter 1: Graphical Excellence", from The Visual Display of Quantitiative Information and the Financial Times' "Chart Doctor" repository.

---
# 1. Tufte's principles of graphical excellence

(Content adapted from Tufte's "The Visual Display of Quantitative Information")

<br>

Presenting your data in a visual form is one of the most valuable, and underappreciated, aspects of data analysis. A good graphic can take a complex idea and frame it in an intuitive and easy to understand manner. Yet often in the research environment plotting data is done in a _pro forma_ fashion, usually adopting conventions from previous papers/studies without careful thought on how to most efficiently convey the most relevant information in a simple manner. 

This was the goal of Edward Tufte's book "The Visual Display of Quantitative Information," which was the first real attempt to identify core theoretical principles in scientific visualization. In particular, Tufte outline the concept of _graphical excellence._

<br>

**Graphical Excellence:** The efficient communication of complex quantitative ideas.

&nbsp;&nbsp;&nbsp;&nbsp; Graphics should _reveal_ data in a meaningful and intuitive way.

<br>

According to Tufte good graphical displays should have nine properties. 

* show the data

* induce the viewer to think about the substance rather than about
methodology, graphic design, the technology of graphic production,
or something else

* avoid distorting what the data have to say

* present many numbers in a small space

* make large data sets coherent

* encourage the eye to compare different pieces of data

* reveal the data at several levels of detail, from a broad overview
to the fine structure

* serve a reasonably clear purpose: description, exploration,
tabulation, or decoration

* be closely integrated with the statistical and verbal descriptions
of a data set

<br>

Let's consider a simple example of this.


## Example: Four x,y data pairs

<br>

You can present raw data in table format which does give the reader full access to information, but it is difficult to immediately understand the different relationships.

![Table](imgs/L4Table1.png)

<br>
Consider the same data visualized as scatter plots. The differences in relationships "pop".

![Figure](imgs/L4Fig1.png)

_"An ill-specified or preposterous model or a puny data set cannot be rescued by a graphic (or by calculation), no matter how clever or fancy." -Tufte_

<br>

Tufte was careful to emphasize that visualizing quantitative information was similar to statistical analyses: i.e., they are only as good as what you put in them. You could, for example, come up with a clear visualization of something like this.

![Fig2](imgs/L4Fig2.png)

But of course the relationships between these variables is arbitrary and meaningless. So don't think that achieving graphrical excellence means you'll necessarily be revealing hidden truths.



***
#  Verstynen's first rule of visualization: 

_The prettier the picture, the more "[truthy](https://en.wikipedia.org/wiki/Truthiness)" the information._


---
# 2. A visual vocabulary

<br>
While Tufte presented what is arguably the first concrete steps at providing schemas and taxonomies of presenting quantitative information, the field of scientific visualization has exploaded over the past two decades. Entire disciplines are now dedicated to thinking about how to present graphical information .

Some of the best places to find creative illustrations of quantiative information is in newspapers and magazines. The graphics producers at these publications have spent years thinking about how to present clean, intuitive, and accurate data visualizations.

For example, _Financial Times_ has setup a guide to what it calls a [**visual vocabulary**](http://ft-interactive.github.io/visual-vocabulary/) for graphics that breaks down the general categories of graphics that are useful for conveying specific ideas. The general taxonomy has been released as a poster that you can download from here (https://github.com/ft-interactive/chart-doctor/tree/master/visual-vocabulary). 

![Visual Vocabulary](imgs/L4VisualVocabulary.jpg)

<br>

In this section we will explore the data vocabulary proposed by _Financial Times'_ graphic artist [Bob Haslett](https://twitter.com/bobhaslett?lang=en) and see how it can produce a helpful roadmap for considering how to present your work. Much of this information is borrowed from Haslett's work.

<br>

---

## The 9 categories of graphical information 

The categories in Haslett's visual vocabulary covers nine types of patterns or relationships that you may be trying to get across.

<br> 

### Deviation

Emphasizes variations (+/-) from a fixed reference point. Typically the reference point is zero but it can also be a target or a long-term average. Can also be used to show sentiment (positive/neutral/negative). For example,

<div>
<img src="imgs/L4DeviationPlot.png" width="800">
</div>

You can see how presenting deviation information is more nuanced than presenting mean values because the relative variation around a baseline collapses information from two dimensions into one (i.e., a subtraction of two values to make one difference score). Thus you can convey more information as relative differences. But this type of approach often neglects or overlooks things like natural base rates.

**Examples of Deviation plots**
* Diverging bar plots
* Diverging stacked bar plots
* Spine charts
* Surplus/decifict filled line.


<br>

### Correlation

Show the relationship between two or more variables. Be mindful that, unless you tell them otherwise, __many readers will assume the relationships you show them to be causal__ (i.e. one causes the other). 


<div>
<img src="imgs/L4CorrelationPlot.png" width="800">
</div>


Statistically, a correlation means that two or more variables covary together: as one changes, the other changes too. So plots designed to illustrate correlative relationships have to get across this mutual dependency in the data. Typically this is done with a simple scatter plot, but as the example above shows, you can often convey multiple information dimensions in a single correlation plot.

**Examples of Correlation plots** 
* Scatterplot
* Line + Column
* Connected scatterplot
* Bubble plots
* 2-dimensional heatmaps


<br>

### Ranking

Use ranking plots when an item’s position in an ordered list is more important than its absolute or relative value. Often you can highlight the points of interest along the way. For example, consider this ranked plot of World Tennis Association players based on their _current_ranking.


<div>
<img src="imgs/L4RankingPlot.png" width="800">
</div>

Notice that this graph conveys several dimensions of information. First, you can see the current rankings of each players, with the highest ranked players at the top. Next, you can also see how each player moved in their ranking over time (during their tenure as player). But the ranked information "pops" out first. 

**Examples of Ranking plots** 
* Ordered bar plots
* Ordered column plots
* Ordored proportional symbols
* Dot strip plot
* Slope
* Lollipop chart

<br>

### Distribution
Show values in a dataset and how often they occur. The shape or skew of a distribution can be a useful way of highlighting the lack of uniformity or equality in the data. Consider how the _Financial Times_ compares the distribution of income for all Americans between 1971 and 2015. 


<div>
<img src="imgs/L4DistributionPlot.png" width="800">
</div>

The values for the 1971 income distribution are presented as a simple line plot, outlining the envelope of that distribution, while the values for 2015 are presented as a standard histrogram with values binned by income bracket. You can easly see how: a) income has a rightward skew, and b) how the income has distribution has shifted to the ultra rich over 44 years.

**Examples of Distribution plots** 
* Histogram
* Boxplot
* Violin plot
* Population pyramid
* Dot strip plot
* Barcode plot
* Cumulative curve


<br>

### Change Over Time

Give emphasis to changing trends. These can be short (within-day) movements or extended series traversing decades or centuries. Choosing the correct time period is important to provide suitable context for the reader. Often we think of time as being seconds/minutes/days/months/years, but you can also present time as relative phases. For example, consider a recent illustration of "bubble" dynamics for investment plotted against Bitcoin value.

<div>
<img src="imgs/L4ChangesOverTimePlot.jpg" width="800">
</div>

Notice how it is clear to see how events evolve over time. This plot takes advantage of multiple ways of presenting temporal information. There's the seismogram showing the rapid fluctuations of the Bitcoin value, there's the curve plot shoing the standard patterns of bubble dynamics, and there's the linear slope illustrating the simple linear trend expected if the investment wasn't hyped.

**Examples of Changes over Time plots**
* Line plot
* Column plot
* Line + column
* Area chart
* Stock price plot
* Fan chart
* Connected scatterplot
* Calendar heatmap
* Priestley timeline
* Circle timeline
* Siesmogram

<br>

### Part-to-whole
Show how a single entity can bebroken down into its component elements. Perhaps the most common version of this is the simple pie chart. However, you can also diplay proportional information in non-pie ways as well. Consider this plot of the Catalan parliament elections in 2015.

<div>
<img src="imgs/L4PartToWholePlot.png" width="600">
</div>

Here we not only see the party breakdown, but we can see the liberal vs. conservative spectrum (left-to-right along the image) and what groups coalition together to control the government (dashed black line). 

**Examples of Part-to-Whole plots**
* Stacked column 
* Proportiaonl stacked bar
* Pie 
* Donut
* Treemap
* Voronoi
* Arc
* Gridplot
* Venn
* Waterfall


<br>

### Magnitude
Show size comparisons between variables. These can be relative or absolute. Usually these show raw counts number (for example, barrels, dollars or people) rather thana calculated rate or per cent. Of course a simple bar plot will suffice to show this information, but there are other ways you can integrate magnitude information with other sources of information (e.g., spatial). For example, look at this plot of oil imports/exports and prices across different countries.

<div>
<img src="imgs/L4MagnitudePlot.jpeg" width="800">
</div>

Keeping with the theme so far notice that we get several types of information here, including relative magnitudes of several variables: net imports, net exports, revenues, and prices. 

**Examples of Magnitude plots**
* Column
* Bar
* Paired column
* Paired bar
* Proportional stacked bar
* Proportional symbol
* Isotype/pictogram
* Lolipop chart

<br>

### Spatial

Used only when precise locations orgeographical patterns in data aremore important to the reader than anything else. Often these are maps simply showing a color code of value across regions, but even spatial maps can convey multiple dimensions. Consider the population adjusted map of 2016 election results across the United States. 

<div>
<img src="imgs/L4SpatialPlot.jpg" width="800">
</div>

Here we not only get the distribution of vote tallies for each state, but the spatial warping gives an estimate of the realtive magnitude of the population in each state as well. 

Maps are also commonly used in neuroimaging research to convey the same type of information. Here's a map of the regions of the brain that encode visually-cued finger movements.


<div>
<img src="imgs/L4BrainMap.png" width="800">
</div>

**Examples of Spatial plots** 
* Basic choropleth
* Proportional symbol
* Flow map
* Contour map
* Equalized cartogram
* Scaled cartogram
* Dot density
* Heat map

<br>

### Flow

Show the reader volumes or intensity of movement between two or more states or conditions. These might belogical sequences or geographical locations. for example take a look at this map showing the origins and routes of refugees into Europe in 2015. 

<div>
<img src="imgs/L4FlowPlot.png" width="800">
</div>

Here you not only can see the physical routes of migration over time, but you can also get a sense of the density of people traveling different routes and how many refugees original from different countries.

**Examples of Flow plots**
* Sankey/river plot
* Waterfall 
* Chord
* Network


---
# 3. Other things to consider

<br>
So far we have presented Tufte's idea of _graphical excellence_ and Haslett's _visual vocabulary_ as an ends and means of data visualization repsectively. But let's consider a few additional principals for presenting your scientific information.

<br>

## 1. Figures can convey what words cannot.

The key to using good visualizations is to reduce the complexity of written prose when a visual image will get across the same information more intuitively. This is often helpful when describing your methods. Consider this figure of a behavioral task.

<div>
<img src="imgs/L4SSbayesMethod.jpg" width="800">
</div>

The top panel (A) shows a task where participants are supposed to stop a rising bar when it intersects with a target line at 520ms. Participants get points for being as close to the target line as possible. On some trials the bar stops and turns red (a stop signal), indicating that participants shouldn't press a button to stop the bar anymore. The bottom panel (B) shows the different groups, where the stop signal is sampled from one of 3 probability distributions (Uniform, Early, & Late). Probe stop signals are also included for all subjects at 200, 250, 300, 350, and 400ms. 

Notice that the dynamics of the task shown in the figure are augmenting how I describe the task in the text. Hopefully the trial dynamics and the critical differences betweeen the groups are clear.

<br>

## 2. Cluttered figures discourage the reader from trying to understand your argument.

While a good figure can make your story easier to tell, a bad figure can completely ruin your narrative. If you put too much in your figure (e.g., too many panels, high density of visual information per figure, too much text), then your reader spends more time examining the figure than they are reading the text. Remember _graphics are supposed to make the reader's job easier, not harder_.

Perhaps no one does high density, clustered visualizations worst (better?) than the military. 

<div>
<img src="imgs/L4MilitaryPlot.jpg" width="800">
</div>

Can you read that thing? I can't.

<br>

## 3. Show your variance properly

Little known fact that there is an 8th deadly sin: not accurately showing your variance on a bar plot.

Now, it seems intuitive that just throwing on simple error bars reflecting the standard error of the mean will save your soul. But you'd be wrong.

Consider this example of the same data plotted three different ways.

<div>
<img src="imgs/L4Errorbars.png" width="800">
</div>

The left panel shows your standard bar plot of the means with standard error acorss conditions. But standard error (along with standard deviation and variance) _assume_ that you have symmetrical distributions around the mean. This is almost always not correct. You can see this in the middle panel, which shows the mean and standard error as black dots and error bars respectively, but this figure also plots each individual data point in each group. So you can see the spread of the data and clearly see the whole variability in each group. Another way to show this (if you have a lot of data and showing individual data points is tricky, is to use violin plots (right panel) that show the estimated kernel desnity of the distribution of the data instead. 

In either case, it's not as easy as you think to avoid committing this sin.





---
# Useful tools

<br>

In this course you will be learning to use the data visualization functions in R (particularly ggplot2). While R has immensely useful data visualization tools, you may also which to consider looking at other data plotting methods. For example, Bob Haslett has published the code for his differnet plot styles (usign JavaScript) on [GitHub](https://github.com/ft-interactive/visual-vocabulary).

It's worth checking out his tools or finding others that graphic artists have developed and released to maximize the impact of your data presentation for your projects.