In [8]:
options(repr.plot.width=8, repr.plot.height=5)
library(tidyverse)
library(ggplot2)

# Lecture 6: Advanced ggplot

<div style="border: 1px double black; padding: 10px; margin: 10px">

**In today's lecture you will:**
* See some examples of great and not-so-great visualizations.
* Along the way, learn how to use ggplot to make clear and effective plots.

This lecture note corresponds (loosely) to Chapters 11-13 of your book. 
</div>


    




## Basic rules for making good plots
1. Label the axes.
2. Include units.
3. Explain anything that is "encoded" (color scales, size variation, etc.)
4. Use appropriate geometry.
5. Include sources/attribution.
6. Use the simplest possible design necessary to convey the information.

## Good visualizations

![faith in america](https://www.pewresearch.org/religion/wp-content/uploads/sites/7/2022/09/PF_2022.09.13_religious-projections_01-01.png?w=553)

![image.png](attachment:image.png)

![euro](https://www.royfrancis.com/assets/images/posts/2015/2015-10-01-elegant-scientific-graphs-learning-from-examples/reuters-europe-unemployment.jpg)

![stem cell](https://www.royfrancis.com/assets/images/posts/2015/2015-10-01-elegant-scientific-graphs-learning-from-examples/ns-network-stem-cell.jpg)

![image.png](attachment:image.png)

## Bad visualizations
Sometimes the easiest way to learn what to do is to study what not to do...

(Mostly taken from https://badvisualisations.tumblr.com).


### 131%

![image.png](attachment:image.png)

In [None]:
tribble(
    ~event, ~probability,
    "NDA staying below 220", .09,
    "NDA crossing 250", .72,
    "NDA getting a majority", .5
)

event,probability
<chr>,<dbl>
NDA staying below 220,0.09
NDA crossing 250,0.72
NDA getting a majority,0.5


#### A giant amongst women
![image.png](attachment:image.png)

In [None]:
tribble(
    ~country, ~height,
    "Latvia", "5'5",
    "Australia", "5'4",
    "Scotland", "5'4",
    "Peru", "5'4",
    "South Africa", "5'2",
    "India", "5'0"
)

country,height
<chr>,<chr>
Latvia,5'5
Australia,5'4
Scotland,5'4
Peru,5'4
South Africa,5'2
India,5'0


<span style="font-size: 6px;">That'sreallyalotofinformationforsuchasmallspace</span>
![image.png](attachment:image.png)

#### Bangaloroscope
![image.png](attachment:image.png)

#### Where is the bathroom
![image.png](attachment:image.png)

#### y-not
![image.png](attachment:image.png)

#### Time warp

![image.png](attachment:image.png)


![image.png](attachment:image.png)

(Link to the paper): https://www.jstor.org/stable/2683253#metadata_info_tab_contents

## Rule #1: Show as Few Data as Possible 

- [Tufte](https://www.edwardtufte.com/tufte/) (famous data viz guy) defines the "data density index" (ddi) as "the number of numbers plotted per square inch." 
- In order to make a bad plot, you want to strive for a ddi that is _as low as possible_.
- Rough guidelines:
    - $\text{ddi}=1$: novice
    - $\text{ddi}=.5$: intermediate
    - $\text{ddi}=.1$: elite-level, or first-year art student
- (Anything above ddi=20 places you at risk of making a good plot.)




![image.png](attachment:image.png)

There are three numbers plotted in a $5\times 3 \text{in}$ graphic, so $\text{ddi}=.2$.

In [None]:
df1a <- tribble(
    ~year, ~jp_pct,
    1967,  .44,
    1972,  .623,
    1977, .70
)


![image.png](attachment:image.png)

In [10]:
df1b <- tribble(
    ~year, ~expenditures,
    1972, 75,
    1974, 80,
    1976, 125,
    1978, 200,
    1980, 245,
    1982, 305
)
library(ggplot2)
ggplot(df1b)+geom_col(aes(x=year, y =expenditures))+geom_col+
  labs(y="Expendutures($millions)", x="year")+
  geom_text(aes(label=expenditures),nudge_y = 10)

ERROR: ignored

## Poll

Which graphic do I prefer:

<ol style="list-style-type: upper-alpha;">
    <li>The monster</li>
    <li>The bar chart</li>
</ol>

## Rule #2: Hide the data
- In situations where it is necessary to show data, make sure it is well hidden.
- Add visual distractions such as grids, illustrations, and other doo-dads that draw the eye away from the data points.
- Minimize contrast, ensuring that older/visually impaired readers have an especially difficult time reading your plot.
- Minimize variation by choosing a scale that is several order of magnitude larger than the natural range of the data. 

![image.png](attachment:image.png)

In [None]:
df2 <- tribble(
    ~school_year, ~type, ~n,
    1930, "private", 9275,
    1940, "private", 10000,
    1950, "private", 10375,
    1960, "private", 13574,
    1970, "private", 14372,
    1930, "public", 255000,
    1940, "public", 191000,
    1950, "public", 140000,
    1960, "public", 105000,
    1970, "public", 85000,
)
ggplot(df2,aes(x=school_year,y=n,fill=type()))+geom_col

![image.png](attachment:image.png)

## Rule #3: Trick the eye
- The human eye is easily deceived by changes in ordering, scale, and visual metaphor. 
    - Switch up the scale, preferably in the same plot.
    - Represent smaller numbers using larger objects, and vice versa.
    - Level 99 trick: represent lengths by area.

![image.png](attachment:image.png)

In [None]:
df3 <- tribble(
    ~year, ~president, ~ppower,
    1958,  "eisenhower", 1.0,
    1963, "kennedy", .94,
    1968, "johnson", .83,
    1973, "nixon", .64,
    1975, "ford", .60,
    1978, "carter", .44
)

![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

In [None]:
df4 <- tribble(
    ~year, ~paper, ~subscribers,
    1977, "post", 503000,
    1978, "post", 621000,
    1979, "post", 642000,
    1980, "post", 654000,
    1981, "post", 732000,
    1977, "news", 1911000,
    1978, "news", 1829000,
    1979, "news", 1636000,
    1980, "news", 1555000,
    1981, "news", 1491000

)


![image-2.png](attachment:image-2.png)

In [None]:
df5 <- tribble(
    ~year, ~doctor, ~other,
    1939, 3262, 1809,
    1947, 8744, NA,
    1951, 13150, 4071,
    1955, 16107, 5055,
    1963, 25050, 7182,
    1965, 28960, 7798,
    1967, 34740, 8882,
    1970, 43100, 10722,
    1972, 46780, 12097,
    1973, 50823, 12977,
    1974, 54140, 13391,
    1975, 58440, 14311,
    1976, 62799, 15272
)

In [None]:
# doctor plot

## Rule #4: Eliminate context
- Context helps the viewer obtain a global/big picture understanding of the data.
- But sometimes the big picture is, inconveniently, at odds with what you wish to be true.

![image.png](attachment:image.png)

In [None]:
df6 <- tribble(
    ~year, ~plan, ~payments,
    1982, "wmc", 2350,
    1982, "president", 2300,
    1983, "wmc", 2300,
    1983, "president", 2250,
    1984, "wmc", 2350,
    1984, "president", 2200,
    1985, "wmc", 2400,
    1985, "president", 2200,
    1986, "wmc", 2450,
    1986, "president", 2200
)

## Rule #5: Dwell on the trivial
- Prefer plots that emphasize inconsequential differences in the data, thus obscuring the important ones.

![image.png](attachment:image.png)

I'm too lazy to manually type in all the numbers in the previous plot, so I went straight to the source: [CPS Historical Time Series Tables]. After some finagling (which you can find in the `cps.ipynb` notebook in this folder, if you are curious) I produced the following dataset containing similar data:

In [None]:
# earnings differences, men vs women
load(url('https://datasets.stats306.org/cps.RData'))

The ``facet_grid()`` command here told ggplot to generate a separate plot for each level of the discrete variable **sex**. It also went ahead and arranged them into a nice 2x1 grid format.

The syntax to facet may look a little funny: 
```{r}
facet_grid(sex ~ .)
```
The tilde (``~``) at the beginning denotes what is called a **formula** in R. We will discuss formulas later in the class when we talk about modeling. For now, just keep in mind that the facet command must be written just so for things to work:

## Rule #5: Label nothing
- Labeling the axes of a plot, and giving it a title, is a sign of weakness.
- Therefore, doing none of these is a sign of strength.

#### I think your Gibbs sampler is broken
![bad plot](https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/mykland_fig1.jpg)
<caption> (Actual plot from a paper in a respected statistics journal.)</caption>
<small>Source: <a href="https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/">https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/</a></small>

## Labels in ggplot
Plot labels in ggplot can be set using the `labs()` command. The main types of labels are:
- Title: Summarizes the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.
- Subtitle adds additional detail in a smaller font beneath the title.
- Caption adds text at the bottom right of the plot, often used to describe the source of the data.

## Rule #6 (my own rule): Abuse correlation
- Correlation is everywhere. 
- Continue sifting through the data until you find a correlation that supports whatever point you wish to make. 
- Then plot this correlation for maximal effect.


I downloaded time series data on a) [per-capita income in Ann Arbor](https://fred.stlouisfed.org/series/ANNA426PCPI) and ii) [the total population of Paraguay](https://data.worldbank.org/indicator/SP.POP.TOTL?locations=PY) and put them in a table `a2i.pp`:

In [None]:
load(url('https://datasets.stats306.org/a2pp.RData'))
print(a2i.pp)

[90m# A data frame: 48 × 3[39m
    year a2.income para.pop
   [3m[90m<int>[39m[23m     [3m[90m<int>[39m[23m    [3m[90m<dbl>[39m[23m
[90m 1[39m  [4m1[24m969      [4m4[24m679  2[4m4[24m[4m1[24m[4m2[24m566
[90m 2[39m  [4m1[24m970      [4m4[24m640  2[4m4[24m[4m7[24m[4m4[24m106
[90m 3[39m  [4m1[24m971      [4m5[24m085  2[4m5[24m[4m3[24m[4m5[24m359
[90m 4[39m  [4m1[24m972      [4m5[24m538  2[4m5[24m[4m9[24m[4m6[24m739
[90m 5[39m  [4m1[24m973      [4m6[24m084  2[4m6[24m[4m5[24m[4m9[24m088
[90m 6[39m  [4m1[24m974      [4m6[24m313  2[4m7[24m[4m2[24m[4m3[24m523
[90m 7[39m  [4m1[24m975      [4m7[24m100  2[4m7[24m[4m9[24m[4m0[24m962
[90m 8[39m  [4m1[24m976      [4m7[24m916  2[4m8[24m[4m6[24m[4m1[24m581
[90m 9[39m  [4m1[24m977      [4m8[24m807  2[4m9[24m[4m3[24m[4m5[24m375
[90m10[39m  [4m1[24m978      [4m9[24m938  3[4m0[24m[4m1[24m[4m2[24m829
[90m# … with 38 more ro

## 🤔 Quiz

What's the correlation of `a2.income` and `para.pop`?

<ol style="list-style-type: upper-alpha;">
    <li>Something negative</li>
    <li>0.0 - 0.5</li>
    <li>0.5 - 0.8</li>
    <li>0.8 - 0.9</li>
    <li>Above 0.9</li>
</ol>

In [None]:
# plot of a2i.pp