<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Storytelling-w/-Data-Visualization" data-toc-modified-id="Storytelling-w/-Data-Visualization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Storytelling w/ Data Visualization</a></span><ul class="toc-item"><li><span><a href="#Line-Plot" data-toc-modified-id="Line-Plot-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Line Plot</a></span></li><li><span><a href="#Clean-Visuals" data-toc-modified-id="Clean-Visuals-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Clean Visuals</a></span><ul class="toc-item"><li><span><a href="#Cleaning-TIck-Marks" data-toc-modified-id="Cleaning-TIck-Marks-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Cleaning TIck Marks</a></span></li><li><span><a href="#Cleaning-Spines" data-toc-modified-id="Cleaning-Spines-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Cleaning Spines</a></span></li><li><span><a href="#Comparing-multiple-Line-Charts,-making-them-consistent" data-toc-modified-id="Comparing-multiple-Line-Charts,-making-them-consistent-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Comparing multiple Line Charts, making them consistent</a></span></li><li><span><a href="#Color-Pallettes" data-toc-modified-id="Color-Pallettes-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Color Pallettes</a></span></li><li><span><a href="#Increase-line-widths" data-toc-modified-id="Increase-line-widths-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Increase line widths</a></span></li><li><span><a href="#Plot-Spacing" data-toc-modified-id="Plot-Spacing-1.2.6"><span class="toc-item-num">1.2.6&nbsp;&nbsp;</span>Plot Spacing</a></span></li><li><span><a href="#Legends" data-toc-modified-id="Legends-1.2.7"><span class="toc-item-num">1.2.7&nbsp;&nbsp;</span>Legends</a></span></li></ul></li></ul></li></ul></div>

# Storytelling w/ Data Visualization

In the **Exploratory Data Visualization** course, we learned how to use visualizations to explore and understand data. Because we were focused on exploring trends and getting familiar with the data, we didn't focus much on tweaking the appearance of the plots to make them more presentable to others. We instead focused on the workflow of quickly creating, tweaking, displaying, and iterating on plots.

In this course, we'll focus on how to use data visualization to communicate insights and tell stories. In this mission, we'll start with a standard matplotlib plot and improve its appearance to better communicate the patterns we want a viewer to understand. Along the way, we'll introduce the principles that informed those changes and provide a framework for you to apply them in the future. Here's a preview that demonstrates some of the improvements we make in this course:

## Line Plot
`import pandas as pd
import matplotlib.pyplot as plt
plt.plot(df.col_x, df.col_y)
plt.show()`

`plt.plot(df.Year, df.Biology, c='blue', label='Women')
plt.plot(df.Year, 100 - df.Biology, c='green', label='Men')
plt.title('Percentage of Biology Degrees Awarded By Gender')
plt.legend(loc='upper right')
plt.show()`

## Clean Visuals
Although our plot is better, it still contains some extra visual elements that aren't necessary to understand the data. We're interested in helping people understand the gender gap in different fields across time. These excess elements, sometimes known as [chartjunk](https://en.wikipedia.org/wiki/Chartjunk), increase as we add more plots for visualizing the other degrees, making it harder for anyone trying to interpret our charts. In general, we want to maximize the [data-ink ratio](https://infovis-wiki.net/wiki/Data-Ink_Ratio), which is the fractional amount of the plotting area dedicated to displaying the data.

The following is an animated GIF by [Darkhorse Analytics](https://www.darkhorseanalytics.com/blog/data-looks-better-naked) that shows a series of tweaks for boosting the data-ink ratio:

Non-data ink includes any elements in the chart that don't directly display data points. This includes tick markers, tick labels, and legends. Data ink includes any elements that display and depend on the data points underlying the chart. In a line chart, data ink would primarily be the lines and in a scatter plot, the data ink would primarily be in the markers. As we increase the data-ink ratio, we decrease non-data ink that can help a viewer understand certain aspects of the plots. We need to be mindful of this trade-off as we work on tweaking the appearance of plots to tell a story, because plots we create could end up telling the wrong story.

This principle was originally set forth by [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte), a pioneer of the field of data visualization. Tufte's first book, [The Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi), is considered a bible among information designers. We cover some of the ideas presented in the book in this course, but we recommend going through the entire book for more depth.

To improve the data-ink ratio, let's make the following changes to the plot we created in the last step:

- Remove all of the axis tick marks.
- Hide the spines, which are the lines that connects the tick marks, on each axis.

### Cleaning TIck Marks

To customize the appearance of the ticks, we use the `Axes.tick_params()` method. Using this method, we can modify which tick marks and tick labels are displayed. By default, matplotlib displays the tick marks on all four sides of the plot. Here are the four sides for a standard line chart:

The left side is the y-axis.
The bottom side is the x-axis.
The top side is across from the x-axis.
The right side is across from the y-axis.
The parameters for enabling or disabling tick marks are conveniently named after the sides. To hide all of them, we need to pass in the following values for each parameter when we call `Axes.tick_params()`:

- `bottom: "off"`
- `top: "off"`
- `left: "off"`
- `right: "off"`

`plt.tick_params(bottom='off', top='off', left='off', right='off')`

### Cleaning Spines

With the axis tick marks gone, the data-ink ratio is improved and the chart looks much cleaner. In addition, the spines in the chart now are no longer necessary. When we're exploring data, the spines and the ticks complement each other to help us refer back to specific data points or ranges. When a viewer is viewing our chart and trying to understand the insight we're presenting, the ticks and spines can get in the way. As we mentioned earlier, chartjunk becomes much more noticeable when you have multiple plots in the same chart. By keeping the axis tick labels but not the spines or tick marks, we strike an appropriate balance between hiding chartjunk and making the data visible.

In matplotlib, the spines are represented using the `matplotlib.spines.Spine` class. When we create an Axes instance, four Spine objects are created for us. If you run `print(ax.spines)`, you'll get back a dictionary of the Spine objects:

`{'right': <matplotlib.spines.Spine object at 0x111089c18>, 'bottom': <matplotlib.spines.Spine object at 0x111060898>, 'top': <matplotlib.spines.Spine object at 0x1110606a0>, 'left': <matplotlib.spines.Spine object at 0x11107cd30>}`

To hide all of the spines, we need to:

- access each Spine object in the dictionary
- call the Spine.set_visible() method
- pass in the Boolean value False

The following line of code removes the spines for the right axis:

`ax.spines["right"].set_visible(False)`

### Comparing multiple Line Charts, making them consistent

`major_cats = ['Biology', 'Computer Science', 'Engineering', 'Math and Statistics']
for sp in range(0,4):
    ax = fig.add_subplot(2,2,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c='blue', label='Women')
    ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c='green', label='Men')
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0, 100)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.tick_params(bottom='off', top='off', left='off', right='off')
    ax.set_title(major_cats[sp])
plt.legend(loc='upper right')
plt.show()`

By spending just a few seconds reading the chart, we can conclude that the gender gap in Computer Science and Engineering have big gender gaps while the gap in Biology and Math and Statistics is quite small. In addition, the first two degree categories are dominated by men while the latter degree categories are much more balanced. This chart can still be improved, however, and we'll explore more techniques in the next mission.

In this mission, we explored how to enhance a chart's storytelling capabilities by minimizing chartjunk and encouraging comparison. In the next mission, we'll explore how to use color, spacing, and weights to further enhance the storytelling capability of the plots.

### Color Pallettes

If we wanted to publish the data visualizations we create, we need to be mindful of color blindness. Thankfully, there are color palettes we can use that are friendly for people with color blindness. One of them is called `Color Blind 10` and was released by Tableau, the company that makes the data visualization platform of the same name. Navigate to [this page](http://tableaufriction.blogspot.com/2012/11/finally-you-can-use-tableau-data-colors.html) and select just the `Color Blind 10` option from the list of palettes to see the ten colors included in the palette.

These numbers represent the **RGB values** for each color. The RGB color model describes how the three primary colors (red, green, and blue) can be combined in different proportions to form any secondary color. The RGB color model is very familiar to people who work in photography, filmography, graphic design, and any field that use colors extensively. In computers, each RGB value can range between 0 and 255. This is because 256 integer values can be represented using 8 bits. You can read more about 8-bit color here.

The first color in the palette is a color that resembles dark blue and has the following RGB values:

- Red: 0
- Green: 107
- Blue: 164

To specify a line color using RGB values, we pass in a tuple of the values to the c parameter when we generate the line chart. Matplotlib expects each value to be scaled down and to range between 0 and 1 (not 0 and 255). In the following code, we scale the first color, which resembles dark blue, in the Color Blind 10 palette and set it as the line color:

`cb_dark_blue = (0/255,107/255,164/255)
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue)`

Color Composition 
`dark_blue = (0/255,107/255,164/255)
orange = (255/255,128/255,14/255)`

`ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c=women_dark_blue, label='Women')
ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c=men_orange, label='Men')`

### Increase line widths

By default, the actual lines reflecting the underlying data in the line charts we've been generating are quite thin. The white color in the blank area in the line charts is still a dominating color. To emphasize the lines in the plots, we can increase the width of each line. Increasing the line width also improves the data-ink ratio a little bit, because more of the chart area is used to showcase the data.

When we call the `Axes.plot()` method, we can use the `linewidth` parameter to specify the line width. Matplotlib expects a float value for this parameter:

`ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue, linewidth=2)`

The higher the line width, the thicker each line will be.

### Plot Spacing

So far, we've been generating our line charts on a 2 by 2 subplot grid. If we wanted to visualize all six STEM degrees, we'd need to either add a new column or a new row. Unfortunately, neither solution orders the plots in a beneficial way to the viewer. By scanning horizontally or vertically, a viewer isn't able to learn any new information and this can cause some frustration as the viewer's gaze jumps around the image.

To make the viewing experience more coherent, we can:

- use layout of a single row with multiple columns
- order the plots in decreasing order of initial gender gap

The leftmost plot has the largest gender gap in 1968 while the rightmost plot has the smallest gender gap in 1968. If we're instead interested in the recent gender gaps in STEM degrees, we can order the plots from largest to smallest ending gender gaps. Here's what that would look like:

In this exercise, you'll order the charts by decreasing ending gender gap. We've populated the list `stem_cats` with the six STEM degree categories, ordering them by decreasing ending gender gap. In the next step, we'll explore how we can replace the legend, which is currently overlapping with the rightmost line chart.

`stem_cats = ['Engineering', 'Computer Science', 'Psychology', 'Biology', 'Physical Sciences', 'Math and Statistics']
fig = plt.figure(figsize=(18, 3))
for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
plt.legend(loc='upper right')
plt.show()`

### Legends

The purpose of a legend is to ascribe meaning to symbols or colors in a chart. We're using it to inform the viewer of what gender corresponds to each color. Tufte encourages removing legends entirely if the same information can be conveyed in a cleaner way. Legends consist of non-data ink and take up precious space that could be used for the visualizations themselves (data-ink).

Instead of trying to move the legend to a better location, we can replace it entirely by annotating the lines directly with the corresponding genders:

If you notice, even the position of the text annotations have meaning. In both plots, the annotation for `Men` is positioned above the orange line while the annotation for `Women` is positioned below the dark blue line. This positioning subtly suggests that men are a majority for the degree categories the line charts are representing (`Engineering` and `Math and Statistics`) and women are a minority for those degree categories.

Combined, these two observations suggest that we should stick with annotating just the leftmost and the rightmost line charts, prioritizing the data-ink ratio over the consistency of elements.

To add text annotations to a matplotlib plot, we use the Axes.text() method. This method has a few required parameters:

- x: x-axis coordinate (as a float)
- y: y-axis coordinate (as a float)
- s: the text we want in the annotation (as a string value)

The values in the coordinate grid match exactly with the data ranges for the x-axis and the y-axis. If we want to add text at the intersection of `1970` from the `x-axis` and `0` from the `y-axis`, we would pass in those values:

`ax.text(1970, 0, "starting point")`


`fig = plt.figure(figsize=(18, 3))
for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
    if sp == 0:
        ax.text(2005, 87, 'Men')
        ax.text(2002, 8, 'Women')
    if sp == 5:
        ax.text(2005, 62, 'Men')
        ax.text(2001, 35, 'Women')
plt.show()`

In this mission, we learned how to improve the viewing experience by making our plots more color-blind friendly and thickening the line widths. We then explored how to use the layout and ordering of the plots as well annotations directly onto the plots to enhance the story that's being told to the viewer. Next in this course is a guided project, where we'll extend the work we did in this mission to all of the degree categories.