<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

# Data Visualization in Python

## Introduction

In this module, you will learn to quickly and flexibly generate a range of visualizations to explore data and communicate with your audience. This module contains a practical introduction to data visualization in Python and covers important rules to follow when creating visualizations.

## Learning Objectives

* Learn critical rules about data visualization (selecting graph types; labeling visual encodings; referencing data sources).

* Become familiar with two core Python data visualization tools, Matplotlib and seaborn.

* Start to develop the ability to conceptualize which visualizations can best reveal various types of patterns in your data.

## Choosing a Data Visualization Package

<!-- 
Matplotlib is always capitalized, like a typical proper noun.
Seaborn is capitalized like an ordinary word, so it's lowercase if "seaborn" appears in the middle of a sentence.
-->

There are many excellent data visualization modules available in Python. You can read more about different options for data visualization in Python in the [More Resources](#More-Resources:) section at the bottom of this notebook. For this the tutorial we will stick to a tried and true combination of 2-D plotting libraries: Matplotlib and seaborn.

Matplotlib is very expressive, meaning that it has functionality to allow extensive and fine-tuned creation of figures. It makes no assumptions about data, so it can be used to make historical timelines and fractals as well as bar charts and scatter plots. Matplotlib's flexibility comes at the cost of additional complexity in its use. 

Seaborn is a higher-level module, trading some of the expressiveness and flexibility of matplotlib for more concise and easier syntax. For our purposes, Seaborn improves on Matplotlib in several ways, making it easier to create small multiples, improving the color and aesthetics, and including direct support for some visualizations such as regression model results. Seaborn's creator, Michael Waskom, has compared the two:

> If Matplotlib "tries to make easy things easy and hard things possible", seaborn tries to make a well-defined set of hard things easy too. 

### Seaborn and Matplotlib together

It may seem like we need to choose between these two approaches, but happily this is not the case. Seaborn is itself written in Matplotlib (and you will sometimes see seaborn called a "wrapper" around Matplotlib). We can use seaborn to make graphs quickly,  then Matplotlib for specific adjustments. Whenever you see `plt` referenced in the code below, we are using a submodule of `matplotlib`.

## Import Packages and Set Up


In [None]:
# These abbreviations (pandas -> pd; seaborn -> sns) may seem arbitrary,
# but they are community conventions that will help keep your work easy
# to read and compare with that of other Python users.

import pandas as pd
from sqlalchemy import create_engine

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter-specific "magic command" to plot images directly in the notebook.
%matplotlib inline

# Engine to connect to SQL database
# We'll create this once and provide to pandas whenever we use read_sql()
engine = create_engine("postgresql://stuffed.adrf.info/appliedda")

## Motivation

In this notebook, we are going to tackle a series of questions. To answer them, we will introduce you to various visualizations which will provide a clearer view of the data than just using summary statistics, and help you create powerful graphics that better convey the point you want to make.

The questions we will focus on this notebook are:
- About how old are graduate students when they finish their dissertations? That is, what is the distribution of age at dissertation? How does this differ by field of study?
- What are the differences in starting salary by field of PhD? How has starting salary changed over the years, both overall and by field?
- What are primary sources of funding for students in various fields of study?
- What are the funding histories of graduate students in the three years leading up to their dissertation? How do the funding histories differ, and what are the most frequent funding sequences?

## Load DataFrames

We've separated out the SQL queries constructing the Data Frames that we will use for this notebook, and read them in from `.sql` files. Since they are not the focus of this notebook, we won't go into detail on how we've built up the queries, but we suggest you take a look at the `joined_person.sql` and `joined_semester.sql` files to make sure you understand how they're created.

`person_df` is at the individual level across the entire range and has individual level statistics on characteristics such as their demographics, academic achievements and debt levels.

`semester_df` is at the person-semester level and has information about their funders, team size, and semester.

In both of these Data Frames, we've done some cleaning in SQL (which you can see in the `.sql` files) to help facilitate our visualization creation.

In [None]:
from pathlib import Path
person_level_query = Path('./joined_person.sql').read_text()
person_df = pd.read_sql(person_level_query, engine)
person_df

In [None]:
from pathlib import Path
semester_level_query = Path('./joined_semester.sql').read_text()

semester_df = pd.read_sql(semester_level_query, engine)
semester_df

## Matplotlib

We'll begin with some straightforward Matplotlib functions. We'll start with some motivation questions, then use the appropriate Matplotlib commands to create a visualization that helps answer that question.

### Prepare the data

When designing visualizations, it can help to just draw a sketch on paper first. Once you have an idea of what type of graph is best suited to illustrate the fact that you want to show, consider how to prepare the data you need for the graph. 

We can provide Matplotlib a `pd.DataFrame` or `pd.Series` that we've created using Pandas. We'll want to ensure that the DataFrame includes exactly the information we want to plot, because Matplotlib won't be doing much more than simple aggregation.

### Histogram

**Motivating Question: About how old are graduate students when they finish their dissertations? That is, what is the distribution of age at dissertation? How does this differ by field of study?**

Since age is a numerical variable, we'll want to use a visualization such as a histogram. We'll start by plotting a histogram of a single variable, then customizing the figure. For a histogram, we'll want to consider the scale -- whether we should plot everything or a subset of values. Plotting our data as a histogram makes it easier to quickly observe some features, such as the overall shape of the distribution and its skewness and kurtosis. 

In [None]:
# For now, we'll just take a single series: age at dissertation
ages = person_df.age_at_diss.dropna()
ages.describe()

An easy way to get started with Matplotlib is to use its state-based interface, `matplotlib.pyplot`, which we have already imported above as `plt`. We can create a graph, then adjust its current state a bit at a time using `plt` functions.

To create a new histogram, we'll simply pass our team size series into `plt.hist()`.

In [None]:
plt.hist(ages)

# The show() function outputs the current state of `pyplot`: our current fig.
plt.show()

The `.describe()` above already suggested a strong right skew, but this visualization shows us the distribution in much greater detail.

But this is bare: let's at least add some labels.

In [None]:
plt.hist(ages)

plt.ylabel('Dissertators', fontsize='medium', labelpad=10)
plt.xlabel('Age', fontsize='medium', labelpad=10)

# In the notebook environment, the figure will automatically be
# displayed if the Python code cell ends with an update to the plot,
# so we can skip plt.show() in many cases.

###  Built-in styles

Now let's see how we can improve the style of this visualization. Every part of this figure can be customized in several ways, and Matplotlib includes several popular styles built-in.

In [None]:
print('Built-in style names:', ', '.join(sorted(plt.style.available)))

In [None]:
# Change the default style (affects font, color, positioning, and more)
plt.style.use('fivethirtyeight')

plt.xlabel('Age', fontsize='medium', labelpad=10)
plt.ylabel('Dissertators', fontsize='medium', labelpad=10)

# We need to replot the data in each newnotebook cell.
plt.hist(ages, bins=30)
plt.show()

### Style customization

That's a bit better, but Matplotlib allows us to customize every individual component on the fly.

> *How can we reset customizations?* In a notebook with multiple figures, we may want to reset everything before our next visualization. Or, having explored several options, we might want to undo all the stylistic tweaks without having to rerun the entire notebook. `matplotlib.rc_file_defaults()` will return just about everything to default settings.

In [None]:
mpl.rc_file_defaults()

# Change the figure size -- let's make it big.
plt.rc('figure', figsize=(8, 5))

# Because `pyplot` works by incrementally updating the state of `plt`,
# some changes must be made prior to creating those elements in the figure.
# We'll make the axes spines (the box around the plot) invisible
mpl.rc('axes', edgecolor='white', titlepad=20)

# These will remove the axes ticks
mpl.rc('xtick', bottom=False)
mpl.rc('ytick', left=False)

# Now we'll replot the data. With such a large sample, let's make a bin for each year.
n_bins = int(ages.max() - ages.min())
plt.hist(
    ages, 
    bins=n_bins, 
    align='left',
    color='xkcd:sage'
)

# Just after adding the data is a good time to remember to source it.
plt.annotate(
    'Sources: NCSES SED', 
    fontsize='x-small',
    xycoords="figure fraction", # specify x and y positions as % of the overall figure
    xy=(1, 0.01), # 100% to the right (x) and 1% to the top (y) means bottom right
    horizontalalignment='right', # the text will align appropriately for bottom right
)

# Add a title to the top of the figure
plt.title("Dissertations Typically Completed Near Age Thirty", fontsize='large')

# Add axis labels, with a bit more padding than default between the label and the axes
plt.xlabel('Age of Dissertator', fontsize='medium', labelpad=10)
plt.ylabel('Number of Dissertations', fontsize='medium', labelpad=10)

# Reduce the size of the axis labels
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)

# Add horizontal gridlines using negative space across the bars
plt.grid(
    color='white', 
    linewidth=1,
    axis='y'
)

### Data sourcing

A critical aspect of any data visualization intended for release is a reference to the source of the data being used. In these examples, we simply reference the agencies and names of the datasets. Whenever possible, we would provide a direct path so that our audience can find the data we used to build the figure. When this is feasible  -- as with these restricted-access data -- be sure to direct the reader to documentation describing the data.

Either way, providing clear sourcing for the underlying data is an absolute requirement of responsible dissemination. Transparent communication of sources and references builds trust between analyst and audience and helps enable the reproducibility of analyses.

In [None]:
# If we're repeatedly doing the same kind of annotation, 
# it helps a lot to turn that into a function.
def add_sourcing(plt, source_string, fontsize='x-small'):
    """Add small sourcing note to lower-right of current plot
    
    We would be using the same arguments over and over to do this.
    So a quick function will make it simpler. Now we can simply:
    
    add_sourcing(plt, 'Sources: IRIS UMETRICS, NCSES SED')
    """
    return plt.annotate(
        source_string,
        fontsize=fontsize,
        xycoords="figure fraction", # specify x and y positions as % of the overall figure
        xy=(1, 0.01), # 100% to the right (x) and 1% to the top (y) means bottom right
        horizontalalignment='right', # the text will align appropriately for bottom right    
    )
print("Now we can simply run:\n   add_sourcing(plt, 'Text goes here')")

### Multiple plots in one figure

Matplotlib is allowing us to make consecutive changes to the same plot, then display it whenever we're ready. The same process allows us to layer on multiple plots. By default, the first graph you create will be at the lowest layer, with each successive graph layered on top.

Below, we observe a difference in mean age of dissertation by field. Let's overlay one of the higher and one of the lower, to visualize the difference in their distributions.

In [None]:
person_df.groupby('phd_major_field')['age_at_diss'].agg(['mean', 'count']).sort_values('mean')

In [None]:
fields_of_interest = ['Physical Sciences', 'Education']

# Create a subset of person_df, plottinga histogram of age for each
for major_field in fields_of_interest:
    field_ages = person_df[person_df.phd_major_field == major_field].age_at_diss.dropna()
    plt.hist(field_ages, bins='doane', alpha=0.5)
    
# We'll definitely need a label to keep these apart...
plt.legend(
    labels=fields_of_interest, #ensure that labels are in the same order as above
    loc='center right', # the default is upper right, so move a little closer
    frameon=False, # remove the box around the legend
)

add_sourcing(plt, 'Sources: NCSES SED')

<font color = red> <h2>Checkpoint #1: Histogram</h2></font>

Try customizing your own histogram. If you want to try something other than age, another continuous variable in `person_df` is `salary_k`, the PhD graduate's anticipated salary (in thousands). Or `semester_df` includes the `team_size` variable, measuring the size of the federally-funded teams that a student is working with in each semester.

You'll definitely want to include:
- A title (`plt.title`)
- Axis labels (`plt.xlabel` and `plt.ylabel`)
- Data sourcing (`plt.annotate` or the `add_sourcing` function defined above)

If you use multiple colors, you'll want to add a legend as well (`plt.legend`)

Here we will change the variable from `age_at_diss` to `salary_k`. We add `plt.title`, `plt.xlabel`, `plt.ylabel`, `add_sourcing` function. If you want to change the range of the bins on x-axis, you can use `plt.xlim` and define the needed range (e.g. from 0 to 200K). To change the range on the y-axis, you can use `plt.ylim`.

In [None]:
fields_of_interest = ['Physical Sciences', 'Education']

# Create a subset of person_df, plottinga histogram of age for each
for major_field in fields_of_interest:
    field_salary = person_df[person_df.phd_major_field == major_field].salary_k.dropna()
    plt.hist(field_salary, bins='doane', alpha=0.5)
    plt.title('Most students expect salary to be between $35-75K')
    plt.xlabel('Anticipated salary')
    plt.ylabel('Number of students')
    plt.xlim(0,200)  # specify the bin range
    
# We'll definitely need a label to keep these apart...
plt.legend(
    labels=fields_of_interest, #ensure that labels are in the same order as above
    loc='center right', # the default is upper right, so move a little closer
    frameon=False, # remove the box around the legend
)

add_sourcing(plt, 'Sources: NCSES SED')

## Seaborn

Seaborn provides a high-level interface to Matplotlib, which is powerful but sometimes unwieldy. Seaborn provides many useful defaults, so that we can quickly have:
- More aesthetically pleasing defaults
- A better range of color palettes
- More complex graphs with less code
- Small multiples (a sequence of small graphs in one figure)

As you'll see, these libraries are complementary. Some tweaks will still require reaching back into Matplotlib.

### Bar chart

For this section, consider the following question:

**What are the differences in starting salary by field of PhD? How has starting salary changed over the years, both overall and by field?**

A bar plot presents categorical data with rectangular bars proportional to the values that they represent. In this case, we plot a horizontal bar plot. A bar plot represent an estimate of central tendency for a numeric variable with the length of each rectangle, and the seaborn `barplot()` function also includes an indication of the uncertainty around the estimate using error bars.

In [None]:
mpl.rc_file_defaults() # reset most Matplotlib features to defaults

# By convention, a returned Axes object is often called `ax`
ax = sns.barplot(
    y="phd_major_field", # seaborn is clever enough to create a horizontal chart
    x="salary_k", 
    data=person_df.sort_values('phd_major_field'), # order in data to order in figure
)

# We can use either `ax` or `plt` here; either will work
add_sourcing(ax, 'Sources: NCSES SED')

ax.set_title('Anticipated Salary Varies Considerably Across Fields of PhD')

# Line chart

We can use a line plot (seaborn `lineplot()` function) for tracking change in a value over time (a time series graph). Here we look at trend in salary expectations over time.

> Note: The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a **narrative** for your chart. It is best to avoid _explanatory_ titles, such as: "Average Expected Salary over Time (2008-2017)". This title is correct, yes -- but it isn't very useful. It is likely to be redundant, since "salary" and "year" are probably labels on the axes already. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question **"Why is this graph important?"** and focus the viewer onto the most critical take-away.

In [None]:
mpl.rc_file_defaults() # reset most settings to defaults

# A `with` statement (context manager) can be used to temporarily set figure styles
with sns.axes_style('darkgrid'):
    axes = sns.lineplot(data=person_df, x='phd_year', y='salary_k', color="#229900")
    axes.set_title('Anticipated Salary Has Been Increasing')

add_sourcing(plt, 'Sources: IRIS UMETRICS, NCSES SED')

### Small multiples

Small multiples can be a great way to compare across categories, so that we can see several similarly plotted versions in the same overall figure. Seaborn offers an  easy interface for combining multiple plots into a single figure using the `FacetGrid` class. Because `FacetGrid` was designed for exactly this use, seaborn has helpful defaults such as automatically synchronized axes.

We've looked at salary expectations over time and across field; here we consider all three variables at once:

In [None]:
# Prepare our grid, which will share axes across multiple plots (wrapping after 5 columns)
g = sns.FacetGrid(person_df, col='phd_major_field', ylim=(0, 140), col_wrap=5)

# Create a lineplot for each cell of the grid
g = g.map(sns.lineplot, "phd_year", "salary_k")

add_sourcing(plt, 'Source: NCSES SED', fontsize='medium')

# Simplify the titles inside each cell
g.set_titles("{col_name}")

# Remove the spine (vertical line) along the y axis
sns.despine(left=True)

### Colors

The colors used in figures in both Matplotlib and seaborn can be represented in code in many ways, but here are two that Matplotlib, seaborn, and many other modern visualization packages handle:

#### Hex triplets

The hex triplet is a specification for the RGB color model commonly used for website and browser-rendered colors. These are formatted as a string with a pound sign `#` followed by a series of six numbers. Each pair of hexadecimal digits (i.e., two of 0-9 and A-F) represents two bytes of color information for red, green, and blue, in that order: `"#RRGGBB"`. A low value (minimum 00) contributes less of that primary color, a high value (maximum FF) a larger amount. Together, these can specify over 16 million colors. An additional two hex digits can be added to indicate alpha (transparency) where 00 is completely transparent and FF is completely opaque. Hex triplets are very common across many platforms and packages well beyond data and visualization.

#### XKCD names

A relatively new standard, XKCD names were the result of an online study created by Randall Monroe where volunteers entered free-form names of colors displayed on screen. Following the input of tens of thousands of participants, 954 common and distinguishing names were codified. Behind the scenes, these are still equivalent to specific hex triplets, but they can be more convenient. The result is a list of color names that many English speakers will find intuitive, from basics such as "gold," "green," and "light grey" to rarely used terms. In Matplotlib and seaborn these are written as a string prefixed by `xkcd:`, for example: `"xkcd:cement"` (#a5a391), `"xkcd:pale magenta"` (#d767ad), `"xkcd:sage"` (#87ae73), and `"xkcd:green/blue"` (#01c08d).

<font color = red><h2> Checkpoint #2: Small Multiples </h2></font>

Trying using `sns.FacetGrid` in combination with a histogram, bar, or line chart of your choice. Separating simple charts into several categories with small multiples can be a big improvement over trying to graph several things on the same chart.

Try experimenting with color choices. Remember to add source and use a title that highlights the main take-away.

If you would like to include another variable in a line plot, for example, to look at differences by gender, you can use the `hue = 'sex'` variable in the `FacetGrid` function.

In [None]:
# Prepare our grid, which will share axes across multiple plots (wrapping after 5 columns)
g = sns.FacetGrid(person_df, col='phd_major_field', hue='sex', ylim=(0, 140), col_wrap=5)

# Create a lineplot for each cell of the grid
g = g.map(sns.lineplot, "phd_year", "salary_k")

add_sourcing(plt, 'Source: NCSES SED', fontsize='medium')

# Simplify the titles inside each cell
g.set_titles("{col_name}")

# Remove the spine (vertical line) along the y axis
sns.despine(left=True)

## More visualization methods and motivating examples

### Scatter plot 



We can represent a relationship between the age of doctorate recipients and their expected salaries using a scatter plot and seaborn `scatterplot()` function.

In [None]:
scatter_df = person_df[person_df.salary_k < 300]
ax = sns.scatterplot(
    x='age_at_diss', 
    y='salary_k', 
    color="xkcd:sea blue",
    data=scatter_df,
    alpha=.1, # we have a LOT of points, so make each nearly transparent
    linewidth=0, # and remove the default (white) circle around each
)
add_sourcing(plt, 'Sources: IRIS UMETRICS, NCSES SED')

### Heat map

Consider the following question:

**What are primary sources of funding for students in various fields of study?**

For something like this, we might want to use a heatmap. This can give a sort of visual summary similar to a crosstab.

In [None]:
MAJOR_FUNDERS = 'NIH NSF DOD DOE USDA NASA ED'.split()

pre_heatmap_df = pd.merge(semester_df, person_df, how='left', on='drf_id')[['modal_funder', 'phd_major_field']]

heatmap_df = pd.crosstab(pre_heatmap_df.phd_major_field, pre_heatmap_df.modal_funder)[MAJOR_FUNDERS]

ax = sns.heatmap(heatmap_df, cmap=sns.cubehelix_palette(light=1, as_cmap=True))
add_sourcing(ax, 'Sources: IRIS UMETRICS, NCSES SED')

# This heatmap fix is only necessary for 3.1.0 < Matplotlib <= 3.1.1; see https://stackoverflow.com/questions/56948670
ax.set_ylim(len(heatmap_df), 0)

## Funding sequence chart

Consider the following question:

**What are the funding histories of graduate students in the three years leading up to their dissertation? How do the funding histories differ, and what are the most frequent funding sequences?**

To create a graphic that lets us answer this question, we need both semester level funding information and time of dissertation. In other words, we need to use a linked dataset with UMETRICS and SED. The UMETRICS data allows us to get the funding history of students, which we can use in conjunction with SED data to see what the funding histories look like leading up to the dissertation.

In the following, we use the flexibility of pandas and these visualization libraries to create an unusual kind of chart. We will display the top ten most common patterns of federal funding in the time before and during the year that a student receives the PhD. 

### Conceptual design

We have the idea, so we'll first want to think about what it will look like in the end, then work backwards to determine how we need to handle the data to create the table we'll need.

It really helps to get concrete, particularly if you aren't doing a standard kind of figure. The final visualization we're aiming for will be organized something like this:

```
funding pattern 

- - - X X X X X X | 11%
X X X X X X X X X | 10%
- - - X X - X X - | 9%
- - - X X X X X - | 8%
- - - X X X X - - | 7%    percent
- - X X X X - - - | 6%    of sample
X X - - - - - - - | 5%
- - - X X - - - - | 4%
X X X - - - - - - | 4%
- - - - X X X - - | 4%
__________________|
 -2    -1     0
      year

```
We'll put the percentages on the right, to help reinforce that we are looking back in time from the PhD award. Each row is a pattern where an `X` indicates federal funding and a `-` is no funded. If these were the real data, the first row would tell us that 11% of the PhD awardees had federal funding only during the last two years before their degree was awarded. The second row shows 10% with federal funding every single semester, nine in a row. The numbers here are arbitrary -- the point is to get a sense of what we're aiming for.

### Data preparation

Before we get there, though, we'll have a fair amount of data preparation. We'll plan to use a heatmap to present the pattern of yes/no funding, which means an aggregated and simplified dataset to pass into seaborn.

From end to beginning:
  - Top ten rows by % of total, nine columns of yes/no semester funding
  - ...will need to be counted from a unique student-level dataset that has nine columns of yes/no funding
  - ...pivoted from the full student X semester-level dataset we have as `semester_df`
  - ...created from those students we have covered in UMETRICS for the entire time period (we'll cut by institution)
  

In [None]:
# 1. DATA CLEANING

# Sixteen schools have UMETRICS coverage 2012-2015, and we only want them for this chart
sequence_sample = semester_df[semester_df['sequence_data_coverage'] == 1]


# 2. PIVOT FROM STUDENTxSEMESTER TO STUDENT

# W pivot to unique rows by person (drf_id) and a column for each semester
pivoted = sequence_sample.pivot(index='drf_id', columns='semester', values='any_federal')

# convert team_size to 1 if federal funding (True); 0 if not (NaN)
pivoted = pivoted.applymap(lambda x: int(x == 1)) 

# We have two different phd_year so we'll need to adjust which semesters are relative to which
pivoted = pivoted.merge(right=person_df[['drf_id', 'phd_year']], how='inner', on='drf_id')


# 3. A WRINKLE IN THE DATA

# Oops -- our data are recorded by calendar semester, but not everyone graduated in the same year: 
# some 2014, some 2015. We want relative semesters, so we need to account for that.

# If we were doing to do this more than once, we'd probably want to create a function to do it automatically...
phd2014 = pivoted[pivoted.phd_year == 2014][['12spr', '12sum', '12fal', '13spr', '13sum', '13fal', '14spr', '14sum', '14fal']]
phd2015 = pivoted[pivoted.phd_year == 2015][['13spr', '13sum', '13fal', '14spr', '14sum', '14fal', '15spr', '15sum', '15fal']]

RELATIVE_COLUMNS = ['-2 Spr', '-2 Sum', '-2 Fal', '-1 Spr', '-1 Sum', '-1 Fal', 'Spr', 'Sum', 'Fal']
phd2015.columns = phd2014.columns = RELATIVE_COLUMNS

# pd.concat() stacks them back together, now that we have re-synchronized their rows
funding_df = pd.concat((phd2014, phd2015))

# To clarify our sample, because our data are not independent of having received federal funding,
# we'll just keep those that show any federal funding during these nine semesters.
funding_df = funding_df[funding_df.sum(axis=1) > 0]


# 4. AGGREGATING TOGETHER BY PATTERNS

# We cluster the data by these semesters (equivalent to the list of columns)
# using size() to count the number of dissertators with each pattern
aggregated = funding_df.groupby(list(funding_df.columns)).size().reset_index()

# Rename the not-so-helpful 0 column from groupby().size() into 'size'
aggregated = aggregated.rename(columns={0: 'size'})
# Keep only the most common ten patterns (10 rows with the highest value in the size column)
aggregated = aggregated.nlargest(10, 'size')


# 5. MINOR BITS OF FORMATTING

# Convert the count size into percentage of the total students
aggregated['size'] = aggregated['size'] / len(pivoted)
# And format the new field so that it will be a nice percentage in our chart
aggregated['size'] = aggregated['size'].apply(lambda x: '{:,.1%}'.format(float(x)))

# Conveniently, seaborn will automatically use the index as the y-axis
aggregated = aggregated.set_index('size')
aggregated

That's exactly the table we want. In reality, putting this together took several iterations to understand exactly how the data should be combined and to minimize errors in sample construction and aggregation.

The `aggregated` df is now ready to pass into seaborn.

In [None]:
mpl.rc_file_defaults() # reset to defaults

ax = sns.heatmap(
    aggregated, 
    cbar=False, # We don't need the heatmap's color bar 
    cmap='Greens', 
    linewidth=.5,  
)
add_sourcing(plt, 'Sources: IRIS UMETRICS, NCSES SED')

# Move the y axis labels over to the right side to help guide the chronology
ax.tick_params(left=False, bottom=False, labelleft=False, labelright=True, labelrotation=1)

ax.set_title('Most Common Funding Patterns of Federally Funded PhD Recipients')
ax.set_xlabel('Semesters of funding, relative to year of PhD')
ax.set_ylabel('')

# This heatmap fix is only necessary for 3.1.0 < Matplotlib <= 3.1.1; see https://stackoverflow.com/questions/56948670
ax.set_ylim(len(aggregated), 0)

## Saving visualizations

When you are satisfied with your visualization, you will likely want a copy outside of your notebook. This can be done directly from the code using the `savefig` function of Matplotlib. When using `savefig`, the extension of the filename you choose is important. Image formats such as like PNG and JPEG are **not ideal**. Instead, save visualizations instead as a vector image via PDF or SVG: `plt.savefig("yourfilename.pdf")`

> *Why not PNG or JPEG?* Raster image formats such as PNG and JPEG store pictures as compressed information on the colors of pixels. They are great for cases like photographs, where we want to minimize the perceived loss of visual quality while saving storage space.
> But with visualizations, we care about *semantic* components: selected fonts, precise curves, and exact distances. Vector images are recorded as coded paths with specific characteristics. A PDF or SVG saved from Matplotlib can be later opened in a program such as Inkscape or Adobe Illustrator to make useful changes. You can shrink the size of a label, change the font used in a title, adjust the position of legends, or scale the entire visualization to the size of a large poster, all with no loss in quality.

In [None]:
mpl.rc_file_defaults() # reset to defaults

# Scale all font sizes slightly smaller
sns.set_context('paper')

grid = sns.catplot(
    data=semester_df,
    x='team_size', 
    y='modal_suborg', 
    kind='strip',
    marker='.',
    jitter=0.2,
    palette=sns.color_palette('deep'),
    order=semester_df['modal_suborg'].value_counts(ascending=True).index
)

sns.color_palette()
grid.set_axis_labels("Size of Student's Team(s) during Semester", "Modal Suborganization of Federal Funding")
add_sourcing(plt, 'Sources: IRIS UMETRICS, NCSES SED')

# Save the current state of the plot to PDF
plt.savefig("example.pdf")
plt.show()

<font color = red><h2> Checkpoint #3: Saving Visualizations</h2></font>

Try saving some of the visualizations that we've created, either from previous checkpoints, or from the examples shown. Additionally, think about the underlying data. We'll cover this in more detail later, but to export anything you create from the ADRF, you'll need to show that the underlying data for any visualizations you want to export pass the disclosure review process. Carefully considering the data and the process that goes into any visualization you save is the first step towards making sure you are able to export the figures you create.

Use `plt.savefig()` function from above and specify the format that you are interested in. We will talk more about the disclosure review and export process in Module 2.

## More Resources

### Matplotlib

* [Matplotlib Documentation](https://matplotlib.org)

* [Matplotlib visualization tutorials](https://matplotlib.org/tutorials/index.html)

### Seaborn

* [Seaborn Documentation](http://seaborn.pydata.org)

* [Advanced Functionality in Seaborn](blog.insightdatalabs.com/advanced-functionality-in-seaborn)

### Colors

Tools like [Adobe Color](https://color.adobe.com) and this [Hex Calculator](https://www.w3schoosl.com/colors/colors_hexadecimal.asp) can help you get used to the hex triplet system.

The [official XKCD color list](https://xkcd.com/color/rgb/) lists all the named colors and their hex triplets; w3schools.com has also published an [XKCD color chart](https://www.w3schools.com/colors/colors_xkcd.asp) with larger swatches.

### Other Python Visualization Libraries

[A Dramatic Tour through Python's Data Visualization Landscape](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair) discusses and compares Matplotlib, seaborn, ggplot, and Altair.

* [Plotly](https://plot.ly) focuses on interactive visualizations, including online hosting.

* [Bokeh](http://bokeh.pydata.org) priotizes ease of use, also with an emphasis on in-browser, interactive charts.

* [ggplot](http://ggplot.yhathq.com) is largely a port of R's heavily-used ggplot2 library, inspired by *The Grammar of Graphics*.

* [Altair](https://altair-viz.github.io) is designed to be accessible and language independent, using the Vega-Lite syntax.