<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

# Data Visualization in Python

## Introduction

In this module, you will learn to quickly and flexibly generate a range of visualizations to explore data and communicate with your audience. This module contains a practical introduction to data visualization in Python and covers important rules to follow when creating visualizations.

## Learning Objectives

* Learn critical rules about data visualization (selecting graph types; labeling visual encodings; referencing data sources).

* Become familiar with two core Python data visualization tools, `Matplotlib` and `seaborn`.

* Start to develop the ability to conceptualize which visualizations can best reveal various types of patterns in your data.

## Choosing a Data Visualization Package

<!-- 
Matplotlib is always capitalized, like a typical proper noun.
Seaborn is capitalized like an ordinary word, so it's lowercase if "seaborn" appears in the middle of a sentence.
-->

There are many excellent data visualization modules available in Python. You can read more about different options for data visualization in Python in the [More Resources](#More-Resources:) section at the bottom of this notebook. For this tutorial, you will stick to a tried and true combination of 2-D plotting libraries: `Matplotlib` and `seaborn`.

`Matplotlib` is very expressive, meaning that it has functionality to allow extensive and fine-tuned creation of figures. It makes no assumptions about data, so it can be used to make historical timelines and fractals as well as bar charts. `Matplotlib`'s flexibility comes at the cost of additional complexity in its use. 

`Seaborn` is a higher-level module, trading some of the expressiveness and flexibility of matplotlib for more concise and easier syntax. For our purposes, `seaborn` improves on `Matplotlib` in several ways, making it easier to create small multiples, improving the color and aesthetics, and including direct support for some visualizations such as regression model results. `seaborn`'s creator, Michael Waskom, has compared the two:

> If `Matplotlib` "tries to make easy things easy and hard things possible, `seaborn` tries to make a well-defined set of hard things easy too. 

### `seaborn` and `Matplotlib` together

It may seem like we need to choose between these two approaches, but happily this is not the case. `seaborn` is itself written in `Matplotlib` (and you will sometimes see seaborn called a "wrapper" around `Matplotlib`). You can use `seaborn` to make graphs quickly,  then `Matplotlib` for specific adjustments. Whenever you see `plt` referenced in the code below, you are using a submodule of `Matplotlib`.

## Import Packages and Set Up


In [None]:
# These abbreviations (pandas -> pd; seaborn -> sns) may seem arbitrary,
# but they are community conventions that will help keep your work easy
# to read and compare with that of other Python users.

# pandas-related imports
import pandas as pd

# Numpy
import numpy as np

# database interaction imports
import sqlalchemy

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter-specific "magic command" to plot images directly in the notebook.
%matplotlib inline

# Work with date
import datetime

In [None]:
# to create a connection to the database, 
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

## Motivation

In this notebook, we are going to tackle a series of questions provoked in the [Data Exploration](01_2_Data_Exploration.ipynb) notebook by using Ohio HEI data and Ohio and Indiana UI wage records. To answer them, you will be introduced to various visualizations which will provide a clearer view of the data than just using summary statistics and help you create powerful graphics that better convey the point(s) you want to make.

The questions we will focus on this notebook are:
- What is the distribution of earnings in Ohio during the first year after graduation for 2012-13 community college graduates? How does this differ by degree fields?
- How have earnings in Ohio changed over time for 2012-13 community college graduates? How do one-year earnings differ by industry?
- How do the degree fields of 2012-13 community college graduates differ across Ohio's regions?
- What are the employment patterns of 2012-13 community college graduates one years after graduation? What are their cross-state movement patterns?

## `Matplotlib`

You will begin with some straightforward `Matplotlib` functions. First, you will start with some motivation questions, and then you will go on to use the appropriate `Matplotlib` commands to create your own visualizations.

### Prepare the data

When designing visualizations, it can help to just draw a sketch on paper first. Once you have an idea of what type of graph is best suited to illustrate the fact that you want to show, you should consider how to prepare the data for the graph. 

You can provide `Matplotlib` a `pandas` `DataFrame` or `Series`. You will want to ensure that the object includes exactly the information you want to plot because `Matplotlib` won't be doing much more than simple aggregation.

### Histogram

<font color=red> **Motivating Question #1**:</font>

What is the distribution of earnings during the first year after graduation for 2012-13 community college graduates? How does this differ by degree fields?

Since earnings is a numerical variable, you will want to use a visualization that can display a variety of numerical outputs on a continuous scale, such as a histogram. You will start by plotting a histogram of a single variable and customizing the figure. For a histogram, you will want to consider the scale -- whether you should plot everything or a subset of values. Plotting your data as a histogram makes it easier to quickly observe some features, such as the overall shape of the distribution and its skewness and kurtosis. 

> Recall that in the Dataset Exploration notebook, we have created `cohort_oh_jobs` by joining the community college graduates table with the by quarter Ohio UI wage records. This table has the most recent degree record of each person and their UI wage records during the four quarters after graduation.

In [None]:
qry = '''
select *
from ada_20_osu.cohort_oh_jobs
'''
df = pd.read_sql(qry, conn)

In [None]:
#Let's take a look at the table first.
df.head()

In [None]:
#The above DataFrame only has quarterly earnings. Let's use .groupby() to calculate the first year earnings
earn_1y=df.groupby(['ssn_hash'])['sumwages'].agg('sum')

earn_1y.head()

In [None]:
# Earnings distribution of 2012-13 Ohio community college graduates 
# who have positive earnings during the first year after graduation
earn_1y.describe()

An easy way to get started with `Matplotlib` is to use its state-based interface, `matplotlib.pyplot`, which we have already imported above as `plt`. We can create a graph, then adjust its current state a bit at a time using `plt` functions.

To create a new histogram, we'll simply pass our earning series into `plt.hist()`.

In [None]:
# bare histogram of earnings distribution
plt.hist(earn_1y)

# The show() function outputs the current state of `pyplot`: our current fig.
plt.show()

The output from `.describe()` above already suggested a strong right skew, but this visualization shows us the distribution in much greater detail.

Take a look at the earnings distribution. Do you need to make transformations to the earnings, such as topcode the outliers or use the log transformation of the earnings?

Regardless, this is bare: let's at least add some labels.

In [None]:
plt.hist(earn_1y)

plt.ylabel('Count', fontsize='medium', labelpad=10)
plt.xlabel('Earnings', fontsize='medium', labelpad=10)

# In the notebook environment, the figure will automatically be
# displayed if the Python code cell ends with an update to the plot,
# so we can skip plt.show() in many cases.

###  Built-in styles

Now let's see how we can improve the style of this visualization. Every part of this figure can be customized in several ways, and `Matplotlib` includes several popular styles built-in.

In [None]:
print('Built-in style names:', ', '.join(sorted(plt.style.available)))

In [None]:
# Change the default style (affects font, color, positioning, and more)
plt.style.use('fivethirtyeight')

plt.xlabel('Earnings', fontsize='medium', labelpad=10)
plt.ylabel('Count', fontsize='medium', labelpad=10)

# We need to replot the data in each newnotebook cell.
plt.hist(earn_1y, bins=30)
plt.show()

### Style customization

That's a bit better, but `Matplotlib` allows us to customize every individual component on the fly.

> *How can you reset customizations?* In a notebook with multiple figures, you may want to reset everything before your next visualization. Or, having explored several options, you might want to undo all the stylistic tweaks without having to rerun the entire notebook. `matplotlib.rc_file_defaults()` will return just about everything to default settings.

In [None]:
mpl.rc_file_defaults()

# Change the figure size -- let's make it big.
plt.rc('figure', figsize=(8, 5))

# Because `pyplot` works by incrementally updating the state of `plt`,
# some changes must be made prior to creating those elements in the figure.
# We'll make the axes spines (the box around the plot) invisible
mpl.rc('axes', edgecolor='white', titlepad=20)

# These will remove the axes ticks
mpl.rc('xtick', bottom=False)
mpl.rc('ytick', left=False)

# Now we'll replot the data. Since the default is 10, but it seems like we can capture
# more variation due to our sample size and distribution, let's try 50 bins.
n_bins = 50
plt.hist(earn_1y
    , 
    bins=n_bins, 
    align='left',
    color='xkcd:sage'
)

# Just after adding the data is a good time to remember to source it.
plt.annotate(
    'Sources: Ohio HEI data and UI wage record', 
    fontsize='x-small',
    xycoords="figure fraction", # specify x and y positions as % of the overall figure
    xy=(1, 0.01), # 100% to the right (x) and 1% to the top (y) means bottom right
    horizontalalignment='right', # the text will align appropriately for bottom right
)

# Add a title to the top of the figure
plt.title("Earnings of 2012-13 Ohio Community College Graduates, First Year After Graduation", fontsize='large')

# Add axis labels, with a bit more padding than default between the label and the axes
plt.xlabel('Earnings One Year Post-Graduation in Ohio', fontsize='medium', labelpad=10)
plt.ylabel('Number of 2012-13 Graduates', fontsize='medium', labelpad=10)

# Reduce the size of the axis labels
plt.xticks(fontsize=9)
plt.yticks(fontsize=9)

# Add horizontal gridlines using negative space across the bars
plt.grid(
    color='white', 
    linewidth=1,
    axis='y'
)

### Data sourcing

A critical aspect of any data visualization intended for release is a reference to the source of the data being used. In these examples, we simply reference the agencies and names of the datasets. Whenever possible, we want to provide a direct path so that our audience can find the data we used to build the figure. When this is feasible  -- as with these restricted-access data -- be sure to direct the reader to documentation describing the data.

Either way, providing clear sourcing for the underlying data is an absolute requirement of responsible dissemination. Transparent communication of sources and references builds trust between analyst and audience and helps enable the reproducibility of analyses.

In [None]:
# If we're repeatedly doing the same kind of annotation, 
# it helps a lot to turn that into a function.
def add_sourcing(plt, source_string, fontsize='x-small'):
    """Add small sourcing note to lower-right of current plot
    
    We would be using the same arguments over and over to do this.
    So a quick function will make it simpler. Now we can simply:
    
    add_sourcing(plt, 'Sources: Ohio Longitudinal Data Archive')
    """
    return plt.annotate(
        source_string,
        fontsize=fontsize,
        xycoords="figure fraction", # specify x and y positions as % of the overall figure
        xy=(1, 0.01), # 100% to the right (x) and 1% to the top (y) means bottom right
        horizontalalignment='right', # the text will align appropriately for bottom right    
    )
print("Now we can simply run:\n   add_sourcing(plt, 'Text goes here')")

### Multiple plots in one figure

`Matplotlib` is allowing you to make consecutive changes to the same plot, then display it whenever you are ready. The same process allows you to layer on multiple plots. By default, the first graph you create will be at the lowest layer, with each successive graph layered on top.

To create these layered plots, you just need to run `plt.hist()` multiple times on different data frames. Then, as long as you provide the correct labels for your legend, you can run `plt.legend()` once to create a comprehensive legend.

<font color = red> <h2>Checkpoint #1: Histogram</h2></font>

1. Plot a histogram for students who graduated during the spring semester. Recall that we have created a degree date variable, `deg_date`. For students who graduated in spring, `deg_date`=`2013-04-01`.

2. Try to plot the histogram for autumn semester graduates on the same graph, i.e.: `deg_date`=`2012-10-01`.

You'll definitely want to include:
- A title (`plt.title`)
- Axis labels (`plt.xlabel` and `plt.ylabel`)
- Data sourcing (`plt.annotate` or the `add_sourcing` function defined above)

If you use multiple colors (as you should with stacked histograms), you'll want to add a legend as well (`plt.legend`).

## `Seaborn`

`Seaborn` provides a high-level interface to `Matplotlib`, which is powerful but sometimes unwieldy. `Seaborn` provides many useful defaults, so that we can quickly have:
- More aesthetically pleasing defaults
- A better range of color palettes
- More complex graphs with less code
- Small multiples (a sequence of small graphs in one figure)

As you'll see, these libraries are complementary. Some tweaks will still require reaching back into `Matplotlib`.

### Bar chart

In this section, you will consider the differences in earnings for 2012-13 community college graduates in different degree fields. Recall that in the [Data Exploration](01_2_Dataset_Exploration.ipynb) notebook, you created two temporary tables:
- `cc_grads`: all 2012-13 Ohio community college graduates. Some people have more than one record.
- `cc_grads_recent`: only has the most recent degree record for each 2012-13 Ohio community college graduates

You will recreate the same temporary table in this notebook, just this time using one query to create `cc_grads`.

A bar plot presents categorical data with rectangular bars proportional to the values that they represent. In this case, you will graph a horizontal bar plot. A bar plot represent an estimate of central tendency for a numeric variable with the length of each rectangle, and the `Seaborn` `barplot()` function also includes an indication of the uncertainty around the estimate using error bars.

In [None]:
# create cc_grads in one step
qry = '''
create temp table cc_grads as
select a.*, lkp.*
from data_ohio_olda_2018.oh_hei_long a
left join data_ohio_olda_2018.oh_hei_campus_county_lkp lkp
on a.degcert_campus = lkp.campus_num
where ((a.degcert_yr_earned = '2012' and (a.degcert_term_earned = '4' or a.degcert_term_earned = '1')) or 
    (a.degcert_yr_earned = '2013' and (a.degcert_term_earned = '2' or a.degcert_term_earned = '3'))) and 
    lkp.campus_type_code in ('TC', 'SC', 'CC')
'''
conn.execute(qry, conn)

In [None]:
# Find most recent graduation within the span of 2012-13 academic year
# also get two-digit subject code
qry = '''
create temp table cc_grads_recent as
select distinct on (ssn_hash) *, left(degcert_subject, 2) as subject
from (
SELECT *, 
    CASE WHEN degcert_term_earned = 4 THEN
        format('%%s-%%s-01', degcert_yr_earned, 7)::date 
    WHEN degcert_term_earned = 1 THEN
        format('%%s-%%s-01', degcert_yr_earned, 10)::date 
    WHEN degcert_term_earned = 2 THEN
        format('%%s-%%s-01', degcert_yr_earned, 1)::date 
    WHEN degcert_term_earned = 3 THEN
        format('%%s-%%s-01', degcert_yr_earned, 4)::date 
    END AS deg_date
    from cc_grads
) q
order by ssn_hash, deg_date DESC
'''
conn.execute(qry)

In [None]:
# read in most_recent with two-digit subject code
qry = '''
select ssn_hash,subject from cc_grads_recent
'''
subject_df = pd.read_sql(qry, conn)

In [None]:
subject_df.head()

In [None]:
# first ignore those with tr
subject_df = subject_df[subject_df['subject'] != 'TR']

> Even when using the 2-digit subject code, you still have 38 degree fields. In this case, you can either group the degree fields into fewer categories or only show a few representative fields to the audience. Here, you will show the average earnings of students in the ten fields with the most graduates in 2012-13 academic year.

In [None]:
# just grabbing subject codes for 10 most popular subjects of graduates in this cohort 
subject_df.groupby(['subject'])['ssn_hash'].count().sort_values(ascending=False)[0:10]

In [None]:
# select these subjects so we can subset most_recent and add the corresponding subject description
# need to set as tuple so we can use .format() properly
pop_subs = tuple(subject_df.groupby(['subject'])['ssn_hash'].count().sort_values(
    ascending=False)[0:10].reset_index()['subject'])

pop_subs

In [None]:
# now get everyone who graduated with a degree in one of the 10 most popular subjects along with
# the year and term they graduated, as well as the corresponding subject description from the lookup table
qry= '''
select cc.ssn_hash, cc.deg_date, cc.subject, lkp.subject_desc 
from cc_grads_recent cc
join data_ohio_olda_2018.oh_subject_codes_lkp lkp
on cc.subject=lkp.subject_code_2010::varchar
where cc.subject in {}
limit 5
'''.format(pop_subs)
pd.read_sql(qry,conn)

In [None]:
# save as temp table ten_subs
qry= '''
create temp table ten_subs as
select cc.ssn_hash, cc.deg_date, cc.subject, lkp.subject_desc 
from cc_grads_recent cc
join data_ohio_olda_2018.oh_subject_codes_lkp lkp
on cc.subject=lkp.subject_code_2010::varchar
where cc.subject in {}
'''.format(pop_subs)
conn.execute(qry)

In [None]:
# Now that we have this, we can match it to the cohort_oh_jobs table because it already contains the earnings
# for most recent graduation within this time
qry = '''
select distinct t.*, j.deg_date, j.sumwages
from ten_subs t
join ada_20_osu.cohort_oh_jobs j
on j.ssn_hash = t.ssn_hash
'''
top_subs_wage = pd.read_sql(qry, conn)

In [None]:
top_subs_wage.head()

In [None]:
# Calculate each person's earnings during the first year after graduation
df_by_ssn = top_subs_wage.groupby(['ssn_hash', 'subject_desc'])['sumwages'].agg('sum').reset_index()

In [None]:
#Now the table has each person's degree field and first year earnings
#Recall that we only keep people who are in the TOP 10 fields with the most graduates
df_by_ssn.head()

In [None]:
# one-year earnings distribution by subject
df_by_ssn.groupby('subject_desc')['sumwages'].agg(['describe'])

In [None]:
mpl.rc_file_defaults() # reset most Matplotlib features to defaults

plt.rc('figure', figsize=(15, 10))

# By convention, a returned Axes object is often called `ax`
ax = sns.barplot(
    y="subject_desc", # seaborn is clever enough to create a horizontal chart
    x="sumwages", 
    data=df_by_ssn, # order in data to order in figure
    palette='vlag'
)

# We can use either `ax` or `plt` here; either will work
add_sourcing(ax, 'Sources: Ohio HEI data and UI wage record')

ax.set_title('First Year Earnings Varies Considerably Across Degree Fields')

# Line chart

<font color=red> **Motivating Question #2**: </font>

How have 2012-13 community college graduates' earnings changed over time in Ohio? 

We can use a line plot (`Seaborn` `lineplot()` function) for tracking change in a value over time (a time series graph). 

We have created a table for you `cohort_oh_wages_big` in the `ada_20_osu` schema that contains earnings by quarter over time for this cohort. To create this table, we used the entire `oh_ui_wage_by_quarter` table instead of `small_ohio_ui` when joining `cc_grads_recent` to the Unemployment Insurance wage records.

In [None]:
qry = '''
select * from ada_20_osu.cohort_oh_wages_big
'''
df_wages = pd.read_sql(qry, conn)

In [None]:
# see table
df_wages.head()

In [None]:
#Convert time_after_grad to quarters
df_wages['quarter_after_grad']=(df_wages['time_after_grad']/90).round(0).astype(int)

In [None]:
df_wages.head()

In [None]:
# group by time after graduation (in days)
df_by_ssn = df_wages.groupby(['ssn_hash', 'quarter_after_grad'])['sumwages'].agg('sum').reset_index()

In [None]:
df_by_ssn.head()

In [None]:
# earnings by time after graduation
df_by_ssn.groupby('quarter_after_grad')['sumwages'].agg(['describe'])

In [None]:
#Let's adjust the nominal earnings to real earnings with a function
def cpi_adj(year,wage):
    """ Adjust annual earnings to 2017 dollars using
        end of period CPI:
    """
    ref = 247.847
    if year == '2007':
        return wage * ref/211.445
    elif year == '2008':
        return wage * ref/211.398
    elif year == '2009':
        return wage * ref/217.347
    elif year == '2010':
        return wage * ref/220.472
    elif year == '2011':
        return wage * ref/227.223
    elif year == '2012':
        return wage * ref/229.594
    elif year == '2013':
        return wage * ref/232.957
    elif year == '2014':
        return wage * ref/236.252
    elif year == '2015':
        return wage * ref/237.761
    elif year == '2016':
        return wage * ref/242.712
    elif year == '2017':
        return wage
    else:
        return 'CPI undefined'

In [None]:
#Let's create the job year variable so that we can it in the function
#The code below gives you the first four characters of `job_date`, which is the year of the job date
df_wages['job_year']=df_wages['job_date'].astype(str).str[0:4]
df_wages['job_year'].head()

In [None]:
# Now let's adjust sumwages to its real value (2017 price level)
df_wages['real_sumwages'] = df_wages.loc[:,('job_year', 'sumwages')].apply(lambda x: cpi_adj(*x), axis = 1).round()

In [None]:
df_by_ssn = df_wages.groupby(['ssn_hash', 'quarter_after_grad'])['real_sumwages'].agg('sum').reset_index()
df_by_ssn.head()

> Note: The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a **narrative** for your chart. It is best to avoid _explanatory_ titles, such as: "Earnings of 2012-13 Ohio Community College Graduates Over Time". This title is correct, yes -- but it isn't very useful. It is likely to be redundant, since "earnings" and "time" are probably labels on the axes already. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question **"Why is this graph important?"** and focus the viewer onto the most critical take-away.

In [None]:
mpl.rc_file_defaults() # reset most settings to defaults

# A `with` statement (context manager) can be used to temporarily set figure styles
with sns.axes_style('darkgrid'):
    axes = sns.lineplot(data=df_by_ssn, x='quarter_after_grad', y='real_sumwages', color="#229900")
    axes.set_title('Anticipated Salary Has Been Increasing')
    
plt.xlabel('Quarters After Graduation')
plt.ylabel('Real Quarterly Earnings')

add_sourcing(plt, 'Sources: Ohio HEI data and UI wage records')

### Small multiples

Small multiples can be a great way to compare across categories, so that you can see several similarly plotted versions in the same overall figure. `Seaborn` offers an  easy interface for combining multiple plots into a single figure using the `FacetGrid` class. Because `FacetGrid` was designed for exactly this use, `seaborn` has helpful defaults such as automatically synchronized axes.

You have looked at first year after graduation earnings over time; here, you will refocus on earnings one year post-graduation to see how earnings varies by employment fields (NAICS codes):

In [None]:
# load in wages by employer table
qry = '''
select *
from ada_20_osu.cohort_oh_jobs_emp emp
join data_ohio_olda_2018.oh_naics3_codes_lkp lkp
on emp.naics_3_digit=lkp.naics_3_digit_num
'''
df_emps = pd.read_sql(qry, conn)

In [None]:
df_emps.head()

In [None]:
# see top 10 fields in this cohort
df_emps.groupby(['naics_3_digit_label'])['ssn_hash'].count().sort_values(ascending=False).reset_index()[0:10]

In [None]:
top_naics = tuple(df_emps.groupby(['naics_3_digit_label'])['ssn_hash'].count().sort_values(ascending=False).reset_index()
      [0:10]['naics_3_digit_label'])

In [None]:
top_naics

In [None]:
# subset df to only include those who found jobs in one of top_naics industries
df_naics = df_emps[df_emps['naics_3_digit_label'].isin(top_naics)]

In [None]:
df_naics.head()

In [None]:
df_by_ssn = df_naics.groupby(['ssn_hash', 'naics_3_digit_label'])['wages'].agg('sum').reset_index()

In [None]:
# one-year earnings distribution by naics
df_by_ssn.groupby(['naics_3_digit_label'])['wages'].agg(['describe'])

In [None]:
# Prepare our grid, which will share axes across multiple plots (wrapping after 5 columns)
g = sns.FacetGrid(df_by_ssn, col='naics_3_digit_label',col_wrap=5)

# Create a lineplot for each cell of the grid
g = g.map(plt.hist, "wages", color="lightcoral")

add_sourcing(plt, 'Source: Ohio HEI data and UI wage record', fontsize='medium')

# Simplify the titles inside each cell
g.set_titles("{col_name}")

# Remove the spine (vertical line) along the y axis
sns.despine(left=True)

### Colors

The colors used in figures in both `Matplotlib` and `seaborn` can be represented in code in many ways, but here are two naming conventions that `Matplotlib`, `seaborn`, and many other modern visualization packages handle:

#### Hex triplets

The hex triplet is a specification for the RGB color model commonly used for website and browser-rendered colors. These are formatted as a string with a pound sign `#` followed by a series of six numbers. Each pair of hexadecimal digits (i.e., two of 0-9 and A-F) represents two bytes of color information for red, green, and blue, in that order: `"#RRGGBB"`. A low value (minimum 00) contributes less of that primary color, a high value (maximum FF) a larger amount. Together, these can specify over 16 million colors. An additional two hex digits can be added to indicate alpha (transparency) where 00 is completely transparent and FF is completely opaque. Hex triplets are very common across many platforms and packages well beyond data visualization.

#### XKCD names

A relatively new standard, XKCD names were the result of an online study crafted by Randall Monroe where volunteers entered free-form names of colors displayed on screen. Following the input of tens of thousands of participants, 954 common and distinguishing names were codified. Behind the scenes, these are still equivalent to specific hex triplets, but they can be more convenient. The result is a list of color names that many English speakers will find intuitive, from basics such as "gold," "green," and "light grey" to rarely-used terms. In `Matplotlib` and `seaborn`, these are written as a string prefixed by `xkcd:`, for example: `"xkcd:cement"` (#a5a391), `"xkcd:pale magenta"` (#d767ad), `"xkcd:sage"` (#87ae73), and `"xkcd:green/blue"` (#01c08d).

<font color = red><h2> Checkpoint #2: Small Multiples </h2></font>

Trying using `sns.FacetGrid` in combination with a histogram, bar, or line chart.  Separating simple charts into several categories with small multiples can be a big improvement over trying to graph several things on the same chart.

Try experimenting with color choices. Remember to add your source(s) and use a title that highlights the main conclusion.

For example, you can explore the earnings distributions for community college students that graduated in different Ohio job regions (`jobsohioregion`). 

Hint: You will need the temporary table `cc_grads_recent` that you created earlier. You will also need the `cohort_oh_jobs` to get the quarterly earnings one year after graduation. Try to join these two tables first and consider what variables you should use.

## More visualization methods and motivating examples

### Heat map

<font color=red> **Motivating Question #3:**</font>

How do the degree fields of 2012-13 community college graduates differ across Ohio's regions?

For something like this, we might want to use a heatmap. This can give a sort of visual summary similar to a crosstab.

In [None]:
# recall the temp table ten_subs which subset our 2012-13 cohort for those who graduated with in one of the
# 10 most popular subjects
qry = '''
select * 
from ten_subs 
limit 5
'''
pd.read_sql(qry, conn)

In [None]:
# need to get jobsohioregion so can match back to cc_grads_recent
qry = '''
create temp table region as
select t.*, cc.jobsohioregion
from ten_subs t
left join cc_grads_recent cc
on t.ssn_hash = cc.ssn_hash and cc.deg_date = t.deg_date
'''
conn.execute(qry)

In [None]:
# read into python
qry = '''
select * from region
'''
df_region = pd.read_sql(qry, conn)

In [None]:
# job region breakdown
df_region.groupby(['jobsohioregion'])['ssn_hash'].count()

In [None]:
heatmap_df = pd.crosstab(df_region['subject_desc'], df_region['jobsohioregion'])

ax = sns.heatmap(heatmap_df, cmap=sns.cubehelix_palette(light=1, as_cmap=True))
add_sourcing(ax, 'Sources: Ohio Longitudinal Data Archive')

# This heatmap fix is only necessary for 3.1.0 < Matplotlib <= 3.1.1; see https://stackoverflow.com/questions/56948670
ax.set_ylim(len(heatmap_df), 0)

## Employment sequence chart

<font color=red> **Motivating Question #4:**</font>

What are the employment patterns of 2012-13 community college graduates one year after graduation? What are their cross-state movement patterns?

To create a graphic that lets us answer this question, we need both time of graduation and quarterly earnings during the four quarters after graduation. In other words, we need to use both HEI data and UI wage records to create several dummy variables to indicate whether a person was employed during a quarter.

In the following, we use the flexibility of pandas and these visualization libraries to create an unusual kind of chart. We will display the top ten most common patterns of employment in the time after a student receive his/her degree.

### Conceptual design

We have the idea, so we'll first want to think about what it will look like in the end, then work backwards to determine how we need to handle the data to create the table we'll need.

It really helps to get concrete, particularly if you aren't doing a standard kind of figure. The final visualization we're aiming for will be organized something like this:

```
  employment pattern 

     - - - X | 11%
     X X X X | 10%
     X - X - | 9%
     - - - X | 8%
     - - X X | 7%    percent
     - X X X | 6%    of sample
     X X - - | 5%
     X X - X | 4%
     X X X - | 4%
     - - - - | 4%
    _________|
    0     1
      year

```
Each row is a pattern where an `X` indicates whether have positive earnings during that quarter and a `-` is no earnings in Ohio UI wages. If these were the real data, the first row would tell us that 11% of the 2012-13 community college graduates had positive and stable earnings in Ohio or Indiana in the fourth quarter after graduation. The second row shows 10% graduates had positive earnings in Ohio or Indiana immediately after graduation. The numbers here are arbitrary -- the point is to get a sense of what we're aiming for.

You've already done the analysis numerically in the Data Exploration notebook. Now, you need to visualize your findings in `seaborn`.

In [None]:
qry = '''
select *
from ada_20_osu.cohort_in_jobs
limit 1
'''
pd.read_sql(qry, conn)

In [None]:
qry = '''
select *
from ada_20_osu.cohort_oh_jobs
limit 1
'''
pd.read_sql(qry, conn)

In [None]:
# temp table of two job tables unioned
qry = '''
create temp table jobs_combined as 
select *, 'in' as state
from ada_20_osu.cohort_in_jobs
union
select ssn_hash, deg_date, job_date, sumwages, , 'oh' as state 
from ada_20_osu.cohort_oh_jobs
'''
conn.execute(qry)

In [None]:
qry = '''
select * from jobs_combined
'''
df_combined = pd.read_sql(qry, conn)

In [None]:
df_combined.head()

In [None]:
#Convert time_after_grad to quarters
df_combined['quarter_after_grad']=(df_combined['time_after_grad']/90).round(0).astype(int)

In [None]:
df_combined.groupby(['ssn_hash', 'quarter_after_grad'])['wages'].count().unstack(['quarter_after_grad'])

In [None]:
df_tmp = df_combined.groupby(['ssn_hash', 'quarter_after_grad'])['wages'].count().unstack(['quarter_after_grad'])

In [None]:
# replace NaN with 0
df_tmp.fillna(0, inplace=True)

# and set values >1 to 1
df_tmp[df_tmp>1] = 2

In [None]:
# make ID value a column instead of an index - then we can count it when we group by the 'year_q' columns
df_tmp.reset_index(inplace=True)
df_tmp=df_tmp.rename(columns={1:'Q1',2:'Q2',3:'Q3',4:'Q4'})
df_tmp.head()

In [None]:
# group by all columns to count number of people with the same pattern
df_tmp.groupby(['Q1','Q2','Q3','Q4'])['ssn_hash'].count().reset_index().sort_values(by='ssn_hash', ascending = False)

In [None]:
# grab the top 10 for a visualization
df_tmp_top = df_tmp.groupby(['Q1','Q2','Q3','Q4'])['ssn_hash'].count().reset_index().sort_values(
    by='ssn_hash', ascending = False)[0:10]

In [None]:
df_tmp_top

In [None]:
# grab for proportion calculations of total
qry = '''
select * from cc_grads_recent
'''
df = pd.read_sql(qry, conn)

In [None]:
# calculate percentage of cohort in each group:
df_tmp_top['pct_cohort'] = df_tmp_top['ssn_hash'].astype(float) / df['ssn_hash'].nunique()

In [None]:
df_tmp_top

In [None]:
cols_for_viz = ['Q1','Q2','Q3','Q4']
# visualize with a simple heatmap
sns.heatmap(df_tmp_top[cols_for_viz],cmap=sns.cubehelix_palette(light=1, as_cmap=True))

The default visualization leaves a lot to be desired. Now let's customize the same heatmap.

In [None]:
# Create the matplotlib object so we can tweak graph properties later
fig, ax = plt.subplots(figsize = (14,8))

# create the list of labels we want on our y-axis
ylabs = ['{:.2f}%'.format(x*100) for x in df_tmp_top['pct_cohort']]

# make the heatmap
sns.heatmap(df_tmp_top[cols_for_viz], linewidths=0.01, linecolor='grey', yticklabels=ylabs, cbar=False, cmap="Blues")

# make y-labels horizontal and change tickmark font size
plt.yticks(rotation=360, fontsize=12)
plt.xticks(fontsize=12)

# add axis labels
ax.set_ylabel('Percent of cohort', fontsize=16)
ax.set_xlabel('Days since Graduation', fontsize=16)

## Data Sourcing:
ax.annotate('Source: Ohio HEI data, Ohio UI, and Indiana UI', 
            xy=(0.5,-0.15), xycoords="axes fraction", fontsize=12)

## add a title
fig.suptitle('Top 10 most common employment patterns of cohort', fontsize=18)
ax.set_title('Blue is "employed" and white is "not employed"', fontsize=12)

plt.show()

<font color = red><h2> Checkpoint #3: Compare with Indiana </h2></font>

How do these employment patterns differ for those who found employment in Indiana?

Hint: You can use the DataFrame you created above `df_combined` and limit the records to `df_combined['state']='in'`. Then follow the steps shown above.

### `Matplotlib`

* [Matplotlib Documentation](https://matplotlib.org)

* [Matplotlib visualization tutorials](https://matplotlib.org/tutorials/index.html)

### `Seaborn`

* [Seaborn Documentation](http://seaborn.pydata.org)

* [Advanced Functionality in Seaborn](blog.insightdatalabs.com/advanced-functionality-in-seaborn)

### Colors

Tools like [Adobe Color](https://color.adobe.com) and this [Hex Calculator](https://www.w3schoosl.com/colors/colors_hexadecimal.asp) can help you get used to the hex triplet system.

The [official XKCD color list](https://xkcd.com/color/rgb/) lists all the named colors and their hex triplets; w3schools.com has also published an [XKCD color chart](https://www.w3schools.com/colors/colors_xkcd.asp) with larger swatches.

### Other Python Visualization Libraries

[A Dramatic Tour through Python's Data Visualization Landscape](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair) discusses and compares Matplotlib, seaborn, ggplot, and Altair.

* [Plotly](https://plot.ly) focuses on interactive visualizations, including online hosting.

* [Bokeh](http://bokeh.pydata.org) priotizes ease of use, also with an emphasis on in-browser, interactive charts.

* [ggplot](http://ggplot.yhathq.com) is largely a port of R's heavily-used ggplot2 library, inspired by *The Grammar of Graphics*.

* [Altair](https://altair-viz.github.io) is designed to be accessible and language independent, using the Vega-Lite syntax.