# Supplemental Visualizations

Benjamin Feder, Brian Kim, Robert Truex, Matthew VanEseltine, Ekaterina Levitskaya, Allison Nunez

## Introduction

In this notebook, we describe how to create two types of visualizations: a funding sequence chart, and a choropleth map. These use techniques that build on the material covered in the Data Visualization notebook.

## Funding sequence chart

Consider the following question:

**What are the funding histories of graduate students in the three years leading up to their dissertation? How do the funding histories differ, and what are the most frequent funding sequences?**

To create a graphic that lets us answer this question, both semester level funding information and time of dissertation are needed. In other words, a linked dataset with UMETRICS and SED should be used. The UMETRICS data allows to get the funding history of students, which can be used in conjunction with SED data to see what the funding histories look like leading up to the dissertation.

The example below displays the top ten most common patterns of federal funding in the time before and during the year that a student receives the PhD. 

### Conceptual design

The final visualization will be organized the following way:

```
funding pattern 

- - - X X X X X X | 11%
X X X X X X X X X | 10%
- - - X X - X X - | 9%
- - - X X X X X - | 8%
- - - X X X X - - | 7%    percent
- - X X X X - - - | 6%    of sample
X X - - - - - - - | 5%
- - - X X - - - - | 4%
X X X - - - - - - | 4%
- - - - X X X - - | 4%
__________________|
 -2    -1     0
      year

```
Each row is a pattern where an `X` indicates federal funding and a `-` is no funded. If these were the real data, the first row would show that 11% of the PhD awardees had federal funding only during the last two years before their degree was awarded. The second row shows 10% with federal funding every single semester, nine in a row. The numbers here are arbitrary.

### Data preparation

From end to beginning:
  - Top ten rows by % of total, nine columns of yes/no semester funding
  - ...will need to be counted from a unique student-level dataset that has nine columns of yes/no funding
  - ...pivoted from the full student X semester-level dataset as `semester_df`
  - ...created from those students covered in UMETRICS for the entire time period (cut by institution)
  

In [None]:
#database interaction imports
library(odbc)

# for data manipulation/visualization
library(tidyverse)
library(lubridate)
library(sf)
library(maps)

# for calculating percentages
library(scales)

# to better view images
# For easier viewing of graphs
theme_set(theme_gray(base_size = 24))
options(repr.plot.width = 20, repr.plot.height = 12)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

#### Step 1:  Sixteen schools have UMETRICS coverage 2012-2015, and we only want them for this chart

In [None]:
# filter for these institutions (comes from the joined_semester.sql file)

qry <- "
select *
from ds_nsf_ncses.dbo.nsf_sed
WHERE phdfy = '2015' and 
phdinst in ('104179', '110680', '139658', '141574', '151351', '153658', '155317', '164988', '170976', '201885', '204796', '209542', '214777', '228778', '240444', '243780')
"
sed_upd_cohort <- dbGetQuery(con, qry)

In [None]:
# see sed_upd_cohort
head(sed_upd_cohort)

In [None]:
# see amount of people
nrow(sed_upd_cohort)

#### Step 2: match 2015 cohort to UMETRICS Data

In [None]:
# join sed to umetrics using umetrics_xwalk
qry <- "
select c.*, d.semester, d.team_size from (
select a.*, b.emp_number 
from ds_nsf_ncses.dbo.nsf_sed a
inner join tr_uncf_excelencia.dbo.sed_umetrics_xwalk b
on a.drf_id = b.drf_id
where a.phdfy = '2015' and a.phdinst in ('104179', '110680', '139658', '141574', '151351', '153658', '155317', '164988', '170976', 
    '201885', '204796', '209542', '214777', '228778', '240444', '243780') ) c
inner join ds_iris_umetrics.dbo.semester d
on c.emp_number = d.emp_number
"
cohort_joined <- dbGetQuery(con, qry)

In [None]:
# see cohort_joined
head(cohort_joined)

In [None]:
names(cohort_joined)

In [None]:
# see number of rows relative to individuals
cohort_joined %>%
    summarize(n=n(), n_people = n_distinct(drf_id))

#### Step 3: Transform data frame

In [None]:
unique(cohort_joined$semester)

In [None]:
# changed semester column name to relative for graduation
cohort_joined <- cohort_joined %>%
    mutate(
        sem_structure = paste(as.character(as.numeric(substring(semester, 1, 4)) - as.numeric(phdcy)), substring(semester, 6, 8))
    ) %>%
    mutate(sem_fix_month = case_when(
        word(sem_structure, 2) == 'may' ~ paste(word(sem_structure, 1), 'Sum'),
        word(sem_structure, 2) == 'jan' ~ paste(word(sem_structure, 1), 'Spr'),
        TRUE ~ paste(word(sem_structure, 1), 'Fal'),
        )
          ) %>%
    mutate(upd_semester = case_when (
        word(sem_fix_month, 1) == '0' ~ word(sem_fix_month, 2),
        TRUE ~ sem_fix_month
    )
          ) %>%
    select(-c(sem_structure, sem_fix_month))


In [None]:
# see updated dataframe
head(cohort_joined)

In [None]:
# received federal funding if team size is at least 1
cohort_joined <- cohort_joined %>%
    mutate(
        fed_funding = ifelse(team_size >= 1, 1, 0)
    )

In [None]:
unique(cohort_joined$upd_semester)

In [None]:
# need to complete full_cohort and fill out for all semesters where funding doesn't exist
cohort_joined <- cohort_joined %>%
    complete(drf_id,upd_semester, fill=list(fed_funding=0))

unique(cohort_joined$fed_funding)

In [None]:
# see updated dataframe
head(cohort_joined)

In [None]:
# find federal funding presence by semester
by_sem <- cohort_joined %>%
    group_by(drf_id, upd_semester) %>%
    summarize(fed_pres = sum(fed_funding)) %>%
    ungroup() %>%
    mutate(fed_ind = ifelse(fed_pres == 1, 'yes', 'no')) %>%
    select(-c(fed_pres))

head(by_sem)

In [None]:
# use pivot_wider
funding_by_sem <- by_sem %>% 
    pivot_wider(names_from = upd_semester, values_from = fed_ind) %>%
    select(drf_id, "-3 Spr", "-3 Sum", "-3 Fal", "-2 Spr", "-2 Sum", "-2 Fal", "-1 Spr", "-1 Sum", "-1 Fal", "Spr", "Sum", "Fal")

head(funding_by_sem)

#### Step 4: Find Counts per Individual and Plot Patterns

In [None]:
# find counts per pattern
patterns <- funding_by_sem %>%
            group_by(`-3 Spr`, `-3 Sum`, `-3 Fal`, `-2 Spr`, `-2 Sum`, `-2 Fal`, `-1 Spr`, `-1 Sum`, `-1 Fal`, `Spr`, `Sum`, `Fal`) %>%
            summarise(count = n_distinct(drf_id)) %>%
            arrange(desc(count)) %>%
            ungroup()

head(patterns)

In [None]:
patterns <- patterns %>%
    mutate(pct = percent(count/sum(count),.01))

head(patterns)

In [None]:
# grab first 15 patterns
patterns_graph <- patterns %>%
    head(15)

In [None]:
# Save counts to use later in the heatmap - we cannot use the counts as index, as there could be duplicate values 
counts <- patterns_graph$count
pcts <- patterns_graph$pct

In [None]:
patterns_graph$Pattern <- seq.int(nrow(patterns_graph))
patterns_graph$count <- NULL
patterns_graph$pct <- NULL

In [None]:
patterns_graph

Now need to convert this table from wide to long format, since our `geom_tile()` function only works with long data frames. Instead of using `pivot_wider()` when creating `patterns`, use `pivot_longer` to create a data frame with each row corresponding to a pattern/quarter/status combination.

In [None]:
# convert to long format
patterns_long <- pivot_longer(patterns_graph, names_to = 'Semester', values_to = 'Status', -c(Pattern))

In [None]:
# see patterns_long
head(patterns_long)

Now create the visualization using `geom_tile` in `ggplot`:

In [None]:
# initial plot

ggplot(data = patterns_long, aes(x = Semester, y = Pattern)) + 
geom_tile(aes(fill = Status), colour = 'black')

In [None]:
# make 1 the highest level and sort the semesters in order

# provide order to display the semesters
sem_order <- c("-3 Spr", "-3 Sum", "-3 Fal", "-2 Spr", "-2 Sum", "-2 Fal", "-1 Spr", "-1 Sum", "-1 Fal", "Spr", "Sum", "Fal")

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowest

ggplot(data = patterns_long, aes(x = factor(Semester, level=sem_order), y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black')

viz <- ggplot(data = patterns_long, aes(x = factor(Semester, level=sem_order), y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black')

In [None]:
# change color palette, specify font size, have semesters at top, add titles, and rename y-axis ticks

viz +
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
labs(
    y = "Individuals - Percentages",
    title = 'Federal Funding Patterns by Semester',
    caption = 'Source: SED NCSES and UMETRICS data',
    x = "Semester"
) +                                               
scale_y_discrete(labels=rev(pcts))  # rename the y-axis ticks to correspond to the counts from the table

Counts can also be on the left side of the y-axis instead.

In [None]:
# Full code for the plot

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowest

ggplot(data = patterns_long, aes(x = factor(Semester, level=sem_order), y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
labs(
    y = "Individuals - Counts",
    title = 'Federal Funding Patterns by Semester',
    caption = 'Source: SED NCSES and UMETRICS data',
    x = "Semester"
) +                                               
scale_y_discrete(labels=rev(counts))  # rename the y-axis ticks to correspond to the counts from the table

## Choropleth map

This example explains how to show regional differences in the number of graduates by state using a map. A choropleth map is a powerful visualization tool which allows for easy comparison and communication of regional differences to external audiences.

First, generate the base table of graduates by origin state.

Get the 2015 SED cohort, with the individual ID (`drf_id`) and institution code (`phdinst`):

In [None]:
# Get the 2015 SED cohort
qry <- "
select drf_id, phdinst
from ds_nsf_ncses.dbo.nsf_sed
where phdfy = '2015'
"
cohort_2015 <- dbGetQuery(con, qry)

In [None]:
head(cohort_2015)

Get the table with the geographic location of educational institutions (using IPEDS code):

In [None]:
# Get the table with the
qry <- "
select *
from ds_public_1.dbo.ipeds_location
"
institution_location <- dbGetQuery(con, qry)

In [None]:
head(institution_location)

There is a common variable in these two tables - the IPEDS code for the educational institutions (`phdinst` variable in the `cohort_2015` table and `unitid` in the `institution_location` table).

Merge two tables on the common variable - `phdinst` in the `cohort_table` and `unitid` in the `institution_location`:

In [None]:
cohort_inst_location <- merge(cohort_2015, institution_location, by.x ='phdinst', by.y = 'unitid')

In [None]:
head(cohort_inst_location)

In [None]:
# Using group_by and summarise(n_distinct) function
state_counts <- cohort_inst_location %>%
                    group_by(stabbr) %>%
                    summarise(counts = n_distinct(drf_id))

In [None]:
head(state_counts)

There is a built-in dataset with state geometry in the `maps` package:
> Note, this built-in dataset contains state geometry for 48 states.

In [None]:
states <- map_data("state")

In [None]:
head(states)

In [None]:
ggplot(data = states,
      mapping = aes(x = long, y = lat, group = group)) +
    geom_polygon(fill = 'white', color='black')

In [None]:
unique(states$region)

Create a crosswalk between the fully spelled state name and state abbreviation:

In [None]:
region <- unique(states$region)
stabbr <- c('AL','AZ','AR','CA','CO','CT','DE','DC','FL','GA','ID','IL','IN','IA','KS','KY','LA','ME','MD',
                  'MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA',
                  'RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY')
state_crosswalk <- data.frame(stabbr, region)

In [None]:
head(state_crosswalk)

In [None]:
states <- left_join(states,state_crosswalk, by='region')

In [None]:
# Only on those states that match
state_counts <- inner_join(state_counts, states, by='stabbr')

In [None]:
head(state_counts)

In [None]:
ggplot(data = state_counts,
      mapping = aes(x = long, y = lat, group = group, fill = counts)) +
        geom_polygon(color="gray90") +
        labs(
        title = 'Most PhD students in 2015 cohort are from REDACTED',
        caption = "Source: SED NCSES Data"
    )