<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    Kshitiz Rastogi, Joshua Edelmann, Benjamin Feder, Nathan Barrett</center>
    <a href="https://doi.org/10.5281/zenodo.6407262"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6407262.svg" alt="DOI"></a>


# **Data Visualization**

## **1. Introduction**
In the two data exploration notebooks, we formed an analytical cohort of graduates and explored the cohort's characteristics and various employment outcomes. Sometimes, it is hard to see trends and to communicate with your audience by only looking at summary statistics. In this notebook, we will import the csv files we saved in the previous two notebooks and learn how to turn them into informative visualizations. We will also discuss what type of visualization to choose for different types of analyses and how to use labels and plot adjustments to better present your results. As you work through the notebook, we will have checkpoints for you to practice using the code. You can think about how you might apply any of the techniques and code presented in this notebook to your project.

## **2. Learning Objectives**

After you finish this notebook, you should know:

- how to use the R package `ggplot2` to create line plots, bar charts, and heatmaps, sometimes creating side-by-side visuals broken down by subgroup
- how to choose visualization types for your analyses
- how to improve your visualizations with informative titles, labels, and plot settings

#### **Research Questions** 
We will create various visualizations by using the summary statistics we created in the two data exploration notebooks. The questions we seek to answer are:

- What are the average quarterly earnings and number of individuals employed by quarter in our cohort? Do they vary by major?
- What were the most common major types within the cohort? Do they vary by gender?
- What are the stable employment outcomes of our cohort? Do they vary by gender?
- What are the most common employment patterns of our cohort?

#### **Datasets** ####

We will explore data provided by the Tennessee Board of Regents, the Tennessee Department of Labor & Workforce Development and Kentucky Center for Statistics using csv files created in previous notebook:

- **Tennessee Unemployment Insurance (UI) wage records**: the `ui_wages` table in the `ds_tn_tdlwd` database contains employment data from 2006Q1 to 2021Q1
- **Kentucky UI wage records**: the `ui_wages` table in the `ds_ky_kystats` database contains employment data from 2007Q1 to 2019Q4
- **Community College Graduates**: The graduates table is provided by TBR. The data include graduations at all TBR community colleges and covers the time period of summer 2009 through fall 2020.
- **Community College Enrollments**:  Also provided by TBR and contains all enrollment data at TBR community colleges from summer 2009 through fall 2020.

#### **Methods**
We will cover the following visualizations in this notebook:
- **Line Plot**: is typically used for time series data to show how a variable changes over time
- **Bar Plot**: visualizes relationships between numerical and categorical variables
- **Small multiples**: compares information by different groups using a series of mini-graphs
- **Heat Map**: adds highlights to your data with color-coding

We use R package `gglot2` to create all visualizations. Here is a brief introduction of **the syntax of `ggplot2`**:

- start with `ggplot()` <br>
- then, supply a dataset and aesthetic mapping with `x` pertaining to the variable on the x-axis, and so on, for example: `ggplot(dataset, aes(x = ..., y = ...)` <br>
- from there, provide a geometric object represented by `geom_` to convey the desired type of visualization <br>
- finally, add additional layers if necessary using `+` <br>

For example, we can use the code blow to create a line plot (geometric object).

    ggplot(data, aes(x = ... , y = ...) + 
        geom_line()
   
We can add an additional layer `labs()` to create a line plot with a title.

    ggplot(data, aes(x = ... , y = ...) + 
        geom_histogram() + 
        labs(title = 'My plot title')
        
The `aes()` call can contain additional arguments outside of `x` and `y` to potentially match the `fill`, `color`, `linetype`, and additional specifications of specific variables in a dataset.

## 3. Notebook Setup

Before we can get started, let's load in the necessary R libraries and connect to the proper server.

In [None]:
# Database interaction imports
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

Now that we have established a connection to our server, let's read in the csvs containing the information to answer the first Research Question, 
**"What are the average quarterly earnings and number of individuals employed by quarter in our cohort? Do they vary by major?"**

Here is a reminder of the summary statistics that each file contains:
- **`avg_and_num.csv`**: average quarterly earnings and number employed by quarter
- **`avg_and_num_major.csv`**: average quarterly earnings and number employed by quarter for most common majors

<font color=red> Before you run the cell below, make sure you have run through the data exploration notebooks and have saved the csv files in your `U:\\..\\TN Training\\Results\\` directory. Replace `..` with your username.</font>

In [None]:
# average quarterly earnings and number employed by quarter
avg_and_num <- read_csv("U:\\..\\TN Training\\Results\\avg_and_num.csv")

# average quarterly earnings and number employed by quarter (common majors)
avg_and_num_major <- read_csv("U:\\..\\TN Training\\Results\\avg_and_num_major.csv")

## 4. Average quarterly earnings and number employed by quarter (lineplot)

Previously, by looking at the counts in data frame `avg_and_num`, we identified that after 4 quarters post-graduation, the number of individuals employed in Tennessee in our cohort begins to decline for a quarter and then again starts picking up, while the average quarterly earnings rise. However, when presenting your results, a list of numbers may not be an effective way for your audience to understand and absorb the information. In this example, we will focus on the time trend of the number of individuals employed per quarter, using a line plot with quarter after graduation as the x-axis and number of individuals employed as the y-axis.

In [None]:
# see avg_and_num
avg_and_num

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in number of people employed

#Specify source dataset and x and y variables
ggplot(avg_and_num, aes(x = quarter_number, y = n_employed)) + 

#Plots a line on the graph
geom_line()

When we looked the data into the table we found the number of graduates employed is fairly consistent, but when we look the graph, we can see there is a small decline in employment in Tennessee after the first, fourth and fifth quarters before it starts rising up. We can clearly identify the trends of the employment through the 12-quarter cutoff point with this basic visualization. 

Although we can see the trends from the above graph quickly and clearly, there are several things we need to improve so that the graph provides the audience some context and delivers information more clearly.

1. **Graph Title**: The title of the visualization should convey the major takeaway(s) of the plot and should answer the original question. In this case, depending on the information you want to address, you may have different titles. For example, if you want to describe the increase in the number of graduates employed in Tennessee after 6 quarters after graduation, the title could be "Employment of TN graduates in TN climbs after the sixth quarter after graduation." However, if you want to describe a past trend, the title could be "Employment of TN graduates in TN decreases in the 1st, 4th, and 5th quarters following graduation".
2. **X-axis and Y-axis Labels**: The labels in our current graph are variable names. We need to change them to short and easy-to-understand descriptions, such as "quarter after graduation" and "number employed".
3. **Data Source**: Providing clear reference and source of the underlying data used for the visualization can increase the credibility and enable the reproducibility of your results. The reference can be the data agency or the name of the dataset. In this analysis, we use TBR and TDLWD as our data sources. 

> We can add a `labs()` layer to the visualization to change the title, axes' names, and caption of the plot. We simply use `+` to add this layer to the code we showed earlier. 

        labs(
            title = "YOUR TITLE",
            x = "X-AXIS LABEL",
            y = "Y-AXIS LABEL",
            caption = "DATA SOURCE OR NOTES"
            )

4. **X-axis and Y-axis Tick Mark**: On our current graph, x-axis only shows quarters on a 2.5 month interval. It is hard for the audience to see key time points, such as when the number of individuals employed in TN was lowest. Therefore, we want to adjust the x-axis tick so that it shows the time points more frequently. In this case, we use `scale_x_continuous()` and `scale_y_continuous()` to specify a continuous data type and to adjust tick marks and axis extent. In both cases, `breaks=` statement controls the minimum and maximum values of the axis labels as well as the tick mark intervals between them.

5. **Key Time Points**: In this analysis, labelling key timepoints, such as when the cohort had its fewest number of individuals employed in TN, helps the audience understand the context. However, it is not necessary to add key timepoints to every graph. We can add vertical lines to the graph to represent significant events using the `geom_vline()` statement and control labels for these lines using the `annotate()` statement. Within `geom_vline()` we must specify where to place the line using the `xintercept=` command. In our case, we assign the peak quarter to a variable and then pass this variable to `xintercept=`. When we use the `annotate()` statement, we must specify the location of the labels using the `x=` and `y=` statements. Keep in mind you will likely need to adjust these based on specific details of your graph. In the example below, we use a height of 7375, which places the label right above the trough.

In [None]:
# Code adjusting overall graph attributes

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 12, repr.plot.height = 8)

In [None]:
# First, assigning the overall low quarter to a variable.
# We will use this to plot a vertical line on the graph representing this quarter

# Overall low point
low_overall <- 6

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in number of people employed

#Specify source dataset and x and y variables
num_employed_plot <- ggplot(avg_and_num, aes(x = quarter_number, y = n_employed)) + 

#Plots a line on the graph
geom_line() +

# Add a blue vertical dashed line at the trough for number of individuals employed in TN
# with a label at the top explaining what it is
geom_vline(xintercept = low_overall,
    color = "blue", linetype = "dashed") +
annotate("text", x = low_overall, y = 7375, color = "blue",
    hjust = -0.1, label = "Trough") +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(7300, 7600, 100),
    limits = c(7300, 7600)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, and data source
labs(title = "Employment of TN graduates in TN climbs After the Sixth Quarter After Graduation",
    x = "Quarter after Graduation", y = "Number Employed",
    caption = "Data Source: TBR, TDLWD Data") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5))

#Display the graph that we just created
print(num_employed_plot)

#### **Checkpoint 1: Create A Line Plot**
Create a line plot to show the trend of average quarterly wages for the cohort in `avg_and_num`. Remember that your plot should include an informative title, axes' labels, data source, and key timepoints. Adjust axes' tick marks and graph sizes to make it easy to read.  

Comment on the trend of average quarterly wages. During which quarter did the average quarterly wages peak?

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in average quarterly wages

#Specify source dataset and x and y variables
ggplot(___, aes(x = ___, y = ___)) + 

#Plots a line on the graph
geom_line()

In [None]:
# We will use this to plot a vertical line on the graph representing this quarter

# Overall peak point
peak_overall <- ___

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in quarterly earnings

#Specify source dataset and x and y variables
avg_wages_plot <- ggplot(___, aes(x = ___, y = ___)) + 

#Plots a line on the graph
geom_line() +

# Add a blue vertical dashed line at the trough for quarterly earnings of the cohort in TN
# with a label at the top explaining what it is
geom_vline(xintercept = ___,
    color = "blue", linetype = "dashed") +
annotate("text", x = ___, y = ___, color = "blue",
    hjust = -0.1, label = "Trough") +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(___, ___, ___),
    limits = c(___, ___)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5))

#Display the graph that we just created
print(avg_wages_plot)

### By Major (Line Plot, Multiple Lines)

In the data exploration wage analysis notebook, we then followed by looking at these trends within the five most common degree fields. By adding to the `aes()` call, we can visualize the differences amongst subgroups using separate lines. 

In the following example, we will visualize the trends in average quarterly wages by quarter for the five most common majors in the cohort. Recall that we already have the wages relative to the quarter after graduation for all graduates in our cohort amongst these five most common majors in the `avg_and_num_major` data frame.

In [None]:
# see avg_and_num_major
head(avg_and_num_major)

Let's create our visualization. Here, we will add in a `color` aesthetic to designate the differences in majors, as the color of the line will correspond to the specific major. 

> You can manually insert colors using `scale_color_manual()` or `scale_colour_brewer()`. Before exporting, we recommend that you choose contrasting colors that satisfy ADA requirements. Examples can be found on ColorBrewer.

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in quarterly wages by major

#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
# you can also use all words but make the legend font smaller
avg_wages_major_plot <- ggplot(avg_and_num_major, aes(x = quarter_number, y = mean_wage, color = word(CIP_Family, 1))) + 

#Plots a line on the graph
geom_line() +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(4000, 13000, 2000),
    limits = c(4000,13000)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "TN Graduates majoring in Health Professions experienced higher earnings after graduation",
    x = "Quarter after Graduation", y = "Average Quarterly Wages",
    caption = "Data Source: TBR, TDLWD Data",
    color = "Major") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

### Earnings and Number Employed Together (Line Plot, Small Multiples)

Perhaps if we wanted to discuss the relationship between average quarterly wages and number of individuals employed in Tennessee (or lack thereof), we might consider creating a visualization with multiple y-axes to designate the difference between the two variables. Since the two variables are on entirely different scales, as `scale_y_continuous()` values did not overlap in the the visualizations, it may not be advisable to do so. 

However, it is possible to leverage a small multiples visualization, where we use a combination of mini-graphs to visualize the relationship of these two variables, relative to the quarter after graduation using `ggplot2`. This will require some data manipulation, as `facet_grid()` and `facet_wrap()` work by displaying subgroups within specific variables. Therefore, we will need to create a new variable, with its subgroups corresponding to either `mean_wage` or `n_employed` for each `quarter_number` value, to use this method.

To do so, we can use `pivot_longer`, where we will take all of the column names except `quarter_number`, and have them become subgroups within one overarching column, while tracking the values for these columns in a separate variable.

In [None]:
# adjust data frame so variables on y axis are in the same column "variable"
# -c(quarter_number) is used so that pivot_longer ignores the column instead of including it as a subgroup in variable
avg_and_num_long <- avg_and_num %>%
    pivot_longer(names_to = "variable", values_to = "value", -c(quarter_number))

head(avg_and_num_long)

Now that the data frame is suitable for displaying small grids, we can include the `facet_grid()` call inside of our plot to stack the two graphs on top of each other. With this plot, keep in mind that the scale changes between the two panels.

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in quarterly wages and number of people employed 

#Specify source dataset and x and y variables
both_vars_plot <- ggplot(data = avg_and_num_long, mapping = aes(x = quarter_number, y = value)) +

#Plots a line on the graph
geom_line() +

# creates separate panels for the subgroups in variable
# scale is set so that the y-axis can have differing scales
facet_grid(variable~., scale = "free_y") +

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(1, 12, 1),
    limits = c(1,12)) + 

#Add a title, labels for the x and y axes and data source
labs(title = "TN Graduates Experience Different Trends in Earnings and Number Employed Relative to \n Quarters after Graduation",
    x = "Quarter after Graduation", y = "Value",
    caption = "Data Source: TBR and TDLWD Data") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(both_vars_plot)

#### Checkpoint 2: Reproduce for Dominant Earnings

Import the dominant earnings csv files you saved in the data exploration notebook. Then create a line plot to show the trend of average quarterly earnings by quarter after graduation. Remember that your plot should include an informative title, axes labels, data source, and key timepoints. Adjust axes tick marks and graph sizes to make it easy to read.  

Comment on the trend of average quarterly dominant earnings for the two major groups. Do they follow similar trends relative to non-dominant earnings?

Recall that we saved the following file:
- **`avg_and_num_dom_major.csv`**: average dominant quarterly earnings and number employed by quarter for the most common majors

<font color=red> You need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# average dominant quarterly earnings and number employed by quarter (common majors)
avg_and_num_dom_major <- read_csv("U:\\..\\TN Training\\Results\\avg_and_num_dom_major.csv")

In [None]:
# match to avg_and_num_major_dom and keep desired columns
avg_and_num_major_dom <- __ %>%
  #  left_join(cip_xwalk, by = __) %>%
    select(___, ___, ___, ___)

In [None]:
#Example ggplot2 syntax to make a simple visualization of trends in dominant quarterly wages by major

#Specify source dataset and x, y, and color variables
# only take first word in title because it makes graph harder to read
avg_wages_major_plot <- ggplot(___, aes(x = ___, y = ___, 
                                                      color = ___)) + 

#Plots a line on the graph
geom_line() +

# Adjust the y scale to assign start and end points as well
# as the interval for tick marks
scale_y_continuous(
    labels = scales::comma,
    breaks = seq(___, ___, ___),
    limits = c(___,___)) + 

# Adjust the x scale to assign start and end points as well
# as the interval for tick marks
scale_x_continuous(
    breaks = seq(___, ___, ___),
    limits = c(___,___)) + 

#Add a title, labels for the x and y axes, color legend, and data source
labs(title = "___",
    x = "___", y = "___",
    caption = "___",
    color = "___") +

# shift labels over slightly to clean up appearance
theme(axis.text.x = element_text(vjust=.5)) 

#Display the graph that we just created
print(avg_wages_major_plot)

## 5. Most Common Majors (Bar plot)

In the previous visualizations, we graphed the earnings outcomes broken down by the five most common degree fields. Let's take a deeper look into the most common majors in the cohort, first on their own before segmenting by gender.

Recall that we saved the following files:
- **`common_major.csv`**: 5 most common majors
- **`common_major_gender.csv`** 5 most common majors by gender

<font color=red> You need to change the directory in the read_csv() statements below. Replace ".." with your username.</font>

In [None]:
# most common majors
common_major <- read_csv("U:\\..\\TN Training\\Results\\common_major.csv")

# most common majors by gender
common_major_gender <- read_csv("U:\\..\\TN Training\\Results\\common_major_gender.csv")

# see common_major
common_major

Although we will not create additional graphs segmented by major, to make sure the color scheme is consistent, we can **assign a color palette to the each major**.  

In [None]:
#Assign a custom color palette to use with the bar graphs
palette_color <- c("Liberal Arts & Science" = "orange",
                   "Health Professions & Related Services" = "blue",
                   "Business Management & Admin. Services" = "red",
                   "Engineering" = "purple",
                   "Computer & Information Sciences" = "green"
                  )

Now we can create the bar plot to show the number of graduates in each major. The code structure is similar to the code we used to create the line plot. But instead of `geom_line()`, we use `geom_col()`. The `x` and `y` variables we include in `aes()` are `n` and `cip_family`. This way, we get a horizontal bar chart with x-axis showing the counts and y-axis showing the descriptions of majors. If you want to create a vertical bar chart, you just need to switch the `x` and `y` variables. In the `scale_fill_manual()` statement, we use `values=` to specify how we want each series to display by passing the `palette_color` variable that we assigned above. `palette_color` relates potential values of `cip_family` to the display color for each. This statement can accept common English names for many colors as well as hexidecimal color codes.

In [None]:
#Create a bar chart showing the 5 majors most represented in the graduating cohort

#Specify source dataset and x and y variables
common_major_plot <- ggplot(common_major, aes(x = n, y = reorder(word(CIP_Family, 1),n), fill=CIP_Family)) + 

#Plots bars on the graph
geom_col() +

#Apply your color palette
scale_fill_manual("", values = palette_color, guide=FALSE) +

#Adjust the x scale to set the interval for tick marks
scale_x_continuous(labels = scales::comma,
    breaks = seq(0, 6200,1000),
    limits = c(0, 6200)) +

#Add titles and axis labels
labs(title = "Liberal Arts & Science Majors more than Double Any Other Major",
     subtitle = "Five Most Common Majors",
     x = "Number of Graduates", y = "Major",
     caption = "Data Source: TBR & TDLWD Data")

#Display the graph we just made
print(common_major_plot)

### By Gender (Side-by-side Bar plot)

Now let's segment the most common majors by gender. Due to disclsoure review limitations (we will discuss this in a later notebook), we will not include values where `Gender` is NA in this visualization. Here, we will create a side-by-side bar plot for the five most common majors by gender, displaying the proportion of graduates receiving the degree by gender.

First, we need to isolate the five most common majors for just the male and female genders in `common_major_gender`.

In [None]:
# isolate most common majors for males and females
top_5 <- common_major_gender %>%
    filter(!is.na(Gender))

top_5

As you can see, the most common majors for males are not always the most common majors for females. As of now, if a major does not exist in `top_5` for one gender but does for the other, it will be the only bar in the side-by-side plot. However, the single bars will not retain the same ordering, so we will leverage the `complete()` function to add in observations for all missing `CIP_Family` and `Gender` combinations, setting these ones as 0.

To create a side-by-side plot, add the `fill` aesthetic, as well as `position_dodge()` inside the `geom_col()` function.

In [None]:
# add in rows for missing CIP_Family/Gender combinations
top_5 <- top_5 %>%
    complete(CIP_Family, Gender, fill=list(prop=0, n=0))

In [None]:
#Create a bar chart showing percentage of graduates by major and gender for most common majors within male and female gender
#Specify source dataset and x and y variables
top_5_major_gender_plot <- ggplot(top_5, aes(x = word(CIP_Family, 1), y = prop, fill=Gender)) + 

#Plots bars on the graph as side by side
geom_col(position=position_dodge()) +

#Adjust the y scale to set the interval for tick marks
scale_y_continuous(labels = scales::comma,
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) +

#Add titles and axis labels
labs(title = "The Most Common Majors Vary across Gender",
     subtitle = "Limited to Five Most Common Majors per Gender",
     x = "Major", y = "Proportion",
     caption = "Data Source: TBR & TDLWD data") +

scale_fill_discrete(name = "Gender", labels = c("Female", "Male")) + 

# rotate x axis text
theme(axis.text.x = element_text(angle = 90))

#Display the graph we just made
print(top_5_major_gender_plot)

## 6. Stable Employment Outcomes (Bar plot, Side-by-side)

Instead of comparing the differences in majors by gender for the cohort, let's take a look at the gender differences in earnings outcome. Recall in the second data exploration notebook where we found the proportion of individuals in the cohort experiencing at least one quarter of full-quarter employment. In this visualization, again utilizing a side-by-side bar plot, we will compare the gender composition of all graduates relative to those who experienced at least one quarter of full-quarter employment in their first three years after graduation.

Recall that we saved the following files:
- **`common_gender`**: Breakdown by gender for the entire cohort
- **`full_q_stats_gender`**: Breakdown by gender for those experiencing full-quarter employment

In [None]:
# gender breakdown 
common_gender <- read_csv("U:\\..\\TN Training\\Results\\common_gender.csv")

# full quarter info by gender
full_q_stats_gender <- read_csv("U:\\..\\TN Training\\Results\\full_q_stats_gender.csv")

Here, we will combine the two data frames, `full_q_stats_gender` and `common_gender` so that we have both the gender breakdown for the full cohort as well as for those experiencing at least one quarter of full-quarter employment in the same data frame. Before doing so, to avoid confusion between the `prop` variables in the two data frames, we will rename one of them. Afterwards, similar to our approach in visualizing the number of individuals employed and their average quarterly earnings by quarter, we will aggregate these two `prop` columns so that each row is either a full-quarter or full cohort observation, with a separate variable denoting its status.

In [None]:
# manipulate data frame for visualization
full_q_stats_gender <- full_q_stats_gender %>%
   rename(full_q_prop = prop) %>%
    left_join(common_gender, by='Gender') %>%
    select(-c(num_individuals, avg_wage, n)) %>%
    pivot_longer(names_to = "var", values_to = "prop", -c(Gender))

full_q_stats_gender

From here, we can employ a similar strategy as we did in the previous visualization, adding in a `fill` aesthetic, as well as `position_dodge()` to ensure the side by side nature of the visualization.

In [None]:
#Create a bar chart showing gender breakdown for entire cohort and full-quarter employment
#Specify source dataset and x and y variables
gender_breakdown_plot <- ggplot(full_q_stats_gender, aes(x = Gender, y = prop, fill=as.factor(var))) + 

#Plots bars on the graph as side by side
geom_bar(stat = "identity", position=position_dodge()) +

#Adjust the y scale to set the interval for tick marks
scale_y_continuous(labels = scales::comma,
    breaks = seq(0, 1, .1),
    limits = c(0, 1)) +

#Add titles and axis labels
labs(title = "The Proportions of those with Full-Quarter Employment Reflect that of the Original Cohort",
     x = "Gender", y = "Proportion",
     caption = "Data Source: TBR & TDLWD data") +

scale_fill_discrete(name = "Variable", labels = c("Full-Quarter Employment", "Entire Cohort"))

#Display the graph we just made
print(gender_breakdown_plot)

## 7. Employment patterns by quarters (Heatmap)

The final visualization in this notebook is a heatmap displaying our cohort of Tennessee graduates' employment patterns by quarter, as we will focus on the 15 most common patterns. We do not use a heatmap in the classic way where each "box" in the map corresponds to a proportion or number. Instead, we will use the heatmap as a format by which to map employment patterns, as we will color-code each box depending on if the pattern has or does not have employment in a specific quarter. We will start by reading in the .csv file of employment patterns.

Recall that we saved the following file:
- **`patterns`**: Employment patterns for entire cohort

In [None]:
# read in patterns
patterns <- read_csv("U:\\..\\TN Training\\Results\\patterns.csv") %>%
    head(15)

In [None]:
# Save counts to use later in the heatmap - we cannot use the counts as index, as there could be duplicate values 
counts <- patterns$cnt
pcts <- patterns$prop

We will add index with unique sequential numbers and remove the `count` AND `pct` columns:

In [None]:
patterns$Pattern <- seq.int(nrow(patterns))
patterns$cnt <- NULL
patterns$prop <- NULL

We now need to convert this table from wide to long format, since our `geom_tile()` function only works with long data frames. Instead of using `pivot_wider()` as we did to create `patterns` in the second data exploration notebook, we will use `pivot_longer()` to create a data frame with each row corresponding to a pattern/quarter/status combination.

In [None]:
# convert to long format
patterns_long <- pivot_longer(patterns, names_to = 'Quarter', values_to = 'Status', -c(Pattern))

In [None]:
# see patterns_long
head(patterns_long)

Now we are ready to create the visualization using the `geom_tile()` layer:

In [None]:
# Full code for the plot

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowes

patterns_long$Quarter <- factor(patterns_long$Quarter, levels=c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10", "Q11", "Q12"))

ggplot(data = patterns_long, aes(x = Quarter, y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
labs(
    y = "Employment - Percentages",
    title = 'Employment Patterns by Quarters',
    caption = 'Source: TBR & TDLWD data'
) +
scale_y_discrete(labels=rev(pcts))  # rename the y-axis ticks to correspond to the counts from the table

We can also have counts on the left side of the y-axis instead.

In [None]:
# Full code for the plot

levels = ordered(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))  # specify in which order to add the rows from our wide table (called "patterns") 
                                                             # we want to preserve the same ordering of rows as they are sorted in the table from highest to lowes

patterns_long$Quarter <- factor(patterns_long$Quarter, levels=c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6", "Q7", "Q8", "Q9", "Q10", "Q11", "Q12"))

ggplot(data = patterns_long, aes(x = Quarter, y = ordered(Pattern, levels=rev(levels)))) +    # sort y-axis according to levels specified above
geom_tile(aes(fill = Status), colour = 'black') +                                            # fill the table with value from Status column, create black contouring
scale_fill_brewer(palette = "Set1") +                                                        # specify a color palette
theme(text=element_text(size=14,face="bold")) +                                                          # specify font size
scale_x_discrete(position = 'top') +                                                         # include x-axis labels on top of the plot
labs(
    y = "Employment - Percentages",
    title = 'Employment Patterns by Quarters',
    caption = 'Source: TBR & TDLWD data'
) +
scale_y_discrete(labels=rev(counts))  # rename the y-axis ticks to correspond to the counts from the table

Of course, this notebook only contains a few of the possible visualizations you can create using the `ggplot2` package. Luckily, the R community has created countless resources you can use for making all types of visualizations!

> Note: You can save a visualization with the `png()` command by supplying the file path (and name) after running the plot.

# References

Feder, Benjamin, Barrett, Nathan, & Simone, Sean. (2022, March 30). Data Visualization using New Jersey Education to Earnings Data System Tables. Zenodo. https://doi.org/10.5281/zenodo.6399326