<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    David Currie, Joseph Chappell, Alex Gorbunov, Nathan Barrett, Benjamin Feder, Josh Edelmann, Sean Simone, Angie Tombari </center>

# Data Exploration: Wages

## **1. Introduction**
Few would argue against the notion that individuals and society benefit from a strong education system. Indeed, the median individual employed full time with a bachelor’s degree earns about 80% more than a similar individual with a high school diploma. More educated populations benefit from increased tax revenue, reduced crime rates and dependence on public assistance programs, and greater civic engagement. It should come as no surprise then that, with the expectation of future benefits, individuals and governments alike invest a tremendous amount of money into education. In Tennessee, colleges' and universities' annual revenue total about 3.6 billion dollars; about 35% from student tuition, with the remainder from state and local appropriations, and dormitories, bookstores, hospitals, grants and contracts (State Higher Education Finance report 2020 by State Higher Education Executive Officers Association). In 2020, the average Tennessee college graduate had a student loan debt balance approaching 36,200 dollars (Student Loan Debt by State by Educationdata.org).

There is room for debate, however, about whether the public is fully and fairly realizing the potential benefits of these investments. There is growing concern of a skills gap whereby individuals with postsecondary degrees are unable to fill the available jobs or find consistent employment with the degrees they have. There are also concerns over equity gaps in the workforce outcomes of postsecondary graduates. Accordingly, there is growing interest in developing and using the data systems needed to better understand the education to workforce pipeline. To date, the federal government has allocated over 750 million dollars to states to develop Statewide Longitudinal Data Systems (SLDS) with the goal of making this possible. Tennessee is one of the states that received such funding. This notebook will leverage these data, from the P20 Connect TN system, to begin to understand the education to workforce pipeline in Tennessee.


The Applied Data Analytics training uses a project-based approach to develop your analytic skills. You will begin by working with your team to develop and refine a research question. A crucial part of this is data exploration. You will implement techniques using SQL and R to explore and better understand the data that are available to you and refine your research question. This will form the basis of all the other types of analyses you will do in this class and is a crucial first step for any data analysis workflow. As you work through the notebook, we will have checkpoints for you to practice writing code by making small adjustments, but you should also think about how you might apply any of the techniques and code presented with other datasets to address your research question.

The guiding research questions we will use for the notebooks are quite general:

>**What are the employment outcomes of the 2015-16 community college graduates? How do these outcomes vary by cohort characteristics and employer characteristics?**

This will allow the code we use to have the most versatility. In the last notebook, you defined a cohort of interest. We will now track their earnings and employment outcomes over time. The exploration of the supply side of the labor market will later be supplemented by an analysis of the demand side to enhance our understanding of the overall labor market.

We are going to show just a portion of what you might be interested in investigating to answer these overarching questions, so don't feel restricted by the questions we have decided to try to answer.

## **2. Learning Objectives**

We will continue to follow the guiding research questions previously introduced in the project template:

>**What are the employment outcomes of the 2015-16 community college graduates? How do these outcomes vary by cohort characteristics and employer characteristics?**

Recall that we have already defined a cohort of interest as 2015-16 associate's degree earners in Tennessee in the first data exploration notebook. Here, we will introduce you to the available wage records and walk through how we can link our cohort to the wage records tables to track the cohort's employment outcomes up to three years post-graduation. We will provide code and explanations for various outcome measures and compare them amongst subgroups such as major and gender. At the end of the notebook, we will save the data frames containing these results as csv files so that we can easily use them in the next notebook where we will visualize these descriptive statistics.

#### **Notebook 2 Questions and Goals** 
In this notebook, we focus on the following questions:
- What are the average quarterly earnings of our cohort in Tennessee? Do they vary by major?
- What are the stable employment outcomes of our cohort? Do they vary by gender?
- What are the average quarterly earnings of those in our cohort employed in Kentucky?
- What are the most common employment patterns of our cohort?

>Note: We also provide code in a Supllemental notebook for you to explore additional employment outcomes.

After completing this notebook you should be able to perform the following analytical tasks:
- Load R libraries and establish a connection to the server
- Link an education cohort to multiple sets of wage data
- Identify full quarter employment
- Identify wage outcomes by different subgroups

#### **Datasets** ####
We will explore data provided by the Tennessee Board of Regents and the Tennessee Department of Labor & Workforce Development and Kentucky Center for Statistics:
- **Tennessee Unemployment Insurance (UI) wage records**: the `ui_wages` table in the `ds_tn_tdlwd` database contains employment data from 2006Q1 to 2021Q1.
- **Kentucky UI wage records**: the `ui_wages` table in the `ds_ky_kystats` database contains employment data from 2007Q1 to 2019Q4.

We will also continue to use our cohort formed from data provided by the Tennessee Board of Regents: 
- **Community College Graduates**: The graduates table is provided by TBR. The data include graduations at all TBR community colleges and covers the time period of summer 2009 through fall 2020.
- **Community College Enrollments**:  Also provided by TBR and contains all enrollment data at TBR community colleges from summer 2009 through fall 2020.

> Note: When linking our cohort to the Kentucky UI wage records, we will need to leverage the `ui_person` table, which provides a crosswalk between the person identifier in the `ds_ky_kystats` database and the social security number columns referenced in others.

## 3. Notebook Setup

Before we can get started, let's load in the necessary R libraries, connect to the server, and load in the table containing our analytical cohort established in the first data exploration notebook.

In [None]:
#Run these database interaction (R package) imports.
#Do not panic if you see a few build version warnings.
#The versions used here were built with R 4.0.5 (Shake and Throw).
#The latest R version is 4.1.0 (Camp Pontanezen).
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

In [None]:
#Connect to the server.
#You will not see any output when the connection is made.
#Jupyter will post a warning if a connection cannot be made or if a connection breaks.
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

> If you are not properly connected to the server and/or have not loaded the packages to do so, you will receive any error message running the following code cell.

Now that we are connected to the proper server, we can load our cohort established in the first notebook, `grads1516` into R.

> Recall `df_cohort` is saved in the `tr_tn_2021` database as `grads1516`.

In [None]:
# Let's get the data cohort from notebook 1 and put it in an R data frame (df_cohort)
# Still no exciting output? Don't panic. You are creating a data frame but not viewing it yet.
qry <- "
select * 
from tr_tn_2021.dbo.grads1516
"

df_cohort <- dbGetQuery(con, qry)

To see some output from `df_cohort`, we can take a look at the first six rows of the data frame.

In [None]:
#Recall that this is an R command, so you can run it in a code cell
#Do a quick scan through the column headers
head(df_cohort)

## 4. A Note on SQL and R for Processing Data

SQL is designed to allow for quick and efficient processing of massive amounts of information, such as UI wage records files. Although you may not have trouble narrowing down a cohort from the graduates table in R, you will run into memory issues reading larger tables into R prior to significantly limiting their size. Particularly because we will need to link our original cohort to the UI wage records to begin to understand the cohort's employment outcomes, we have saved our resulting analytical file formed at the end of the first data exploration notebook as a table in SQL. This will allow us to easily perform a linkage to their employment outcomes in SQL, as opposed to reading the entire UI wage records table into R to perform the linkage. Once we have our final table of wage outcomes specific to our cohort within a defined time period, we should be able to read this table into R to perform more complex analyses, as it is just a small subset of the original UI wage records file.
 
Oftentimes, analysts working with large datasets will begin their analysis in SQL to define their analytical frame before reading the resulting table into R. This workflow typically maximizes the power of the two languages, as SQL will be much more efficient when working with massive amounts of data, and R allows for more complex analyses and visualizations. 

## 5. Linking Cohort to Wage Records

Since our cohort, `df_cohort`, does not contain employment outcomes, we will need to figure out a method to extract post-graduation earnings from Tennessee's UI wage records. This section will walk you through a possible linkage procedure.

### Understanding Tennessee's UI Wage Records

Before we can try to link `df_cohort` to any wage records, we need to get a better sense of the contents of the wage records table. Let's take a look at the column headers in the `ui_wages` table and see if we can spot any common variables by which we can create a potential linkage.

In [None]:
# see five rows of data from tn ui wage records table
qry <- "
select top 5 *
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

As you may have noticed, the variable `ssn` (which is a hashed SSN) is common to both our cohort table (`tr_tn_2021.dbo.grads1516`) and to the wages table (`ds_tn_tdlwd.dob.ui_wages`). However, as a slight complication it is referred to as `SSN` in the cohort table as opposed to `ssn` in the UI wage record.

Recall that we also only want to include employment outcomes after graduation. Since Tennessee's UI wage records table contains information going back as far as 2006, we will want to limit the time frame within our query.

#### **Checkpoint 1: Time Travel**

Given the available variables in the `ui_wages` table and `df_cohort` (`tr_tn_2021.dbo.grads1516` in SQL), can you identify potential variables we can use to define a specific time frame (up to three years post-graduation)? Refer to the data dictionaries for complete column definitions.

> Note: You don't need to perform the linkage--we will be doing that in a handful of code cells--but please think about potential variables we might be able to use in the future.

In [None]:
# which variables appear in the two tables? You can also consult the data dictionaries
qry <- "
SELECT COLUMN_NAME
FROM ds_ky_kystats.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'ui_wages'
"
dbGetQuery(con, qry)

In [None]:
# which variables appear in the two tables? You can also consult the data dictionaries
qry <- "
SELECT COLUMN_NAME
FROM tr_tn_2021.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'grads1516'
"
dbGetQuery(con, qry)

--------

### Data Manipulation

There are a few different ways we can approach linking these two tables so that they satisfy a specific time constraint. Although the solution presented may not line up with your answer to Checkpoint 1, it is one that can be applied to a lot of other datasets.

The general idea is to create new variables in each of the tables that represent graduation and employment information in terms of calendar dates. From there, we can take advantage of SQL and R's date-specific functions to extract employment data within a three-year timespan. If you refer back to the original tables from which we are taking wage and graduation data, you will notice that there are no columns indicating specific dates (i.e. mm/dd/yyyy format) within either. Luckily, though, there are columns in both tables that can allow us to approximate these dates in a consistent manner. 

For example, in Tennessee's UI wage records, the variable `wge_yr_qtr` tracks the year and fiscal quarter corresponding to each employment record in the format "yyyyq".

> One benefit of working with dates in the specific `date` type is that there are built-in functions to calculate the time elapsed between two dates.

In [None]:
# see wge_yr_qtr in ui_wages
qry <- "
select top 5 wge_yr_qtr
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

We will have to do a bit of manipulation to get a rough but consistent "date" of employment across all of the wage records. To do so, we will approximate the job date as the first day of the quarter, so employment in **Q1** will correspond to **January 1**, **Q2** to **April 1**, **Q3** to **July 1**, and **Q4** to **October 1**. A quick way to map these quarters to their corresponding month given this rule is that you can multiply each quarter by 3 and subtract 2. 

Therefore, our strategy to add in a variable `job_date`, which will be a date-formatted approximation of the date of employment in `mm/dd/yyyy` format, will be as follows:

1. Extract the quarter (fifth of five characters) from `wge_yr_qtr` using `substring()`.
    - In order to use `substring()`, we will need to convert `wge_yr_qtr` from an `integer` to a `varchar` using `cast()` so we can isolate the fifth character.
2. Multiply the quarter by 3 and subtract 2 to get the month.
3. Isolate the year (first through fourth characters) from `wge_yr_qtr`, given that `wge_yr_qtr` is already a `varchar`.
4. Combine the month (step 2) and the year (step 3) using `concat()` so that the date format is mm/dd/yyyy, with dd always corresponding to '01', the first day of the quarter.
5. Convert the manipulated date string, which is of type `varchar` after running `concat()`, to `datetime` so that `job_date` registers as a date type.

If you were to write the code out in steps, with each step building on the last, the code to create a `job_date` variable could look as follows:

#### **Step 1**: Extract the quarter from `wge_yr_qtr`

In [None]:
# extract quarter
# wge_yr_qtr must be of type 'varchar' (currently integer) to use substring()
qry <- "
select top 5 wge_yr_qtr, substring(cast(wge_yr_qtr as varchar), 5, 5) as quarter_for_job_date
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

#### **Step 2**: Adjust quarter to correspond to a month

In [None]:
# map extracted quarter to month by multiplying by 3 and subtracting 2
# showing example where the quarter is not quarter 1
qry <- "
select top 5 wge_yr_qtr, substring(cast(wge_yr_qtr as varchar), 5, 5)*3-2 as month_for_job_date
from ds_tn_tdlwd.dbo.ui_wages
where wge_yr_qtr = 20052
"
dbGetQuery(con, qry)

#### **Step 3**: Isolate year from `wge_yr_qtr`

In [None]:
# extract year from wge_yr_qtr, which is first four characters
# creating new column "year_for_job_date" to illustrate example
# again, wge_yr_qtr must be of type 'varchar' to use 'substring()' to extract the year
qry <- "
select top 5 wge_yr_qtr, 
    substring(cast(wge_yr_qtr as varchar), 5, 5)*3-2 as month_for_job_date, 
    substring(cast(wge_yr_qtr as varchar), 1, 4) as year_for_job_date
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

#### **Step 4**: Coerce extracted month and year into a date-like format

In [None]:
# combine extracted month, day (always 01), and year into date-like format using 'concat()'
# want format mm/dd/yyyy with '/' separators
# have combined month_for_job_date and year_for_job_date into 'job_date'
# notice type is 'chr'
qry <- "
select top 5 wge_yr_qtr,
    concat(substring(cast(wge_yr_qtr as varchar), 5, 5)*3-2, '/', '01', '/', substring(cast(wge_yr_qtr as varchar), 1, 4)) as job_date
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

#### **Step 5**: Convert `job_date` to `datetime` type

In [None]:
# convert job_date to datetime in SQL using convert()
# first argument of convert() is the type that to which you would like to convert the variable
qry <- "
select top 5 wge_yr_qtr,
    convert(datetime, concat(substring(cast(wge_yr_qtr as varchar), 5, 5)*3-2, '/', '01', '/', substring(cast(wge_yr_qtr as varchar), 1, 4))) as job_date
from ds_tn_tdlwd.dbo.ui_wages
"
dbGetQuery(con, qry)

Now, you have code to generate a rough `job_date` variable from the Tennessee UI wage records. While possible to do in R using the `lubridate` package, again due to the size of the table, you will often run into memory and speed issues. 

> Note that we have just demonstrated the code while printing five rows and have not made any permanent changes to the `ui_wages` table (you do not have permission to do so) or created a separate table with the `job_date` column yet. However, we have already saved a permanent version of this table (`ui_wages_dated`) for you in the `tr_tn_2021` database using the code below. Note that we have limited the `ui_wages` table to only include observations that already existed in our cohort table created in the first notebook, `grads1516`. This is done to speed up the future linkage between the cohort and the wage records.

    select *, convert(datetime, concat(substring(cast(wge_yr_qtr as varchar), 5, 5)*3-2, '/', '01', '/', substring(cast(wge_yr_qtr as varchar), 1, 4))) as job_date
    into tr_tn_2021.dbo.ui_wages_dated 
    from ds_tn_tdlwd.dbo.ui_wages
    where ssn in (
        select distinct(SSN) 
        from tr_tn_2021.dbo.grads1516
    )

Now that we have successfully created the `job_date` variable, we can follow a similar process for generating `grad_date`. There are a few columns you can use from the `grads1516` table. One option is a combination of `TermAward`, which tracks the academic term of graduation, and `YearAward`, the year of graduation. Keep in mind that the academic terms *do not* correspond to fiscal quarters.

- In `TermAward`, '1' maps to Fall, '3' to Spring, and '4' to 'Summer'.

In [None]:
# see TermAward and YearAward
qry <- "
select top 5 TermAward, TermDesc, YearAward
from tr_tn_2021.dbo.grads1516
"
dbGetQuery(con, qry)

To add `grad_date`, as you can see, we need to manipulate either `TermAward` or `TermDesc` so that it can correspond to the first day of the proper fiscal quarter. Here, we will consider the first month of the Fall semester to correspond to October 1, or the first day of the 4th fiscal quarter, Spring's to correpsond to April 1, and Summer's to July 1. Let's try creating a new column, `new_month`, that takes in this transformation.

In [None]:
# create new_month
qry <- "
select top 5 TermAward, TermDesc,
case
    when TermAward = 1 then 10
    when TermAward = 3 then 4
    else 7 end as new_month
from tr_tn_2021.dbo.grads1516
"
dbGetQuery(con, qry)

Assuming the `new_month` column, we now have the information needed to map degree dates to quarters. In order to preserve the `new_month` column, we could create a permanent table with this column, or add it to `grads1516`. Another option, though, is to use a common table expression (CTE), where we will create intermediate results that we can combine together to create our final table in one query. To start our first CTE, we will begin by using a `with` clause in SQL. `with` enables us to essentially define an intermediate table without writing it to the database before using the intermediate table to derive our desired result. In this example, the code creating `new_month` will be contained within the `with` clause, and then we will use this intermediate result set to create our desired `grad_date` variable. 

In [None]:
# see grad date variable
qry <- "
with upd_month_table as (
    select *,
    case
        when TermAward = 1 then 10
        when TermAward = 3 then 4
        else 7 end as new_month
    from tr_tn_2021.dbo.grads1516
)
select top 5 TermDesc, YearAward, convert(datetime, concat(new_month, '/', '01', '/', YearAward)) as grad_date
from upd_month_table
"
dbGetQuery(con, qry)

We have slightly adapted this code to create a permanent version of the table (`grads1516_dated`) in the `tr_tn_2021` database. This new table contains the exact same records as `grads_1516` but also contains the `grad_date` and `new_month` columns. The code we used is pasted below:

    with new_table as (
        select *,
        case
            when TermAward = 1 then 10
            when TermAward = 3 then 4
            else 7 end as new_month
        from tr_tn_2021.dbo.grads1516
    )
    select *, convert(datetime, concat(new_month, '/', '01', '/', YearAward)) as grad_date
    into tr_tn_2021.dbo.grads1516_dated 
    from new_table

### Joining Updated Tables

At this point, we have dated and undated versions of tables at our disposal to find employment history up to *3* years after graduation: `tr_tn_2021.dbo.grads1516` or `tr_tn_2021.dbo.grads1516_dated` and `ds_tn_tdlwd.dbo.ui_wages` or `tr_tn_2021.dbo.ui_wages_dated`. As mentioned previously, we can link between these two tables based on common `ssn` values (hashed SSN numbers) and limit the time frame using SQL's date functions.

We will use a simple `join` statement and add our time constraints to the `where` clause. The time constraint will be implemented by only taking `job_date` values (found in `tr_tn_2021.dbo.ui_wages_dated`) that occur within 13 quarters (3 years plus 1 quarter) of graduation.

The date-specific function we will use in SQL is `dateadd()`, as it allows us to add different time intervals to date variables.

> Note: We will filter out all wage records with `wage` values of 0, as employers are only required to report non-zero wages paid during the quarter. In this notebook, we will assume that the records with 0 quarterly wages are reflective of HR efforts to report future earnings instead of employment with 0 wages earned in the quarter.

In [None]:
# link wage and education tables for up to 13 quarters post-graduation
qry <- "
select cohort.*, w.indst_cde, w.file_yr_qtr, w.wge_yr_qtr, w.empr_nbr, w.job_date, w.wge_amt
from tr_tn_2021.dbo.grads1516_dated cohort
join tr_tn_2021.dbo.ui_wages_dated w
on cohort.SSN  = w.ssn
where w.job_date >= cohort.grad_date and dateadd(quarter, 13, cohort.grad_date) >= w.job_date and w.wge_amt > 0
"
df_wages <- dbGetQuery(con, qry)

head(df_wages)

#### **Checkpoint 2: Time-Keeping**

Adjust the query above to only include wage records in the two years (8 quarters) prior to graduation. Return five rows to confirm your results. (Hint: you can re-use the dateadd function by switching the order of the variables and reversing the inequality signs.)

In [None]:
# link wage and education tables for two years after graduation
qry <- "
select top 5 __
from ___
join ___
on ___
where __
"
dbGetQuery(con, qry)

--------

## 6. Quick Exploration

Before we start evaluating different employment measures, we should get a better grasp of the data frame. Let's start by finding the number of individuals, as well as jobs that were linked across the two tables.

In [None]:
# see number of jobs and individuals employed in TN according to ui records
df_wages %>%
    summarize(
        n_ind = n_distinct(SSN),
        n_jobs = n()
    )

For reference, you can compare the number of individuals who were employed in at least one quarter according to the wage records within their first 13 quarters after graduation with the number of individuals in the original cohort.

In [None]:
# size of original cohort
df_cohort %>%
    summarize(
        n_ind = n_distinct(SSN)
    )

Does this proportion surprise you?

We assume that there should be one entry for each individual-employer-quarter combination. Let's confirm that assumption by counting the number of entries within each `SSN`-`empr_nbr`-`wge_yr_qtr` combination.

In [None]:
#see if people have multiple filings for same employer and job date
#we are only looking at the head here, but if n > 1 in a row, then you have multiple filings.
#note the count is aranged in descending order so higher numbers first.
df_wages %>%
    group_by(SSN, empr_nbr, job_date) %>%
    count() %>%
    arrange(desc(n)) %>%
    head()

It turns out that it is possible to have more than one observation within a `hashed_ssn`-`hashed_empr_nbr`-`yrq` combination. This occurs because employers are allowed to refile to correct previously submitted employment records. To only keep the most recent records, we select the most recent `filedate` within this combination.

In [None]:
#Unduplicate wages
df_wages_undup <- df_wages %>%
    arrange(SSN, empr_nbr, job_date, desc(file_yr_qtr)) %>%
    distinct(SSN, empr_nbr, job_date, .keep_all=TRUE)

In [None]:
#See number of jobs and individuals employed in TN according to ui records
df_wages_undup %>%
    summarize(
        n_ind = n_distinct(SSN),
        n_jobs = n()
    )

For reference, by comparing the `n_jobs` within the code cell above and the one for `df_wages`, you can see that relatively few records contained duplicate entries differing by `filedate`. However, it is important to make sure information is not being potentially duplicated or miscast. Let's confirm that we successfully unduplicated the `df_wages_undup`.

In [None]:
# confirm de-duplication
df_wages_undup %>%
    group_by(SSN, empr_nbr, job_date) %>%
    count() %>%
    arrange(desc(n)) %>%
    head()

Now that we have successfully de-duplicated the linked wage records up to three years post-graduation, we can save `df_wages_undup` to the `tr_tn_2021` database as a permanent table using the code below:

    qry <- " use tr_tn_2021;
    "
    DBI::dbExecute(con, qry)

    DBI::dbWriteTable(
        conn = con,
        name = DBI::SQL("dbo.nb_cohort_wages_link"), 
        value = df_wages_undup
    )

In [None]:
# test that we can query from the new table
qry <- "
select top 5 *
from tr_tn_2021.dbo.nb_cohort_wages_link
"
dbGetQuery(con, qry)

## 7. Employment Outcomes

Connecting education data to employment data is only the first part of understanding the outcomes for Tennessee graduates. There are many ways that one could define and evaluate these outcomes. We present a few here and several more in the Supplemental notebook. While working through this section, think through what outcomes would be most relevant for your research question.

We will look at the following outcomes:
- Average quarterly earnings and the total workers employed by quarter (data: TN wages)
    - Distribution by major
- Average quarterly earnings and the total workers employed by quarter (data: dominant TN wages)
    - Distribution by major
- Full quarter employment (data: full quarter TN wages, non-dominant)
    - Distribution by gender
- Average quarterly earnings and the total workers employed by quarter of those in our cohort employed in Kentucky (data: KY wages)
- Employment patterns (data: quarterly wages-TN and KY)

### Average quarterly earnings and number employed by quarter

As mentioned earlier, we plan to focus on the first 12 quarters post-graduation for each individual. However, `df_wages_undup` currently contains employment outcomes in the quarter of graduation, as well as 13 quarters post-graduation. To isolate each quarter post-graduation, we will create a new variable, `quarter_number`, which leverages some of R's functionality when it comes to working with date variables. After converting `grad_date` and `job_date` to date objects in R, we can calculate the difference in weeks between the two values (thanks to `difftime` from the `lubridate` package), divide by 13 since there are roughly 13 weeks in each quarter, and round to the nearest whole number.

In [None]:
# get quarter from graduation
df_wages_undup <- df_wages_undup %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    )

# see evidence
df_wages_undup %>%
    select(grad_date, job_date, quarter_number) %>%
    head()

With `quarter_number`, we can sum each individual's total earnings by quarter while excluding all observations where `quarter_number` is either 0 or 13.

In [None]:
# ignore quarters 0 and 13
df_wages_undup <- df_wages_undup %>%
    filter(!(quarter_number %in% c(0, 13)))

# find quarterly wages
df_wages_undup %>%
    group_by(SSN, quarter_number) %>%
    summarize(
        quarterly_wages = sum(wge_amt),
    ) %>%
    ungroup() %>%
    head()

Let's save these results to the data frame `quarterly_wages` so we can compute the cohort average quarterly earnings.

In [None]:
#save as quarterly_wages
quarterly_wages <- df_wages_undup %>%
    group_by(SSN, quarter_number) %>%
    summarize(
        total_wages = sum(wge_amt),
    ) %>%
    ungroup()

Now that we have `quarterly_wages` to capture each individual's total quarterly earnings, we can compute the cohort average quarterly earnings, as well as the total number of individuals employed, broken down by quarter after graduation.

In [None]:
#average wages and number of grads with wages by quarter after graduation
avg_and_num <- quarterly_wages %>%
    group_by(quarter_number) %>%
    summarize(
        mean_wage = mean(total_wages),
        n_employed = n_distinct(SSN)
    )

avg_and_num

We can see that the number of graduates employed is fairly consistent, and that over time, the average quarterly earnings rise. Keep in mind that the `mean_wage` encompasses average quarterly earnings, so if an individual had multiple earning sources in a quarter, we are currently taking the sum of them. Let's see if we see similar trends amongst those receiving the most common degrees within the cohort.

#### By Major

Recall the first data exploration notebook, where we identified the most common degrees earned within the cohort. We used the following code to isolate the most common majors:

    df_common_major <- df %>%
        count(CIP_Family) %>%
        arrange(desc(n)) %>%
        mutate(
            prop = n/sum(n)
        ) %>%
        head(5)
        
Let's see if there are consistent trends relative to the entire cohort of earners for those who received degrees in the five most common fields. Recall that whereas the cohort was saved in the data frame `df` in the first notebook, it is saved as `df_cohort` here.

In [None]:
# 5 most common majors
com_majors <- df_cohort %>%
    count(CIP_Family) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5)

com_majors

Now that we have the most common majors, we can `filter` `df_wages_undup` for only observations for these majors and evaluate earnings and the number of individuals employed within these five majors.

In [None]:
# earnings and number employed for most common majors
# first find quarterly wages for each person while including major
# then find average wages within groups
avg_and_num_major <- df_wages_undup %>%
    group_by(SSN, quarter_number, CIP_Family) %>%
    summarize(
        total_wages = sum(wge_amt)
    ) %>%
    ungroup() %>%
    filter(CIP_Family %in% com_majors$CIP_Family) %>%
    group_by(CIP_Family, quarter_number) %>%
    summarize(
            mean_wage = mean(total_wages),
            n_employed = n_distinct(SSN)
        ) 

avg_and_num_major

Naturally, we may wonder if we see the same trend amongst `mean_wage` when isolating for dominant earnings within a quarter, or the job where the individual had the highest wages per quarter.

### Average quarterly earnings and number employed by quarter (Dominant Earnings)

Before we can perform our analysis, we need to restrict `df_wages_undup` to only include the highest earnings per quarter for each individual (referred to here as dominant earnings). We will take a similar approach to when we unduplicated wage records, by first arranging employment for each individual/quarter combination instead by descending wages (as opposed to file date) before taking the highest earnings. We will save this resulting data frame as `df_dom_wages`.

In [None]:
# identify dominant wages in each quarter
df_dom_wages <- df_wages_undup %>%
    arrange(SSN,quarter_number, desc(wge_amt)) %>%
    distinct(SSN, quarter_number, .keep_all=TRUE)

head(df_dom_wages)

In [None]:
# confirm we have one entry for each SSN/quarter_number combination
df_dom_wages %>%
    count(SSN, quarter_number) %>%
    arrange(desc(n)) %>%
    head()

Now that we have isolated dominant earnings in `df_dom_wages`, we can recycle the same code that we applied to `df_wages_undup`.

In [None]:
# average dominant wages and number of grads with wages by quarter after graduation
avg_and_num_dom <- df_dom_wages %>%
    group_by(quarter_number) %>%
    summarize(
        mean_wage = mean(wge_amt),
        n_employed = n_distinct(SSN)
    )

avg_and_num_dom

Interestingly enough, you can see that a similar trend appears within `mean_wage`. Let's see if this trend is similar for graduates of the five most common majors.

> Note that `n_employed` is unchanged from the results of the previous code on `df_wages_undup` because `df_dom_wages` contains the same number of individuals, but only keeps records of dominant wages in a given quarter.

#### By Major

In [None]:
# dominant earnings and number employed for most common majors
avg_and_num_dom_major <- df_dom_wages %>%
    group_by(SSN, quarter_number, CIP_Family) %>%
    summarize(
        total_wages = sum(wge_amt)
    ) %>%
    ungroup() %>%
    filter(CIP_Family %in% com_majors$CIP_Family) %>%
    group_by(CIP_Family, quarter_number) %>%
    summarize(
            mean_wage = mean(total_wages),
            n_employed = n_distinct(SSN)
        ) 

avg_and_num_dom_major

Notice the different salary growth experiences, on average, between graduates of these majors. However, these two employment measures, using total quarterly and dominant wages, do not include any measure of employment stability, which is often an important job aspect, especially for recent graduates.

### Full-Quarter Employment

There are many ways in which to define stable employment. Sometimes, it may be useful to define stable employment by a consecutive number of quarters worked with the same employer. Other times, stable employment may assume an alternative definition. Here, we will define stable employment as full-quarter employment. Full-quarter employment in quarter *t* is indicated by a presence of wages with the same employer in quarters *t-1*, *t*, and *t+1*.

**Example: Data Needed to Assess Full-Quarter (FQ) Employment for Person1/EmployerA in Each of the 4 Quarters Post-Graduation**

|Person/Employer Combination|High Degree |FQ YearQtr |t-1|t|t+1|
|---|---|---|---|---|---|
|_Person 1/Employer A_ |_2013 Q3_ |_2013 Q4_ |<font color=green>2013 Q3</font> |**2013 Q4** |2014 Q1 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q1_ |2013 Q4 |**2014 Q1** |2014 Q2 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q2_ |2014 Q1 |**2014 Q2** |2014 Q3 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q3_ |2014 Q2 |**2014 Q3** |<font color=green>2014 Q4</font> |

As can be seen in the table above, calculating full-quarter employment for the same four quarter span requires two additional quarters of wage information. This requires us to extend our data frame to include the employment quarter of graduation as well as one additional quarter after our final quarter of interest. In this example, it means including employment in the quarter prior to quarter 1 and the quarter after quarter 12. Now, it should be clear as to why we initially brought in 14 quarters worth of data, as we need all 14 quarters to calculate full-quarter employment within the 12 on which we want to focus.

In practice, to isolate instances of full-quarter employment, we can do so by creating three copies of our `nb_cohort_wages_link` table, and matching them based on `empr_nbr` and `SSN` values while accounting for `job_date` differences amounting to quarters t-1, t, and t+1.


In [None]:
# get full quarter instances
qry <- "
select b.SSN, b.empr_nbr, b.wge_amt, b.job_date, b.grad_date, b.Gender
from  tr_tn_2021.dbo.nb_cohort_wages_link a,  tr_tn_2021.dbo.nb_cohort_wages_link b,  tr_tn_2021.dbo.nb_cohort_wages_link c
where a.SSN = b.SSN and a.empr_nbr = b.empr_nbr and a.job_date = dateadd(month, 3, b.job_date)
and a.SSN = c.SSN and a.empr_nbr = c.empr_nbr and b.job_date = dateadd(month, 3, c.job_date)
"
full_q_wages <- dbGetQuery(con, qry)

head(full_q_wages)

Let's check how many individuals experienced at least one quarter of stable employment, as well as the number of employers by which members of the cohort experienced stable employment and their average wages.

In [None]:
# see number with at least one quarter of full quarter, number of unique employers, and what are their average wages
full_q_stats <- full_q_wages %>%
    summarize(
        num_individuals = n_distinct(SSN),
        num_employers = n_distinct(empr_nbr),
        avg_wage = mean(wge_amt)
    )

full_q_stats

Is there a difference by gender? In our next subsection, we will try to answer this question.

#### By Gender

At the end of the first data exploration notebook, after exploring the most common majors within the cohort, we also analyzed the gender breakdown of the cohort. Here, we will compare the overall gender breakdown to that of those who experienced at least one quarter of full-quarter employment.

Recall the code from the first notebook in the code cell below:      

In [None]:
# gender breakdown
df_gender <- df_cohort %>%
    count(Gender) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    )

# see df_gender
df_gender

Now we can apply it to the `full_q_wages` data frame and calculate the number and proportion of individuals by gender, as well as their average wages.

In [None]:
# see number with at least one quarter of full quarter by gender, what are their average wages
full_q_stats_gender <- full_q_wages %>%
    group_by(Gender) %>%
    summarize(
        num_individuals = n_distinct(SSN),
        avg_wage = mean(wge_amt)
    ) %>%
    mutate(
        prop = num_individuals/sum(num_individuals)
    )

full_q_stats_gender

We can see that roughly the same proportions exist, and that of those who experienced stable employment, there appear to be a major discrepancy in average wages across genders. What do you think are some potential reasons for this?

------

#### **Checkpoint 3: Stable Employment by Major**

Recreate the table above (besides `prop`) for the five most common majors. Do the results surprise you?

In [None]:
# replace ___
full_q_wages %>%
    group_by(___) %>%
    summarize(
        num_individuals = n_distinct(SSN),
        avg_wage = mean(wge_amt)
    )

### Finding Employment Records for our Cohort outside of Tennessee

For the last part of this notebook, we will try to identify portions of our cohort that were employed in Kentucky by linking our cohort to the Kentucky UI wage records. After linking Tennessee graduates to Kentucky wage records, we will analyze common employment patterns over the 12 potential quarters of employment post graduation.

#### Manipulating Kentuckys's UI Wage Records

Before we link our cohort to Kentucky's employment outcomes, let's take a look at the Kentucky UI wage records data, which is stored in the `ds_ky_kystats` database, to get a better sense of the data. We will work with the `ui_wages` table.

In [None]:
# look at KY UI wages
qry <- "
select top 5 *
from ds_ky_kystats.dbo.ui_wages
"

dbGetQuery(con, qry)

You may have noticed that there is not a common social security variable that we can use to join to our cohort like we did with Tennessee's UI wage records. Instead, there is a `coleridge_id` column, which is a common individual-based identifier that exists across all tables in the `ds_ky_kystats` schema. Luckily, the table `ui_person` in the `ds_ky_stats` database allows us to match `coleridge_id` values to `ssn` values to match these individuals to tables in other databases.

Let's take a quick peek at `ui_person`.

In [None]:
# look at ui_person
qry <- "
select top 5 *
from ds_ky_kystats.dbo.ui_person
"
dbGetQuery(con, qry)

Similar to the process for matching Tennessee's UI wage records to the graduates cohort, we will create a `job_date` column in Kentucky's UI wage records. However, due to the lack of a pre-existing common hashed social security number, we will also need to leverage the `ui_person` table to add social security numbers for matching. Therefore, if we were to break down this operation into a set of steps, it may look as follows:

1. Create `job_date` variable using a combination of the existing `calendaryear` and `qtr` variables in `ui_wages`.
2. Join `ui_person` to `ui_wages` to include social security numbers.
3. Subset `ui_wages` to only include wage records of individuals in our original cohort. While unnecessary, this will speed up the eventual linkage between the modified Kentucky UI wage records table and the original cohort.

We can apply these steps in an iterative process.

##### Step 1: Create `job_date`

In [None]:
# create job_date
qry <- "
select top 5 *,
    convert(datetime, concat(qtr*3-2, '/', '01', '/', calendaryear)) as job_date
from ds_ky_kystats.dbo.ui_wages
"
dbGetQuery(con, qry)

##### Step 2: Add hashed social security numbers

In [None]:
# add hashed ssns
qry <- "
select top 5 wages.*, person.ssn,
    convert(datetime, concat(wages.qtr*3-2, '/', '01', '/', wages.calendaryear)) as job_date
from ds_ky_kystats.dbo.ui_wages wages
join ds_ky_kystats.dbo.ui_person person
on wages.coleridge_id = person.coleridge_id
"
dbGetQuery(con, qry)

##### Step 3: Subset original `ui_wages` table

In [None]:
# subset ui_wages table
qry <- "
select top 5 wages.*, person.ssn,
    convert(datetime, concat(wages.qtr*3-2, '/', '01', '/', wages.calendaryear)) as job_date
from ds_ky_kystats.dbo.ui_wages wages
join ds_ky_kystats.dbo.ui_person person
on wages.coleridge_id = person.coleridge_id
WHERE ssn in (
    SELECT DISTINCT(SSN) 
    FROM tr_tn_2021.dbo.grads1516_dated
)
"
dbGetQuery(con, qry)

We have slightly adapted this code to create a permanent version of the table (`ky_ui_wages_dated`) in the `tr_tn_2021` database. The code we used is pasted below:

    select wages.*, person.ssn,
        convert(datetime, concat(wages.qtr*3-2, '/', '01', '/', wages.calendaryear)) as job_date
    into tr_tn_2021.dbo.ky_ui_wages_dated
    from ds_ky_kystats.dbo.ui_wages wages
    join ds_ky_kystats.dbo.ui_person person
    on wages.coleridge_id = person.coleridge_id
    WHERE ssn in (
        SELECT DISTINCT(SSN) 
        FROM tr_tn_2021.dbo.grads1516_dated
    )

In [None]:
#query the new ky_ui_wages_dated table
qry <- "
select top 5 * 
from tr_tn_2021.dbo.ky_ui_wages_dated
"
dbGetQuery(con, qry)

#### Linking Cohort to Kentucky's wage records

With the new `ky_ui_wages_dated` table, we now have Kentucky's UI wage records table in a format by which we can link to our cohort within our desired time period. We can do so by recycling the code linking to Tennessee's wage records, as long as we substitute in the proper variable names.

In [None]:
# link wage and education tables
qry <- "
select cohort.*, w.naics, w.wages, w.coleridge_id, w.employeeno, w.job_date 
from tr_tn_2021.dbo.grads1516_dated cohort
join tr_tn_2021.dbo.ky_ui_wages_dated w
on cohort.SSN = w.ssn 
where w.job_date >= cohort.grad_date and DATEADD(quarter, 13, cohort.grad_date) >= w.job_date and w.wages > 0;
"
df_wages_ky <- dbGetQuery(con, qry)

head(df_wages_ky)

#### (Brief) Exploration of Employment Outcomes in Kentucky

Let's see the number of individuals (and number of jobs) from our original cohort that had employment outcomes in Kentucky.

In [None]:
# number of jobs and number of individuals 
df_wages_ky %>%
    summarize(
        n_jobs = n(),
        n_individuals = n_distinct(SSN)
    )

We can also inspect the wage records for any potential duplication on the person-employer-quarter combination.

In [None]:
# inspect for potential wage duplication
df_wages_ky %>% 
    count(SSN, employeeno, job_date) %>%
    arrange(desc(n)) %>% 
    head()

Finally, let's see potential quarterly wage progression and movement of jobs to or from Kentucky. To do so, we can insert the `quarter_number` variable into `df_wages_ky`.

In [None]:
# get quarter from graduation
df_wages_ky <- df_wages_ky %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    )

# see evidence
df_wages_ky %>%
    select(grad_date, job_date, quarter_number) %>%
    head()

After ignoring all entries with `quarter_number` is either 0 or 13, we can calculate the quarterly wages per individual for employment outcomes in Kentucky.

In [None]:
# ignore quarters 0 and 13
df_wages_ky <- df_wages_ky %>%
    filter(!(quarter_number %in% c(0, 13)))

#save as quarterly_wages_ky
quarterly_wages_ky <- df_wages_ky %>%
    group_by(SSN, quarter_number) %>%
    summarize(
        total_wages = sum(wages),
    ) %>%
    ungroup()

Now that we have `quarterly_wages_ky`, we can compute the cohort average quarterly earnings, as well as the total number of individuals employed, broken down by quarter after graduation.

In [None]:
#average wages and number of grads with wages by quarter after graduation in KY
avg_and_num_ky <- quarterly_wages_ky %>%
    group_by(quarter_number) %>%
    summarize(
        mean_wage = mean(total_wages),
        n_employed = n_distinct(SSN)
    )

avg_and_num_ky

We do not observe similar patterns for employment outcomes in Kentucky. Certainly, there is a lot of potential for evaluating other factors at play, such as the potential for employment in multiple states in the same quarter, for example.

##### **Checkpoint 4: Kentucky's Dominant Earnings**

Recreate the table above for dominant earnings in Kentucky. Do you observe any changes?

In [None]:
# replace ___
df_dom_wages_ky <- df_wages_ky %>%
    arrange(SSN, quarter_number, desc(__)) %>%
    distinct(SSN, quarter_number, .keep_all=__)

df_dom_wages_ky %>%
    group_by(___) %>%
    summarize(
        mean_wage = __
        n_employed = ___
    )

------

### Employment Patterns

Employment outcomes can be defined by more than just earnings. While we looked at stability we can also look more comprehensively at employment patterns. How quickly do graduates find employment? How many spells of nonemplyment do graduates experience and for how long? These are important factos to consider as graduates enter and navigate the labor market, particularly as we consider differences across major and demographic groupings. 

At the end of this section, we hope to have found the most common employment patterns for everyone in the original cohort, not just those who matched to the Tennessee and/or Kentucky UI wage records. To start, let's get a sense of the amount of individuals that are missing from the combination of `quarterly_wages` and `quarterly_wages_ky`, which contain entries of quarterly earnings for individuals of our cohort employed in Tennessee and Kentucky, respectively.

In [None]:
# see size of original cohort
df_cohort %>%
    summarize(
        num_inds = n_distinct(SSN)
    )

In [None]:
# see amount of people with employment outcomes in TN
quarterly_wages %>%
    summarize(n_distinct(SSN))

In [None]:
# see amount of people with employment outcomes in ky
quarterly_wages_ky %>%
    summarize(n_distinct(SSN))

Note that the number of individuals not tracked in either Tennessee's or Kentucky's wage records is the difference between the number of individuals in `df_cohort` and the summation of those in `quarterly_wages` and `quarterly_wages_ky`, because there may be individuals with employment outcomes sometime in the 12 quarters post-graduation in both states. 

Let's combine records from `quarterly_wages` and `quarterly_wages_ky` so that we have one row of wage records from either state. To do so, we can use an `anti_join` to see the records that exist in `quarterly_wages_ky` but not `quarterly_wages` in a given `SSN`-`quarter_number` combination.

> We don't need to worry about accurate quarterly earnings by summing potential earnings across the two states for these individuals because we are focused on employment patterns.

In [None]:
# find ssn/quarter_number combinations in quarterly_wages_ky that don't exist in quarterly_wages
# assign to ky_not_tn
ky_not_tn <- quarterly_wages_ky %>%
    anti_join(quarterly_wages, by = c("SSN", "quarter_number"))

head(ky_not_tn)

Now, if we add all of the rows from `ky_not_tn` to `quarterly_wages`, we will have one row for all `SSN`/`quarter_number` combinations for members of the original cohort that appear either in the Tennessee or Kentucky UI wage records. We can do so using `rbind`.

> `rbind` will only work if the column names in the two data frames are the same.

In [None]:
# join missing SSN/quarter_number combinations from ky_not_tn to quarterly_wages
# save resulting data frame as combined_wages
combined_wages <- quarterly_wages %>%
    rbind(ky_not_tn)

# see number of individuals with at least one record of wage records in either TN or KY wage records
combined_wages %>%
    summarize(n_distinct(SSN))

Before we manipulate any existing data frames, let's confirm that if we join `combined_wages` to `df_cohort`, the number of individuals where `quarter_number` (or pick any other variable only in `combined_wages`) is equal to the difference in individuals between `df_cohort` and `combined_wages`.

In [None]:
# see that everyone with na quarter is equal to amount who didn't show up in df_dom_wages
df_cohort %>%
    left_join(combined_wages, by = "SSN") %>%
    filter(is.na(quarter_number)) %>%
    summarize(n_distinct(SSN))

Now that we have confirmed that our join should work as intended, as there are no instances of any observations that may be duplicated, we will join `combined_wages` to `df_cohort`. After doing so, we will set all instances where `quarter_number` is `NA` equal to 1, so that we will eventually be able to have 12 observations for each individual, one for each potential quarter of employment.

In [None]:
# set all where quarter is na equal to one so we can use complete
full_wages <- df_cohort %>%
    left_join(combined_wages, by = "SSN") %>%
    mutate(
        quarter_number = ifelse(is.na(quarter_number), 1, quarter_number)
    )

# see potential quarter numbers
full_wages %>%
    distinct(quarter_number) %>%
    arrange(quarter_number)

Now that we have all potential `SSN` values, as well as instances of all desired `quarter_number` values, we can leverage `complete`, which will add additional rows for any combinations of `SSN`/`quarter_number` that do not currently exist in `full_wages`. If the combination does not appear in `full_wages`, the resulting `total_wages` value will be `NA`, signifying the individual was not employed in this quarter. As verification, the number of rows should equal the number of individuals multiplied by 12.

In [None]:
# complete file
completed <- full_wages %>%
    complete(SSN, quarter_number, fill=list(total_wages=NA))

# see that n should be a multiple of n_dist
completed %>%
    summarize(
        n = n(),
        n_inds = n_distinct(SSN),
        test = n_inds*12 == n
    )

In [None]:
# see completed
head(completed)

Now that we have created `completed`, we just need to aggregate and manipulate the data frame so that each column is a quarter, and each observation is an individual, with the corresponding columns indicating whether the individual was employed in the given quarter. To start, let's create a variable `wage_ind`, which will be "yes" if the individual was employed in the quarter, and "no" otherwise. Additionally, for each in column manipulation, we will change each `quarter_number` value from 1, 2, 3,..., 12 to Q1, Q2, Q3,..., Q12 and call this variable `quarter`.

In [None]:
# create wage_ind and quarter variables
patterns <- completed %>%
    mutate(
        wage_ind = ifelse(is.na(total_wages), "no", "yes"),
        quarter = paste("Q",quarter_number, sep="")
    )

head(patterns)

Now, we need to figure out how to "pivot" the data frame so that each column is a value of `quarter`, with `wage_ind` values for the `SSN` values. To do so, we will use `pivot_wider`, which allows us to take a tidy data frame (one observation per row) and "widen" it so that each column becomes values from what was previously a single column (`quarter`) and the rows are occupied by those from a corresponding column (`wage_ind`). 

After manipulating the data frame, we can aggregate by the quarter columns, and count the number of observations within each of these patterns to discover the most common employment patterns for the cohort.

In [None]:
# find most common employment patterns
patterns <- patterns %>%
    select(SSN, quarter, wage_ind) %>%
    pivot_wider(names_from = quarter, values_from = wage_ind) %>%
    group_by(Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12) %>%
    summarize(cnt = n_distinct(SSN)) %>%
    arrange(desc(cnt)) %>%
    ungroup() %>%
    mutate(
        prop = percent(cnt/sum(cnt), .01)
    ) 

head(patterns)

## 8. Save as csvs

Before we finish the notebook, let's save your work as .csv files so that they can be referenced in the Data Visualization notebook.

<font color=red> Note that you need to change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Save dataframes to CSV to use in later notebook

# average quarterly earnings and number employed by quarter
write_csv(avg_and_num, "U:\\..\\TN Training\\Results\\avg_and_num.csv")

# average quarterly earnings and number employed by quarter (common majors)
write_csv(avg_and_num_major, "U:\\..\\TN Training\\Results\\avg_and_num_major.csv")

# average dominant quarterly earnings and number employed by quarter
write_csv(avg_and_num_dom, "U:\\..\\TN Training\\Results\\avg_and_num_dom.csv")

# average dominant quarterly earnings and number employed by quarter (common majors)
write_csv(avg_and_num_dom_major, "U:\\..\\TN Training\\Results\\avg_and_num_dom_major.csv")

# full quarter info
write_csv(full_q_stats, "U:\\..\\TN Training\\Results\\full_q_stats.csv")

# full quarter info by gender
write_csv(full_q_stats_gender, "U:\\..\\TN Training\\Results\\full_q_stats_gender.csv")

# employment patterns
write_csv(patterns, "U:\\..\\TN Training\\Results\\patterns.csv")

-----