<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    Heath Prince, Rukhshan Mian, Benjamin Feder, Nathan Barrett </center>
    <a href="https://doi.org/10.5281/zenodo.6412649"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6412649.svg" alt="DOI"></a>


# Data Exploration: Wages

## 1. Introduction

It is generally accepted that individuals and society benefit from higher levels of educational attainment. Postsecondary credentials, in particular, appear to be the keys to individual self-sufficiency, greater civic participation, and higher levels of family well-being and the catalysts for local, regional, and national economic growth. In 2010, economist Anthony Carnevale (2010) referred to access to postsecondary education and training as the "arbiter of opportunity in America," and this statement continues to hold true; the median invididual employed full-time today earns 80 percent more than a similar individual with only a high school diploma. Perhaps more so than ever, success in the labor market requires workers to demonstrate compentencies in thinking critically and applying new skills to ever more complex technology, as well as to demonstrate the ability to learn wholly new skills in short order. Employers benefit from a more highly skilled and productive workforce, and society benefits from increased tax revenue, reduced crime rates and dependence on public assistance programs, and greater civic engagement. 

Texas is certainly no exception to this general rule. In their analysis of returns to postsecondary attainment in Texas, Murdock et al. (2003) found that for every \\$1.00 invested in higher education, the state received more than \\$4.00 in returns in terms of reduced public assistance, lower incarceration rates, and increased tax revenue. Similarly, in their study of the return on state investments in workforce services in Texas, King et al. (2008) found an annualized ROI over a 10-year period of 38 percent for participants and 25 percent for taxpayers, with investments in high-itensity services (including postsecondary education up to the associate degree level) yeilding somewhat higher earnings than low-intensity investments). 

Despite these generally positive findings, it is likely that the issue of the value of a postsecondary credential is considerably more nuanced than simply whether or not one has one. In their analyses of NLSY79 data, Roska and Levy (2010) found "occupationally specific degrees are beneficial at the point of entry into the labor market but have the lowest growth in occupational status over time. Students earning credentials focusing on general skills, in contrast, begin in jobs with low occupational status but subsequently report the greatest growth." 

These mixed findings suggest that there are many unanswered questions regarding state-level investments in postsecondary education and training: how is the value defined for and attributed to education and training certificates and credentials; what role, if any, does prior work experience play in determining education and training-related labor market outcomes; how do these outcomes differ by race, gender, family income quartile, etc., plus a host of others. 

## **2. Learning Objectives** 

Recall the guiding research questions we will use for this series of notebooks are quite general: 

>**What are the employment outcomes of the 2015 bachelor's degree recipients? How do these outcomes vary by cohort characteristics?**

In the first data exploration notebook, `1.Data_Exploration.ipynb`, we have already defined a cohort of interest as 2015 calendar year bachelor's degree earners in Texas. Here, we will introduce you to the available wage records and walk through how we can link our cohort to the wage records tables to track the cohort's employment outcomes up to three years post-graduation. We will provide code and explanations for various outcome measures and compare them amongst subgroups such as major and gender. At the end of the notebook, we will save the data frames containing these results as csv files so that we can easily use them in the next notebook where we will visualize these descriptive statistics.

At the end of this notebook, you should test your skills by performing the following tasks:

(1) Linking wage records with one or more cohorts related to your specific research question <br>
(2) Examining earnings over time and by subgroups



In [None]:
# Import the file with Checkpoint hints and solutions
source("nb2_hints_and_solutions.txt")

### **Notebook 2 Questions and Goals**
In this notebook, we focus on the following questions:
- What are the average quarterly earnings of our cohort in Texas? Do they vary by major?
- What are the stable employment outcomes of our cohort? Do they vary by gender?
- What are the most common employment patterns of our cohort?

After completing this notebook you should be able to perform the following analytical tasks:
- Link an education cohort to multiple sets of wage data
- Identify full quarter employment
- Identify wage outcomes by different subgroups

#### **Datasets**

We will explore the Texas Workforce Commission wage table and build upon your work begun in Notebook 1 with the Texas Higher Education Coordinating Board (THECB) completers table. More specifically, we will leverage the following datasets:

- **Quarterly Wages**: The `wage_records_lehd` table is provided by the TWC and taken from Unemployment Insurance (UI) wage data. The data include individual quarterly earnings.
- **College Graduates**: The graduates table is provided by the THECB. The data include graduations at all Texas colleges and universities and covers the time period of January 2011 through December 2020.

## 3. Notebook Setup

In [None]:
# Run these database interaction (R package) imports.
# Do not panic if you see a few build version warnings.
# The versions used here were built with R 4.0.5 (Shake and Throw).
# The latest R version is 4.1.0 (Camp Pontanezen).
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)

To make your life easier, please insert your ADRF username to replace the ____ inside the quotations in the following cell.

In [None]:
# insert ADRF username Firstname.Lastname.UserID
username <- "___"

In [None]:
# Connect to the server.
# You will not see any output when the connection is made.
# Jupyter will post a warning if a connection cannot be made or if a connection breaks.
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

> If you are not properly connected to the server and/or have not loaded the packages to do so, you will receive an error message running the following code cell.

Now that we are connected to the proper server, we can load our cohort established in the first notebook, `grads15` into R.

> Recall `bachelors`, the data frame from the first notebook, is saved as `grads15` in the `tr_tx_2021` database.

In [None]:
# Let's get the data cohort from Notebook 1 and put it in an R data frame (df_cohort)
# Still no exciting output? Don't panic. You are creating a data frame but not viewing it yet.
qry <- 
"
SELECT * 
FROM tr_tx_2021.dbo.grads15
"

df_cohort <- dbGetQuery(con, qry)

To see some output from `df_cohort`, we can take a look at the first six rows of the data frame.

In [None]:
# Recall that this is an R command, so you can run it in a code cell
# Do a quick scan through the column headers
head(df_cohort)

## 4. A Note on SQL and R for Processing Data

SQL is designed to allow for quick and efficient processing of massive amounts of information, such as UI wage records files. Although you may not have trouble narrowing down a cohort from the graduates table in R, you will run into memory issues reading larger tables into R prior to significantly limiting their size. Particularly because we will need to link our original cohort to the UI wage records to begin to understand the cohort's employment outcomes, we saved our resulting analytical file formed at the end of Notebook 1 as a table in SQL. This will allow us to easily perform a linkage to the cohort's employment outcomes in SQL, as opposed to reading the entire UI wage records table into R to perform the linkage. Once we have our final table of wage outcomes specific to our cohort within a defined time period, we should be able to read this table into R to perform more complex analyses, as it is just a small subset of the original UI wage records file.
 
Oftentimes, analysts working with large datasets will begin their analysis in SQL to define their analytical frame before reading the resulting table into R. This workflow typically maximizes the power of the two languages, as SQL will be much more efficient when working with massive amounts of data, and R allows for more complex analyses and visualizations. 

## 5. Linking the Cohort to Wage Records

Since our cohort, `df_cohort`, does not contain employment outcomes, we will need to figure out a method to extract post-graduation earnings for these individuals from Texas's UI wage records table. This section will walk you through a possible linkage procedure.

### Understanding Texas's UI Wage Records

Before we can try to link `df_cohort` to any wage records, we need to get a better sense of the contents of the wage records table. Let's take a look at the column headers in the `wage_records_lehd` table and see if we can spot any common variables by which we can create a potential linkage.

In [None]:
# see five rows of data from tx wage_record table
qry <- "
SELECT top 5 *
FROM ds_tx_twc.dbo.wage_records_lehd
"
dbGetQuery(con, qry)

As you may have noticed, the individual identifier in the `wage_records_lehd` (`ssn`) table is different from the identifier in the `grads15` (`gradid`) table. Despite this, the two identifiers represent the same hashed social security number for the individual. For the purposes of linking the two tables and conducting the analysis, we will use `gradid` going forward as the individual identifier.

<font color=orange> <h3> **Checkpoint 1: Time Travel** </h3> </font>

Given the available variables in the `wage_records_lehd` table and `df_cohort` (`tr_tx_2021.dbo.grads15` in SQL), identify potential variables we can use to define a specific time frame (up to three years post-graduation). Refer to the data dictionaries for complete column definitions.

> Note: You don't need to perform the linkage — we will be doing that in a handful of code cells — but please think about potential variables we might be able to use in the future.

In [None]:
# which variables appear in the two tables? You can also consult the data dictionaries
qry <- "
SELECT COLUMN_NAME
FROM tr_tx_2021.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'grads15'
"
dbGetQuery(con, qry)

We have not provided hints or solutions for this checkpoint. You are encouraged to utilize the data dictionaries to think about what variables could be used to define a time frame. We will be going over an answer for this checkpoint in the next section. 

### Wage Records: Data Exploration

There are a few different ways we can approach linking our cohort with Texas' wage records so that they satisfy a specific time constraint. Although the solution presented may not line up with your answer to Checkpoint 1, it is one that can be applied to a lot of other datasets.

The general idea is to create new variables in each of the tables that represent graduation and employment information in terms of calendar dates. From there, we can take advantage of SQL and R's date-specific functions to extract employment data within a three-year timespan. If you refer back to the original tables from which we are taking wage and graduation data, you will notice that there are no columns indicating specific dates (i.e. mm/dd/yyyy format) within either. Luckily, though, there are columns in both tables that can allow us to approximate these dates in a consistent manner. 

For example, in Texas's UI wage records, the variable `quarter` tracks the fiscal quarter and `year` denotes the calendar year corresponding to each employment record.

> One benefit of working with dates in the specific `datetime` type is that there are built-in functions to calculate the time elapsed between two dates.

In [None]:
# see quarter and year in wage_records_lehd
qry <- "
SELECT TOP 5 quarter, year
FROM ds_tx_twc.dbo.wage_records_lehd
"
dbGetQuery(con, qry)

We will have to do a bit of manipulation to get a rough but consistent "date" of employment across all of the wage records from the `quarter` and `year` variables... To do so, we will approximate the job date as the first day of the quarter, so employment in:

- **Q1** will correspond to **January 1**
- **Q2** will correspond to **April 1** 
- **Q3** will correspond to **July 1**
- **Q4** will correspond to **October 1** 

A quick way to map these quarters to their corresponding month given this rule is that you can multiply each quarter by 3 and subtract 2. For example: in order, to get the 1st month associated with the 3rd quarter, we can do: $(3*3 - 2) = 7$. That is, the first month associated with the 3rd quarter is July (month = 7).

Therefore, our strategy to add in a variable `job_date`, which will be a date-formatted approximation of the date of employment in `mm/dd/yyyy` format, will be as follows:

1. Multiply `quarter` by 3 and subtract 2 to get the month.
2. Combine the month (step 1) and the calendar year using `concat()` so that the date format is mm/dd/yyyy, with dd always corresponding to '01', the first day of the quarter.
5. Convert the manipulated date string, which is of type `varchar` after running `concat()`, to `datetime` so that `job_date` registers as a date type.

The following table illustrates the output for each of these steps:

|`year`    | `quarter` |Step 1| Step 2| Step 3 |
| ----------- | :-----------: | :-----------: |:-----------: |:-----------:|
| 2015      |  3| 7 | 07/01/2015|2015-07-01

If you were to write the code out in steps, with each step building on the last, the code to create a `job_date` variable could look as follows:

#### Adjust `quarter` to correspond to a month

In [None]:
# map extracted quarter to month by multiplying by 3 and subtracting 2
# showing example where the quarter is not quarter 1
qry <- "
SELECT TOP 5 (quarter)*3-2 as month_for_job_date
FROM ds_tx_twc.dbo.wage_records_lehd
WHERE quarter = 3
"
dbGetQuery(con, qry)

#### Coerce extracted month and calendar year into a date-like format

In [None]:
# combine extracted month, day (always 01), and calendar year into date-like format using 'concat()'
# want format mm/dd/yyyy with '/' separators
# have combined month_for_job_date and year into 'job_date'
# notice type is 'chr'
qry <- "
SELECT TOP 5 quarter, year,
    CONCAT((quarter)*3-2, '/', '01', '/', year) as job_date
FROM ds_tx_twc.dbo.wage_records_lehd
"
dbGetQuery(con, qry)

#### Convert `job_date` to `datetime` type

In [None]:
# convert job_date to datetime in SQL using convert()
# first argument of convert() is the type to which you would like to convert the variable
qry <- "
SELECT TOP 5 quarter, year,
    CONVERT(datetime, concat((quarter)*3-2, '/', '01', '/', year)) as job_date
FROM ds_tx_twc.dbo.wage_records_lehd
"

dbGetQuery(con, qry)

Now, you have code to generate a rough `job_date` variable from the Texas UI wage records. While possible to do in R using the `lubridate` package, again due to the size of the table, you will often run into memory and speed issues.

### Assessing duplicates

When it comes to wage records, one of our assumptions is that there should be one entry for each individual-employer-quarter-year combination. Let's confirm that assumption by counting the number of entries within each `ssn-empr_no-quarter-year` combination.

The following code checks our assumption:

In [None]:
# check for duplicates
qry <- 
"
SELECT TOP 5 ssn, empr_no, year, quarter, COUNT(*) AS count
FROM ds_tx_twc.dbo.wage_records_lehd
GROUP BY ssn, empr_no, year, quarter 
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
"

dbGetQuery(con, qry)

It turns out that it is not possible to have more than one observation within a `ssn`-`empr_no`-`quarter`-`year` combination in this table. Sometimes, you may run into duplication when working with wage tables because employers may refile to amend previously-submitted employment records. It is necessary to assess all duplicates to avoid double-counting. Luckily, though, these data were already cleaned to adjust for the most recent file date within each employment record.

In review, we have been adhering to the following steps to prepare the wage records for the linkage procedure:

1. Create a column called `job_date` that reflects the 1st day of every quarter-calendar year combination [Done!]
2. Assess any potential duplicate records and apply de-duplication process [Not necessary]
3. Keep wages that are greater than zero [To do]

We have already created a table, `wage_record_dated_dedup`, for you in the `tr_tx_2021` database using the code below that accounts for all three of these steps. Additionally, we limited the wage records to only include observations that already existed in our cohort table created in the first notebook, `grads15`. This is done to speed up certain data exploration and linkage tasks. The code for creating the `wage_record_dated_dedup` is as follows:

```
SELECT *, convert(datetime, concat(substring(cast(wage_qtr as varchar), 5, 5)*3-2, '/', '01', '/', substring(cast(wage_qtr as varchar), 1, 4))) as job_date
INTO tr_tx_2021.dbo.wage_record_dated_dedup
FROM ds_tx_twc.dbo.wage_records_lehd 
WHERE wage > 0 and ssn in (
    SELECT DISTINCT(gradid) 
    FROM tr_tx_2021.dbo.grads15
)
```

### Creating `grad_date`
Now that we have successfully created the `job_date` variable and evaluated the wage records for potential duplication, we can follow a relatively similar process for generating `grad_date`. There are a couple of columns you can use from the `grads15` table. One option is a combination of `gradmonth`, which tracks the month of graduation, and `gradyear`, the year of graduation. However, as discussed in the first notebook, `gradyear` is actually the fiscal year of graduation, not the calendar year. Luckily, though, we know that the calendar year for all of these graduates is 2015 due to our original cohort contruction.

In [None]:
# see gradmonth and gradyear
qry <- "
SELECT TOP 5 gradmonth, gradyear
FROM tr_tx_2021.dbo.grads15
"
dbGetQuery(con, qry)

In the case of Texas wage records, we had the `quarter` and `year` variables that provided us with a quarter-calendar year combination. As you can see, the graduations file looks somewhat similar with `gradmonth` and `gradyear` columns. To create `grad_date`, we need to manipulate `gradmonth` to correspond to the first day of the proper fiscal quarter to ensure that it will align with `job_date`. Here, we will consider the first month of the Fall semester to correspond to October 1, or the first day of the 4th fiscal quarter; we will consider the first month of the Spring semester to correspond to April 1, and the first month of the Summer semester to July 1. Let's try creating a new column, `new_quarter`, that takes in this transformation.

Our end goal is to line up `gradmonth` to correspond to our method we applied to the Texas wage records. For example: If we have `gradmonth = 5`, we would create `grad_date` using the following method:

1. Extract the quarter from `gradmonth` (`new_quarter`). We do this by utilizing the `CEILING` function in SQL. We divide `gradmonth` by 3 and then we apply `CEILING` to **round up** the result. 
2. Multiply the quarter by 3 and subtract 2 to extract the first month for each quarter (`first_month`).
3. Combine the first day of `first_month` (01), `first_month` itself and the corresponding calendar year (2015).
4. Convert combined date into SQL's datetime format.



|`gradmonth`     |calendar year| `new_quarter`| `first_month` | combined date| final result | 
|:-----------:|:---:| :-----------:| :----------:|:-----------:| :----------:|
| 5      | 2015 |2| 4 | 04/01/2015|2015-04-01

Let's translate these steps to code.

In [None]:
# extract quarter from month 
qry <- "
SELECT TOP 5 gradmonth, ceiling(gradmonth/3.0) as new_quarter
    FROM tr_tx_2021.dbo.grads15
"

dbGetQuery(con, qry)

Assuming the `new_quarter` column, we now have the information needed to map degree dates to quarters. In order to preserve the `new_quarter` column, we could create a permanent table with this column, or add it to `grads15`. Another option, though, is to use a Common Table Expression (CTE), where we will create intermediate results (finding `new_quarter` for every graduate) that we can combine together to create our final table in one query. 

With a CTE, we create intermediate results that we can combine together to create our final table in one query. A CTE can be initiated by using a `with` clause in SQL. `with` enables us to define an intermediate table without writing it to the database before using the intermediate table to derive our desired result. The code below allows us to combine steps 2 through 4 in a CTE. 

In [None]:
# see grad date variable
qry <- "
WITH new_table as (
    SELECT *,
    CEILING(gradmonth/CAST(3 AS float)) as new_quarter
    FROM tr_tx_2021.dbo.grads15
)
SELECT top 5 gradmonth, CONVERT(datetime, CONCAT(new_quarter*3 - 2, '/', '01', '/', 2015)) as grad_date
FROM new_table
"
dbGetQuery(con, qry)

To avoid delays in terms of processing times, we have already created a new table, `grads15_dated` that incorporates the `grad_date` variable using the code below:
```
with new_table as (
    select *,
    ceiling (gradmonth/CAST(3 AS float)) as new_quarter
    from tr_tx_2021.dbo.grads15
)
select *, convert(datetime, concat((new_quarter*3) - 2, '/', '01', '/', 2015)) as grad_date
into tr_tx_2021.dbo.grads15_dated
from new_table;
```

### Performing the Join

At this point, we have both dated and undated versions of tables at our disposal to find employment history up to *3* years after graduation: 

- `ds_tx_thecb.dbo.graduates` ---> `tr_tx_2021.dbo.grads15` --->  `tr_tx_2021.dbo.grads15_dated`
- `ds_tx_twc.dbo.wage_records_lehd` ---> `tr_tx_2021.dbo.ui_wage_record_dated_dedup`

As mentioned previously, we can link between these two (dated) tables by  `ssn` values (hashed SSN numbers) in the `wage_record_dated_dedup` table and `gradid` values in the `grads15_dated` table to limit the time frame using SQL's date functions.

We will use a `join` statement and add our time constraints to the `where` clause. The time constraint will be implemented by only taking `job_date` values (found in `tr_tx_2021.dbo.wage_record_dated_dedup`) that occur within 14 quarters (including the quarter of graduation) of graduation. We include 14 quarters so that we can calculate full-quarter employment later on in this notebook.

The date-specific function we will use in SQL is `dateadd()`, which allows us to add different time intervals to date variables.

In [None]:
# link wage and education tables for up to 13 quarters post-graduation and include quarter of graduation
qry <- "
SELECT cohort.*, w.naics, w.year, w.quarter, w.empr_no, w.job_date, w.wage
FROM tr_tx_2021.dbo.grads15_dated cohort
JOIN tr_tx_2021.dbo.wage_record_dated_dedup w
ON cohort.gradid = w.ssn
WHERE w.job_date >= cohort.grad_date AND dateadd(quarter, 13, cohort.grad_date) >= w.job_date
"

# we call this df_wages_undup because the wages data frame does not have any duplicates in terms of individuals-employer-quarter-year
df_wages_undup <- dbGetQuery(con, qry)

head(df_wages_undup)

<font color=orange> <h3> **Checkpoint 2: Time Keeping** </h3> </font>


Adjust the query above to only include wage records in the two years (8 quarters) after graduation, including the quarter of graduation. Return five rows to confirm your results.

In [None]:
# link wage and education tables for two years after graduation
qry <- "
SELECT TOP 5 __
FROM ___
JOIN ___
ON ___
WHERE __
"
dbGetQuery(con, qry)

Uncomment the lines below if you would like to see a hint or a solution.

In [None]:
# checkpoint_2.hint()

In [None]:
# checkpoint_2.solution()

## 6. Saving linked cohort-wages data to a table

We can use the following code to save our `df_wages_undup` data frame as a table within the `tr_tx_2021` database.
```
qry <- " use tr_tx_2021;"
DBI::dbExecute(con, qry)

DBI::dbWriteTable(
    conn = con,
    name = DBI::SQL("dbo.nb_cohort_wages_link"), 
    value = df_wages_undup
    )
```

In [None]:
# test that we can query from the new table
qry <- "
SELECT TOP 5 *
FROM tr_tx_2021.dbo.nb_cohort_wages_link
"
dbGetQuery(con, qry)

## 7. Employment Outcomes

Connecting education data to employment data is only the first part of understanding the outcomes for Texas' bachelor's degree recipients. There are many ways that one could define and evaluate these outcomes. We present a few here. While working through this section, think through the outcomes that would be most relevant for your research question.

We will look at the following outcomes:
- Average quarterly earnings and the total workers employed by quarter (data: TX wages)
    - Distribution by major
- Full quarter employment (data: full quarter TX wages)
    - Distribution by gender
- Employment Patterns (data: quarterly TX wages)

### Average quarterly earnings and number employed by quarter

As mentioned earlier, we plan to focus on the first 12 quarters post-graduation for each individual. However, `df_wages_undup` currently contains employment outcomes in the quarter of graduation, as well as 13 quarters post-graduation. To isolate each quarter post-graduation, we will create a new variable, `quarter_number`, which leverages some of R's functionality when it comes to working with date variables. After converting `grad_date` and `job_date` to date objects in R, we can calculate the difference in weeks between the two values (thanks to `difftime` from the `lubridate` package), divide by 13 since there are roughly 13 weeks in each quarter, and round to the nearest whole number to find the quarter of employment relative to graduation.

In [None]:
# get quarter from graduation
df_wages_undup <- df_wages_undup %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    )

# see evidence
df_wages_undup %>%
    select(grad_date, job_date, quarter_number) %>%
    head()

With `quarter_number`, we can sum each individual's total earnings by quarter while excluding all observations where `quarter_number` is either 0 or 13.

In [None]:
# ignore quarters 0 and 13
df_wages_undup <- df_wages_undup %>%
    filter(!(quarter_number %in% c(0, 13)))

# find quarterly wages
df_wages_undup %>%
    group_by(gradid, quarter_number) %>%
    summarize(
        quarterly_wages = sum(wage),
    ) %>%
    ungroup() %>%
    head()

Let's save these results to the data frame `quarterly_wages` so we can compute the cohort's average quarterly earnings by quarter.

In [None]:
# save as quarterly_wages
quarterly_wages <- df_wages_undup %>%
    group_by(gradid, quarter_number) %>%
    summarize(
        total_wages = sum(wage),
    ) %>%
    ungroup()

Now that we have `quarterly_wages` to capture each individual's total quarterly earnings, we can compute the cohort average quarterly earnings, as well as the total number of individuals employed, broken down by quarter after graduation.

In [None]:
# average wages and number of grads with wages by quarter after graduation
avg_and_num <- quarterly_wages %>%
    group_by(quarter_number) %>%
    summarize(
        mean_wage = mean(total_wages),
        n_employed = n_distinct(gradid)
    )

avg_and_num

We can see that the number of graduates employed in Texas is fairly consistent, and that over time, the average quarterly earnings rise. Keep in mind that the `mean_wage` encompasses average quarterly earnings, so if an individual had multiple earning sources in a quarter, we are currently taking the sum of them. Let's see if we see similar trends amongst those receiving the most common degrees within the cohort.

#### By Major

Let's see if there are consistent trends relative to the entire cohort of earners for those who received degrees in the five most common fields. Recall that whereas the cohort was saved in the data frame `bachelors` in the first notebook, it is saved as `df_cohort` here.

As a brief recap, we can use the following code to rediscover the 5 most common majors in our cohort. This is equivalent to what we did in `1.Data_Exploration.ipynb`.

In [None]:
# df_cohort: Create a 2 digit CIP program code from the full CIP code in `gradmaj`
df_cohort <- df_cohort %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

# df_wages_undup: Create a 2 digit CIP program code from the full CIP code in `gradmaj`
df_wages_undup <- df_wages_undup %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )
    
# load CIP crosswalk into R
qry <- "
SELECT *
FROM ds_public_1.dbo.cip_lookup
"
cip_lookup <- dbGetQuery(con, qry)

# only select 2010 columns
cip_lookup <- cip_lookup %>%
    select(ends_with("2010"))

# 5 most common majors
com_majors <- df_cohort %>%
    count(CIP_Program) %>%
    arrange(desc(n)) %>%
    mutate(
        prop = n/sum(n)
    ) %>%
    head(5) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))


com_majors

Now that we have the most common majors, we can `filter` `df_wages_undup` for only observations for these majors and evaluate earnings and the number of individuals employed within these five majors.

In [None]:
# earnings and number employed for most common majors
# first find quarterly wages for each person while including major
# then find average wages within groups
avg_and_num_major <- df_wages_undup %>%
    filter(CIP_Program %in% com_majors$CIP_Program) %>%
    group_by(gradid, quarter_number, CIP_Program) %>%
    summarize(
        total_wages = sum(wage)
    ) %>%
    ungroup() %>%
    group_by(CIP_Program, quarter_number) %>%
    summarize(
            mean_wage = mean(total_wages),
            n_employed = n_distinct(gradid)
        )  %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))


avg_and_num_major

Notice the different salary growth experiences, on average, between graduates of these majors. However, these two employment measures do not include any measure of employment stability, which is often an important job aspect, especially for recent graduates.

### Full-Quarter Employment

There are many ways to define stable employment. Sometimes, it may be useful to define stable employment by a consecutive number of quarters worked with the same employer. Other times, stable employment may assume an alternative definition. Here, we will define stable employment as full-quarter employment. Full-quarter employment in quarter *t* is indicated by a presence of wages with the same employer in quarters *t-1*, *t*, and *t+1*.

**Example: Data Needed to Assess Full-Quarter (FQ) Employment for Person1/EmployerA in Each of the 4 Quarters Post-Graduation**

|Person/Employer Combination| Degree |FQ YearQtr |t-1|t|t+1|
|---|---|---|---|---|---|
|_Person 1/Employer A_ |_2013 Q3_ |_2013 Q4_ |<font color=green>2013 Q3</font> |**2013 Q4** |2014 Q1 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q1_ |2013 Q4 |**2014 Q1** |2014 Q2 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q2_ |2014 Q1 |**2014 Q2** |2014 Q3 |
|_Person 1/Employer A_ |_2013 Q3_ |_2014 Q3_ |2014 Q2 |**2014 Q3** |<font color=green>2014 Q4</font> |

As can be seen in the table above, calculating full-quarter employment for the same four quarter span requires two additional quarters of wage information. This requires us to extend our data frame to include the employment quarter of graduation as well as one additional quarter after our final quarter of interest. In this example, it means including employment in the quarter prior to quarter 1 and the quarter after quarter 12. Now, it should be clear as to why we initially brought in 14 quarters worth of data, as we need all 14 quarters to calculate full-quarter employment within the 12 on which we want to focus.

In practice, to isolate instances of full-quarter employment, we can do so by creating three copies of our `nb_cohort_wages_link` table, and matching them based on `empr_no` and `gradid` values while accounting for `job_date` differences amounting to quarters t-1, t, and t+1.

In [None]:
# get full quarter instances
qry <- "
SELECT b.gradid, b.empr_no, b.wage, b.job_date, b.grad_date, b.gradgen, b.gradmaj
FROM  tr_tx_2021.dbo.nb_cohort_wages_link a,  tr_tx_2021.dbo.nb_cohort_wages_link b,  tr_tx_2021.dbo.nb_cohort_wages_link c
WHERE a.gradid = b.gradid AND a.empr_no = b.empr_no AND a.job_date = dateadd(month, 3, b.job_date)
AND a.gradid = c.gradid and a.empr_no = c.empr_no AND b.job_date = dateadd(month, 3, c.job_date)
"
full_q_wages <- dbGetQuery(con, qry)

head(full_q_wages)

Let's check how many individuals experienced at least one quarter of stable employment, as well as the number of employers by which members of the cohort experienced stable employment and their average wages.

In [None]:
# see number with at least one quarter of full quarter, number of unique employers, and what are their average wages
full_q_stats <- full_q_wages %>%
    summarize(
        num_individuals = n_distinct(gradid),
        num_employers = n_distinct(empr_no),
        avg_wage = mean(wage)
    )

full_q_stats

For reference, we can compare this to the number of individuals in the cohort that were employed in at least one quarter in Texas.

In [None]:
# see number with at least one quarter, number of unique employers, and what are their average quarterly wages (per employer)
wage_stats <- df_wages_undup %>%
    summarize(
        num_individuals = n_distinct(gradid),
        num_employers = n_distinct(empr_no),
        avg_wage = mean(wage)
    )

wage_stats

Interestingly enough, it appears as though most members of the cohort that found any employment in Texas within this time frame experienced at least one quarter of stable employment. But is there a difference by gender? In our next subsection, we will try to answer this question.

#### By Gender

At the end of the first data exploration notebook, after exploring the most common majors within the cohort, we also analyzed the gender breakdown of the cohort. Here, we will compare the overall gender breakdown to that of those who experienced at least one quarter of full-quarter employment.

Recall the code from the first notebook in the code cell below:   

In [None]:
# gender breakdown
df_gender <- df_cohort %>%
    count(gradgen) %>%
    arrange(desc(n)) 

# see df_gender
df_gender

Now we can apply this code concept to the `full_q_wages` data frame and calculate the number and proportion of individuals by gender experiencing at least one quarter of full quarter employment, as well as their average wages.

In [None]:
# see number with at least one quarter of full quarter by gender, what are their average wages
full_q_stats_gender <- full_q_wages %>%
    group_by(gradgen) %>%
    summarize(
        num_individuals = n_distinct(gradid),
        avg_wage = mean(wage)
    ) %>%
    inner_join(df_gender, by='gradgen') %>%
    mutate(
        prop = num_individuals/n
    )

full_q_stats_gender

Let's do the same to that of individuals who were employed in at least one quarter in Texas.

In [None]:
# see number with at least one quarter, number of unique employers, and what are their average quarterly wages (per employer)
wage_stats_gender <- df_wages_undup %>%
    group_by(gradgen) %>%
    summarize(
        num_individuals = n_distinct(gradid),
        avg_wage = mean(wage)
    ) %>%
    inner_join(df_gender, by = 'gradgen') %>%
    mutate(
        prop = num_individuals/n
    )

wage_stats_gender

We can see that roughly the same proportional differences exist across the two measures, and that there appears to be a major discrepancy in average wages across genders. What do you think are some potential reasons for this?

<font color=orange> <h3> **Checkpoint 3: Stable Employment by Major** </h3> </font>

Recreate the table above (besides `prop`) for the five most common majors. Do the results surprise you?

In [None]:
# replace ___
full_q_wages %>%
    group_by(____) %>%
    summarize(
        num_individuals = n_distinct(gradid),
        avg_wage = mean(wages)
    ) %>%
    arrange(desc(num_individuals)) %>%
    mutate(CIP_Program = substring(gradmaj, 1, 2)) %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010")) %>%
    head(5)


Uncomment the lines below to look at a hint or a solution.

In [None]:
# checkpoint_3.hint()

In [None]:
# checkpoint_3.solution()

### Employment Patterns

Employment outcomes can be defined by more than just earnings. While we looked at stability, we can also look more comprehensively at employment patterns. How quickly do graduates find employment? How many spells of nonemplyment do graduates experience and for how long? These are important factors to consider as graduates enter and navigate the labor market, particularly as we consider differences across major and demographic groupings. 

At the end of this section, we hope to have found the most common employment patterns for everyone in the original cohort. To start, let's get a sense of the amount of individuals that are missing from the combination of `quarterly_wages`, which contain entries of quarterly earnings for individuals of our cohort employed in Texas.

In [None]:
# see size of original cohort
df_cohort %>%
    summarize(
        num_inds = n_distinct(gradid)
    )

In [None]:
# see amount of people with employment outcomes in TX
quarterly_wages %>%
    summarize(n_distinct(gradid))

Before we manipulate any existing data frames, let's confirm that if we join `quarterly_wages` to `df_cohort`, the number of individuals where `quarter_number` (or pick any other variable only in `quarterly_wages`) is equal to the difference in individuals between `df_cohort` and `quarterly_wages`.

In [None]:
# see that everyone with na quarter is equal to amount who didn't show up in df_dom_wages
df_cohort %>%
    left_join(quarterly_wages, by = "gradid") %>%
    filter(is.na(quarter_number)) %>%
    summarize(n_distinct(gradid))

Now that we have confirmed that our join should work as intended, we will join `quarterly_wages` to `df_cohort`. After doing so, we will set all instances where `quarter_number` is `NA` equal to 1, so that we will eventually be able to have 12 observations for each individual, one for each potential quarter of employment.

In [None]:
# set all where quarter is na equal to one so we can use complete
full_wages <- df_cohort %>%
    left_join(quarterly_wages, by = "gradid") %>%
    mutate(
        quarter_number = ifelse(is.na(quarter_number), 1, quarter_number)
    )

# see potential quarter numbers
full_wages %>%
    distinct(quarter_number) %>%
    arrange(quarter_number)

Now that we have all potential `gradid` (SSN) values, as well as instances of all desired `quarter_number` values, we can leverage the tidyverse's `complete` function, which will add additional rows for any combinations of `gradid`/`quarter_number` that do not currently exist in `full_wages`. If the combination does not appear in `full_wages`, the resulting `total_wages` value will be `NA`, signifying the individual was not employed in this quarter. As verification, the number of rows should equal the number of individuals multiplied by 12.

In [None]:
# complete file
completed <- full_wages %>%
    complete(gradid, quarter_number, fill=list(total_wages=NA))

# see that n should be a multiple of n_dist
completed %>%
    summarize(
        n = n(),
        n_inds = n_distinct(gradid),
        test = n_inds*12 == n
    )

In [None]:
# see completed
head(completed)

Now that we have created `completed`, we just need to aggregate and manipulate the data frame so that each column is a quarter, and each observation is an individual, with the corresponding columns indicating whether the individual was employed in the given quarter. To start, let's create a variable `wage_ind`, which will be "yes" if the individual was employed in the quarter, and "no" otherwise. Additionally, for each in column manipulation, we will change each `quarter_number` value from 1, 2, 3,..., 12 to Q1, Q2, Q3,..., Q12 and call this variable `quarter`.

In [None]:
# create wage_ind and quarter variables
patterns <- completed %>%
    mutate(
        wage_ind = ifelse(is.na(total_wages), "no", "yes"),
        quarter = paste("Q", quarter_number, sep="")
    )

head(patterns)

Now, we need to figure out how to "pivot" the data frame so that each column is a value of `quarter`, with `wage_ind` values for the `gradid` values. To do so, we will use `pivot_wider`, which allows us to take a tidy data frame (one observation per row) and "widen" it so that each column becomes values from what was previously a single column (`quarter`) and the rows are occupied by those from a corresponding column (`wage_ind`). 

After manipulating the data frame, we can aggregate by the quarter columns and count the number of observations within each of these patterns to discover the most common employment patterns for the cohort.

In [None]:
# find most common employment patterns
patterns <- patterns %>%
    select(gradid, quarter, wage_ind) %>%
    pivot_wider(names_from = quarter, values_from = wage_ind) %>%
    group_by(Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8, Q9, Q10, Q11, Q12) %>%
    summarize(cnt = n_distinct(gradid)) %>%
    arrange(desc(cnt)) %>%
    ungroup() %>%
    mutate(
        prop = percent(cnt/sum(cnt), .01)
    ) 

head(patterns)

From this data frame, we can see that about half of the cohort received non-zero wages in Texas for all of their first 12 quarters after graduation. Additionally, we can assess some other employment patterns within the cohort to get a better sense of their diverging experiences.

## 8. Save as csvs
Before we finish the notebook, let's save your work as .csv files so that they can be referenced in the Data Visualization notebook.

In [None]:
# Save dataframes to CSV to use in later notebook

# average quarterly earnings and number employed by quarter
write_csv(avg_and_num, sprintf("U:\\%s\\TX Training\\Results\\avg_and_num.csv", username))

# average quarterly earnings and number employed by quarter (common majors)
write_csv(avg_and_num_major, sprintf("U:\\%s\\TX Training\\Results\\avg_and_num_major.csv", username))

# full quarter info
write_csv(full_q_stats, sprintf("U:\\%s\\TX Training\\Results\\full_q_stats.csv", username))

# full quarter info by gender
write_csv(full_q_stats_gender, sprintf("U:\\%s\\TX Training\\Results\\full_q_stats_gender.csv", username))

# any quarter info
write_csv(wage_stats, sprintf("U:\\%s\\TX Training\\Results\\wage_stats.csv", username))

# any quarter info by gender
write_csv(wage_stats_gender, sprintf("U:\\%s\\TX Training\\Results\\wage_stats_gender.csv", username))

# patterns
write_csv(patterns, sprintf("U:\\%s\\TX Training\\Results\\patterns.csv", username))

## References

Currie, David, Edelmann, Joshua, Feder, Benjamin, & Barrett, Nathan. (2022, April 1). Data Exploration and Linkage using Tennessee Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.6407258

Carnevale, A. "Postsecondary Education and Training As We Know It is Not Enough: Why We Need to Leaven Postsecondary Strategy with More Attention to Employment Policy, Social Policy, and Career and Technical Education in High School." Paper prepared for the Georgetown University and Urban Institute Conference on Reducing Poverty and Economic Distress after ARRA. (2010).

King, C., Tang, Y., Smith, T., Schroeder, D., Barnow, B., "Returns from Investments in Workforce Services:Estimations for Participants, Taxpayers, and Society", Ray Marshall Center, LBJ School of Public Affairs, University of Texas at Austin, (2008).

Murdock, S., White, S., Hoque, N., Pecotte, B., You, X., and Balkan, J., "The New Texas Challenge: Population Change and the Future of Texas," College Station: Texas A&M University Press, (2003).

Roska, J., and Levey, T., "What Can You Do with That Degree: College Major and Occupational Status of College Graduates over Time," Social Forces 89.2 (2010): 389-416

Shudde, L., Bernell, K., "Educational Attainment and Nonwage Labor Market Returns in the United States," AERA Open 5.3 (2019): 1-18.




## Appendix

### Dominant Earnings

Another potential earnings outcome to measure is dominant earnings, which only includes the record of the highest earnings per quarter for each individual. We will first arrange employment for each individual/quarter combination by descending wages before taking the highest earnings per `gradid`-`quarter_number` combination. We will save this resulting data frame as `df_dom_wages`.

In [None]:
# identify dominant wages in each quarter
df_dom_wages <- df_wages_undup %>%
    arrange(gradid, quarter_number, desc(wage)) %>%
    distinct(gradid, quarter_number, .keep_all=TRUE)

head(df_dom_wages)

In [None]:
# confirm we have one entry for each gradid/quarter_number combination
df_dom_wages %>%
    count(gradid, quarter_number) %>%
    arrange(desc(n)) %>%
    head()

Now that we have isolated dominant earnings in `df_dom_wages`, we can recycle the same code that we applied to `df_wages_undup`.

In [None]:
# average dominant wages and number of grads with wages by quarter after graduation
avg_and_num_dom <- df_dom_wages %>%
    group_by(quarter_number) %>%
    summarize(
        mean_wage = mean(wage),
        n_employed = n_distinct(gradid)
    )

avg_and_num_dom