# Linkage Notebook

Brian Kim, Maryah Garner, Rukhshan Mian, Ekaterina Levitskaya.

In this notebook we show quick helper code on how to join datasets available in the class.

# R Setup

As always, start by importing the required libraries, as well as creating the connection to the database.

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# add weights to data
suppressMessages(library(survey))

#Switching on warnings
options(warn = 0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

![Diagram](images/Database_Diagram_2021_version.jpg)

# Before we start

When performing a join with multiple tables, remember:
- **use LEFT JOIN** if you would like to keep everyone in your original table (cohort); 
- **use INNER JOIN** if you only want to keep individuals found in all joined tables;
- **only pull in the columns of interest**, in order not to overload the memory (note that SED and SDR data sources have hundreds of columns). We will show examples with different select variables in this notebook, please feel free to substitute with your variables of interest.

# SED <> SED (Previous/Other Degrees)

If you are interested in using SED variables related to previous or other degrees of doctoral students, these variables are included in the separate tables in the database:

- Baccalaureate degree - `nsf_sed_bacc`
- Master\'s degree - `nsf_sed_mast`
- Associate\'s degree (only available for 2017 cohort) - `nsf_sed_assoc`
- Professional doctoral degree (only available for 2017 cohort) - `nsf_sed_doct`

Get the first baccalaureate field (**bafield**) for 2015 cohort

In [None]:
qry <- " 
select sed.drf_id, bafield
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN ds_nsf_ncses.dbo.nsf_sed_bacc bacc
on sed.drf_id = bacc.drf_id
where sed.phdfy = '2015'"

cohort_2015_bacc_field <- dbGetQuery(con, qry)

Get the most recent master's degree institution HBCU indicator (**mahbcu**) for 2015 cohort

In [None]:
qry <- " 
select sed.drf_id, mahbcu
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN ds_nsf_ncses.dbo.nsf_sed_mast mast
on sed.drf_id = mast.drf_id
where sed.phdfy = '2015'"

cohort_2015_mast_hbcu <- dbGetQuery(con, qry)

Get the first associate's degree year (**aayear**) - data available only for 2017 cohort

In [None]:
qry <- " 
select sed.drf_id, aayear
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN ds_nsf_ncses.dbo.nsf_sed_assoc assoc
on sed.drf_id = assoc.drf_id
where sed.phdfy = '2017'"

cohort_2017_assoc_year <- dbGetQuery(con, qry)

Get the professional doctorate institution (**profinst**) - data available only for 2017 cohort

In [None]:
qry <- " 
select sed.drf_id, profinst
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN ds_nsf_ncses.dbo.nsf_sed_doct doct
on sed.drf_id = doct.drf_id
where sed.phdfy = '2017'"

cohort_2017_prof_inst <- dbGetQuery(con, qry)

# SED <> SDR

SDR contains information on the employment of doctoral recipients and can give insight into career pathways of doctoral students from SED.

It is possible to track individuals longitudinally, from their graduation in SED to their employment status in SDR 2015, in SDR 2017, and in SDR 2019 (if they remain eligible for the survey).

Note that in order to check the sample size, we will use INNER JOIN in this query. SDR is a sample, not a census like SED, and it represents aournd 10% of SED population. Having missing individuals in SDR means that they were not included in that particular SDR cycle. For example, the SDR 2019 target population includes individuals that meet the following criteria:
- Earned an SEH research doctorate degree from a U.S. academic institution prior to 1 July 2017
- Are not institutionalized or terminally ill on 1 February 2019
- Are less than 76 years of age as of 1 February 2019 

In this query we show an example of joining multiple tables: SED 2013 cohort with SDR 2015, SDR 2017, and SDR 2019. We compare doctoral students' salary in different years of their employment (**salary** variable). 

Note that each of SDR tables have separate crosswalks. For example, in order to join SED 2013 cohort with SDR 2015, we first need to join with the SDR 2015 crosswalk (which contains only individual identifiers for SED and SDR, `sdr_drfid_2015` table), and then join with the actual SDR 2015 table (which contains all SDR variables, `nsf_sdr_2015` table). We then repeat the same code for SDR 2017 and SDR 2019.

In [None]:
qry <- "
select sed.drf_id, sdr_2015.salary as salary_2015, sdr_2017.salary as salary_2017, sdr_2019.salary as salary_2019
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN ds_nsf_ncses.dbo.sdr_drfid_2015 sdr_2015_xwalk
on sed.drf_id = sdr_2015_xwalk.drf_id
INNER JOIN ds_nsf_ncses.dbo.nsf_sdr_2015 sdr_2015
on sdr_2015_xwalk.refid = sdr_2015.refid
INNER JOIN ds_nsf_ncses.dbo.sdr_drfid_2017 sdr_2017_xwalk
on sed.drf_id = sdr_2017_xwalk.drf_id
INNER JOIN ds_nsf_ncses.dbo.nsf_sdr_2017 sdr_2017
on sdr_2017_xwalk.refid = sdr_2017.refid
INNER JOIN ds_nsf_ncses.dbo.sdr_drfid_2019 sdr_2019_xwalk
on sed.drf_id = sdr_2019_xwalk.drf_id
INNER JOIN ds_nsf_ncses.dbo.sdr_2019 sdr_2019
on sdr_2019_xwalk.refid = sdr_2019.refid
where sed.phdfy = '2013'
"

longitudinal_cohort_2013 <- dbGetQuery(con, qry)

In [None]:
# Check the sample size - the number of unique DRF IDs matched between all tables (SDR 2015, SDR 2017, and SDR 2019)
length(unique(longitudinal_cohort_2013$drf_id))

# SED <> UMETRICS Semester

UMETRICS is an administrative dataset which contains award transaction data and which can give further insight into funding patterns of doctoral students.

If you would like to check the sample size after the join, use INNER JOIN:

In [None]:
qry <- " 
select sed.drf_id, semester
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.umetrics_inst_xwalk inst_xwalk
on sed.phdinst = inst_xwalk.phdinst
INNER JOIN tr_ncses_2021.dbo.sed_umetrics_xwalk indiv_xwalk
on indiv_xwalk.drf_id = sed.drf_id
INNER JOIN ds_iris_umetrics.dbo.semester sem
on sem.emp_number = indiv_xwalk.emp_number
where sed.phdfy = '2015'"

sed_umetrics_inner_join <- dbGetQuery(con, qry)

In [None]:
# Check the number of individuals with the inner join
# We only included individuals who have funding
length(unique(sed_umetrics_inner_join$drf_id))

If you would like to have a comparison group for your research project (e.g. individuals who did not receive funding), use LEFT JOIN - to keep all the individuals in the SED table from UMETRICS universities on the left. 

We will use an institutional crosswalk between UMETRICS institution id and SED institution id (`umetrics_inst_xwalk`), in order to get all SED individuals from UMETRICS institutions. We will then use the crosswalk with individuals (`sed_umetrics_xwalk`).

In [None]:
qry <- " 
select sed.drf_id, semester
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.umetrics_inst_xwalk inst_xwalk
on sed.phdinst = inst_xwalk.phdinst
LEFT JOIN tr_ncses_2021.dbo.sed_umetrics_xwalk indiv_xwalk
on indiv_xwalk.drf_id = sed.drf_id
LEFT JOIN ds_iris_umetrics.dbo.semester sem
on sem.emp_number = indiv_xwalk.emp_number
where sed.phdfy = '2015'"

sed_umetrics_left_join <- dbGetQuery(con, qry)

In [None]:
# Check the number of individuals with the left join - this is everyone from 2015 cohort included in the SED
length(unique(sed_umetrics_left_join$drf_id))

In [None]:
# Those who don't have funding will have NA in the semester variable
sed_umetrics_left_join %>%
    filter(is.na(semester)) %>% head

# SED <> UMETRICS Employee

If you link SED with the UMETRICS Employee table, you can find out whether a given employee was full-time (**fte_status**) or what was their occupation (**umetrics_occupational_class**)

In [None]:
# Get the occupation of employees
qry <- " 
select sed.drf_id, umetrics_occupational_class
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.umetrics_inst_xwalk inst_xwalk
on sed.phdinst = inst_xwalk.phdinst
LEFT JOIN tr_ncses_2021.dbo.sed_umetrics_xwalk indiv_xwalk
on indiv_xwalk.drf_id = sed.drf_id
LEFT JOIN ds_iris_umetrics.dbo.core_employee employee
on employee.emp_number = indiv_xwalk.emp_number
where sed.phdfy = '2015'"

sed_umetrics_employee_inner_join <- dbGetQuery(con, qry)

# SED <> UMETRICS Employee <> UMETRICS Award

If you link with the UMETRICS Award table, you will be able to bring in the variables related to awards, such as **award_title**. 

Note that the unit of observation in the UMETRICS Award table is grant, institution, and a specific transaction period (**period_start_date** and **period_end_date**).

In this query we will show an example of getting an award title. We will use INNER JOIN to only get individuals who are found in the UMETRICS dataset.

In [None]:
# Get the award title
qry <- " 
select sed.drf_id, award.period_start_date, award.period_end_date, award_title
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.umetrics_inst_xwalk inst_xwalk
on sed.phdinst = inst_xwalk.phdinst
INNER JOIN tr_ncses_2021.dbo.sed_umetrics_xwalk indiv_xwalk
on indiv_xwalk.drf_id = sed.drf_id
INNER JOIN ds_iris_umetrics.dbo.core_employee employee
on employee.emp_number = indiv_xwalk.emp_number
INNER JOIN ds_iris_umetrics.dbo.core_award award
on award.unique_award_number = employee.unique_award_number and award.institution_id = employee.institution_id 
and award.cfda = employee.cfda and award.recipient_account_number = employee.recipient_account_number 
and award.period_start_date = employee.period_start_date and award.period_end_date = employee.period_end_date 
where sed.phdfy = '2015'"

sed_umetrics_employee_award_inner_join <- dbGetQuery(con, qry)

In [None]:
head(sed_umetrics_employee_award_inner_join)

Note that there duplicate award title, as the unit of observation in the UMETRICS Award table is the transaction period. In this case, we are only interested in the unique award title per individual, and we can drop duplicates by the award title.

In [None]:
# Remove columns with the transaction dates
sed_umetrics_employee_award_inner_join <- sed_umetrics_employee_award_inner_join %>%
                                            select(-c(period_start_date,period_end_date))

In [None]:
# Drop duplicates
sed_umetrics_employee_award_inner_join <- sed_umetrics_employee_award_inner_join %>%
                                                distinct()

# UMETRICS Employee <> Federal RePORTER

Federal RePORTER data provides an abstract for a grant. Note that we are using a LEFT JOIN from the UMETRICS Employee table, to get all matched individuals in the UMETRICS dataset, even if they didn't receive an NIH grant (those individuals will have missing values in the abstract).

There are three tables with grant abstracts:
<br />NIH grants - `umetrics_nih_table`
<br />NSF grants - `umetrics_nsf_grants`
<br />USDA grants - `umetrics_usda_grants`

Example below is with NIH grants. 

In [None]:
# Get the grant abstract
qry <- " 
select sed.drf_id, nih.abstract as nih_abstract
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.umetrics_inst_xwalk inst_xwalk
on sed.phdinst = inst_xwalk.phdinst
INNER JOIN tr_ncses_2021.dbo.sed_umetrics_xwalk indiv_xwalk
on indiv_xwalk.drf_id = sed.drf_id
INNER JOIN ds_iris_umetrics.dbo.core_employee employee
on employee.emp_number = indiv_xwalk.emp_number
LEFT JOIN ds_iris_umetrics.dbo.umetrics_nih_grants nih
on employee.unique_award_number = nih.unique_award_number and employee.institution_id = nih.institution_id
where sed.phdfy = '2015'"

sed_umetrics_employee_nih_abstract <- dbGetQuery(con, qry)

In [None]:
head(sed_umetrics_employee_nih_abstract, 1)

# SED <> HERD

HERD data provides institutional characteristics data, such as medical school flag (**med_sch_flag**) or total R&D expenditures (**total_rd**)

Note that the unit of observation is year and institution - e.g. each institution has an entry per year between 2010 and 2017. 

In some cases there is a duplicate row, because the original **ipeds_inst_name** variable has two spelling versions of the institution.

In [None]:
# Explore HERD based on one institution - 111966
qry <- " 
select *
from ds_nsf_ncses.dbo.nsf_herd
where ipeds_inst_id = '111966'"

herd <- dbGetQuery(con, qry)

In [None]:
# Note the difference in the institution name in the "ipeds_inst_name" column
head(herd, 2)

To link it to the individuals from SED, we would use both the institution variable (the **phdinst** variable in SED (IPEDS ID) and the **ipeds_inst_id** in HERD) and the year variable (**phdfy** in SED and **year** in HERD), and we also need to use distinct values, to de-duplicate rows (`select distinct`).

In [None]:
qry <- " 
select distinct drf_id, phdinst, med_sch_flag
from ds_nsf_ncses.dbo.nsf_sed sed
INNER JOIN ds_nsf_ncses.dbo.nsf_herd herd
on sed.phdinst = herd.ipeds_inst_id and sed.phdfy = herd.year
where phdfy = '2015'"

sed_herd <- dbGetQuery(con, qry)

In [None]:
head(sed_herd)

# SED <> IPEDS

In this example, we show how to pull institutional name and location (state) using IPEDS data. For the list of IPEDS variables available in the class, please refer to the IPEDS data dictionary.

The unit of observation in IPEDS data is institution. We will link on **phdinst** variable in SED (IPEDS ID) and **unitid** variable in IPEDS.

In [None]:
qry <- " 
select distinct drf_id, phdinst, instnm, stabbr
from ds_nsf_ncses.dbo.nsf_sed sed 
INNER JOIN tr_ncses_2021.dbo.ipeds ipeds
on sed.phdinst = ipeds.unitid
where sed.phdfy = '2015'"

sed_ipeds <- dbGetQuery(con, qry)

In [None]:
head(sed_ipeds)