<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 
    <br>
    
# SED/SDR Dataset Exploration
    
<br>
    
<center>Maryah Garner, Ekaterina Levitskaya, Allison Nunez, Ben Feder, Rukhshan Arif Mian.</center>

### Learning Objectives

This notebook gives the opportunity to spend some hands-on time with SED and SDR data. The notebook demonstrates various techniques on how to use SQL and R to explore these datasets.


### R Setup

Before using R functions that are not available in `base` R, load them using the built-in function `library()`. For example, running `library(tidyverse)` loads the `tidyverse` suite of packages.


For the purpose of this notebook, we run it slightly differently – you may execute the cell below to load in the libraries. We utilize the `suppressMessages` and `options` functions to reduce clutter in our output. 

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# for data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# add weights to data
suppressMessages(library(survey))

#Switching on warnings
options(warn = 0)

The purpose of `suppressMessages` is to hide all messages associated with reading in a library and we use `options(warn = -1)` to switch off warnings. These two functions do not impact our results in any way but serve to reduce clutter in our output. `options(warn = 0)` is used to switch on the warnings as we do not want to ignore any future warnings in our code but only the ones associated with reading in the libraries.

The code for loading the libaries without surpressing the warnings is as follows:

```
# database interaction imports
library(odbc)

# for data manipulation/visualization
library(tidyverse)

# scaling data, calculating percentages, overriding default graphing
library(scales)

# add weights to data
library(survey)
```

## Load the Data

### Establish a Connection to the Database

The first step is to create a connection to the database using the `dbConnect` function. The `Driver` argument is specifying the type of SQL database, while the `Server` arguments points to where the database is within the ADRF. 

When creating a new notebook in this course, make sure to copy the following code chunk to be able to connect to the database.

__Database Connection__

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

### Formulate Data Query

This part is similar to writing a SQL query in DBeaver. Depending on the questions of interest, different queries can be used to pull different data. In this example, the query below is used to pull all columns from the SED data for doctoral students who graduated in 2015.

In [None]:
# Create query character string
    # Database name: ds_nsf_ncses
    # Schema name: dbo
    # Table name: nsf_sed

query <- "
SELECT TOP 10 *
FROM ds_nsf_ncses.dbo.nsf_sed
WHERE phdfy = '2015'
"

Here, `SELECT TOP` is used to read-in only the first 10 rows - this is just a preview of the data, and it also allows to avoid eating up memory by reading a large data frame into R. 

> `SELECT TOP` provides one simple way to get a "sample" of data; however, using `TOP` does **not provide a _random_** sample. It may return different samples of data, but it is just based on what is fastest for the database to return.


### Reading in the Data 

Now the data can be read into an R data frame using the `con` and `query` as inputs to `dbGetQuery()`. 

> Recall that `con` (our server connection) includes a reference that tells it what driver to use. Forgetting to set up the driver would cause an error. 

In [None]:
# Read the data into an R dataframe called df
df <- dbGetQuery(con, query)

In [None]:
# see first 5 rows of df
head(df,5)

## Summary Statistics

This section covers aggregating statistics on the data. The goal of this exercise is to get a better understanding of the data. Ask the following questions: Are the data generally clean? What are possible sources of error? What are the types of objects and what are the variables?

> Note: __Large tables__ can take a long time to process on shared databases, so using SQL and R is demonstrated with consideration for how much data is read back into R.

To answer these broader research questions, start by looking at simple aggregate statistics in each of the data sources.

## Data Exploration #1: **Survey of Earned Doctorates (SED)**

**Motivating Questions:**
- What are the most pursued fields of study by doctorate students?
- What are the most common sources of funding?

In order to avoid pulling a large amount of information, only pull in the data with the unique identifier of a person (**drf_id**) and their field of study (in the SED data this variable is called **PHDFIELD_NAME**).

### Field of Study

In this sub-section, we explore the counts for different fields of study within the SED data. For this purpose, we first take a look at the data dictionary to identify the variable associated with field of study. In this case, the relevant variable is: **PHDFIELD_NAME**. We then use SQL to aggregate counts by each **PHDFIELD_NAME**.

It is important to note that because we are using confidential data in this class that needs to be submitted for export before it can be shared publicly, every produced statistic that has a count of individuals also needs to have a related count of institutions. According to the NCSES/IRIS export rules, only those statistics that have more than 10 individuals from at least 3 institutions could be disclosed publicly. Therefore, we will add two variables: **indiv_count** (count of individuals) and **inst_count** (count of institutions).

In [None]:
# Create query character string
    # Database name: ds_nsf_ncses
    # Schema name: dbo
    # Table name: nsf_sed

qry = "
SELECT PHDFIELD_NAME, COUNT(DISTINCT(drf_id)) as indiv_count, COUNT(DISTINCT(phdinst)) as inst_count
FROM ds_nsf_ncses.dbo.nsf_sed
WHERE phdfy = '2015'
GROUP BY PHDFIELD_NAME
"

You will notice that we use `DISTINCT` within the `COUNT` function in our code above. This is done in order to make sure that we are counting unique (distinct) values of **drf_id** to avoid duplicated counts – we use `DISTINCT` here to avoid double counting individuals. We can also see that each group in this table represents more than 10 individuals from more than 3 institutions, therefore, this table would pass the disclosure review to be released publicly. 

In [None]:
# Read the data into an R dataframe called sed_fos
sed_fos <- dbGetQuery(con, qry)

# View the entire sed_fos data frame
sed_fos

#### Calculate percentages for each field of study and arrange them in descending order.

It’s important to note that since we are using confidential data that needs to be submitted for export prior to being shared publicly, every produced statistic will require the underlying count of individuals and the count of their associated unique institutions. Only those statistics produced using 10 or more individuals from at least 3 universities can be disclosed publicly. Further, counts and percentages need to be rounded. 

To calculate the percentage for the disclosure review process, we need to follow these steps:
- **Step 1. Round the counts in both numerator and denominator**  (**indiv_count_rounded** and **total_count_rounded**). Both numerator and denominator (before rounding) need to have at least 10 individuals from at least 3 institutions.
> According to the NCSES/IRIS rounding rules, numbers below 999 should be rounded to the nearest ten, and numbers above 999 should be rounded to the nearest hundred. 
> When using the `round` fundtion, rounding to a negative number of digits means rounding to a power of ten, so round(x, digits = -1) rounds to the nearest ten while round(x, digits = -2) rounds to the nearest hundred.
When you do not specify the number of digits, the `round` function will round to the nearest whole number.
- **Step 2. Calculate the percentage on the rounded counts** (**percentage**)
- **Step 3. Round the percentage to the whole number** (**percentage_rounded**) - this would be the final percentage that could be released publicly, if the unrounded counts in the numerator and the denominator pass the threshold mentioned above.


We need to show the evidence of both unrounded and rounded counts for the disclosure review process, so we will keep all these variables in the table. 

In [None]:
sed_fos <- sed_fos %>%
# Create a new column for total individuals (it will repeat for each observation) 
    mutate(total_indiv = sum(indiv_count),    # use the sum function to add together all the indiv_count
           indiv_count_rounded = case_when(indiv_count > 999 ~ round(indiv_count, digits = -2),  # round to the nearest 100
                                           indiv_count < 999 ~ round(indiv_count, digits = -1)), # round to the nearest 10
           total_indiv_rounded = case_when(total_indiv > 999 ~ round(total_indiv, digits = -2),  # round to the nearest 100
                                           total_indiv < 999 ~ round(total_indiv, digits = -1)),  # round to the nearest 10
           percentage = (indiv_count_rounded / total_indiv_rounded) * 100,                      # calculate percentage
           percentage_rounded = round(percentage)) %>%                                          # round percent to the nearest whole number
    arrange(desc(percentage))                                                                   # sort in decending order 

# View the entire sed_fos data frame
sed_fos

### Source of Funding

Next, we address the following question: What are the most common sources of funding?

We start our data exploration by calling the relevant variable for sources of funding: **srceprim**.

In [None]:
# Create the query and select only two variables: unique identifier (drfid) and primary source of support (srceprim)
    # Database name: ds_nsf_ncses
    # Schema name: dbo
    # Table name: nsf_sed

query <- "
SELECT drf_id, srceprim
FROM ds_nsf_ncses.dbo.nsf_sed
WHERE phdfy = '2015'
"

# Read the data into an R dataframe called sed_ncses_2015
sed_ncses_2015 <- dbGetQuery(con, query)

# View the first 6 rows of the table
head(sed_ncses_2015)

The `unique` function returns a vector, data frame or array like x but with duplicate elements/rows removed.
Below we use the `unique` function to identify all of the values the variable **srceprim** takes on. 
- Since **srceprim** is a vector inside the **sed_ncses_2015** data frame, a vector of values is returned with all the the dublicates removed, (i.e. the unique values). 

In [None]:
#Check what are the unique values in the primary support variable (srceprim)
unique(sed_ncses_2015$srceprim)

Note that the REDACTED observation listed above is an REDACTED REDACTED `REDACTED`. These observations are not coded as NA, rather they have an empty string for their recorded value. This is an important distinction we will have to make when recoding and grouping the Primary Source of funding below.

Use the `COUNT`, `GROUP BY` and `ORDER BY` functions in SQL to aggregate the number of graduates in each category and sort them in a descending order. As a reminder, we also need to add the counts of institutions for the disclosure review process.

In [None]:
# Count the number of graduates (their distince drf_id), 
# group by a primary source of support variable, 
# sort the counts in a descending order
query <- "
SELECT srceprim, COUNT(DISTINCT(drf_id)) AS indiv_count, COUNT(DISTINCT(phdinst)) AS inst_count
FROM ds_nsf_ncses.dbo.nsf_sed
WHERE phdfy = '2015'
GROUP BY srceprim
ORDER BY indiv_count DESC
"

# Read the data into an R dataframe called primary_support
primary_support <- dbGetQuery(con, query)

# View the results (i.e. the entire dataframe)
primary_support  

#### Understanding Primary Source categories
We have arrived at the most common sources of funding but at this point, it is unclear what each row refers to. That is, we are not sure what source of funding refers to 'A', 'B' or any of the remaining categories. In order to address this, we utilize SED's data dictionary through which we can look up categories and recode our values based on those. We utilize `case_when` to approach this. 

The data dictionary can be found here: `tr-ncses-2021\Documentation\2017 SED Codebook.docx`



#### Recoding – Part 1:

In [None]:
primary_support <- primary_support %>% 
    mutate(source_name=case_when(
                        srceprim == "A" ~ "Fellowship, scholarship",
                        srceprim == "B" ~ "Dissertation grant",
                        srceprim == "C" ~ "Teaching assistantship",
                        srceprim == "D" ~ "Research assistantship",
                        srceprim == "E" ~ "Other assistantship",
                        srceprim == "F" ~ "Traineeship",
                        srceprim == "G" ~ "Internship, clinical residency",
                        srceprim == "H" ~ "Loans (from any source)",
                        srceprim == "I" ~ "Personal savings",
                        srceprim == "J" ~ "Personal earnings during graduate school (other than sources listed above)",
                        srceprim == "K" ~ "Spouse's, partner's, or family's earnings or savings",
                        srceprim == "L" ~ "Employer reimbursement/assistance",
                        srceprim == "M" ~ "Foreign (non-U.S.) support",
                        srceprim == "" ~ "Unknown",
                        srceprim == "N" ~ "Other - specify"))

# View the first 6 rows of the data frame
head(primary_support)

#### Recoding Part 2 – Creating a broader category:

We will group the 15 sources of funding into the 6 broader categories listed below. 
For your own research project, there are a few different reasons you might want to group specific outcomes into broader categories; for instance, you might want to connect the three main data sources together (SED,SDR and UMETRICS) for your final analysis, and you might not have sufficient coverage to analyze the more nuanced differences in funding sources or over variables of interest.

1. Fellowship
2. Research Assistantship (RA)
3. Teaching Assistantship (TA)
4. Loan/Personal
5. Other
6. Unknown

We used these 6 categories because the literature supports using these categories.  

Use the `==` operator when identifying is **srceprim** takes on a spacific value (i.e. srceprim == "C") and the `%in%` operator we when identifying is srceprim takes on one of several values (i.e. %in% c("H", "I", "J")). 
- In the later example (srceprim %in% c("H", "I", "J") ~ "Own Funds/Loan") if **srceprim** takes on either "H", "I", or "J", **source_cat** will be assigned the value "Own Funds/Loan"

In [None]:
# Using mutate and case_when, create a new variable called source_cat
primary_support <- primary_support %>% 
   mutate(source_cat = case_when(srceprim %in% c("A", "B") ~ "Fellowship Grant",
                                  srceprim == "C" ~ "Teaching \nAssistantship", 
                                  srceprim == "D" ~ "Research \nAssistantship",
                                  srceprim %in% c("H", "I", "J") ~ "Own Funds/Loan", 
                                  srceprim %in% c("E", "F", "G", "K", "L", "M","N") ~ "Other",
                                  srceprim == '' ~ 'Unknown'))

# View entire primary_support dataframe
primary_support

## Data Exploration #2: **Survey of Doctorate Recipients (SDR)**

**Motivating Question:** What is the distribution of earnings by gender? You will look at race/ethinicity in the checkpoints notebook.

As the SDR data includes sub-samples of the SED population, survey weights need to be used in the calculations. Applying survey weights allows us to take into account an unequal sample selection – we do these to make the statistics we compute to be more representative of the population.

### Distribution of earnings by gender 
Find the distribution of earnings from the  2017 SDR for the cohort 2015.
>  use the variable **sdryr** (the year of first award of a U.S. PhD degree) to subset the data for the 2015 cohort.

In [None]:
# Create the query and select only 4 variables: refid (the unique identifier), gender, salary, wtsurvy (serve weights)
    # Database name: ds_nsf_ncses
    # Schema name: dbo
    # Table name: sdr_2017
    # subset: 2015 cohort
query <- "
SELECT refid, gender, salary, wtsurvy 
FROM ds_nsf_ncses.dbo.sdr_2017
WHERE sdryr = '2015'
"

# Read the data into an R dataframe called gender_earnings
gender_earnings <- dbGetQuery(con, query)

#View the first 6 rows of the data frame
head(gender_earnings)

Next, we observe how values for certain salaries do not make sense. These are referred to as logical skips – defined by Qualtrics, a renowned survey platform, as:

*Skip logic allows you to send respondents to a future point in the survey based on how they answer a question. For instance, if a respondent indicates that they don’t agree to your survey’s consent form, they could immediately be skipped to the end of the survey.*

For the purpose of our data exploration tasks, we remove any salaries associated with logical skips.

In [None]:
# remove all observarions with a recorded salery of 9999998
gender_earnings <- gender_earnings[gender_earnings$salary != 9999998, ]

#View the first 6 rows of the data frame
head(gender_earnings)

#### Applying weights
When loading the R libraries at the beginning of the notebook, an R package called `survey` was imported (by calling `library(survey)`). This library allows to calculate weighted variables by applying survey weights to the data.

Apply a `svydesign` function to the unweighted data frame called `gender_earnings`, to calculate the weighted earnings.

In [None]:
gender_earnings_weighted <- svydesign(ids=~1, data=gender_earnings, weights=gender_earnings$wtsurvy)

Instead of a data frame, the `svydesign` function returns an object of class `survey.design` - try to call `gender_earnings_weighted`:

In [None]:
gender_earnings_weighted

It is not a regular table output, like with data frames. For this new object, use functions provided in the survey package. For example, to find a weighted mean of female earnings, call a function called svymean

#### Weighted mean by group (gender)

Since we are hoping to look at the mean of **salary** by each **gender**, we combine two functions: `svyby` and `svymean`. The code syntax will be as follows:

```
svby(~var_to_calc_mean_for, ~var_to_group_by, weighted_df_name, svymean, keep.names=FALSE)
```

For our implementation, the associated variables would be:

- var_to_calc_mean_for: **salary**
- var_to_group_by: **gender**
- weighted_df_name: **gender_earnings_weighted**
- keep.names: This is always equal to FALSE in our case

The implementation is as follows:

In [None]:
weighted_earnings_by_gender <- svyby(~salary, ~gender, gender_earnings_weighted, svymean, keep.names = TRUE)
weighted_earnings_by_gender

#### Unweighted mean by group (gender)

Compare the weighted mean with the unweighted mean for each gender (using the unweighted data frame called `gender_earnings`):

In [None]:
unweighted_earnings_by_gender <- gender_earnings %>%
    group_by(gender) %>%
    summarise_at(vars(salary), list(mean=mean), na.rm=TRUE)

unweighted_earnings_by_gender 

There is a slight difference between unweighted and weighted means - remember, with the survey data, always use the weighted variables.

## Saving Results

We have already created a directory structure for you. Within the `Module 2` folder, you will see the following:

- Figures
- Tables

We will now save some of the tables that we created above as csv files.

For this purpose, we utilize the following code:

`write.csv(df_to_save, "Tables\\df_to_save.csv", row.names = FALSE)`

Here:

- `df_to_save` refers to the R dataframe we want to save. 
- `"Tables\\df_to_save.csv"` refers to the filepath, we would like to save our dataframe in.
- `row.names = FALSE` excludes row numbers when we save our dataframe as a csv. If we had `TRUE` instead of `FALSE`, we would have an extra column (of row numbers) as the first column in our csv. 


We have added the saving code for each dataframe we want to save and you do not have to make any updates to the code below in order for it to run.

In [None]:
# Primary source of support
write.csv(primary_support, "Tables\\primary_support_coverage.csv", row.names = FALSE)

## Closing the Database connection

Uncomment the code below to close the database connection. This allows us to free up resources (most importantly, memory) for future work. However, you are advised to close the connection only when you have executed the cells above. 

**Note**: Once the connection is closed, you will have to re-open the connection (at the start of the notebook) to utilize run the code again. You will not be able to run any SQL-related code in R once the connection is closed. That is, once you have run `dbDisconnect(con)`, you will have to re-run the code associated with establishing a connection with SQL to interact with any tables.

In [None]:
# Close the database connection
# dbDisconnect(con)