# Checkpoints

**The purpose of this notebook and all other checkpoint notebooks is to get you to practice making changes to the code that will help construct your own research project. We have given hints and solutions but these are more applicable to a generic research project. You are encouraged to think about how these checkpoints, hints/solutions may help formulate and address your research question.**

This notebook serves as an overview of what was discussed in `02_Creating_a_cohort.ipynb` through **4 checkpoints**. 

At each checkpoint you will be replacing the `___` with the appropriate variable, function, or R code snippet. 

Participants are encouraged to attempt the checkpoints on their own. Having said that, hints and suggested solutions are provided which can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution` respectively. 

> Note: Codes for accessing hints and solutions are currently commented out. To run them, you will need to uncomment them first. 

In [None]:
options(warn = -1)                   # switches warnings off

suppressMessages(library(odbc))      # allows R to connect with the database
suppressMessages(library(tidyverse)) # useful for data manipulation and visualization
suppressMessages(library(scales))    # to calculate percentages, graphing
suppressMessages(library(lubridate)) # for easy working with dates 

options(warn = 0)                    # switches warnings on 
options(scipen=999)                  # prevents scientific notation

source("02_Creating_a_cohort_hints_solutions.txt") # defining hints + solutions

In [None]:
# server connection
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Checkpoint 1: Looking at TANF enterers

In `02_Creating_a_cohort.ipynb`, we defined our population of interest focusing on TANF exiters during the second quarter of 2018 (2018 Q2). For your project, you might want to select your cohort based on: 

1. TANF enterers or TANF exiters 
2. Looking at a different year-quarter combination
3. Including more than one year-quarter combinations 
4. Focusing on a certain type of demographic (based on gender, race etc.)

As a starting point, for this checkpoint, we would like you to select your cohort based on TANF entry (**tanf_start = 1**) or TANF exit (**tanf_end = 1**) for a year-quarter combination that will help you answer your research question (1 and 2 from above).

> Note: You might want to select an earlier quarter if you want to be able to look at long-run TANF experiences or employment outcomes.


In [None]:
# The first blank corresponds to the year-quarter combination of your choice 
# The second blank corresponds to the variable of your choice (either tanf_start or tanf_end)

query <- "
SELECT ssn, caseid, dob, yr_month, welig, affil, tanf_total_months, tanf_start, tanf_end, hispan, native, asian, black, hawaia, white, county
FROM tr_tdc_2022.dbo.person_month_clean
WHERE yr_quarter = '___'  AND ___ = 1
"

tanf_cohort <- dbGetQuery(con, query)

head(tanf_cohort)

In [None]:
# Uncomment the line below if you would like to see the hint
#check_1.hint()

In [None]:
# Uncomment the line below if you would like to see the solution
#check_1.solution()

Once you have defined the new cohort, we encourage you to see if there are any individuals who show up more than once. This is similar to what we checked for in `02_Creating_a_cohort.ipynb`. 

Run the code below to see if there are any such cases.

> Note: You are not supposed to fill out any blanks in the following chunks of code as these are not checkpoints. 

In [None]:
# find number of rows, unique people, and unique cases in TANF data
tanf_cohort %>%
    summarize(
        n_rows = n(),
        n_cases = n_distinct(caseid),
        n_people = n_distinct(ssn)
    )

In [None]:
# find example of someone showing up multiple times
# use pull to isolate the contents of the variable (individual ssn) and save as an object
ssn_dup <- tanf_cohort %>%
    count(ssn) %>%
    arrange(desc(n)) %>%
    head(1) %>% 
    pull(ssn)

tanf_cohort %>%
    filter(ssn == ssn_dup)

As discussed in `02_Creating_a_cohort.ipynb`, there are cases where we do not want to double-count individuals. and instead, focus on their most recent occurrence. We remove these instances using the following code:

In [None]:
# Select latest entry (yr_month) for each ssn
tanf_cohort_clean <- tanf_cohort %>%
    arrange(ssn, desc(yr_month)) %>%
    distinct(ssn, .keep_all=T) 

Run the code below to see if there are any duplicate entries in our cohort. 
> Note: **n_rows** should equal **n_people**. 

In [None]:
# confirm we have one row per ssn
tanf_cohort_clean %>%
    summarize(
        n_rows = n(),
        n_people = n_distinct(ssn)
    )

For the purpose of your project, it might be useful to think of cases where keeping the latest entry would not be the best way to remove duplicates. It might also be worthwhile to see if there are other methods you could use to address duplicate entries. Finally, there could be cases where you would want to keep individuals who appear more than once in your data (for example: looking at the same individual over time).

## Checkpoint 2: Age

In `02_Creating_a_cohort.ipynb`, we filtered out individuals who were not adults (age is less than 18). We reuse the code from the previous notebook to create a variable **age_at_event** and keep individuals who are above a certain age threshold (that you will be define) at the time they entered/exited TANF (based on how you defined your cohort).

If you are evaluaating TANF experiences, you might be interested in a sample that contains children in addition to adults. In this case, you would not want to just keep ages 18 or older. Furthermore, if you are looking at employment outcomes, you might want to select an age bracket of individuals who would be less likely to have competing employment interests such as college.

There are multiple blanks in this checkpoint:

1. The first blank corresponds to the last date associated with the year-quarter combination that you chose in Checkpoint 1. That is, if you chose '2017 Q2', this blank will contain '2017/06/30'. 
2. The second blank correspods to the age cutoff of your choice. The third and fourth blanks correspond to defining category names (as we did in `02_Creating_a_cohort.ipynb`).
3. The last blank corresponds to the age cutoff you have defined above. 

In [None]:
# numerical summary of age
tanf_cohort_clean <- tanf_cohort_clean %>%
    mutate(
        # fill in the blank for your year-quarter combination of interest in: 'YYYY/MM/DD'
        age_at_event = trunc((dob %--% ymd('____')) / years(1)),
        # fill in the blank with the age cutoff of your choice. Define categories accordingly. 
        age_ind = ifelse(age_at_event >= ___, "____", "____")
    ) %>%
    # filter based on the age cutoff of your choice (from above)
    filter(age_at_event >= ___)


In [None]:
# Uncomment the line below if you would like to see the hint
# check_2.hint()

In [None]:
# Uncomment the line below if you would like to see the solution
# check_2.solution()

Based on your results from above, we will visualize the age distribution of the remaining TANF enterers/exiters. This may prove useful for your research projects as such methods help contextualize future findings. In this case, it will help us further develop an understanding of the age distribution of adult TANF enterers.

In [None]:
# fill in the blank below to create a visual summary of age
tanf_cohort_clean %>%
    ggplot(aes(x=age_at_event)) + 
    geom_histogram()

## Checkpoint 3: Affiliation

In the `02_Creating_a_cohort.ipynb`, we looked at utilizing the **affil** variable, which tracks the affiliation of an individual to the one receiving assistance. We concentrated on the member of the family receiving assistance (**affil = 1**).

Within this checkpoint, we encourage you to explore additional values that **affil** can take on. 

From the data dictionary, we observe that **affil** can take up multiple values (ranging from 1-5). For example: a value of 2 refers to "Parent of minor child in the eligible family receiving assistance". Is this a case that you would like to keep in your cohort? Are there other values of **affil** that interest you? We encourage you to leverage the data dictionary for this checkpoint:

In [None]:
# filter to keep additional values of affil. These can be more than 3 as well
tanf_cohort_clean <- tanf_cohort_clean %>%
    # format: c(value_1, value_2, value_3, ...) – 3 blanks are placeholders, you can choose any range of values.
    filter(affil %in% c(__, __, __)) 

In [None]:
# counting the number of individuals (not a checkpoint)
tanf_cohort_clean %>% 
    summarize(
        n_people = n_distinct(ssn)
    )

In [None]:
# Uncomment the line below if you would like to see the hint
#check_3.hint()

In [None]:
# Uncomment the line below if you would like to see the solution
#check_3.solution()

## Checkpoint 4: Saving as a Permanent Table

In this checkpoint, we ask you to save the cohort that you created above. When doing so, please save your table with the prefix **team#\_name** to differentiate between tables for each team and invididual. **#** refers to your team number and **name** refers to your name (in lower case). For example: **team5_cohort_john**.


> Note: The **dbo.** string will stay as it is and does not need to be changed. 



In [None]:
qry <- "use tr_tdc_2022;"
DBI::dbExecute(con, qry)

DBI::dbWriteTable(
    conn = con,
    # Replace the blank below with a string in the following format: team#_name (example provided above).
    name = DBI::SQL("dbo.___"), # Only replace the blank. Keep "dbo." as it is
    value = tanf_cohort_clean, 
    overwrite = TRUE # Overwrites any existing tables 
)

In [None]:
# Uncomment the line below if you would like to see the hint
#check_4.hint()

We have not provided a solution for this. You can confirm that your table saved properly by querying from it.