# Checkpoints

<a href="https://doi.org/10.5281/zenodo.10408116"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10408116.svg" alt="DOI"></a>


**The purpose of this notebook and all other checkpoint notebooks is to get you to practice making changes to the code that will help construct your own research project. We have given hints and solutions but these are more applicable to a generic research project. You are encouraged to think about how these checkpoints, hints/solutions may help formulate and address your research question.**

This notebook serves as an overview of what was discussed in `02_Creating_a_cohort.ipynb` and `01_EDA.ipynb` through **4 checkpoints**. 

At each checkpoint you will be replacing the `___` with the appropriate variable, function, or R code snippet. 

Participants are encouraged to attempt the checkpoints on their own. Having said that, hints and suggested solutions are provided which can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution` respectively. 

> Note: The code for accessing hints and solutions are currently commented out. To run them, you will need to uncomment them first. 

In [None]:
options(warn = -1)                   # switches warnings off

suppressMessages(library(odbc))      # allows R to connect with the database
suppressMessages(library(tidyverse)) # useful for data manipulation and visualization
suppressMessages(library(scales))    # to calculate percentages, graphing
suppressMessages(library(lubridate)) # for easy working with dates 

options(warn = 0)                    # switches warnings on 
options(scipen=999)                  # prevents scientific notation

source("02_Creating_a_cohort_hints_solutions.txt") # defining hints + solutions

In [None]:
# server connection
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Checkpoint 1: Looking at Apprenticeship Enterers

In `02_Creating_a_cohort.ipynb`, we defined our population of interest focusing on apprenticeship completers. For your project, you might want to select your cohort based on: 

1. Apprenticeship start or end date
1. Whether participants completed their program
1. Focusing on a certain demographic information

As a starting point, for this checkpoint, we would like you to select your cohort based on apprenticeship entry or completion date.

> Note: You might want to select different years or whether or not they have completed the training program. The answers given in the checkpoint might not align with your answers!

In [None]:
# question
query <- "
SELECT
RP.psnumber,
RP.progname,
RP.standardstype,
RA.apprnumber,
RA.naicscode,
NNI.Name,
RA.occupationtitle,
RA.onetsoccode,
RA.apprstatus,
RA.progstate,
RA.progzip5,
RA.county,
RA.termlengthmin,
RA.gender,
RA.race,
RA.ethnicity,
RA.vetstatind,
RA.disabled,
RA.ageatstart,
RA.exitwagedt
FROM ds_public_1.dbo.rapids_apprentice RA
JOIN ds_public_1.dbo.rapids_program RP ON (RP.psnumber =RA.psnumber)
LEFT JOIN ds_ar_dws.dbo.NAICS_National_Industry NNI ON (NNI.Code = RA.naicscode)
WHERE RA.progstate='AR' --RESTRICT TO ARKANSAS PARTICIPANTS ONLY
AND YEAR(RA._____) BETWEEN ___ AND ___ --RESTRICT TO INDIVIDUALS THAT ENTERED OR COMPLETED THE TRAINING DURING THESE YEARS
AND RA.apprstatus = "_"' -- RESTRICT TO COMPLETION STATUS
"

checkpoint_cohort <- dbGetQuery(con, query)

# view the first 6 observations
head(checkpoint_cohort)

In [None]:
# Uncomment the line below if you would like to see the hint
#check_1.hint()

To see the answer for an entry cohort, uncomment and run the code cell below.

In [None]:
# Uncomment the line below if you would like to see the solution
#check_1_entry.solution()

To see the answer for an completion cohort, uncomment and run the code cell below.

In [None]:
# Uncomment the line below if you would like to see the solution
#check_1_exit.solution()

Once you have defined the new cohort, we encourage you to see if there are any individuals who show up more than once. This is similar to what we checked for in `02_Creating_a_cohort.ipynb`. 

Run the code below to see if there are any such cases.

> Note: You are not supposed to fill out any blanks in the following chunks of code as these are not checkpoints. 

In [None]:
# confirm we have one row per ssn
checkpoint_cohort %>%
    summarize(
        n_rows = n(),
        n_people = n_distinct(apprnumber)
    )

For the purpose of your project, it might be useful to think of cases where keeping the latest entry would not be the best way to remove duplicates. It might also be worthwhile to see if there are other methods you could use to address duplicate entries. Finally, there could be cases where you would want to keep individuals who appear more than once in your data (for example: looking at the same individual over time).

## Checkpoint 2: Race Distribution

In `02_Creating_a_cohort.ipynb`, we looked at our cohort by gender. Let's explore this cohort by race.

In [None]:
# question
# numerical summary of age
checkpoint_cohort %>%
    count(___) %>%
    arrange(desc(n))


In [None]:
# Uncomment the line below if you would like to see the hint
#check_2.hint()

In [None]:
# Uncomment the line below if you would like to see the solution
#check_2.solution()

Based on your results from above, we will visualize the race distribution.... This may prove useful for your research projects as such methods help contextualize future findings. In this case, it will help us further develop an understanding of the race distribution.

## Checkpoint 2A: Race Distribution Visual

In [None]:
# fill in the blank below to create a visual summary of race
# question
checkpoint_cohort %>%
    ggplot(aes(x=___)) + 
    geom_bar() + 
    theme(axis.text.x=element_text(angle=45, hjust=1))

In [None]:
#check_2A.hint()

In [None]:
#check_2A.solution()

## Checkpoint 3: Age

In the `01_EDA.ipynb` notebook, we looked at the distribution of the **ageatstart** variable, which is the apprentice's age at the beginning of the program. Let's see the age breakdown of the individuals in our cohort at the start of their apprenticeships.

In [None]:
# filter to keep additional values of affil. These can be more than 3 as well
age_freq <- checkpoint_cohort %>%
    count(____) %>%
    arrange(___)
age_freq

In [None]:
# not a checkpoint
age_freq %>% 
    ggplot(aes(x = as.numeric(ageatstart), y = n)) +
    geom_line()

In [None]:
# Uncomment the line below if you would like to see the hint
# check_3.hint()

In [None]:
# Uncomment the line below if you would like to see the solution
# check_3.solution()

## Checkpoint 4: Saving as a Permanent Table

In this checkpoint, we ask you to save the cohort that you created above. When doing so, please save your table with the prefix **team#\_name** to differentiate between tables for each team and invididual. **#** refers to your team number and **name** refers to your name (in lower case). For example: **team5_cohort_john**.


> Note: The **dbo.** string will stay as it is and does not need to be changed. 



In [None]:
qry <- "use tr_ar_2022;"
DBI::dbExecute(con, qry)

DBI::dbWriteTable(
    conn = con,
    # Replace the blank below with a string in the following format: team#_name (example provided above).
    name = DBI::SQL("dbo.___"), # Only replace the blank. Keep "dbo." as it is
    value = checkpoint_cohort, 
    overwrite = TRUE # Overwrites any existing tables 
)

In [None]:
# Uncomment the line below if you would like to see the hint
#check_4.hint()

We have not provided a solution for this. You can confirm that your table saved properly by querying from it.