<center><br><br>
    Arkansas Work-Based Learning to Workforce Outcomes <br>
    Applied Data Analytics Training | Spring 2022
    <h1> Characterizing Demand: Unsupervised Machine Learning Checkpoints </h1>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center> Joshua Edelmann, Rukhshan Arif Mian, Benjamin Feder</center>
</center>

***

# Introduction

The purpose of this checkpoint notebook is to apply the unsupervised machine learning methods we used in `04_Characterizing_Demand_Advanced.ipynb` to your cohort. 

In the checkpoint notebooks for `02_Creating_a_cohort.ipynb`, we asked you to create and save your cohort as an SQL table. You will be utilizing the cohort you created as part of this checkpoint notebook. 

At each checkpoint, you will be replacing the `___` with the appropriate variable, function or R code snippet. 

You are encouraged to attempt the checkpoints on your own. Having said that, hints and suggested solutions are provided and these can be accessed by utilizing the following code:

Hints: `check_#.hint()`

Solutions: `check_#.solution()` – your solutions may vary based on how you define your cohort. We have shared our suggested solutions.

In both cases, # refers to the checkpoint number. For example: we can access the hint and solution for Checkpoint 2 using: `check_2.hint()` and `check_2.solution()`, respectively. 

> Note: This checkpoint notebook has been created by keeping a cohort of apprenticeship graduates in mind. We encourage you to reach out to your team facilitator to learn more about the methods you can use to Characterize Demand if your cohort is defined differently, such as by apprenticeship starters. Also, the code for accessing hints and solutions is currently commented out – in order for the cells to run, you will need to uncomment them first. 

In [None]:
options(warn=-1)
# Database interaction imports
suppressMessages(library(odbc))

# For data manipulation/visualization
suppressMessages(library(tidyverse))

# For faster date conversions
suppressMessages(library(lubridate))

# Use percent() function
suppressMessages(library(scales))

suppressMessages(library(zoo))

# clustering
suppressMessages(library(cluster))
options(warn=0)

# set seed to ensure work is reproducible because k-means has random starting points
set.seed(1)

# ignore scientific notation
options(scipen=999)

source('04_Characterizing_Demand_Advanced_checkpoints_hints_solutions.txt')

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

In [None]:
# Code adjusting overall graph attributes
# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 12, repr.plot.height = 8)

## Checkpoint 1: Reading in the year of your interest

For this checkpoint, we ask you to update the code below with the year prior to that selected for your cohort. If your cohort contains individuals from multiple years, we ask that you only focus on one year's worth of data for this notebook. We recommend the year prior to your cohort selection, as this will ideally be the most recent information they can use prior to (assuming a cohort defined by completion) completing their apprenticeship. For example: If the cohort that you created as part of the checkpoints notebook for `02_Creating_a_cohort.ipynb` consists of apprenticeship earners in 2016, you will update the code below with 2015.

In [None]:
# read aggregated employer data for a specific year
query <- "select *
from tr_ar_2022.dbo.employer_yearly_agg
where year = ___
"

emp <- dbGetQuery(con, query)

In [None]:
# hint
# check_1.hint()

In [None]:
# suggested solution
# check_1.solution()

In [None]:
# create ratio of full quarter employees variable
emp <- emp %>% 
    mutate(ratio_full_total = avg_full_num_employed/avg_num_employed)


# see employers
head(emp)

In [None]:
# do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.
# we call the order function on all of emp's columns
emp <- emp[do.call(order, emp), ]

In [None]:
# Remove features without explanatory power
emp_ml <- emp %>%
    select(-c(federal_ein, year, NAICS_National_Industry_ID, two_digit_naics ))

### Examine Scales Across Variables

Use the `str` function to see if there are any categorical variables that remain in your data frame (now called `emp_ml`).

In [None]:
# Check data type of all variables - make sure all of them are numeric
str(emp_ml)

Convert all variables to a **numeric** type and scale them.

In [None]:
# convert all numeric variables to numeric type otherwise integer64 won't scale using sapply
emp_ml_num <- emp_ml %>%
    sapply(as.numeric)

In [None]:
# Scale the features since variables like avg_emp_rate are much smaller than avg_total_earnings
emp_ml_scale <- scale(emp_ml_num)

# View first rows after scaling
emp_ml_scale %>% 
   head()

### Analyze Missingness

If an employer has missing information in any of the columns, the row will be dropped in the clustering method.

> Note that you should **never remove data** if possible - in a real world setting you would likely want to fill any missing data with an imputation method or a baseline assumption.

In [None]:
# Check number of rows (where each row is a unique employer/year combination)
nrow(emp_ml_scale)

In [None]:
# na.omit will remove any rows with any NA values
emp_ml_scale <- na.omit(emp_ml_scale)

In [None]:
# Check number of rows after dropping rows with any NA values
nrow(emp_ml_scale)

### Elbow Method

Utilize Elbow Method to choose the appropriate number of clusters, *k*.

In [None]:
# function to compute total within-cluster sum of squares
# we can run this for multiple values of k – showcased later in this notebook
wss <- function(k) {
    kmeans(emp_ml_scale, centers=k, nstart=20)$tot.withinss
}


Utilize `map_dbl` to run the `wss` function for each value of k. 

> Note: Beware that this code may take a few minutes and include some warnings. Refer to `04_Characterizing_Demand_Advanced.ipynb` for more information.

In [None]:
# compute and plot wss for k =1 to k = 15
k.values <- 1:15

# extract wss values for each k
wss_values <- map_dbl(k.values, wss)


Once you have `wss_values`, you can plot these using the code below:

In [None]:
wss_df <- data.frame(wss_values, k.values)

In [None]:
# plotting wss_df
wss_df %>%
    ggplot(aes(x=k.values, y=wss_values)) + 
    geom_line() + 
    geom_point()

## Checkpoint 2: Choosing *k*
Based on the plot above, choose the appropriate value for *k*. Try choosing a number around the inflection point, where the change in SSE becomes negligible. Store this value by updating the filling in the blank below with your choice of *k*.

In [None]:
k <- __

In [None]:
# hint
# check_2.hint()

In [None]:
# solution
# check_2.solution()

### Try Model

Now that you have chosen **k** using the elbow method, initialize the kmeans model on the scaled employer measures using this value. 

In [None]:
# Initialize the model and run on emp_ml_scale with centers = k
set.seed(2)
k_means <- kmeans(emp_ml_scale, centers = k, nstart = 20)

The output of the `kmeans` function returns the following components:

In [None]:
names(k_means)

Check the size of each cluster by using the code below:

In [None]:
# see size of cluster
k_means$size

### Describe Features across Clusters

Select out the `naics_code` and `adj_naics_2` columns.

In [None]:
emp_few_cols <- emp %>%
    select(-c(NAICS_National_Industry_ID, two_digit_naics))

Remove missing values and create a variable `k.cluster` that allows you to identify the cluster each employer falls in. 

In [None]:
# remove missing values (none here)
emp_few_cols <- na.omit(emp_few_cols) 

# add cluster number to the original dataframe
frame_4 <- emp_few_cols %>% 
    mutate(k4.cluster = k_means$cluster)  


head(frame_4)

### Summarizing Clusters

Summarize the clusters based on the provided employer measures. 

#### Mean

Use the mean as the summary statistic of interest to summarize your clusters. 

In [None]:
# remove empr_nbr, year, and naics codes related columns
frame_4_few_cols <- frame_4 %>%
    select(-c(federal_ein, year))

# summarize and add in sizes of each cluster
frame_4_few_cols %>%
    group_by(k4.cluster) %>%
    # getting averages for each cluster
    # add suffix "by_employer" to each summarize variable
    summarise(across(everything(), # adds the suffix across every column in our dataframe
                     list(by_employer=mean))) %>%
    mutate(
        size = k_means$size
    ) %>%
    # relocates the size column after the k4.cluster columns
    relocate(size, .after=k4.cluster)

#### Standard Deviation
Use the standard deviation as the summary statistic of interest to summarize your clusters. 

In [None]:
# summarize and add in sizes of each cluster
frame_4_few_cols %>%
    group_by(k4.cluster) %>%
    # getting averages for each cluster
    # add suffix "by_employer" to each summarize variable
    summarise(across(everything(), # adds the suffix across every column in our dataframe
                     list(by_employer=sd))) %>%
    mutate(
        size = k_means$size
    ) %>%
    # relocates the size column after the k.cluster columns
    relocate(size, .after=k4.cluster)

## Checkpoint 3: Linking to cohort 

Update the code below to read in your cohort joined with Arkansas' wage records using the tables from the dimensional model. We ask you to update 2 blanks, the first one with the name of your cohort and the second with your year of interest.

In [None]:
# using dimensional model to get primary employer information
qry <- "
SELECT
F.Quarter_ID - P.Apprenticeship_End_Quarter_ID AS Quarters_Relative_to_Completion,
P.Person_ID,
F.Primary_Employer_Wages,
PE.Federal_EIN
FROM 
tr_ar_2022.dbo.___ C --COHORT
JOIN tr_ar_2022.dbo.AR_MDIM_Person P ON (P.Apprentice_Number=C.apprnumber) --PERSON
JOIN tr_ar_2022.dbo.AR_FACT_Quarterly_Observation F --QUARTERLY OBSERVATION FACT
	ON (P.Person_ID=F.Person_ID) 
	AND (F.Quarter_ID BETWEEN (P.Apprenticeship_End_Quarter_ID) AND (P.Apprenticeship_End_Quarter_ID+4))  --QTRS POST COMPLETION
JOIN tr_ar_2022.dbo.AR_RDIM_NAICS_National_Industry NNI ON (P.Apprenticeship_NAICS_National_Industry_ID=NNI.NAICS_National_Industry_ID) --APPRENTICESHIP INDUSTRY
JOIN tr_ar_2022.dbo.AR_MDIM_Employer PE ON (PE.Employer_ID=F.Primary_Employer_ID)  --PRIMARY EMPLOYER
WHERE P.Apprenticeship_Completer='Y' and YEAR(C.exitwagedt) = ___ --RESTRICT COHORT YEAR
"

cohort_wages_empr <- dbGetQuery(con, qry)

head(cohort_wages_empr)

In [None]:
# hint
# check_3.hint()

In [None]:
# solution
# check_3.solution()

Link the **cohort_wages_empr** dataframe with **emp** to identify what clusters the employees in your cohort fall into. 

In [None]:
# add cluster number to the original dataframe
frame_4 <- emp %>%                     
    mutate(k4.cluster = k_means$cluster)  

# Join wages table with frame_4 clustering results
cohort_wages_empr_clus <- cohort_wages_empr %>%
    inner_join(frame_4, by=c('Federal_EIN' = 'federal_ein'))

head(cohort_wages_empr_clus)

#### Number of Employers by cluster

In [None]:
# see number of employers by cluster that primarily employed someone in the cohort
cohort_wages_empr_clus %>%
    group_by(k4.cluster) %>%
    summarise(emp_cohort = n_distinct(Federal_EIN))

#### Number of Primary Employers by cluster and quarter

In [None]:
# see number of employers by cluster that primarily employed someone in the cohort
cohort_wages_empr_clus %>%
    group_by(k4.cluster, Quarters_Relative_to_Completion) %>%
    summarise(emp_cohort = n_distinct(Federal_EIN))

#### Comparing within cohort primary employers to all employers

In [None]:
# compare within cohort primary employers to all employers in original clusters
# Get number of unique employers per cluster in the full dataframe (all employers)
frame_4 %>%
    group_by(k4.cluster) %>%
    summarise(emp_all = n_distinct(federal_ein))

In [None]:
# compare with percentages
cohort_emp <- cohort_wages_empr_clus %>%
    group_by(k4.cluster) %>%
    summarise(emp_cohort = n_distinct(Federal_EIN))

emp_all <- frame_4 %>%
    group_by(k4.cluster) %>%
    summarise(emp_all = n_distinct(federal_ein))

# Join cohort primary employers with all employers, and find percentage
cohort_emp %>%
    inner_join(emp_all, by = 'k4.cluster') %>%
    mutate(percentage = (emp_cohort / emp_all) * 100)

To keep this notebook short, we will stop the analysis here. However, we encourage you to continue onwards and analyze the employment outcomes for your cohort by the cluster of their primary employer.