# **Checkpoints**
## **Part 2: Prediction Models and Model Evaluation**

In this notebook, we provide you with checkpoints to practice on. The structure of this notebook is the same as the second ML notebook's. We removed the texts between code cells so that you can focus on the code. **For detailed explanations about the analysis steps and code, refer back to the [machine learning prediction notebook](./4.Machine_Learning_Prediction.ipynb).**

## **1. Load the Data**

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

# Classification and regression training package. For streamlining the process for creating predictive models
library(caret)

# For Decision Tree model and plot the tree
library(rpart)
library(rpart.plot)

# For Random Forest model
library(randomForest)

library(scales)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

Let's import the data and check its top records. <font color=red> Note that you need to change the directory in read.csv() statements below. Replace ". ." with your username.</font>

> Note that the csv file name is `reg_ml_data.csv`.

In [None]:
#Import the data that we have cleaned in the first ML notebook
df_ml <- read.csv("U:\\..\\ETA Training\\Data\\reg_ml_data.csv")

#See the top records of the DataFrame
head(df_ml)

In [None]:
#Convert categorical variables into factor type
df_ml$gender <- factor(df_ml$gender)
df_ml$race <- factor(df_ml$race)
df_ml$ethnicity <- factor(df_ml$ethnicity)
df_ml$disability <- factor(df_ml$disability)
df_ml$education <- factor(df_ml$education)
df_ml$naics_maj_code_rv <- factor(df_ml$naics_maj_code_rv)
df_ml$occupation_code <- factor(df_ml$occupation_code)

#### **Checkpoint 1: Split the Training Set and the Testing Set**

In this project, we will train ML models on cohort 1, claimants who entered the UI program during the week ending March 28th and the week ending April 4th. Then, we will validate the models on cohort 2, claimants who entered the UI program during the week ending June 27th and the week ending July 4th. Based on this information, split the training set and the testing set.

In [None]:
# The training set is the COVID-19 cohort (cohort 1)
# Replace '___' with the condition you can use to limit the data to the training set
df_training <- df_ml %>% 
    filter(___) %>%
    select(-c(ssn_id,cohort,byr_start_week)) # remove identifiers, since we do not need them in the ML model

# The testing set is the cohort of claimants entered 13 weeks later (cohort 2)
# Replace '___' with the condition you can use to limit the data to the testing set
df_testing <- df_ml %>% 
    filter(___) %>%
    select(-c(ssn_id,cohort,byr_start_week)) # remove identifiers, since we do not need them in the ML model

## **2. Create Functions**

You just need to run the next three blocks of code so that they will be ready for you to use in the model evaluation.

In [None]:
#Create a function, which returns a DataFrame containing precision and recall at K% of the population
precision_recall <- function(test_data, label, pscore) {
    
    # Get the actual label and predicted score in the testing data
    df_temp <- df_testing[, c(label, pscore)] 
    
    #Calculate Precision and Recall at K
    df_temp <- df_temp %>%
        arrange(desc(!!sym(pscore))) %>% # Sort the rows descendingly based on the predicted score
        mutate(rank = row_number()) %>% # Add rank to each row
        mutate(recall = cumsum(label == 1)/sum(label == 1),  # Calculate Recall at K
               precision = cumsum(label == 1)/rank, # Calculate precision at k
               k = rank/(nrow(test_data))) # Percent of population
    
    return(df_temp)
}

In [None]:
#Create a function, which returns precision at K
precision_at_k <- function(k=0.1, test_data, label, pscore) {
    
    # Get the Precision-Recall at K DataFrame
    df <- precision_recall(test_data, label, pscore)
    
    # Assign a few parameters
    pct_pop <- k  # Percent of population the resource can cover
    test_pop <- nrow(df_testing) # Total number of people in the testing data
    pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the recourse can cover
    
    # Get precision at K% from the Precision-Recall DataFrame
    prec_at_k <- df$precision[pop_at_k]
    
    return(prec_at_k)
}

In [None]:
# Create a function, which returns a DataFrame containing measures for baseline models 
# Baseline model 1: Randomly assign the label (0 or 1)
# Baseline model 2: Guess everyone is a slow exiter

compare_w_baseline <- function(k=0.1, model, test_data, label, pscore) {
    # Set a seed so we get consistent results
    set.seed(42)
    
    # Assign a few parameters
    pct_pop <- k  # Percent of population the resource can cover
    test_pop <- nrow(df_testing) # Total number of people in the testing data
    pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the recourse can cover
    
    # Get the Precision-Recall at K DataFrame
    df <- precision_recall(test_data, label, pscore)
    
    # Generate Precision-Recall at K for baseline model 1
    df_random <- df %>%
        mutate(random_score = runif(nrow(df))) %>% # Generate a row of random scores
        arrange(desc(random_score)) %>% # Sort the data by the random scores
        mutate(random_rank = row_number()) %>% # Add rank to each row
        mutate(random_recall = cumsum(label==1)/sum(label==1), # Calculate Recall at K
           random_precision = cumsum(label==1)/random_rank, # Calculate Precision at K
           random_k = random_rank/(nrow(df)))
    
    # Precision at K of the model
    model_precision_at_k <- precision_at_k(k, test_data, label, pscore)
    
    # Precision at K of baseline model 1
    random_precision_at_k <- df_random$random_precision[pop_at_k]
    
    # Precision at K of baseline model 2
    allstay_precision_at_k <- sum(test_data$label)/test_pop
    
    # Create a DataFrame which shows all measures
    df_compare_prec <- data.frame("model" = c(model, "Random", "All Slow Exiters"),
                              "precision" = c(model_precision_at_k, random_precision_at_k, allstay_precision_at_k))
    
    return(df_compare_prec)
}

## **3. Logistic Regression**

In [None]:
# Run the logit regression with the training dataset
lr_model <- glm(label ~ ., family = binomial(link = 'logit'), data = df_training)

#Show feature importance
summary(lr_model)

#### **Checkpoint 2: Evaluate the Logitist Regression**

1. In the confustion matrix section, change the value of the threshold to see how the logistic regression's accuracy change. Which threshold will you use eventually and why?

2. In the precision at K section, assign a value to `k_pct`. Why do you choose this value? What's the interpretation of the precision you get?

3. In the compare our model with baselines section, assign a value to k. How well does your model perform compared to the baseline models?

4. Comment on the precision-recall curve.

#### **Accuracy**

In [None]:
# Predict the slow exiters with the coefficients of our model and the testing set
df_testing$predict_score <- predict(lr_model, df_testing, type = 'response')

# Set a threshold for the predicted score
# Assume people who get more than ___ predicted score will be slow exiter
# Replace '___' with the threshold of your choice
threshold <- ___

# Add the predicted outcome to a new column in df_testing
# If predicted score is greater than the threshold, then predict label = 1, otherwise, predict label =0.
df_testing$predict_label <- ifelse(df_testing$predict_score > threshold, 1, 0) 

#Confusion matrix
confusionMatrix(factor(df_testing$predict_label), factor(df_testing$label))

#### **Precision at K**

In [None]:
# Check precision at K%
# Replace '___' with the value of your choice. Note that the value should be greater than 0 and less than 1.
k_pct <- ___ # Here, we are checking precision at ___% of the population, change the value to check precision at different K

lr_prec_at_10 <- precision_at_k(k_pct, df_testing, "label", "predict_score")

print(paste0("In the Logistic Regression Model, precision at ", label_percent()(k_pct), 
             " of the population is: ", round(lr_prec_at_10,5)))

#### **Compare our model with baselines**

In [None]:
# Compare precision at K% of the population with baseline models
# Replace '___' with a value of your choice. It should be greater than 0 and less than 1.
df_compare <- compare_w_baseline(k=___, "Logistic Regression", df_testing, "label", "predict_score")

df_compare

In [None]:
# Use a bar plot to show the comparision of the Logistic Regression model and the baselines (random guess or guess everyone stay)
# Replace '___' in the labs() layer title with the k of your choice

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

#Specify source dataset and x and y variables
lr_baseline_plot <- ggplot(df_compare, aes(x = model, y = precision)) + 
    geom_col() + #Plots bars on the graph
    geom_text(size = 5, aes(label = format(precision, digit = 3), vjust = -0.5)) + # Show values on top of the bar
    scale_y_continuous(breaks = seq(0, 1, 0.2), limits = c(0, 1)) + #Adjust the y scale to set the interval for tick marks
    labs(title = "Precision at ___% against the baseline, Logistic Regression", # Add graph title
         x = " ", y = 'Precision at __%') + 
    theme(axis.title.x = element_text(face="bold"), #Adjust the style of X-axis label
          axis.title.y = element_text(face="bold"), #Adjust the styles of the two Y-axes labels
          axis.text.x = element_text(face = "bold", size = 16),
          plot.title = element_text(hjust = 0.5))  #Center the graph title

print(lr_baseline_plot)

#### **Precision-Recall Curve**

In [None]:
# Get Precision-Recall at K DataFrame
df_measure_at_k <- precision_recall(df_testing, "label", "predict_score")

# See the top records of the DataFrame
head(df_measure_at_k)

In [None]:
# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

#Create the precision-recall curve
lr_pr_curve <- ggplot(df_measure_at_k, aes(x=k)) + # Plot percent of population (k) on the x-axis
    geom_line(aes(y=precision), color = 'blue') + # Add the precision curve
    geom_line(aes(y=recall), color = 'red') + # Add the recall curve
    scale_y_continuous(       # We need to create a dual-axis graph, so we need to define two y axes
        name = "Precision",     # Label of the first axis
        sec.axis = sec_axis(~.*1,name="Recall"), # Add a second axis and specify its label
        breaks = seq(0, 1, 0.1)) +  # Adjust the tick mark on Y-axis
    scale_x_continuous(breaks = seq(0, 1, 0.2)) + # Adjust the tick mark on X-axis
    labs(title = "Precision-Recall Curve, Logistic Regression", # Add graph title
         x = "Percent of Population") + # Add X-axis label
    theme(axis.title.x = element_text(face="bold"), #Adjust the style of X-axis label
          axis.title.y.left = element_text(face="bold", color="blue"), #Adjust the styles of the two Y-axes labels
          axis.title.y.right = element_text(face = 'bold', color = 'red'),
          plot.title = element_text(hjust = 0.5))  #Center the graph title

#Display the graph that we just created
print(lr_pr_curve)

## **4. Decision Tree**

In [None]:
# Run the decision tree model
dt_model <- rpart(label ~ ., method = 'class', data = df_training)

# Print results
printcp(dt_model)

In [None]:
# Print the tree
prp(dt_model, # Your decision tree model
    type = 0, # type of trees
    extra = 100, # what information to show in each node
    main = "Decision Tree Model") # Add a title to your decision tree graph

In [None]:
# Get the Decision Tree model predicted score and save it in a column in the testing DataFrame
df_testing$dt_predict_score <- predict(dt_model, df_testing, type = 'prob')[,2]

# Get the Decision Tree model Precision-Recall at K% DataFrame
df_dt_measure_at_k <- precision_recall(df_testing, "label", "dt_predict_score")

In [None]:
# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

#Create the precision-recall curve
dt_pr_curve <- ggplot(df_dt_measure_at_k, aes(x=k)) + 
    geom_line(aes(y=precision), color = 'blue') + # Add the precision curve
    geom_line(aes(y=recall), color = 'red') + # Add the recall curve
    scale_y_continuous(       # We need to create a dual-axis graph, so we need to define two y axes
        name = "Precision",     # Label of the first axis
        sec.axis = sec_axis(~.*1,name="Recall"), # Add a second axis and specify its label
        breaks = seq(0, 1, 0.1)) +  # Adjust the tick mark on Y-axis
    scale_x_continuous(breaks = seq(0, 1, 0.2)) + # Adjust the tick mark on X-axis
    labs(title = "Precision-Recall Curve, Decision Tree", # Add graph title
         x = "Percent of Population") + # Add X-axis label
    theme(axis.title.x = element_text(face="bold"), #Adjust the style of X-axis label
          axis.title.y.left = element_text(face="bold", color="blue"), #Adjust the styles of the two Y-axes labels
          axis.title.y.right = element_text(face = 'bold', color = 'red'),
          plot.title = element_text(hjust = 0.5))  #Center the graph title

#Display the graph that we just created
print(dt_pr_curve)

## **5. Compare Multiple Models**

In [None]:
# The training set is the COVID-19 cohort (cohort 1)
df_training <- df_ml %>% 
    filter(cohort == 'cohort1') %>%
    select(-c(ssn_id,cohort,byr_start_week)) # remove identifiers, since we do not need them in the ML model

# The testing set is the cohort of claimants entered 13 weeks later (cohort 2)
df_testing <- df_ml %>% 
    filter(cohort == 'cohort2') %>%
    select(-c(ssn_id,cohort,byr_start_week)) # remove identifiers, since we do not need them in the ML model

In [None]:
#Create a list to include all the models we want to compare
#LR: Logistic Regresion
#DT: Decision Tree
#RF: Random Forest
model_list <- c("LR", "DT", "RF")

#Define percent of population the resource can cover
k <- 0.1
pct_pop <- k  # Percent of population the resource can cover
test_pop <- nrow(df_testing) # Total number of people in the testing data
pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the resourse can cover

#Define an empty DataFrame to save our results
n <- length(model_list) #Number of rows of the DataFrame
df_compare_models <- data.frame(Model = model_list, accuracy = double(n), 
                                precision_at_k = double(n), recall_at_k = double(n)) # Define the columns of the DataFrame

In [None]:
for (model in model_list) {
    
    # Logististic Regression Model
    if (model=="LR") {
        fit <- glm(label ~ ., family = binomial(link = 'logit'), data = df_training) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing, type = 'response') # Predict scores
    }
    
    #Decision Tree Model
    if (model=="DT") {
        fit <- rpart(label ~ ., method = 'class', data = df_training) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing, type = 'prob')[,2] # Predict scores
    }
    
    #Random Forest Model
    if (model == "RF"){
        fit <- randomForest(label ~ ., data = df_training, type = 'class', ntree = 500, mtry = 6, importance = TRUE) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing) # Predict scores
    }
    
    # Get Precision-Recall DataFrame
    df_prec_rec <- precision_recall(df_testing, "label", "predict_score")
    
    #Calculate accuracy
    threshold <- df_prec_rec$predict_score[pop_at_k] # Get the predicted score at K%
    df_testing$predict_label <- ifelse(df_testing$predict_score > threshold, 1, 0) # Predict the label, if > threshold, then 1; if < threshold, then 0
    df_testing <- df_testing %>% mutate(accurate = 1*(label == predict_label)) # Generate an indicate of whether the prediction is correct
    acc = (sum(df_testing$accurate)/nrow(df_testing)) # Calculate accuracy
    df_compare_models$accuracy <- ifelse(df_compare_models$Model == model,acc,df_compare_models$accuracy) # Save accuracy to the DataFrame
    
    #Calculate precision and save it in the df_compare_models DataFrame
    prec_at_k <- precision_at_k(k, df_testing, "label", "predict_score")
    df_compare_models$precision_at_k <- ifelse(df_compare_models$Model == model, prec_at_k, df_compare_models$precision_at_k)
    
    #Calculate Recall and save it in the df_compare_models DataFrame
    rec_at_k <- df_prec_rec$recall[pop_at_k]
    df_compare_models$recall_at_k <- ifelse(df_compare_models$Model == model, rec_at_k, df_compare_models$recall_at_k)
    
}

#### **Checkpoint 3: Compare Multiple Models's Results**

Based on the results in the DataFrame, `df_compare_model`, which model would you choose to predict slow exiters in this project? Which measure(s) do you use to select the model and why?

In [None]:
df_compare_models

#### 