<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Joshua Edelmann, Benjamin Feder, Nathan Barrett </center>
<a href="https://doi.org/10.5281/zenodo.6412954"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.6412954.svg" alt="DOI"></a>



# **Machine Learning Part 2: Prediction Models and Model Evaluation**

Now that the features and label have been created for supervised machine learning, we can deploy and evaluate various algorithms in trying to predict future employment for a cohort of graduates. Recall that we will build our model on 2017 bachelor's degree recipients in Texas and then test the model on the 2018 cohort.

**If you have not done so, please review the [features creation notebook](./5A.ML_Feature_Creation.ipynb) prior to running any code in this notebook.**

## **1. Read in the Data**

As always, let's load R packages and establish the database connection first. Note that in this notebook, we include additional packages `caret`, `rpart`, `rpart.plot`, and `randomForest` to get functions for calculating evalution metrics and fitting models.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

# Classification and regression training package. For streamlining the process for creating predictive models
library(caret)

# For Decision Tree model and to plot the tree
library(rpart)
library(rpart.plot)

# For Random Forest model
library(randomForest)

library(scales)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

In [None]:
# Import the training set that we created in the first ML notebook
# Adding in major name instead of CIP code
query <- "
SELECT nts.gradid, nts.gradtypi, SUBSTRING(cl.CIPTitle2020, 1, 10) gradmaj, nts.instregion, nts.transfer_ind, nts.total_sems, nts.total_hours, nts.employed_count, nts.employed_prop, nts.wage_ind label
FROM tr_tx_2021.dbo.nb_training_set nts 
LEFT JOIN ds_public_1.dbo.cip_lookup cl
ON nts.gradmaj = cl.CIPCode2020
" 

training_set <- dbGetQuery(con, query)

In [None]:
head(training_set)

In [None]:
# Import the testing set that we created in the first ML notebook
query <- "
SELECT nts.gradid, nts.gradtypi, SUBSTRING(cl.CIPTitle2020, 1, 10) gradmaj, nts.instregion, nts.transfer_ind, nts.total_sems, nts.total_hours, nts.employed_count, nts.employed_prop, nts.wage_ind label
FROM tr_tx_2021.dbo.nb_testing_set nts 
LEFT JOIN ds_public_1.dbo.cip_lookup cl
ON nts.gradmaj = cl.CIPCode2020
" 

testing_set <- dbGetQuery(con, query)

In [None]:
head(testing_set)

## **2. Prepare the Data**

Similar to clustering, a dataset must have certain properties for running a prediction model. We will clean our data frames by following 5 steps:

1. Remove overlap
1. Remove non-explanatory features
1. Set variable types
1. Analyze missingness
1. Examine scales across variables

### **Remove Overlap**

When running and assessing prediction algorithms, there should not be any units (individuals, in this case) present in both the training and testing sets, as we want to provide the algorithm with unseen test data.

In [None]:
# see amount of individuals before removal
testing_set %>%
    summarize(
        n_distinct(gradid)
    )

# remove all gradid values from testing_set that are in training_set
testing_set <- testing_set %>% 
    anti_join(training_set, by = 'gradid')

# see amount of individuals after removal
testing_set %>%
    summarize(
        n_distinct(gradid)
    )

### **Remove non-explanatory features**

Like we did with clustering, we will want to remove any non-explanatory features before running a supervised machine learning model.

In [None]:
# remove gradid
training_set <- training_set %>% select(-gradid)

testing_set <- testing_set %>% select(-gradid)

### **Set Variable Types**

Take a look at the type of each column. Note that the categorical features are still in character type, such as `instregion` and `gradmaj`, or integer type, such as `gradtypi` and `transfer_ind`. We need to convert them to factors so that they can be used as dummy variables in ML models. 

In [None]:
# look at structure of data frame
str(training_set)

In [None]:
# change all character variables to factors
training_set <- training_set %>% mutate_if(is.character, as.factor)
testing_set <- testing_set %>% mutate_if(is.character, as.factor) 

# convert some numeric variables into factor type. We will leave the label as numeric
training_set$gradtypi <- factor(training_set$gradtypi)
training_set$transfer_ind <- factor(training_set$transfer_ind)

testing_set$gradtypi <- factor(testing_set$gradtypi)
testing_set$transfer_ind <- factor(testing_set$transfer_ind)


In [None]:
# save data frames as df_training and df_testing b/c will need original data frames later
df_training <- training_set
df_testing <- testing_set

### **Analyze Missingness**

Before we run a machine learning model, we want to make sure that there are no missing values in our data so that all rows are preserved.

In [None]:
# Check if columns have missing values
sapply(df_training, function(x) sum(is.na(x)))

### **Examine Scales Across Variables**

Just like we did prior to clustering, we will want to confirm that all of the numerical features are on similar scales to prevent inappropriate overweighting of certain features.

In [None]:
# Get descriptions of each variable using "summary" function
summary(df_training)

In [None]:
# Scale the numeric variables for both sets
df_training <- df_training %>%
    mutate(
        total_sems = scale(total_sems)[,1],
        total_hours = scale(total_hours)[,1],
        employed_count = scale(employed_count)[,1],
        employed_prop = scale(employed_prop)[,1]
    )

df_testing <- df_testing %>%
    mutate(
        total_sems = scale(total_sems)[,1],
        total_hours = scale(total_hours)[,1],
        employed_count = scale(employed_count)[,1],
        employed_prop = scale(employed_prop)[,1]
    )

# see evidence of scaling
head(df_testing)

## **3. Create Functions**

Some code will be used many times in this notebook, such as the code to calculate precision and recall, the code to get precision at K, and the code to compare a model's results with baselines. We can put these code in functions so that we can easily repeat these calculation processes by using one line of code.

The description of these functions is available below.

1. `precision_recall(test_data, label, pscore)`: calculates precision and recall 

    After we run an ML model and use its results to predict outcomes for the testing data, we can use the `precision_recall()` function to calculate precision and recall. **This function returns a data frame which contains precision and recall at various k (0<k<1).** It has three arguments:
    - `test_data`: the data frame of the testing data
    - `label`: the name of the label column. It should be a **string**.
    - `pscore`: the name of the predicted score column. It should be a **string**. 

2. `precision_at_k(k, test_data, label, pscore)`: calculate precision at K. 

    We can use the `precision_at_k()` function to get the precision at the k% we choose. **This function returns a number.** It has four arguments:
    - `k`: the percent of population we have enough resources to intervene (e.g., help graduates get employed)
    - `test_data`: the data frame of the testing data
    - `label`: the name of the label column. It should be a **string**.
    - `pscore`: the name of the predicted score column. It should be a **string**. 

3. `compare_w_baseline(k, model, test_data, label, pscore)`: compare a model's precision at k with baselines

    This function compares our ML model with two baseline models. In the first baseline model, we randomly assign labels (0 or 1) to each person. In the second baseline model, we assign 1 to every person. **This function returns a data frame which includes precision at k (we choose the k) of our ML model and the two baseline models.** It has five arguments: 
    - `k`: the percent of population we have enough resources to intervene (e.g., help graduates get employed)
    - `model`: the name of the ML model. It should be a **string**
    - `test_data`: the data frame of the testing data
    - `label`: the name of the label column. It should be a **string**.
    - `pscore`: the name of the predicted score column. It should be a **string**. 

Now we just need to run the next three blocks of code so that they will be ready for us to use in the model evaluation.

In [None]:
# Create a function, which returns a DataFrame containing precision and recall at K% of the population
precision_recall <- function(test_data, label, pscore) {
    
    # Get the actual label and predicted score in the testing data
    df_temp <- df_testing[, c(label, pscore)] 
    
    #Calculate Precision and Recall at K
    df_temp <- df_temp %>%
        arrange(desc(!!sym(pscore))) %>% # Sort the rows descendingly based on the predicted score
        mutate(rank = row_number()) %>% # Add rank to each row
        mutate(recall = cumsum(label == 1)/sum(label == 1),  # Calculate Recall at K
               precision = cumsum(label == 1)/rank, # Calculate precision at k
               k = rank/(nrow(test_data))) # Percent of population
    
    return(df_temp)
}

In [None]:
# Create a function, which returns precision at K
precision_at_k <- function(k=0.1, test_data, label, pscore) {
    
    # Get the Precision-Recall at K DataFrame
    df <- precision_recall(test_data, label, pscore)
    
    # Assign a few parameters
    pct_pop <- k  # Percent of population the resource can cover
    test_pop <- nrow(df_testing) # Total number of people in the testing data
    pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the recourse can cover
    
    # Get precision at K% from the Precision-Recall DataFrame
    prec_at_k <- df$precision[pop_at_k]
    
    return(prec_at_k)
}

In [None]:
# Create a function, which returns a DataFrame containing measures for baseline models 
# Baseline model 1: Randomly assign the label (0 or 1)
# Baseline model 2: Guess everyone is employed

compare_w_baseline <- function(k=0.1, model, test_data, label, pscore) {
    # Set a seed so we get consistent results
    set.seed(42)
    
    # Assign a few parameters
    pct_pop <- k  # Percent of population the resource can cover
    test_pop <- nrow(df_testing) # Total number of people in the testing data
    pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the recourse can cover
    
    # Get the Precision-Recall at K DataFrame
    df <- precision_recall(test_data, label, pscore)
    
    # Generate Precision-Recall at K for baseline model 1
    df_random <- df %>%
        mutate(random_score = runif(nrow(df))) %>% # Generate a row of random scores
        arrange(desc(random_score)) %>% # Sort the data by the random scores
        mutate(random_rank = row_number()) %>% # Add rank to each row
        mutate(random_recall = cumsum(label==1)/sum(label==1), # Calculate Recall at K
           random_precision = cumsum(label==1)/random_rank, # Calculate Precision at K
           random_k = random_rank/(nrow(df)))
    
    # Precision at K of the model
    model_precision_at_k <- precision_at_k(k, test_data, label, pscore)
    
    # Precision at K of baseline model 1
    random_precision_at_k <- df_random$random_precision[pop_at_k]
    
    # Precision at K of baseline model 2
    allemp_precision_at_k <- sum(test_data$label)/test_pop
    
    # Create a DataFrame which shows all measures
    df_compare_prec <- data.frame("model" = c(model, "Random", "All Employed"),
                              "precision" = c(model_precision_at_k, random_precision_at_k, allemp_precision_at_k))
    
    return(df_compare_prec)
}

## **4. Logistic Regression**

In [None]:
# Run the logit regression with the training dataset
# label is outcome variable and we will predict on all other features
# family = binomial (link = 'logit') specifies functional form and error distribution
lr_model <- glm(label ~ ., family = binomial(link = 'logit'), data = df_training)

# Show feature importance
summary(lr_model)

In the above results, the column **Estimate** shows the importance of each feature in predicting whether a graduate will be employed in the 4th quarter after graduation. The stars at the end of each row indicates whether a feature coefficient is statistically significant. Note that our model automatically leaves out one category for each group of dummy variables. For example, of all the `gradmaj` dummies, agriculture is the omitted category. 

We can see that based on our logistic regression model, most of our features are important predictors. Additionally, we can interpret the estimate for factors, as a comparison to the feature left out.

### **Model Evaluation**

Next, let's evaluate the logistic regression's performace on the testing set. There are a variety of evaluation metrics we can use, and we will introduce the following:
- Confusion matrix
- Precision at k
- Comparison to baseline models
- Precision-recall curve

#### **Confusion Matrix**

First, let's use the function `predict()` to predict outcomes of graduates in our testing data based on the results of our logistic regression. Recall that the predicted scores don't tell us the predicted outcome of a graduate. We can say that a graduate with score 0.9 is predicted to be less likely to be employed in the 4th quarter after graduation than a graduate with score 0.7. But whether 0.9 implies the graduate will be employed or not depends on our choice of the threshold. For example, if we choose a threshold of 0.5, any graduate with predicted scores greater than 0.5 will be defined as employed and any graduate with predicted scores less than 0.5 will be defined as unemployed. 

In [None]:
# Predict on testing set
# type = "response" ensures we get predicted probabilities
df_testing$predict_score <- predict(lr_model, df_testing, type = "response")

# Set a threshold for the predicted score
# Assume people who get more than 0.5 predicted score will be employed 4th quarter after graduation
threshold <- .5

# Add the predicted outcome to a new column in df_testing
# If predicted score is greater than the threshold, then predict label = 1, otherwise, predict label =0.
df_testing$predict_label <- ifelse(df_testing$predict_score > threshold, 1, 0) 

# Confusion matrix
confusionMatrix(factor(df_testing$predict_label), factor(df_testing$label))

From the results of `confusionMatrix()`, you can view the number of true positives, false positives, true negatives, and false negatives predicted with the model.  Additionally, you can see the specificity, sensitivity, and accuracy of the models. We will discuss some of these metrics in the following sections.

#### **Precision at K**

Most of the time, we may not care about how accurate our model is in predicting both positive (employed) and negative (not employed) outcomes. Instead, we may want to check how accurate our model is in predicting positive outcomes, which is captured by **precision**.

$$Precision = \frac{True Positive}{True Positive + False Positive}$$

Instead of arbitrarily controlling the threshold like we did before, we may want to think about that given our resources, such as funding, time, and staff, what percentage of the population we can help. In the context of this project, we need to decide what percentage of graduates we can provide assistance to so that they have a better opportunity to be employed after graduation. Suppose that we have enough resources to cover 10% of the graduates. We can use the function we created in section 2, `precision_at_k()`, to get the precision at 10% of the population. 

> Note: In the case of intervention, it may be more useful to assign `label = 1` as the negative scenario, thus isolating the most likely to not find employment, in this example.

In [None]:
# Check precision at K%
k_pct <- .1 # Here, we are checking precision at 10% of the population, change the value to check precision at different K

lr_prec_at_10 <- precision_at_k(k_pct, df_testing, "label", "predict_score")

print(paste0("In the Logistic Regression Model, precision at ", label_percent()(k_pct), 
             " of the population is: ", (lr_prec_at_10)))

The result implies that at 10% of the population, among all the graduates the logistic regression model predicts to be employed, REDACTED were actually employed in the 4th quarter after graduation. <font color=red> You can change the value assigned to `k_pct` in the first line of the code to explore precision at other percent of the population.</font>

#### **Compare our model with baselines**

How good is the logistic regression's precision at 10% of the population? Is it accurate enough or not? Recall the function `compare_w_baseline()`, which allows us to compare our model with two baselines. 

In [None]:
# Compare precision at K% of the population with baseline models
df_compare <- compare_w_baseline(k=0.1, "Logistic Regression", df_testing, "label", "predict_score")

df_compare

We can see that the logistic regression model's precision is higher than the two baseline models' precision. This implies that at 10% of the population, logistic regression model is better at predicting who will be employed compared to randomly guessing who will be employed or by guessing that everyone will be employed. 

In [None]:
# Use a bar plot to show the comparision of the Logistic Regression model and the baselines (random guess or guess everyone stay)

# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

# Specify source dataset and x and y variables
lr_baseline_plot <- ggplot(df_compare, aes(x = model, y = precision)) + 
    geom_col() + # Plots bars on the graph
    geom_text(size = 5, aes(label = format(precision, digit = 3), vjust = -0.5)) + # Show values on top of the bar
    scale_y_continuous(breaks = seq(0, 1, 0.2), limits = c(0, 1)) + # Adjust the y scale to set the interval for tick marks
    labs(title = "Precision at 10% against the baseline, Logistic Regression", # Add graph title
         x = " ", y = 'Precision at 10%') + 
    theme(axis.title.x = element_text(face="bold"), # Adjust the style of X-axis label
          axis.title.y = element_text(face="bold"), # Adjust the styles of the two Y-axes labels
          axis.text.x = element_text(face="bold", size = 16),
          plot.title = element_text(hjust = 0.5))  # Center the graph title

print(lr_baseline_plot)

#### **Precision-Recall Curve**

Another measure we often use to evaluate the performance of a ML model is recall. It shows us what percentage of people with actual positive outcomes our model can capture. In the context of this project, recall tells us what percentage of those who found employment our ML model can accurately predict. 

Again, we do not need to set a specific threshold, as we can plot their values on a line chart so that it is easier for us to see how precision and recall change with our choice of k.

$$Precision = \frac{True Positive}{True Positive + False Positive}$$

$$Recall = \frac{True Positive}{True Positive + False Negative}$$

In [None]:
# Get Precision-Recall at K data frame
df_measure_at_k <- precision_recall(df_testing, "label", "predict_score")

# See the top records of the DataFrame
head(df_measure_at_k)

In [None]:
# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

# Create the precision-recall curve
lr_pr_curve <- ggplot(df_measure_at_k, aes(x=k)) + # Plot percent of population (k) on the x-axis
    geom_line(aes(y=precision), color = 'blue') + # Add the precision curve
    geom_line(aes(y=recall), color = 'red') + # Add the recall curve
    scale_y_continuous(       # We need to create a dual-axis graph, so we need to define two y axes
        name = "Precision",     # Label of the first axis
        sec.axis = sec_axis(~.*1,name="Recall"), # Add a second axis and specify its label
        breaks = seq(0, 1, 0.1)) +  # Adjust the tick mark on Y-axis
    scale_x_continuous(breaks = seq(0, 1, 0.2)) + # Adjust the tick mark on X-axis
    labs(title = "Precision-Recall Curve, Logistic Regression", # Add graph title
         x = "Percent of Population") + # Add X-axis label
    theme(axis.title.x = element_text(face="bold"), # Adjust the style of X-axis label
          axis.title.y.left = element_text(face="bold", color="blue"), # Adjust the styles of the two Y-axes labels
          axis.title.y.right = element_text(face='bold', color = 'red'),
          plot.title = element_text(hjust = 0.5))  # Center the graph title

# Display the graph that we just created
print(lr_pr_curve)

In the above graph, the blue line represents precision. We can see that it stays relatively constant when we increase the choice of k. The red line shows recall. Its values increases as we increase the choice of k. At 10% of the population, precision is high and recall is low.

We would expect to see this because the precision curve starts at a high point, because when k is low, only graduates with the highest predicted scores will be defined as employed. When k increases (meaning we are selecting a higher percent of the population), we relax the threshold and predict graduates with relatively low predicted scores to be not employed. Therefore, the precision decreases, which is what we see. 

## **5. Decision Tree**

There are other supervised machine learning algorithms besides logistic regressions. Here, we will try a decision tree model.

In [None]:
# Run the decision tree model
# method = 'class' is to designate the classification tree method (label is binary)
dt_model <- rpart(label ~ ., method = 'class', data = df_training)

# Print results
printcp(dt_model)

In the above result, we can see that the only variable the decision tree model actually used in prediction was `total_sems`. In the logistic regression model, we saw that majority of the variables were statistically significant. 

We can also view the tree graphically. 

In [None]:
# Print the tree
prp(dt_model, # Your decision tree model
    type = 0, # type of trees
    extra = 100, # what information to show in each node
    main = "Decision Tree Model") # Add a title to your decision tree graph

We can evaluate the decision tree model in a similar fashion.

In [None]:
# Get the Decision Tree model predicted score and save it in a column in the testing DataFrame
df_testing$dt_predict_score <- predict(dt_model, df_testing, type = 'prob')[,2]

# Get the Decision Tree model Precision-Recall at K% DataFrame
df_dt_measure_at_k <- precision_recall(df_testing, "label", "dt_predict_score")

In [None]:
# For easier reading, increase base font size
theme_set(theme_gray(base_size = 16))
# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 10, repr.plot.height = 5)

# Create the precision-recall curve
dt_pr_curve <- ggplot(df_dt_measure_at_k, aes(x=k)) + 
    geom_line(aes(y=precision), color = 'blue') + # Add the precision curve
    geom_line(aes(y=recall), color = 'red') + # Add the recall curve
    scale_y_continuous(       # We need to create a dual-axis graph, so we need to define two y axes
        name = "Precision",     # Label of the first axis
        sec.axis = sec_axis(~.*1,name="Recall"), # Add a second axis and specify its label
        breaks = seq(0, 1, 0.1)) +  # Adjust the tick mark on Y-axis
    scale_x_continuous(breaks = seq(0, 1, 0.2)) + # Adjust the tick mark on X-axis
    labs(title = "Precision-Recall Curve, Decision Tree", # Add graph title
         x = "Percent of Population") + # Add X-axis label
    theme(axis.title.x = element_text(face="bold"), # Adjust the style of X-axis label
          axis.title.y.left = element_text(face="bold", color="blue"), # Adjust the styles of the two Y-axes labels
          axis.title.y.right = element_text(face = 'bold', color = 'red'),
          plot.title = element_text(hjust = 0.5))  # Center the graph title

# Display the graph that we just created
print(dt_pr_curve)

We can see that the precision-recall curve for the decision tree model looks relatively similar to the curve for the logistic regression model. However, the precision at low values of *k* is lower than that of the logistic regression.

## **6. Compare Multiple Models**

In this section, we provide you code to compare evaluation metrics for several ML models all at once. 

In [None]:
# "refresh" the training data and the testing data to remove predicted scores

df_training <- training_set
df_testing <- testing_set

In addition to a logistic regression and decision tree, we will also try inputting our data into a random forest. This list can be expanded to include other ML models.

In [None]:
# Create a list to include all the models we want to compare
# LR: Logistic Regresion
# DT: Decision Tree
# RF: Random Forest
model_list <- c("LR", "DT", "RF")

# Define percent of population the resource can cover
k <- 0.1
pct_pop <- k  # Percent of population the resource can cover
test_pop <- nrow(df_testing) # Total number of people in the testing data
pop_at_k <- as.integer(pct_pop * test_pop) # At K percent of the population, how many people the resource can cover

# make data frame number of rows to be number of models
n <- length(model_list)
df_compare_models <- data.frame(
    Model = model_list, 
    accuracy = double(n),
    precision_at_k = double(n),
    recall_at_k = double(n)
) 

We can then loop through the models to run them all in the same series of commands.

> Note: The code cell below will take approximately 8 minutes to execute. 

In [None]:
for (model in model_list) {
    
    # Logististic Regression Model
    if (model=="LR") {
        fit <- glm(label ~ ., family = binomial(link = 'logit'), data = df_training) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing, type = 'response') # Predict scores
    }
    
    # Decision Tree Model
    if (model=="DT") {
        fit <- rpart(label ~ ., method = 'class', data = df_training) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing, type = 'prob')[,2] # Predict scores
    }
    
    # Random Forest Model
    if (model == "RF"){
        df_training$label <- factor(df_training$label) # for the algorithm to run we need to convert the label to a factor.
        df_testing$label <- factor(df_testing$label)
        # ntree determines number of trees
        # mtry is number of variables randomly sampled as candidates at each split
        # importance = TRUE to include feature importances
        fit <- randomForest(label ~ ., data = df_training, type = 'class', ntree = 500, mtry = 6, importance = TRUE) # Fit the model
        df_testing$predict_score <- predict(fit, df_testing) # Predict scores
        df_testing$predict_score <- as.numeric(as.character(df_testing$predict_score)) # for our previously defined functions to work, we need to convert the predict_score and label to numeric.
        df_testing$label <- as.numeric(as.character(df_testing$label))
    }
    
    
    # Get Precision-Recall DataFrame
    df_prec_rec <- precision_recall(df_testing, "label", "predict_score")
    
    # Calculate accuracy
    threshold <- df_prec_rec$predict_score[pop_at_k] # Get the predicted score at K%
    df_testing$predict_label <- ifelse(df_testing$predict_score >= threshold, 1, 0) # Predict the label, if > threshold, then 1; if < threshold, then 0
    df_testing <- df_testing %>% mutate(accurate = 1*(label == predict_label)) # Generate an indicate of whether the prediction is correct
    acc = (sum(df_testing$accurate)/nrow(df_testing)) # Calculate accuracy
    df_compare_models$accuracy <- ifelse(df_compare_models$Model == model,acc,df_compare_models$accuracy) # Save accuracy to the DataFrame
    
    # Calculate precision and save it in the df_compare_models DataFrame
    prec_at_k <- precision_at_k(k, df_testing, "label", "predict_score")
    df_compare_models$precision_at_k <- ifelse(df_compare_models$Model == model, prec_at_k, df_compare_models$precision_at_k)
    
    # Calculate Recall and save it in the df_compare_models DataFrame
    rec_at_k <- df_prec_rec$recall[pop_at_k]
    df_compare_models$recall_at_k <- ifelse(df_compare_models$Model == model, rec_at_k, df_compare_models$recall_at_k)
    
}

# check results
df_compare_models

We can see that at the 10% of the population, the decision tree model is the most accurate. However, the logistic regression model is better at predicting if a graduate will be employed and can capture more graduates that will be employed than the other two models.

## 7. References

Lou, Tian. (2022, March 18). Machine Learning Model Deployment and Evaluation Using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.6369160

## **Footnotes:**
<span id="fn1"> 1. For more information, see <a href='http://www.milbo.org/rpart-plot/prp.pdf'>plotting rpart trees with the rpart.plot package</a>. </span>  

[[Go back]](#13)

<span id="fn2"> 2. For more information, see <a href='https://cran.r-project.org/web/packages/randomForest/randomForest.pdf'>Breiman and Cutler's Random Forests for Classification and
Regression</a>. </span>  

[[Go back]](#14)

#### 