<center> <img style="float: center;" src="images/CI_horizontal.png" width="450">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span> 

# **<center>Machine Learning Part 1: Data Preparation</center>**

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6368905.svg)](https://doi.org/10.5281/zenodo.6368905)

<br>
<center>Tian Lou</center>

## **1. Introduction**

In the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb), we have tracked records of the COVID-19 cohort (claimants who entered during the week ending March 28th and the week ending April 4th, 2020) and have calculated their exit rates. We have also examined how claimants from different industries exit at various rates. For example, we found that claimants from REDACTED and REDACTED have slower exit rates than claimants from other industries. However, industry is just one of the factors that are related to claimants' exit rates. What about other factors, such as demographics, occupation, prior earnings, location, and total amount of benefits received from the UI program? Can we use these factors to predict which group of workers are the mostly likely to stay in the UI program and eventually exhaust their benefits? Can interventions, such as economic shocks and/or training programs, accelerate claimants' rate of returning to work? 

In the two machine learning (ML) notebooks, we will develop a machine learning model to answer these questions. We will use it to predict claimants who are the most likely to stay in the UI program for 13 weeks or longer and to examine how economic shocks influence claimants' exit rates. In the first ML notebook, we will formulate the research questions and clean up the data, including identifying and constructing rows, and creating labels (outcome, Y) and features (predictors, X). In the second ML notebook, we will split the data into a training set and a testing set. We will train a few models, such as logistic regression, decision tree, and random forest. Finally, we will evaluate these models by using several measures, such as accuracy, precision at K, and recall at K. 

After you work through the machine learning notebooks, we have two checkpoint notebooks for you to practice the code. You can think about how you might apply any of the techniques and code presented in these notebooks to your project. The model we develop in these notebooks can be used to predict other outcomes, such as claimants who are at risk of exhausting their benefits, and to examine the effectiveness of other interventions, such as different job search assistance services and job training programs.

## **2. Learning Objectives**

Throughout this series of notebooks, our overarching questions are: How can states best produce information that can be used by local workforce boards to help inform resource allocation for unemployed workers? What information is the most useful for workforce boards for strategic resource planning and efficient resource allocation? What measures are most useful to local workforce boards and how do we characterize variation in those measures? We use Illinois administrative data, Bureau of Labor Statistics (BLS) workforce data, and Opportunity Insights Economic Tracker data as an example and show you step by step how we develop a project, including: exploring the data set, selecting sample, defining outcome measures, conducting subgroup analyses, creating visualizations, and building a prediction machine learning model.

After you finish the two ML notebooks, you should know:
- What a row means and how to create rows by using the certified claims data
- How to use consumer spending data and how to join it with the claims data
- How to create the label (outcome variable, Y)
- How to create features (predictor variables, X)
- How to deal with missing values
- How to split data into a training set and a testing set
- How to run ML models
- How to create evaluation measures (e.g., accuracy, precision, recall)
- How to evaluate ML models

#### **Research Questions**

In this project, we are interested in **predicting claimants who will stay in the UI program for 13 weeks or longer**. We define claimants who left before the 13th week as **fast exiters** and those who left during or after the 13th week as **slow exiters**. The questions we seek to answer are:

- Which claimants will stay in the UI program for 13 weeks or longer?
- How important is an economic shock in determining whether a claimant becomes a slow exiter? 

#### **Datasets**
We will use the Illinois PROMIS file and the Consumer Spending data from the Opportunity Insights Economic Tracker:
- **2020 Illinois PROMIS certified claims file**: weekly UI claims data. Each record represents a certified claim in a certain week. The data has a claimant's demographics, education level, prior industry, occupation, and locations. It also contains detailed information about the claim, such as program type, claim type, certification status, benefit starting date, and benefit amount. Federal Pandemic Unemployment Compensation (FPUC, $600/week) and dependent benefits are included in the total amount paid.

The full certified claims data has more than REDACTED million rows. Using the full dataset for data exploration is time-consuming. Therefore, **we will use a 1% random sample of the certified claims data in all notebooks. You should also use the 1% random sample when walking through all notebooks and when identifying the scope of your project.** After you decide your research questions and analysis scope, only pull the data and the variables you need from the full dataset.

- **Opportunity Insights Economic Tracker, Consumer Spending Data**[<sup>1</sup>](#fn1) <a id = "9"> </a>: county-level daily credit/debit card spending relative to January 4-21 2020 in all merchant category codes (MCC). The data is seasonally adjusted and is presented as a 7 day lookback moving average. It captures about 10% of debit and credit card spending in the U.S and covers a broad range of industries, but over-represents spending in industries such as accommodation and food services and clothing. Due to privacy and confidentiality reasons, small cells and outliers have been suppressed. Only REDACTED (out of REDACTED) IL counties' spending data are consistently available over time. [<sup>2</sup>](#fn2) <a id = "10"> </a> 

#### **Analytical Methods**
The specific techniques include but not limited to:
- **SQL statements/keywords**:
 - `SELECT ... FROM`: select data from a table in the database
 - `WHERE`: select subset of tables from the database
 - `CONVERT`/`CAST`: convert one data type into another
 

- **R code**:
 - `str_pad`: pad a string, such as adding leading zeros
 - `replace`: replace values in a column
 - `scale`: standardize/normalize values in a column
 - `factor`: convert a character/numeric type column to factor type
 - `glm`: fit generalized linear models. In this project, we use it for logistic regression
 - `rpart`: fit decision tree model
 - `randomForest`: fit random forest model

## **3. Load the Data**

In this notebook, we use data in the table `il_des_promis_1pct` and the table `affinity_county_daily` (consumer spending data) in the schema `tr_dol_eta.dbo`. First, we need to load R packages and estabilish the database connection.

In [None]:
# Database interaction imports
library(odbc)

# For data manipulation/visualization
library(tidyverse)

# For faster date conversions
library(lubridate)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Database = "tr_dol_eta",
                     Trusted_Connection = "True")

#### **IL PROMIS Certified Claims File**

**We want to predict claimants who will stay in the UI program for 13 weeks or longer with a machine learning model.** In the two data exploration notebooks, we selected the data by using `week_end_date` to get the cross-sectional sample and sliced the data by using `byr_start_week` to get the cohort sample. Only the latter one allows us to observe a claimant's records from program entry to program exit and allows us to calculate how long a claimant stayed in the UI program. Therefore, to construct the data for the machine learning model, we select the data the same way as what we did in the cohort analysis.  

We need at least two cohorts to develop the model. We will train the model on the first cohort and validate it on the second cohort. Therefore, our data includes the COVID-19 cohort (claimants who entered during the week ending March 28th and the week ending April 4th, 2020) and a cohort of claimants who entered 13 weeks later (claimants who entered during the week ending June 27th and the week ending July 4th, 2020). The table below shows the number of weeks after program entry for both cohorts.

||Week Ending Date|3-28|4-4|4-11|4-18|...|6-20|6-27|7-4|7-11|7-18|...|9-19|9-26|...|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|**Training set**|**Cohort 1** |1|2| 3 | 4 |...| 13 |...|
|**Training set**|**Cohort 1** | | 1 | 2 | 3 |...| 12 | 13 |...|
|**Testing set**|**Cohort 2**|||||||1|2|3|4|...|13|...|
|**Testing set**|**Cohort 2**||||||||1|2|3|...|12|13|...|

The SQL query we use to load the data is the same as the query we use in the cohort analysis notebook. We still only look at Regular state UI program (`program_type=1 AND sub_program_type=1`) and new claims (`claim_type=1`). The only difference is that we need to select four benefit year starting weeks (`byr_start_week in ('2020-03-28','2020-04-04','2020-06-27','2020-07-04')`).

In [None]:
# Select PROMIS certified claimant records from the database to a dataframe

# Store SQL query to a character variable
query <- "
SELECT ssn_id,
    week_end_date,
    byr_start_week,
    birth_date,
    gender,
    race,
    ethnicity,
    disability,
    education,
    county_fips_code,
    naics_code,
    occupation_code,
    total_pay,
    wages_2019
FROM tr_dol_eta.dbo.il_des_promis_1pct
WHERE sub_program_type = 1
AND program_type = 1
AND claim_type = 1
AND byr_start_week in ('2020-03-28','2020-04-04','2020-06-27','2020-07-04');
"

# Execute query
df_claimants <-dbGetQuery(con,query)

# R interprets dates as character when pulling from the database, must convert with ymd()
df_claimants <- df_claimants %>%
    mutate(week_end_date=ymd(week_end_date),
          byr_start_week=ymd(byr_start_week),
          birth_date=ymd(birth_date))

# See top records in the dataframe
head(df_claimants)

We add a new column `cohort` to `df_claimants`. It indicates which cohort a claimant is from. We will use this cohort indicator frequenly in the later analysis, such as to join the claims data and the consumer spending data and to split the training set and the testing set.

In [None]:
#Add the cohort indicator to df_claimants
df_claimants <- df_claimants %>%
    mutate(cohort = case_when(byr_start_week == '2020-03-28'|byr_start_week == '2020-04-04' ~ 'cohort1',
                              byr_start_week == '2020-06-27'|byr_start_week == '2020-07-04' ~ 'cohort2',
                              TRUE ~ 'other'))

#See top records of the cohort indicator we just generated
head(df_claimants %>% select(ssn_id, byr_start_week, cohort))

#### **Opportunity Insights Economic Tracker: Consumer Spending Data**

In addition to the variables in the PROMIS file, we also include county-level consumer spending as a feature and use it to capture local economic shocks. The data is stored in the table `affinity_county_daily` in the schema `tr_dol_eta.dbo`. In the query below, we combine the `year`, `month`, and `day` columns into one variable `spend_date` by using the `CONVERT()` and `CAST()` statements.

In [None]:
# Select IL Consumer spending data from the database to a dataframe

# Store SQL query to a character variable
# In this query, we use CONVERT() and CAST() statements to combine the year, month, and day columns into a date variable 'spend_date'
query <- "
SELECT CONVERT(date, CAST([year] AS varchar(4)) + '-' +  
                     CAST([month] AS varchar(2)) + '-' + 
                     CAST([day] AS varchar(2))) AS spend_date,
    county_fips,
    spend_all
FROM tr_dol_eta.dbo.affinity_county_daily;
"

# Execute query
df_spend <-dbGetQuery(con,query)

# R interprets dates as character when pulling from the database, must convert with ymd()
df_spend <- df_spend %>%
    mutate(spend_date=ymd(spend_date))

# See top records in the dataframe
head(df_spend)

There are three columns in the DataFrame `df_spend`: 1) `spend_date`, indicates which day the spending data is for; 2) `county_fips`, the last three digits are county FIPS code and the first one or two digits are state FIPS code; 3) `spend_all`, daily credit/debit card spending index of a county. The spending index is seasonally adjusted and is 7-day moving average. 

> The earlist spending data available to us is from 1/13/2020. That's why `spend_all` is NA in the first few rows.[<sup>3</sup>](#fn3) <a id = "11"> </a>

The data contains spending indexes for multiple states. We need to limit it to IL data by using FIPS code. First, a few states' FIPS codes are four-digit. We use the `str_pad` function to add a leading zero to these states' FIPS codes so that we can create state FIPS codes correctly. Second, we use the `substr` function to separate `county_fips` into two-digit state FIPS code and three-digit county FIPS code. Finally, we filter the data to `state_fips==REDACTED` (IL) and also remove an IL county that does not show consistently in the data over time. 

In [None]:
# Clean up FIPS code and limit the spending data to IL counties that show in the data consistently over time
df_spend <- df_spend %>%
    mutate(county_fips = str_pad(county_fips, 5, side = c("left"), pad="0")) %>% # Add a leading zero to county fips code
    mutate(state_fips = substr(county_fips, 1, 2), county_fips_code = substr(county_fips, 3, 5)) %>% # Generate two-digit state FIPS code and three-digit county FIPS code
    filter(state_fips == 'REDACTED' & county_fips_code!= 'REDACTED') %>% # Only keep IL data and remove a county that is not consistenly available during the time period we look at
    select(spend_date, spend_all, county_fips_code) # Only keep the columns we need

#Check county_fips_code and each FIPS code's frequency
table(df_spend$county_fips_code)

In this model, we use the **county-level average consumer spending during one week, four weeks, and eight weeks before each cohort's entry** as features. It captures short-term and long-term local economic activities prior to a claimant's job separation. We will test its importance in determining how long claimants stay in the UI program. One might argue that the economic conditions after UI program entry might also affect a claimant’s exit decision. For example, when the economy is expanding, there are more job opportunities and claimants are more likely to leave the program. However, we cannot use it as a feature in the prediction machine learning model, because when we predict cohort 2's (6/27 and 7/4 entrants) exit decisions, we don't know the future economic conditions. Imagine that you are using this model on 7/5/2020. You don't know how consumer spending will change in the next 13 weeks. **In general, we must avoid using future data to predict outcomes**. 

The variable we suggest here is not the only way you can use the consumer spending data. Depending on your assumptions about how economic activities may impact UI claimants' exit decisions, you may create the variables in different ways. 

In [None]:
# Calculate weekly average spending
df_weekly_spend <- df_spend %>% 
    mutate(spend_week = epiweek(spend_date)) %>% # Create a week indicator. For example, any date between 12-29-2019 and 01-04-2020 will be labelled as 1.
    group_by(spend_week, county_fips_code) %>% 
    summarize(avg_spend = mean(spend_all)) # Calculate average weekly spending index for each county

#See the top records
head(df_weekly_spend)

In [None]:
# Generate the two cohorts' week indicators
cohort1_week <- epiweek('2020-03-28')
cohort2_week <- epiweek('2020-06-27')

# Average weekly spending indexes one week, four weeks, and eight weeks prior to Cohort 1's entry
df_cohort1_avg_spend <- df_weekly_spend %>%
    filter(spend_week %in% c(cohort1_week-1, cohort1_week-4, cohort1_week-8)) %>% # Subset the data to 1, 4, 8 weeks prior program entry week
    pivot_wider(names_from = spend_week, values_from = avg_spend) %>% # Reshape the data to wide format so that each week's spending index is a column
    `colnames<-`(c('county_fips_code', 'avg_spend_8', 'avg_spend_4', 'avg_spend_1')) %>% # Rename columns
    mutate(cohort = 'cohort1') # Generate the cohort indicator

# Average weekly spending indexes one week, four weeks, and eight weeks prior to Cohort 1's entry
df_cohort2_avg_spend <- df_weekly_spend %>%
    filter(spend_week %in% c(cohort2_week-1, cohort2_week-4, cohort2_week-8)) %>% # Subset the data to 1, 4, 8 weeks prior program entry week
    pivot_wider(names_from = spend_week, values_from = avg_spend) %>% # Reshape the data to wide format so that each week's spending index is a column
    `colnames<-`(c('county_fips_code', 'avg_spend_8', 'avg_spend_4', 'avg_spend_1')) %>% # Rename columns
    mutate(cohort = 'cohort2') # Generate the cohort indicator

# Append(Combine) the two DataFrames
df_avg_spend <- rbind(df_cohort1_avg_spend, df_cohort2_avg_spend)

# See the top records of the DataFrame
head(df_avg_spend)

#### **Combine Datasets**

Now we can join the claims data and the consumer spending data on county FIPS code (`county_fips_code`) and the cohort indicator (`cohort`). Note that we use `inner_join` to keep observations that are in both datasets, i.e., claimants who are living in the REDACTED counties that are included in the consumer spending data.

In [None]:
#Inner join the claims data with the average spending data on county FIPS code and the cohort indicator
#Inner join keeps the rows that are in both df_claimants and df_avg_spend
df_claimants <- df_claimants %>%
    inner_join(df_avg_spend, by = c('county_fips_code','cohort'))

#See top records of the DataFrame
head(df_claimants)           

## **4. Create the Label**

In this section, we construct the label (i.e., the outcome variable, the dependent variable, or Y). The value of the label is usually binary, i.e., 0 or 1.[<sup>4</sup>](#fn4) <a id = "12"> </a> Since we are interested in predicting claimants who will stay for 13 weeks or longer in the UI program, we define the label the following way:

- **label = 1, if the claimant stayed 13 weeks or longer**
- **label = 0, if the claimant left before the 13th week**

To create the label, we need to generate an indicator `week_number` first. It represents the number of weeks after program entry (similar to what we did in the [cohort analysis notebook](./2.Data_Exploration_Cohort_Analysis.ipynb)). Note that a small number of claimants' benefit year starting weeks are not consistent over time. We assume it's an data error and adjust these people's `byr_start_week` to their earliest starting weeks. This adjustment only applies to the cohorts we are looking at, because the changes we observe are mostly one week. Also, we only have one year of data. It's very unlikely that a person started two benefit years. If you are looking at multiple years of data and longer term outcomes, you need to investigate whether the changes are due to data error or multiple program entries.

We define **stayers** as claimants who consistenly showed in the UI program over time. If a claimant left the UI program for one week, we assume he/she would not return and define this person as an **exiter**. Based on this definition, we only need a claimant's records during the first 13 weeks after program entry to create the label. Then, we count each claimant's number of records during the first 13 weeks. If the number of records is greater than 13, it implies that the person has stayed in the program for at least 13 weeks and his/her label is 1. If a person has less than 13 records, he/she must have left the program during the initial 13 weeks and his/her label is 0.

In [None]:
# A few people's benefit year starting date changed, adjust them to the earliest date
df_claimants <- df_claimants %>%
    group_by (ssn_id, cohort) %>%
    mutate(byr_start_week = min(byr_start_week)) %>% # For each person in each cohort, adjust the 'byr_start_week' to the earliest date
    ungroup() #return data to non-grouped form. If we don't include this code, you may get errors in the later analysis

# Create week number field and keep the first 13 weeks of records for each person
df_claimants <- df_claimants %>% 
    mutate(week_number = as.integer(difftime(week_end_date, byr_start_week, units = "weeks")) + 1) %>%
    filter(week_number <= 13) # Only keep each person's first 13 records

#Create the label
df_label <- df_claimants %>% 
    group_by(ssn_id, cohort) %>%
    summarize(stay_weeks = n()) %>% #count each person's number of records
    mutate(label = case_when(stay_weeks >= 13 ~ 1, TRUE ~ 0)) %>% #create the label, if stayed 13 weeks or more, then label=1, otherwise, label=0
    ungroup() #return data to non-grouped form. If we don't include this code, you may get errors in the later analysis

#Check the top rows of df_label
head(df_label)

The code below calculates the percentages of faster exiters and slow exiters in each cohort. This helps us understand the distribution of our label. Depending on the context of your project, you may need to adjust your label definitions and/or research questions. For example, if only 1% of claimants stayed for 13 weeks or longer, we may want to change our goal and use a stricter definition, such as whether a claimant left the program within 6 weeks. 

In [None]:
# Understand what percent of claimants in each cohort left before the 13th week

#Count number of people by cohort and label
df_label_freq <- df_label %>% group_by(cohort,label) %>% summarize(freq = n())

#Count number of claimants in each cohort
df_cohort_pop <- df_label %>% group_by(cohort) %>% summarize(cohort_pop = n_distinct(ssn_id))

#Left join the two DataFrame and Calculate the percentages of fast exiters and slow exiters in each cohort
df_label_freq <- df_label_freq %>%
    left_join(df_cohort_pop, by = 'cohort') %>%
    mutate(percent = round((freq/cohort_pop),3))

#Look at the statistics
df_label_freq

## **5.Create the Features**

In this section, we clean up the rest of the variables and create the features (i.e., predictor variables, independent variables, or X). The list below shows the variables that are available to us in this project. There are two types of features: categorical features and numeric features. Later, we will transform them into the format that we can use in a machine learning model. But first, we need to clean up the data, generate a few variables, such as age at program entry, and regroup categorical variables, such as converting 6-digit NAICS code to 2-digit NAICS code.

- **Categorical features**: gender, race, ethnicity, disability, education, naics_code, occupation_code 
- **Numeric features**: birth_date(age), total_pay(weekly average), wages_2019, county-level average consumer spending

#### **Clean Up Data and Features**

Our current claims data is in long format, i.e., each record represents a certified claim during a specific week and each claimant could have more than one record. However, **in the data we need for the machine learning model, each row should represent a claimant in a specific cohort.** Technically, a claimant can show in both cohorts, but since our first cohort entered only 13 weeks prior to the second cohort, this case doesn't apply to our data. Therefore, in this model, you can also interpret each row as a claimant. 

Most of our features are time-invariant, except for the total pay. It may change from week to week due to part-time employment, start and end of the Federal Pandemic Unemployment Compensation (FPUC), etc. Since we only need one record for each claimant, we need to aggregate the data on `ssn_id` and `cohort` and calculate each claimant's average weekly total pay first. 

Moreover, in the administrative data, we often find that some people's time-invariant attributes change over time, such as gender and race. It could be data errors and/or corrections of wrong attributes in the most recent records. In our data, we also find that a few claimants' characteristics are different across records. In this case, we sort the data ascendingly by using `ssn_id` and `week_end_date` and for each claimant, we use his/her first record in the data. Some people's characteristics may have been corrected in the more recent records. However, remember that at the time of the prediction (7/5/2020), we don't know the testing cohort's corrected data yet and we need to avoid using data from the "future". 

In [None]:
#First, we need to calculate average weekly total pay
df_claimants <- df_claimants %>%
    group_by(ssn_id, cohort) %>%
    mutate(avg_total_pay = mean(total_pay))

#Keep each person's first record, since we only need one set of features for each person
df_feature <- df_claimants %>% 
    arrange(ssn_id, week_end_date) %>% #Sort the data ascendingly based on ssn_id and week_end_date
    group_by(ssn_id, cohort) %>%
    mutate(record_id = row_number()) %>% #create record ID for each person
    filter(record_id == 1)  %>% #Keep each person's first record 
    ungroup() #return data to non-grouped form. If we don't include this code, you may get errors in the later analysis

# See top records of df_feature
head(df_feature)

Now, our data is in the format we want. Next, we generate age at program entry and regroup categorical variables. To calculate age at program entry, we use the `difftime` function to get the number of days between a person's date of birth and date of UI program entry and then divide it by 365.25. A small number of people's ages are too young. We bottom code their ages to REDACTED. For the other categorical variables, we regroup them in the same way as we suggested in the two data exploration notebooks. **If a variable has missing values, we don't suggest to drop them. We usually create an "Unknown" category for these variables.**

- **Race**: white, African American, other race, race unknown
- **Education Level**: less than 12 years, high school graduate, Associate's degree, Bachelor's degree, Master degree or higher, education unknown
- **Occupation**: use 2-digit occupation codes and combine small categories if needed
- **Industry**: use 2-digit NAICS codes and combine small industries

In [None]:
# Calculate age at program entry and bottom code outliers
df_feature <- df_feature %>%
    mutate(age = as.integer(difftime(byr_start_week, birth_date, units='days')/365.25)) %>%
    mutate(age = case_when(age<REDACTED ~ as.integer(REDACTED), TRUE~age)) # A small number of people are too young. We bottom code their ages

# Recode gender, race, ethnicity, disability, and education
df_feature <- df_feature %>%
    mutate(gender = case_when(gender == 1 ~ 'Male', gender == 2 ~ 'Female', TRUE ~ 'Unknown'),
           race = case_when(race == 1 ~ 'White', race == 2 ~ 'African_American', race %in% c(3, 4, 6) ~ 'Other_race', TRUE ~ 'Unknown'),
           ethnicity = case_when(ethnicity == 1 ~ 'Hispanic', ethnicity == 2 ~ 'Not_Hispanic', TRUE ~ 'Unknown'),
           disability = case_when(disability == 1 ~'Disabled', disability == 2 ~ 'Not_Disabled', TRUE ~ 'Unknown'),
           education = case_when (education >= 1 & education <= 13 ~ "Less_than_HS",
                                    education >= 14 & education <= 18 ~ "HS_graduate_or_some_college",
                                    education >= 19 & education <= 20 ~ "Associate",
                                    education >= 21 & education <= 22 ~ "Bachelor",
                                    education >= 23 ~ "Master_or_higher",
                                    TRUE ~ "Other" ))

In [None]:
# Combine NAICS major codes based on grouping used for UI dashboard 
# Import NAICS code crosswalk
naics_groups <- read_csv('P:\\tr-dol-eta\\ETA Class Notebooks\\xwalks\\naics_groups.csv', col_types = "ccc")    

# Convert 6-digit NAICS codes to 2-digit NAICS codes by keeping only the first two characters
df_feature <- df_feature %>% mutate(naics_maj_code = substr(naics_code,1,2))

# Join NAICS groupings to claimant dataset
df_feature <- df_feature %>% 
    left_join(naics_groups, by = 'naics_maj_code') %>%
    mutate(naics_maj_code_rv = case_when(naics_maj_code %in% c('REDACTED') ~ 'REDACTED', TRUE ~ naics_maj_code_rv)) 

# Check the top records of df_feature
head(df_feature)

At this point, we have created all the variables we need. Let's remove the columns we no longer need so that it's easier to manage the data and we don't accidentally use them in the model.

In [None]:
#Only keep the columns we need
df_feature <- df_feature %>% 
    select(ssn_id, cohort, byr_start_week, # Variables to identify person and cohort
           gender, race, ethnicity, disability, education, naics_maj_code_rv, occupation_code, # Categorical Variables
           wages_2019, avg_spend_1, avg_spend_4, avg_spend_8, avg_total_pay, age) #Numeric variables

#### **Check Missing Values**

Before we run a machine learning model, we want to make sure that there is no missing value in our data. Previously, we have recoded missing categorical variables into "unknown" categories. Let's check if our data has other missing values. The code below returns the number of missing values for each variable.

In [None]:
#Check which columns have missing values
sapply(df_feature, function(x) sum(is.na(x)))

We can see that some claimants' 2019 earnings are missing. There are various reasons for missing 2019 earnings. For example, some people may have only worked in 2020 but have earned enough incomes to be qualified for UI benefits. Depending on your assumptions about the reasons of missing earnings, you may impute the earnings in different ways. Here, we replace missing 2019 earnings with the average earnings of claimants with the same education level. 

In [None]:
#Fill in missing values
#Replace missing 2019 earnings with the education level mean earnings
df_feature <- df_feature %>%
    group_by(education) %>%
    mutate(wages_2019 = replace(wages_2019, is.na(wages_2019), mean(wages_2019, na.rm = TRUE))) %>%
    ungroup()  #return data to non-grouped form. If we don't include this code, you may get errors in the later analysis

#Check missing values again
sapply(df_feature, function(x) sum(is.na(x)))

#### **Numeric Features**

Next, we need to convert the features into formats that can be used in ML models. **Specifically, we use the `scale()` function to standardize all of our numeric features.** That way, the importance of variables of different magnitudes/units can be assessed on the same scale. This also enables our models to work more efficiently than when we don't scale numeric variables. For example, `wages_2019` has values such as 1,000 or 10,000, but `age` is less than 100. We usually consider 10 years as a big age difference, but $100 as a very small annual income difference. However, in a distance-based ML algorithm, these differences are treated as numbers and a difference of 100 is more significant than a difference of 10. It's very likely that annual income will be used as a more important predictor than age simplily because income has a larger magnitude than age.

The `scale()` function substracts a variable by its mean and then divides it by its standard deviation. The scaled variables's means are 0 and standard deviations are 1. **Note that you need to scale the training set's and the testing set's data separately**, because a variable's mean and standard deviation may be different in the training set than in the testing set. In the code below, we group the data by `cohort` first and then scale the numeric features.

In [None]:
# Scale the numeric variables by cohort
df_feature <- df_feature %>%
    group_by(cohort) %>%
    mutate(age_scaled = scale(age), avg_total_pay_scaled = scale(avg_total_pay), 
           avg_spend_1_scaled = scale(avg_spend_1), avg_spend_4_scaled = scale(avg_spend_4),
           avg_spend_8_scaled = scale(avg_spend_8), wages_2019_scaled = scale(wages_2019)) %>%
    select(-c(age, avg_total_pay, avg_spend_1, avg_spend_4, avg_spend_8,wages_2019)) %>% # Remove columns we don't need
    ungroup()  #return data to non-grouped form. If we don't include this code, you may get errors in the later analysis

#### **Categorical Features**
We also use the `factor` function to turn all the categorical features into factor type. ML model functions will implement them as dummy variables. Note that we will write the data to a csv file and read it in the second ML notebook. During this process, factor type columns will be converted back to character type. So, **we will only use `gender` as an example to show you how to use the `factor` function here. We will convert all the categorical features into factor type at the beginning of the next notebook.** 

In [None]:
# Use gender as an example
# Convert character type into factor type; numeric variables can be converted into factor type as well
df_feature$gender <- factor(df_feature$gender)

#Check the types of ssn_id, gender, race. They should have <int>, <fct>, <chr> types, respectively.
head(df_feature %>%  select(ssn_id, gender, race))

## **6.Combine DataFrames**

Finally, we combine the label DataFrame we created in section 4, `df_label`, and the feature DataFrame we created in section 5, `df_feature`. We export the final DataFrame `df_ml` to a csv file and will use it in the second ML notebook.

In [None]:
# Combine the label DataFrame (df_label) 
# and the feature DataFrame with dummy variables and scaled numeric variables (df_feature_trans)
df_ml <- df_label[,c('ssn_id','cohort','label')] %>%
    left_join(df_feature, on = c('ssn_id','cohort'))

# See top records of the final DataFrame
head(df_ml)

<font color=red> Note that you need to create a "Data" folder in your "ETA Training" folder first. Then change the directory in write.csv() statements below. Replace ". ." with your username.</font>

In [None]:
# Export the data to a csv file. We will use it in the next notebook
write.csv(df_ml, "U:\\..\\ETA Training\\Data\\ml_data.csv", row.names=F)

## **Footnotes:**
<span id="fn1"> 1. The data is publicly available and can be downloaded  <a href='https://github.com/OpportunityInsights/EconomicTracker'>here</a>. </span>  

[[Go back]](#9)

<span id="fn2"> 2. Chetty, R., Friedman, J. N., Hendren, N., Stepner, M., & The Opportunity Insights Team. (2020). <a href='https://opportunityinsights.org/wp-content/uploads/2020/05/tracker_paper.pdf'>The economic impacts of COVID-19: Evidence from a new public database built using private sector data</a> (No. w27431). National Bureau of Economic Research. </span>  

[[Go back]](#10)

<span id="fn3"> 3. For more information about the data coverage, see: <a href='https://github.com/OpportunityInsights/EconomicTracker/blob/main/docs/oi_tracker_data_documentation.md'> Opportunity Insights Economic Tracker Data Documentation. </a> </span>  

[[Go back]](#11)

<span id="fn4"> 4.<a href='https://github.com/dssg/hitchhikers-guide/blob/master/sources/curriculum/3_modeling_and_machine_learning/machine-learning/machine_learning_clean.ipynb'> Hitchhiker's Guide to Data Science for Social Good. </a> </span>  

[[Go back]](#12)

> Note that the above links don't work inside of the ADRF since you don't have internet access.

> Click [Go back] to go back to where you were.