<center><br><br>
    Arkansas Work-Based Learning to Workforce Outcomes <br>
    Applied Data Analytics Training | Spring 2022
    <h1> Exploratory Data Analysis and Dataset Introduction </h1>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Robert McGough, Joshua Edelmann, Benjamin Feder</center>
</center>

***

Exploratory Data Analysis (EDA) is a vital first step in any data analysis process. It provides an opportunity to get a better sense of the data available in your project and may provide interesting insights worth exploring in the future. In this notebook, we will walk through a basic EDA process on the primary dataset available for you in this training program, the RAPIDS (Registered Apprenticeship) data.

Even if you are confident your analysis is completely developed, undergoing the EDA process is still essential, as it can function as a part of a data quality check as well.

## 1. Getting started

Before we can dive into the data, we need to load certain packages in R and establish a connection to the proper data source. You will need to do this in every R notebook you create, and we recommend copying these first two code blocks to start any R notebook in the future.

> Note: The `options` and `supressMessages` functions prevent a long warning message output from being displayed after running the first code block.

In [None]:
options(warn = -1)                   # switches warnings off

suppressMessages(library(odbc))      # allows R to connect with the database
suppressMessages(library(tidyverse)) # useful for data manipulation and visualization
suppressMessages(library(scales))    # to calculate percentages, graphing
suppressMessages(library(lubridate)) # for easy working with dates 

options(warn = 0)                    # switches warnings on 
options(scipen=999)                  # prevents scientific notation

In [None]:
# server connection
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## 2. RAPIDS Data 

The RAPIDS data is publicly-available, individual-level de-identified data provided by the Employment and Training Administration. RAPIDS covers 43 states/territories, and you will have access to to a crosswalk table (**ar_rapids_xwalk**) provided by the Arkanasas Office of Skills Development (**ds_ar_osd**) to link this data to other Arkansas administrative data. The primary RAPIDS-based table we will be working with exists in the public database (**ds_public_1**) in the table **rapids_apprentice**.

> You will get an error if you try to read in all of the columns from **rapids_apprentice**, so you must subset variables before bringing data into R.

### Defining a Row

We will begin our EDA for the Arkansas RAPIDS data by quickly taking a look at the data and then defining a row. Based on the dataset, a row may represent many different occurrences; it may be a person, an apprenticeship, or something else completely. Understanding a row will allow you to form a plan for eventually developing your analytical frame and understanding the decisions you may need to make (filters, duplications, etc.) to do so.

First, we will explore five rows of the data and then count the number of rows and compare that to the number of records of individuals and total apprenticeships in the data.

> Use the data documentation to look up what each of these columns represent.

In [None]:
# Select first 5 rows of AR RAPIDS data
# Join RAPIDS data to AR crosswalk to subset the RAPIDS data
query <- "
SELECT TOP 5
RX.SSN,
RA.apprnumber,
RA.naicscode,
RA.occupationtitle,
RA.onetsoccode,
RA.progstate,
RA.progzip5,
RA.county,
RA.termlengthmin,
RA.gender,
RA.race,
RA.ethnicity,
RA.vetstatind,
RA.disabled,
RA.ageatstart,
RA.startingwage,
RA.startdt,
RA.exitwage
FROM 
ds_public_1.dbo.rapids_apprentice RA
JOIN ds_ar_osd.dbo.ar_rapids_xwalk RX 
ON (RX.rapids_number=RA.apprnumber) --RESTRICT DATA SET TO ARKANSAS LINKAGE;
"

# read in query results in the object "temp"
temp <- dbGetQuery(con, query)

# view the temp data frame
temp

In [None]:
# see variable names in temp
names(temp)

# remove temp from the environment
# since we only used temp to see the variable names, we will remove it for efficiency
rm(temp)

After referring to the data dictionary, we can see that the **ssn** and **apprnumber** are the person- and apprenticeship-level identifiers in the data set, respectively.

In [None]:
# Find number of rows, unique people, and unique apprenticeships in RAPIDS data

query <- "
SELECT COUNT(*) AS number_rows, COUNT(DISTINCT(RX.ssn)) AS number_people, COUNT(DISTINCT(RA.apprnumber)) AS count_appr
FROM ds_public_1.dbo.rapids_apprentice RA
JOIN ds_ar_osd.dbo.ar_rapids_xwalk RX 
ON RX.rapids_number = RA.apprnumber;
"

dbGetQuery(con, query)

Each row of **rapids_apprentice** represents an observation--or a record--of a person's apprenticeship. Note that the number of unique individuals (**ssn**) is equal to the total number of rows and apprenticeships (**apprnumber**) in the sample, suggesting that an individual will only have one record of an apprenticeship.

### Range of the Data Source

Within the project scoping process, it is essential to understand the coverage of the data from a time perspective. In this training program, where groups are expected to build out longitudinal analyses, confirming the range of the data source is a necessary part of EDA. 

> Note: We will read the data into R, which may take a few seconds to complete. You can continue to explore the data in SQL as well.

In [None]:
# query the RAPIDS data for Arkansas
# please feel free to select in different variables!
query <- "
SELECT
RX.ssn,
RA.apprnumber,
RA.naicscode,
RA.occupationtitle,
RA.onetsoccode,
RA.progstate,
RA.progzip5,
RA.county,
RA.termlengthmin,
RA.gender,
RA.race,
RA.ethnicity,
RA.vetstatind,
RA.disabled,
RA.ageatstart,
RA.startingwage,
RA.startdt,
RA.exitwage
FROM 
ds_public_1.dbo.rapids_apprentice RA
JOIN ds_ar_osd.dbo.ar_rapids_xwalk RX 
ON RX.rapids_number=RA.apprnumber;
"

# read in query results to a data frame in R
df_rapids_apprentice <- dbGetQuery(con, query)

# View the first 6 observations
head(df_rapids_apprentice)

If you scroll to the end of the output above, you can see that **startdt**, the start date of the apprenticeship, is represented as a character string instead of a date variable. Since **startdt** is in the date order of month-day-year, the associated variable from the `lubridate` package `mdy` can convert the variable to a date.

In [None]:
# find range of startdt
df_rapids_apprentice <- df_rapids_apprentice %>% 
    mutate(
        startdt=mdy(startdt)
    ) 

df_rapids_apprentice %>% 
    pull(startdt) %>%
    range()

Data between July 1997 and January 2022 provide a long time frame over which to analyze those participating in registered apprenticeship programs in Arkansas.

### Explore Columns of Interest

Columns often represent variables in data tables. At this point in your project, you may have identified certain columns of interest. We will walk you through exploring one numeric and one non-numeric variable.

#### Numeric Variable Exploration

Let's explore the **ageatstart** variable. If you look at the data frame above, you will see this variable is also stored as a character. We need to change this variable type to be a numeric variable. We do this so we can see the summary statistics for this variable. 

In [None]:
# see age distribution with a quick numerical summary
df_rapids_apprentice <- df_rapids_apprentice %>%
    mutate(
        ageatstart = as.numeric(ageatstart)
    ) 

df_rapids_apprentice %>%
    pull(ageatstart) %>%
    summary()

We can also view this distribution visually to help inform our understanding. With a numerical variable, a histogram can be a helpful visual option for exploring its distribution. We will leverage the `ggplot2` package (part of the `tidyverse`) to create a histogram of **ageatstart**.

In [None]:
# see age distribution instead with a quick visual summary
# include density plot with stat(density)
df_rapids_apprentice %>%
    ggplot(aes(x=ageatstart, y = stat(density))) +
    geom_histogram()

Notice that most of the individuals we see at this point in this dataset are under REDACTED years of age. This may align with our preconceived notions that majority of individuals attend an apptenticeship in their late teens to early 20's.

You may want to only include those over a certain age when we later create our cohort if we are interested in employment histories prior to entering a registered apprenticeship program.

#### Non-Numeric Variable Exploration

Non-numeric variables can be explored in a different fashion. Whereas you can look at the distribution of a numeric variable by finding the mean or the median, non-numeric variables require different approaches. Here we will explore the number of individuals entering an apprenticeship by *year* by counting the number of individuals within each year of `startdt`. To do so, we can select only the year from a date using the `year` function from the `lubridate` package. 

> Note: Missing variables often will appear as a separate value for non-numeric variables. We will discuss missingness in future lectures, and within EDA, identifying potential missingness within key variables is the goal.

In [None]:
# extract year and assign variable as start_year
df_rapids_apprentice <- df_rapids_apprentice %>%
    mutate(
        start_year = year(startdt)
    )

head(df_rapids_apprentice)

You can see the new **start_year** variable by scrolling to the right on the output above.

In [None]:
# count number of individuals entering registered apprenticeship programs
freq <- df_rapids_apprentice %>%
    group_by(start_year) %>%
    summarize(individuals = n_distinct(ssn)) %>%
    ungroup()

head(freq)

Due to the amount of values of **start_year**, it is a bit harder to digest this distribution in a tabular format. We can also view this distribution visually with a line graph, as the **start_year** variable, while saved as a numeric variable, is not a true numeric variable, as it is part of a date.

> Note: Another visual option for viewing non-numeric variables is a bar plot.

In [None]:
# line graph of frequency of individuals entering apprenticeship training by year
freq %>% 
    ggplot(aes(x = start_year, y = individuals)) +
    geom_line()

We can see that the number of individuals entering the registered apprenticeship program in Arkansas (of those that are in the crosswalk) drastically increases in roughly 2015, and then drops off in 2022. It is recommended to identify an analytical frame of individuals starting at the absolute earliest of 2015.

## 4. References

TDC EDA Notebook (link to come)