<center><br><br>
    Arkansas Work-Based Learning to Workforce Outcomes <br>
    Applied Data Analytics Training | Spring 2022
    <h1> Linked Dataset Construction for Longitudinal Analysis </h1>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Coleridge Initiative</a>
    </span>
    <center>Robert McGough, Nishav Mainali, Benjamin Feder, Josh Edelmann</center>
</center>

***

This notebook covers record linkage and creating a linked dataset that allows for longitudinal analysis for our cohort. 

Analyses involving administrative data often require:
-	Linking observations from multiple sources
-	Mediating differences in semantics
-	Mediating differences in grain (month versus quarter)
-	Mediating differences in cardinality and the potential to unintentionally exclude or overreport values
-	Facilitating intuitive and efficient processing and analysis of very large record sets
-	Mediating differences in names and relationships over time

This notebook will introduce and demonstrate some helpful techniques for linking administrative data while mediating the above issues.  The output of the notebook should provide a flexible and performant framework that meets the needs of most projects and can be easily customized to include additional variables or characteristics.

The linked data asset documented in this notebook has already been completely created and loaded in the **tr_ar_2022 database** as tables beginning with an "AR" prefix, with the final fact table titled **AR_FACT_Quarterly_Observation**.  This notebook will not create or load duplicative copies of the linked dataset but rather cover the techniques used to construct and load the model and hopefully serve as a resource to use when building future linked data sets. 

In [None]:
# Switching off warnings
options(warn = -1)

# Database interaction imports
suppressMessages(library(odbc))

# data manipulation/visualization
suppressMessages(library(tidyverse))

# scaling data, calculating percentages, overriding default graphing
suppressMessages(library(scales))

# for as.yearqtr()
suppressMessages(library(zoo))

#Switching on warnings
options(warn = 0)

In [None]:
# Connect to the database
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## Record Linkage

Record linkage is an important component of any analysis, unless you have a fictitious perfectly curated dataset with no messiness or missing variables, and especially when linking administrative records. Unlike survey data that allows for perfectly selected variables with some potential for messiness, administrative data is tailored to adminisitrative purposes (not academic). That means that we will not have all of the variables we ideally want, and it also means that the data can be messy (either missing responses or with variables that we may not quite understand or have at our disposal). While we may not directly address missing responses (more on indirectly addressing this in the inference lecture), we can enrich our data set by pulling in relevant information from other data sets. We will proceed to describe how to link registered apprenticeship participants and UI wage record data to create a panel of individual records over time. We also describe some of the issues that arise when linking records of various sorts.

## Dimensional Modeling

To facilitate easy and performant analysis of very large record sets (quarterly wages), we will be formatting the data in a dimensional model.  This type of model:
-	Facilitates efficient and intuitive storage, processing, and analysis of very large, linked data sets
-	Facilitates slicing, dicing, and drilling for exploratory data analysis
-	Provides excellent performance for dashboards and visualizations

The modeling process involves “conjugating” the data into events and observations (verbs/facts) and the entities and attributes with which they are associated and by which they are analyzed (nouns/dimensions) (Kimball and Ross, 2019).

The SQL scripts for the actual creation of the dimensional model used in this notebook are in the supplemental folder in a subfolder titled "Linked Data Model Scripts." They include statements to create the tables and foreign key constraints/indices used to enforce relational integrity and enhance query performance.  Logical and physical diagrams and metadata for this dimensional model have been added to the References page of the class site.

You will not need to create additional tables, but you may wish to review the SQL as a reference for creating dimensional models for future projects.

### Merged Dimensions ###

The modeling process starts with identifying the “dimensions” that describe the observations of interest and by which they will be analyzed.
These will be combined into dimension entities (tables) that merge attributes (columns) from multiple data sources.  Some of the advantages of using merged dimensions include:

-	Mediating differences in semantics
-	Facilitating easy hierarchy navigation
-	Improving query performance by reducing the number of joins involved and facilitating joins to the fact table with numeric surrogate IDs that require less storage space than character-based natural keys
-	Allowing for easy expansion with additional attributes without disrupting the much larger table of observations
-	Referencing data dimensions that have an external registration authority for interoperability across departments, states, and sectors (such as FIPS county and NAICS codes)
-	Facilitating changes in naming or attributes over time.  This is a “time variant” or “slowly changing” dimension, which was not used in this model.

The selection logic for the county dimension is below, illustrating some of the techniques used to pull together and format merged dimensions.  The complete SQL script also includes an insert statement for loading the table.

In [None]:
# county dimension example
qry <- "SELECT 
CAST(C.Code AS SMALLINT) AS County_ID,  --HERE WE ARE CASTING THE FIPS COUNTY ID AS AN INTEGER REPRESENTATION FOR MORE PERFORMANT JOINS.  THE FEWER BYTES THE BETTER.
C.Code AS County_Code, --HERE WE ARE RENAMING THE FIPS COUNTY CODE TO A MORE INTUITIVE NAME FOR WRITING SQL AND ANALYZING RESULTS
C.Name AS County_Name,
'AR' AS State_Code, --WE DON'T HAVE STATE CODE IN OUR TABLE, SO WE ARE INCLUDING IT AS A STRING
'Arkansas' AS State_Name,
C.Rural_Urban_Continuum AS 'Rural_Urban_Continuum_Code',

--HERE WE ARE DECODING THE USDA RURAL URBAN CONTINUUM CODE TO THE DESCRIPTIVE NAMES FOR EACH CODE
CASE WHEN C.Rural_Urban_Continuum = 1 THEN 'Counties in metro areas of 1 million population or more'
	WHEN C.Rural_Urban_Continuum = 2 THEN 'Counties in metro areas of 250,000 to 1 million population'
	WHEN C.Rural_Urban_Continuum = 3 THEN 'Counties in metro areas of fewer than 250,000 population'
	WHEN C.Rural_Urban_Continuum = 4 THEN 'Urban population of 20,000 or more, adjacent to a metro area'
	WHEN C.Rural_Urban_Continuum = 5 THEN 'Urban population of 20,000 or more, not adjacent to a metro area'
	WHEN C.Rural_Urban_Continuum = 6 THEN 'Urban population of 2,500 to 19,999, adjacent to a metro area'
	WHEN C.Rural_Urban_Continuum = 7 THEN 'Urban population of 2,500 to 19,999, not adjacent to a metro area'
	WHEN C.Rural_Urban_Continuum = 8 THEN 'Completely rural or less than 2,500 urban population, adjacent to a metro area'
	WHEN C.Rural_Urban_Continuum = 9 THEN 'Completely rural or less than 2,500 urban population, not adjacent to a metro area'
	END AS Rural_Urban_Continuum_Name,

C.Local_Workforce_Development_Area

FROM 
ds_ar_dws.dbo.CountyByLWDA C;"

countydim <- dbGetQuery(con, qry)

head(countydim)


### Time Dimension

A special type of dimension that is helpful for longitudinal analysis is a time dimension.  This is a dimension that stores all possible values for a period of time (day, week, quarter, month, year) across a long period and allows for easy cross-referencing across time periods such as day to quarter, state fiscal year, academic year, etc.

Using an incrementing integer identifier as the primary key for time dimensions is particularly useful for longitudinal analysis as it facilitates comparison across periods through simple arithmetic.  For example, in order to find outcomes for the 4th quarter following quarter of completion t, you simply need to look up t+4.

By encoding all dates at a consistent grain (quarter) and representation (incrementing integer), it makes it easy to conduct analyses based on relative longitudinal outcomes (4 quarters past exit for all completions between 2015 and 2018) in additional to absolute longitudinal outcomes (2015 Q3 employment for 2015 Q1 completers).  This is especially helpful when smaller data sets limit the cohort size for absolute cohort outcomes.

To construct the time dimension, we set up a loop that increments between a starting and ending period and derives various time period representations and relationships needed for our analysis.

In [None]:
# time dimension
qry2 <- "
SET NOCOUNT ON --ALLOWS US TO CREATE TEMP TABLES INSIDE AN R KERNEL IN JUPYTERLAB

DECLARE
@StartDate DATE,
@EndDate DATE,
@Date DATE,
@ID SMALLINT


SET @StartDate = '2001-01-01' -- DATE BEFORE EARLIEST DATE IN DATA OF INTEREST
SET @EndDate = '2021-12-31' -- DATE AFTER LATEST DATE IN DATA OF INTEREST
SET @ID = 1

SET @Date = @StartDate

--CREATE TEMPORARY TABLE TO STORE RESULTS.  THE REAL QUERY INSERTS TO THE TIME DIMENSION WITH EACH LOOP ITERATION. TEMP TABLES, AS YOU MAY HAVE GUESSED, ARE TEMPORARY IN NATURE
CREATE TABLE #Temp_AR_RDIM_Quarter (
	Quarter_ID smallint NOT NULL,
	Quarter_Code char(6) NULL,
	Calendar_Year smallint NULL,
	Calendar_Quarter tinyint NULL,
	Calendar_Month_Number_Start tinyint NULL,
	Calendar_Month_Number_End tinyint NULL,
	Start_Date date NULL,
	End_Date date NULL
)

--LOOP BETWEEN BEGIN AND END DATES DERIVING THE DESIRED VALUES FOR EACH PERIOD
WHILE @Date <= @EndDate
BEGIN

INSERT INTO #Temp_AR_RDIM_Quarter (
	Quarter_ID,
	Quarter_Code,
	Calendar_Year,
	Calendar_Quarter,
	Calendar_Month_Number_Start,
	Calendar_Month_Number_End,
	Start_Date,
	End_Date
)
VALUES
(
	@ID, --Quarter_ID,
	CAST(DATEPART(YY,@Date) AS CHAR(4)) + 'Q' + CAST(DATEPART(Q,@Date) AS CHAR(1)), --Quarter_Code
	DATEPART(YY,@Date), --Calendar_Year
	DATEPART(Q,@Date), --Calendar_Quarter
	DATEPART(MM,@Date), -- AS Calendar_Month_Number_Start
	DATEPART(MM,@Date) + 2, -- AS Calendar_Month_End
	@Date, -- AS Start_Date
	DATEADD(D,-1,DATEADD(Q,1,@Date)) -- AS End_Date
)
	
	--INCREMENT THE DATE BY ONE QUARTER
    SET @Date = dateadd(mm,3,@Date )
    --INCREMENT THE SURROGATE PRIMARY KEY ID BY ONE
    SET @ID =@ID +1
END

--SELECT THE LOOP RESULTS
SELECT * FROM #Temp_AR_RDIM_Quarter

DROP TABLE #Temp_AR_RDIM_Quarter
;"

timedim <- dbGetQuery(con, qry2)

head(timedim)
tail(timedim)

## Mastering
Unlike reference data that is consistent across states (NAICS, SOC), master data refer to the unique collection of persons, employers, or households served by each state.  A state can have many different references to the same real-world entity, and mastering is the processing of assembling a set that has one member (record) for each unique instance of an entity in the real world.  

This master record can merge attributes from multiple sources, resulting in a “golden record” with a higher completeness than is available in individual sources.  When multiple references to the same entity have different values, those differences are resolved through a process called survivorship in which decisions are made about which value to keep (most recent, most frequent, highest quality source, etc.).

In our example, there can be multiple QCEW records for each employer, so the record with the highest total wages from the most recent quarter was selected for the surviving NAICS and County values.  This was chosen because the most recent employer location and industry is the most relevant for supporting current work-based learning policy, strategy, and consumer information.  The optimal survivorship strategy may vary based on the intended audience and purpose.

We can also simplify more complex logic during the load process in order to make analysis easier, more performant, and more consistent across products.  For example, in this query we are decoding the Multi Establishment Employer Indication (MEEI) to create a simpler flag for identifying which employers have multiple worksites/establishments reporting in a single record.  We sometimes wish to identify or exclude these from location-specific analyses since the location recorded is not necessarily the location of employment.

In [None]:
# QCEW mastering
# code takes a little while to run because it's processing so much data
empqry <- "
--DECLARE COMMON TABLE EXPRESSION TO RANK QCEW ENTRIES BY MOST RECENT REPORT AND HIGHEST TOTAL WAGES FOR DEDUPLICATING EMPLOYER RECORDS
WITH QCEW_EMPLOYERS_RANKED AS (
	SELECT
	QCEW.EIN,
	Q.Quarter_ID,
	I.NAICS_National_Industry_ID,
	C.County_ID,
	
	CASE WHEN QCEW.MEEI_Code IN (1,2,3) THEN 'N'
		WHEN QCEW.MEEI_Code IN (4,5,6) THEN 'Y'
		ELSE NULL END
	AS Multiple_Worksite_Record,
	
	QCEW.QCEW_ID,
	QCEW.Total_Wages AS Total_Wages,

	--THIS RANKS EACH EMPLOYER QCEW RECORD BY MOST RECENT AND HIGHEST TOTAL WAGES
	ROW_NUMBER ( )   
	    OVER (PARTITION BY 	QCEW.EIN ORDER BY Q.Quarter_ID DESC, QCEW.Total_Wages DESC) --MOST RECENT THEN HIGHEST WAGES
	AS ROW
	    
	FROM
	ds_ar_dws.dbo.qcew QCEW
	JOIN tr_ar_2022.dbo.AR_RDIM_NAICS_National_Industry I ON (I.NAICS_National_Industry_Code=QCEW.NAICS_Code)
	JOIN tr_ar_2022.dbo.AR_RDIM_County C ON (C.County_Code= QCEW.State_FIPS+QCEW.County_Code)
	JOIN tr_ar_2022.dbo.AR_RDIM_Quarter Q ON (Q.Calendar_Year=QCEW.Reference_Year) AND (Q.Calendar_Quarter=QCEW.Reference_Quarter)
	
	WHERE 
	QCEW.Total_Wages > 0
	AND QCEW.EIN IS NOT NULL
),

--DECLARE COMMON TABLE EXPRESSION TO SURVIVE QCEW ENTRIES BY MOST RECENT REPORT AND HIGHEST TOTAL WAGES FOR DEDUPLICATING EMPLOYER RECORDS
QCEW_EMPLOYERS_SURVIVED AS (
SELECT 
QER.EIN,
QER.Quarter_ID,
QER.NAICS_National_Industry_ID,
QER.County_ID,
QER.Multiple_Worksite_Record

FROM
QCEW_EMPLOYERS_RANKED QER
WHERE
QER.ROW = 1  --THIS SELECTS ONLY THE MOST RECENT EMPLOYER QCEW RECORD WITH THE HIGHEST WAGES
)

--CREATE EMPLOYER MASTER DIMENSION WITH UNIQUE EMPLOYER IDENTITIES AND SURVIVED QCEW ATTRIBUTES
--THE ACTUAL LOAD IS AN INSERT OF OVER 100K ROWS.  THIS SELECTS A SAMPLE OF 100
SELECT TOP 100
UIWL.Federal_EIN,
UIWL.State_EIN,
QCEW.NAICS_National_Industry_ID,
QCEW.County_ID,
QCEW.Multiple_Worksite_Record
FROM 
ds_ar_dws.dbo.ui_wages_lehd UIWL
LEFT JOIN QCEW_EMPLOYERS_SURVIVED QCEW ON (UIWL.Federal_EIN=QCEW.EIN)
;"

empdim <- dbGetQuery(con, empqry)

head(empdim)


## Fact Table
The fact table stores the actual observations (facts) of interest.  Since this table often contains large numbers of records, it will ideally be comprised of a small number of bytes per row and primarily consist of indexed foreign keys to dimension tables and observation-specific measures.  This allows for storage of large records sets with low storage cost and extremely high query performance (extremely helpful for supporting dashboards).

In this example, the fact table is at the grain of one row per person per quarter.  We will create a record for every quarter between the first and last observations of a person in either employment or apprenticeship data sets, regardless of employment or apprenticeship participation in a given quarter.  These “missing” observation quarters are materialized because unemployment and non-participation may be just as interesting for some analyses and longitudinal analysis benefits from consistent representation across time periods of consistent grain.

Some of our cohort members have observations for multiple employers in a single quarter.  Since our unit of analysis is the person, not the person-employer combination, we need to resolve these one-to-many relationships into a single observation while retaining the information pertinent to analysis.  In this example, the primary employer and associated wages were identified and recorded based on the employer with the largest wages.  In order to minimize loss of potentially relevant information, the total wages and number of employers is also included on each observation.

> This query is looking at large volumes of data and will run for several minutes. It also contains some code for calculating employment measures, which is noted in the code comments and will be covered in the next section.

In [None]:
# Fact table of all individuals in RAPIDS or UI Wages data
factqry <- "
--DECLARE COMMON TABLE EXPRESSION FOR DISTINCT PERSON QUARTER OBSERVATIONS IN RANGE OF AR EMPLOYMENT OR APPRENTICESHIP
WITH Person_Quarter_Observations AS (
	SELECT
	P.Person_ID,
	Q.Quarter_ID,
		
	CASE WHEN (Q.Quarter_ID BETWEEN P.Apprenticeship_Start_Quarter_ID AND P.Apprenticeship_End_Quarter_ID) THEN 'Y'
		WHEN ((P.Apprenticeship_End_Quarter_ID IS NULL) AND (Q.Quarter_ID >= P.Apprenticeship_Start_Quarter_ID)) THEN 'Y'
		ELSE 'N'
	END AS Apprenticeship_Participation
	
	FROM 
	tr_ar_2022.dbo.AR_MDIM_Person P
	JOIN tr_ar_2022.dbo.AR_RDIM_Quarter Q ON
		(Q.Quarter_ID BETWEEN P.First_AR_Employment_Quarter_ID  AND P.Last_AR_Employment_Quarter_ID) --EMPLOYMENT QUARTERS
			OR (Q.Quarter_ID BETWEEN P.Apprenticeship_Start_Quarter_ID AND P.Apprenticeship_End_Quarter_ID) --COMPLETED APPRENTICE QUARTERS
			OR ((P.Apprenticeship_End_Quarter_ID IS NULL) AND (Q.Quarter_ID >= P.Apprenticeship_Start_Quarter_ID)) --CURRENT APPRENTICE QUARTER
),
--RANK EMPLOYMENT RECORDS FOR EACH PERSON AND QUARTER BY HIGHEST WAGE AMOUNT
Wage_Rank AS (
	SELECT
	P.Person_ID,
	Q.Quarter_ID,
	ROW_NUMBER() OVER(PARTITION BY P.Person_ID, Q.Quarter_ID ORDER BY W.Employee_Wage_Amount DESC) AS RANK,
	E.Employer_ID,
	W.Employee_Wage_Amount
	
	FROM 
	ds_ar_dws.dbo.ui_wages_lehd W
	JOIN tr_ar_2022.dbo.AR_MDIM_Person P ON (W.Employee_SSN=P.SSN)
	JOIN tr_ar_2022.dbo.AR_MDIM_Employer E ON (E.State_EIN=W.State_EIN)
	JOIN tr_ar_2022.dbo.AR_RDIM_Quarter Q ON ((W.Reporting_Period_Year=Q.Calendar_Year) AND (W.Reporting_Period_Quarter=Q.Calendar_Quarter))
),
--SELECT ONLY THE PRIMARY EMPLOYER RECORD FOR EACH PERSON AND QUARTER
Primary_Employer_Wage AS (
	SELECT
	WR.Person_ID,
	WR.Quarter_ID,
	WR.Employer_ID AS Primary_Employer_ID,
	WR.Employee_Wage_Amount AS Primary_Employer_Wages
	
	FROM
	Wage_Rank WR
	
	WHERE
	WR.RANK=1
),
--SELECT SUMMARY RECORDS ACROSS ALL EMPLOYERS FOR EACH PERSON AND QUARTER
All_Employer_Wage AS (
	SELECT 
	WR.Person_ID,
	WR.Quarter_ID,
	COUNT(WR.Employer_ID) AS Employer_Count,
	SUM(WR.Employee_Wage_Amount) AS Total_Wages
	
	FROM 
	Wage_Rank WR
	
	GROUP BY
	WR.Person_ID,
	WR.Quarter_ID
)

--LOAD QUARTERLY OBSERVATION FACT TABLE.  THE ACTUAL QUERY INSERTS 65M RECORDS.  WE ARE SELECTING A SMALL SAMPLE.
SELECT TOP 100
PQO.Person_ID,
PQO.Quarter_ID,
CASE WHEN PEW.Person_ID IS NULL THEN 'N' ELSE 'Y' END AS Employed,
PEW.Primary_Employer_ID,
PEW.Primary_Employer_Wages,
AEW.Total_Wages,
AEW.Employer_Count,
PQO.Apprenticeship_Participation,
--CREATING EMPLOYMENT MEASURES, DISCUSSED IN NEXT SECTION
CASE WHEN PEW.Primary_Employer_ID = PPEW.Primary_Employer_ID THEN 'Y'
	WHEN PEW.Primary_Employer_ID <> PPEW.Primary_Employer_ID THEN 'N' --<> IS THE SAME AS !=
	ELSE 'N'
END AS Primary_Employer_Beginning_of_Quarter_Employment,

CASE WHEN PEW.Primary_Employer_ID = SPEW.Primary_Employer_ID THEN 'Y'
	WHEN PEW.Primary_Employer_ID <> SPEW.Primary_Employer_ID THEN 'N'
	ELSE 'N'
END AS Primary_Employer_End_of_Quarter_Employment,

CASE WHEN (PEW.Primary_Employer_ID = PPEW.Primary_Employer_ID) AND (PEW.Primary_Employer_ID = SPEW.Primary_Employer_ID) THEN 'Y'
	WHEN (PEW.Primary_Employer_ID <> PPEW.Primary_Employer_ID) OR (PEW.Primary_Employer_ID <> SPEW.Primary_Employer_ID) THEN 'N'
	ELSE 'N'
END AS Primary_Employer_Full_Quarter_Employment

FROM
Person_Quarter_Observations PQO
LEFT JOIN Primary_Employer_Wage PEW ON (PEW.Person_ID=PQO.Person_ID) AND (PEW.Quarter_ID=PQO.Quarter_ID) 
LEFT JOIN All_Employer_Wage AEW ON (AEW.Person_ID=PQO.Person_ID) AND (AEW.Quarter_ID=PQO.Quarter_ID) 
LEFT JOIN Primary_Employer_Wage PPEW ON (PPEW.Person_ID=PQO.Person_ID) AND (PPEW.Quarter_ID=PQO.Quarter_ID-1) --PRIOR QUARTER (t-1) PRIMARY EMPLOYER WAGE
LEFT JOIN Primary_Employer_Wage SPEW ON (SPEW.Person_ID=PQO.Person_ID) AND (SPEW.Quarter_ID=PQO.Quarter_ID+1) --SUBSEQUENT QUARTER (t+1) PRIMARY EMPLOYER WAGE
;"

fact <- dbGetQuery(con, factqry)

head(fact)


## Employment Measures

While loading the fact table, we added some attributes to support job-level employment analysis, particuarly when looking at retention.  Quarterly wages are reported if there is even one day of employment in a quarter, so it is helpful to know whether in individual was employed at the beginning of a quarter, the end of a quarter, or both (full quarter employment).  The three metrics below are based on the Quarterly Workforce Indicator definitions used by the U.S. Census Bureau, but they are restricted to the primary employer.

Primary Employer Beginning of Quarter Employment (**Primary_Employer_Beginning_of_Quarter_Employment** in the above query) - This is a Y/N flag that evaluates as true if the Primary Employer is the same in quarters t (reference) and t-1 (previous).  This is the same as the QWI Beginning-of-Quarter employment measure but only for the primary employer (highest wages).

Primary Employer End of Quarter Employment (**Primary_Employer_End_of_Quarter_Employment** in the above query) - This is a Y/N flag that evaluates as true if the Primary Employer is the same in quarters t (reference) and t+1 (subsequent).  This is the same as the QWI End-of-Quarter employment measure but only for the primary employer (highest wages).

Primary Employer Full Quarter Employment (**Primary_Employer_Full_Quarter_Employment** in the above query) - This is a Y/N flag that evaluates as true if the Primary Employer is the same in quarters t (reference), t-1 (previous), and t+1 (subsequent).  This is the same as the QWI Full-Quarter (Stable) Employment measure but only for the primary employer (highest wages).

The query below selects a sample person with multiple employers so that you can see how the values change when the primary employer changes.

In [None]:
# employment measure example query
empmeasqry <- "
SELECT 
* 
FROM 
tr_ar_2022.dbo.AR_FACT_Quarterly_Observation 
WHERE
Person_ID IN (
	SELECT TOP 1
	F.Person_ID
	FROM 
	tr_ar_2022.dbo.AR_FACT_Quarterly_Observation F
	GROUP BY 
	F.Person_ID
	HAVING
	COUNT(DISTINCT F.Primary_Employer_ID) > 2
)
;"

empmeas <- dbGetQuery(con, empmeasqry)

empmeas


## References

Abowd, et. al., The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators, 2006 (https://lehd.ces.census.gov/doc/technical_paper/tp-2006-01.pdf).

Kimball, R., & Ross, M. (2019). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Ed. Wiley.