Project: 2023-006 Employment Center 2.0 Data QC \
Author: PSI \
Test Description: Validate Stored Procedure outputs by manually calculating outputs for File 1 and File 2 for select employment center characteristics 


In [None]:
import pandas as pd 
import numpy as np 
import pyodbc

File 1: DTE provided a shapefile for the new employment centers which are based in MGRA15. DTE noted that This shapefile does not contain (a) the two "super-centers" Sorrento Valley and Kearny Mesa, which are each aggregations of two emp ctrs, and (b) it does not have the new sub-centers.  It should be noted that the MGRA 13_15 xref file provided by the Estimates and Forecast team could not be used for this analysis since not all MGRA15 have MGRA13 assigned to it. There are 1051 MGRA15 which do not have an MGRA13 xref. Hence, the QA team used ArcGIS pro to identify which MGRAs points were within the employment center shapefiles. The ArcGISPro output was saved in the results folder on Sharepoint and was imported into Python for further calculation. 

In [4]:
# importing the employment center to MGRA15 xref file and mgra13-15 xref file
mgra_xref = pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\mgra13_15_xref.xlsx')
mgra_empct = pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\00_MASTER_MGRA15_to_EC2_XRef_2023_01_31.xlsx', sheet_name='MGRA-EC XRef 2023_01_24', skiprows=1)


In [5]:
mgra_empct= mgra_empct[["MGRA15", "EC_ID", "EC_Name"]].dropna().reset_index(drop= True)
mgra_empct_merge= pd.merge(mgra_empct, mgra_xref, on= "MGRA15", how= "left")

In [7]:
# Counting NAN in MGRA 13-15 xref file-- this was indicative that not all MGRA15 had MGRA13 assignments and hence this file could not be used for the manual calculation  
mgra_empct_merge.isna().sum()

MGRA15        0
EC_ID         0
EC_Name       0
MGRA13     1051
dtype: int64

In [13]:
#check duplicates on MGRA 15 in mgra_xref 
mgra15_dup= mgra_xref[mgra_xref.duplicated(['MGRA15'], keep=False)]

# check which MGRA 15 values in empct are not in mgra_xref 
mgra15_xref= list(mgra_xref["MGRA15"])
mgra15_na= mgra_empct[~mgra_empct["MGRA15"].isin(mgra15_xref)]

In [48]:
# Loading the ARCGISPro Output which assigns MGRA13 to employment centers 
mgra13_empct_int = pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\EmpCent_MGRA13_Intersect_TableToExcel.xlsx')
mgra13_empct_int= mgra13_empct_int[["MGRA","EC_ID","FIRST_EC_I" ,"FIRST_EC_N" ]]
mgra13_empct_int.nunique()

MGRA          7678
EC_ID          101
FIRST_EC_I     101
FIRST_EC_N     101
dtype: int64

In [28]:
# Setting up the sql connection to bring in demographic warehouse data for total population [datasource_id=45]
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};'
                    'Server=DDAMWSQL16.sandag.org;'
                    'Database=demographic_warehouse;'
                    'Trusted_Connection=yes;')

pop_query= '''SELECT  mgra_denormalize.mgra
,SUM(population.population) as pop
FROM demographic_warehouse.fact.population
INNER JOIN demographic_warehouse.dim.mgra_denormalize
ON mgra_denormalize.mgra_id = population.mgra_id
WHERE population.datasource_id = 45 AND population.yr_id = 2021
GROUP BY mgra_denormalize.mgra
 '''

mgra_pop= pd.read_sql_query(query, conn)

In [188]:
# merging the employment center MGRA13 intersect file with mgra_pop query output from demographic_warehouse
empct_pop= pd.merge(mgra13_empct_int, mgra_pop, how= "left", left_on= "MGRA", right_on= "mgra")
empct_pop= empct_pop.groupby(['EC_ID', 'FIRST_EC_I', 'FIRST_EC_N'], as_index= False )['pop'].sum().sort_values(by=["EC_ID"])
#empct_pop

In [146]:
# Comparing the QA calculation for total employment with file 1 total population numbers. 
file1 = pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\EC2_Data_File_1_Build.xlsx', sheet_name= 'Output File Build', skiprows= 3)
file1= file1[["EC_ID", "EC_Name", "Pop_Total"]].sort_values(by=["EC_ID"])
comp_file1= pd.merge(file1, empct_pop, how= 'left', on= "EC_ID")
comp_file1['diff']= comp_file1['Pop_Total']- comp_file1['pop']
#comp_file1.to_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\file1_pop_sp_manual comparison.xlsx')- commented this out to avoid rewriting outputs

File 2: Comparing sql query outputs with file 2 'Jobs_by_Demo_JT00' and 'Jobs_by_Demo_JT02' data. QA team recreated the stored procedure (the query for both JT00 and JT02 is saved in the results folder) and compared the outputs from the query with the file 2 outputs. 

In [192]:
#JT00
file2_JT00= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\EC2_Data_File_2_Build.xlsx', sheet_name= 'Jobs_by_Demo_JT00', skiprows= 2)
file2_JT00= file2_JT00.set_index(['tier', 'employment_center_id', 'employment_center_name'], inplace= False)
sql_output_JT00= pd.read_csv(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\LEHD_queryresult_JT00.csv', index_col=False)
sql_output_JT00= sql_output_JT00.set_index(['tier', 'employment_center_id', 'employment_center_name'], inplace= False)
diff_JT00= sql_output_JT00.subtract(file2_JT00)


In [193]:
#JT02
file2_JT02= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\EC2_Data_File_2_Build.xlsx', sheet_name= 'Jobs_by_Demo_JT02', skiprows= 2)
file2_JT02= file2_JT02.set_index(['tier', 'employment_center_id', 'employment_center_name'], inplace= False)
sql_output_JT02= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\LEHD_queryresult_JT02.xlsx')
sql_output_JT02= sql_output_JT02.set_index(['tier', 'employment_center_id', 'employment_center_name'], inplace= False)
diff_JT02= sql_output_JT02.subtract(file2_JT02)
diff_JT02.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,jobs,male,female,age_lt30,age_30to54,age_55plus,educ30_lt_hs,educ30_hs,educ30_some_college,educ30_bachelor_plus,...,jobs_firms_age_0_to_1,jobs_firms_age_2_to_3,jobs_firms_age_4_to_5,jobs_firms_age_6_to_10,jobs_firms_age_11_plus,jobs_firms_size_0_to_19,jobs_firms_size_20_to_49,jobs_firms_size_50_to_249,jobs_firms_size_250_to_499,jobs_firms_size_500_plus
tier,employment_center_id,employment_center_name,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
0,80,Kearny Mesa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,81,Sorrento Valley,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,1001,Carslabd Palomar Airport Sub-Center: Airport,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [194]:
# writing out the diff_JT00 and diff_JT02 outputs in excel  
with pd.ExcelWriter(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\sp_validation.xlsx') as writer:
    diff_JT00.to_excel(writer, sheet_name='JT00')
    diff_JT02.to_excel(writer, sheet_name='JT02')

In [165]:
# comparing file 2 with sql query output-- JT00
# sql query output 
#sql_output_JT00= pd.read_csv(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\LEHD_queryresult_JT00.csv', index_col=False)
#sql_output_JT00= sql_output_JT00.sort_values(by='employment_center_id', inplace= False)
#sql_output_JT00= sql_output_JT00.reset_index(drop= True, inplace= False)
#sql_output_JT00

#file2_JT00= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\EC2_Data_File_2_Build.xlsx', sheet_name= 'Jobs_by_Demo_JT00', skiprows= 2)
#file2_JT00= file2_JT00.sort_values(by=["employment_center_id"], inplace= False)
#file2_JT00= file2_JT00.reset_index(drop= True, inplace= False)

#diff_JT00= file2_JT00.compare(sql_output_JT00, keep_shape=True, keep_equal= True)

#diff_JT00.to_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\JT00_sp_calcdiff.xlsx')



In [166]:
# comparing file 2 with sql query output-- JT02

#sql_output_JT02= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\LEHD_queryresult_JT02.xlsx')
#sql_output_JT02= sql_output_JT02.sort_values(by='employment_center_id', inplace= False)
#sql_output_JT02= sql_output_JT02.reset_index(drop= True, inplace= False)
#sql_output_JT02

#file2_JT02= pd.read_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\data\EC2_Data_File_2_Build.xlsx', sheet_name= 'Jobs_by_Demo_JT02', skiprows= 2)
#diff_JT02= file2_JT02.compare(sql_output_JT02, keep_shape= True, keep_equal= True)
#diff_JT02

#diff_JT02.to_excel(r'C:\Users\psi\San Diego Association of Governments\SANDAG QA QC - Documents\Projects\2023\2023-006 Employment Centers 2.0\results\JT02_sp_calcdiff.xlsx')