# Join Flags for 2016-2020 joining
In SQL server, an initial join was done between the 2020 employment inventory table and the 2016 table based on the LOCNUM field of 2020 = INFOID_16 field of 2016. From this, most records had a match.

## Join methodology
Join SQL File - https://github.com/SACOG/emp-inventory/blob/main/EMP2020/SQL/Join16_20_Tests_SG.sql

Steps in SQL:
1. Join 2016 to 2020 based on LOCNUM = INFOID_16
2. If there was no 2016 match to a 2020 record after doing the ID-based join, then fill in missing 2016 values based on company name and street addresses both having exact matches (i.e., address2020 = address2016 and bizname2020 = bizname2016).

The resulting table still has rows where there's 2020 but not a corresponding 2016 value, and should still have a 2016 value but do not because the address or biz name changed slightly between the two years. One role of this script is to identify, through the `fuzzywuzzy` python library, where such cases are.

## Fields considered
* LOCNUM / INFOID_16 - the "unique ID" fields for the respective years
* Business name (coname / coname16) field
* Biz address field (staddr / staddr16)

## Possible check results
* FullExMatch = the LOCNUM in 2020 has a matching INFOID_16 value, and the biz name and address are an EXACT match
* FullFzMatch = the LOCNUM in 2020 has a matching INFOID_16 value, and the biz name and address are a FUZZY match (`fuzz.ratio` > 80)
* IDMatchNameChg = the IDs match between the two years, but the biz name changed
* IDMatchAddrChg = the IDs match between the two years, but the biz address changed
* IDMatchNameAddrChg = the IDs match between the two years, but the biz address and the biz name changed
* NamAddrExMatch = IDs do not match between 2016 and 2020, but the business name and address are an EXACT match
* NamAddrFzMatch = IDs do not match between 2016 and 2020, but the business name and address are a FUZZY match (`fuzz.ratio` > 80)
* NoMatch = The IDs do not match, nor is there a FUZZY match between both the name and address



In [1]:
# define key variables and parameters
in_csv = r"\\data-svr\Monitoring\Employment Inventory\Employment 2020\SQL\TestDupeFlag20210412_1754_2.csv"

match_threshold = 80 # if fuzzy match below this number, then flag the values as being different

fld_coname20 = 'coname'
fld_locnum20 = 'locnum'
fld_staddr20 = 'staddr'
fld_zip = 'zip'
fld_naics = 'naics'
fld_naicsd = 'naicsd'
fld_home = 'home'
fld_locemp20 = 'locemp'
fld_latitude = 'latitude'
fld_longitude = 'longitude'
fld_geo_level = 'geo_level'
fld_naics4 = 'naics4'
fld_dupe_flag = 'dupe_flag'
fld_latlon_uid = 'latlon_uid'
fld_coname16 = 'coname16'
fld_staddr16 = 'staddr16'
fld_emp16 = 'emp16'
fld_notes16 = 'notes16'
fld_infoid16 = 'infoid16'

fld_jflag = 'join_flag'





In [7]:
# Load and set up table and flag function

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

df = pd.read_csv(in_csv)

df[fld_jflag] = '_'

df[fld_locnum20] = df[fld_locnum20].astype(str)
df[fld_infoid16] = df[fld_infoid16].astype(str)





In [8]:
df.loc[df[fld_locnum20] == df[fld_infoid16]].shape

(109111, 20)

In [9]:
# Find and flag records that have ID match in both years but whose names and address significantly differ

"""
* FullExMatch = the LOCNUM in 2020 has a matching INFOID_16 value, and the biz name and address are an EXACT match
* FullFzMatch = the LOCNUM in 2020 has a matching INFOID_16 value, and the biz name and address are a FUZZY match (`fuzz.ratio` > 80)
* IDMatchNameChg = the IDs match between the two years, but the biz name changed
* IDMatchAddrChg = the IDs match between the two years, but the biz address changed
* IDMatchNameAddrChg = the IDs match between the two years, but the biz address and the biz name changed
* NamAddrExMatch = IDs do not match between 2016 and 2020, but the business name and address are an EXACT match
* NamAddrFzMatch = IDs do not match between 2016 and 2020, but the business name and address are a FUZZY match (`fuzz.ratio` > 80)
* NoMatch = The IDs do not match, nor is there a FUZZY match between both the name and address
"""

def get_jflag_1(in_row):
    
    jflag_fullmatch = 'FullExMatch'
    jflag_idfuzzmatch = 'FullFzMatch'
    jflag_newname = "IDMatchNameChg" 
    jflag_newaddr = "IDMatchAddrChg" 
    jflag_nmaddrchg = "IDMatchNameAddrChg" 
    jflag_nmaddrematch = 'NamAddrExMatch' 
    jflag_nmaddrfmatch = 'NamAddrFzMatch' 
    jflag_nomatch16 = 'NotMatch16' 
    
    id16 = in_row[fld_infoid16]
    id20 = in_row[fld_locnum20]
    name16 = str(in_row[fld_coname16])
    name20 = str(in_row[fld_coname20])
    addr16 = str(in_row[fld_staddr16])
    addr20 = str(in_row[fld_staddr20])
    
    id_match = id16 == id20
    name_addr_ematch = name16 == name20 and addr16 == addr20
    
    name_fuzzmatch = fuzz.ratio(name20, name16) > match_threshold
    addr_fuzzmatch = fuzz.ratio(addr20, addr16) > match_threshold
    
    name_addr_fmatch = name_fuzzmatch & addr_fuzzmatch
    
    if id_match:
        if name_addr_ematch:
            output = jflag_fullmatch
        elif name_addr_fmatch:
            output = jflag_idfuzzmatch
        elif name_fuzzmatch and not addr_fuzzmatch: # address changed
            output = jflag_newaddr
        elif addr_fuzzmatch and not name_fuzzmatch: # biz name changed
            output = jflag_newname
        elif not addr_fuzzmatch and not name_fuzzmatch: # biz name and address changed
            output = jflag_nmaddrchg
        else:
            output = 'ERROR'
    else:
        if name_addr_ematch:
            output = jflag_nmaddrematch
        elif name_addr_fmatch:
            output = jflag_nmaddrfmatch
        else:
            output = jflag_nomatch16
            
    
    return output
        
        
        
        
        

In [10]:
%timeit df[fld_jflag] = df.apply(lambda x: get_jflag_1(x), axis=1)

testcols = [fld_locnum20, fld_coname20, fld_staddr20, fld_infoid16, fld_coname16, fld_staddr16, fld_jflag]





12.3 s +- 106 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)


In [11]:
df[fld_jflag].value_counts()

# df.loc[(df[fld_jflag] == 'No16ID') & (pd.notnull(df['locnum'])) & (pd.notnull(df['infoid16']))][testcols].head()

# dft = df.loc[df[fld_locnum20].isin([104833801, 403881922])]
# print(dft.iloc[0]['locnum'])
# print(dft.iloc[0]['infoid16'])
# print(dft.iloc[0]['infoid16'] - dft.iloc[0]['locnum'])
# dft['lnum_str']

NotMatch16            262700
FullExMatch            72301
FullFzMatch            26461
IDMatchAddrChg          4970
IDMatchNameChg          4805
NamAddrExMatch          3963
IDMatchNameAddrChg       574
NamAddrFzMatch           143
Name: join_flag, dtype: int64

In [12]:
df.to_csv(r"\\data-svr\Monitoring\Employment Inventory\Employment 2020\SQL\Recs_w_2016jnflag_20210419.csv", index=False)