# Production Versions of Fuzzy Match Algorithms

## Algorithms Overview
- We want to fuzzy match addresses and owner names to create clusters of the same owners. However, fuzzy matching is extremely inefficient.
- Further, we may want to do a fuzzy merge between owner address and/or name and business registry data.
- Therefore, we should consider how to get Approximate Nearest Neighbors first (e.g. a group of owner names or addresses that are similiar) rather than having to search the entire search space. That way, the number of comparisons are reduced only to the search space.
- The goal is to have an algorithm that is highly accurate while being efficient.
- We can do this by making some assumptions, as below.
### Assumptions
**For Addresses**
- *Zip code*. We can create clusters of potentially similiar owners by using owner zip code.
- *Address number*. Within zip code search spaces, we can further narrow down the search by owner address number.
- Both the above comparisons are quick because they are vectorized.
- Once we find owners with the same zip code and address number, then we can compare names using fuzzy matching.

**For Names**
- If we are only clustering within names, without consideration of address, there are few ways we can reduce the search space, except the options listed for *both*.

**Both**
- *Tokenize and use vector comparison to create ANN search space*. Within both owner names and owner addresses, it is likely that at least one token (e.g. a character sequence separated by spaces) will be identical to another in the same cluster. For instance, "PROGRESS" will likely appear in the owner name of any properties owned by "PROGRESS RESIDENTIAL." For addresses, this token is typically the address number, but can also be something like "MAIN" for "MAIN ST" or "MAIN STREET."
- *Word embeddings and Word2Vec*. Word embeddings create word vectors based on how often they (word tokens) appear next to each other. Words that appear together frequently will be closer together. From these vectors, a true ANN algorithm can be used. However, the initial computation can be quite heavy. This also has some potential for identifying corporate names due to LLC or other indicator tokens frequently being in the owner name.
---


hide company flag- does the model predict the value should be what it actually was sold for?

naive approach, same buyer buying more than 2 properties in the time period -> investor



merge appeals too?

interactive map over years showing investor activity?

In [43]:
import polars as pl

In [44]:
atl_sales_parcel = pl.read_csv('../output/geocoded/csv/atl_sales_parcel_neighborhoods.csv',
                               infer_schema_length=0)

In [45]:
parid_owner = atl_sales_parcel[['parid_strip', 'own1', 'own_adrno',
                                'own_adrstr', 'own_zip', 'neighborhood']]

In [46]:
two_neighborhoods = parid_owner.filter(
    (pl.col('neighborhood') == 'Thomasville Heights') |
    (pl.col('neighborhood') == 'South Atlanta')
)

In [47]:
two_neighborhoods.groupby('own1').count().sort(by='count', descending=True)

own1,count
str,u32
"""CHARIS SOUTH A…",51
"""HABITAT FOR HU…",38
"""RCA SUPPORT CO…",32
"""MAPLE STREET R…",29
"""MC KINNEY JANI…",22
"""CENTRAL LIBERT…",21
"""FULTON COUNTY …",18
"""FEDERAL NATION…",16
"""RON CLARK ACAD…",16
"""LBAK HOLDINGS …",15


In [48]:
(
    two_neighborhoods.groupby(['own_adrno', 'own_adrstr'])
    .count().filter(
        pl.col('count') > 1
    ).sort(by='count', descending=True)
)

own_adrno,own_adrstr,count
str,str,u32
"""228""","""MARGARET""",50
"""519""","""MEMORIAL""",36
"""420253""","""PO BOX 420253""",31
"""750""","""GLENWOOD""",27
"""17628""","""P O BOX 17628""",25
"""650043""","""P O BOX 650043…",23
"""2063""","""PHILLIPS""",22
"""3710""","""LONE TREE""",21
"""1297""","""MCDONOUGH""",20
"""150316""","""P O BOX 150316…",18


In [49]:
sum((
    two_neighborhoods.groupby(['own_adrno', 'own_adrstr'])
    .count().filter(
        pl.col('count') > 1
    )
)['count'])

1640

In [50]:
len(two_neighborhoods)

2218

In [56]:
two_neighborhoods.groupby(['own_adrno', 'own_zip']).count().sort(by='count', descending=True)

own_adrno,own_zip,count
str,str,u32
"""228""","""30315""",50
"""519""","""30312""",36
"""420253""","""30342""",31
"""1297""","""30315""",27
"""650043""","""75265""",27
"""750""","""30316""",27
"""17628""","""30316""",24
"""2063""","""30315""",22
"""3710""","""94509""",21
"""34""","""30303""",18


In [51]:
two_neighborhoods.groupby(['own_adrno', 'own_adrstr']).count().sort(by='count', descending=True)

own_adrno,own_adrstr,count
str,str,u32
"""228""","""MARGARET""",50
"""519""","""MEMORIAL""",36
"""420253""","""PO BOX 420253""",31
"""750""","""GLENWOOD""",27
"""17628""","""P O BOX 17628""",25
"""650043""","""P O BOX 650043…",23
"""2063""","""PHILLIPS""",22
"""3710""","""LONE TREE""",21
"""1297""","""MCDONOUGH""",20
"""150316""","""P O BOX 150316…",18


In [None]:
two_neighborhoods.filter(
    (pl.col('own_zip') == 'MARGARET') |
    (pl.col('own_adrno') == '228')
)['own1'].unique().to_list()

In [57]:
two_neighborhoods.filter(
    (pl.col('own_adrno') == '650043')
)['own1'].unique().to_list()

['FEDERAL NATL MTG ASSN',
 'FEDERAL NATIONAL MORTGAGE ASSN',
 'FEDERAL NATIONAL MORTGAGE ASSOCIATION',
 'FEDERAL NATIONAL MORTGAGE']

In [53]:
two_neighborhoods.filter(
    (pl.col('own_adrstr') == 'MEMORIAL') |
    (pl.col('own_adrno') == '519')
)['own1'].unique().to_list()

['HABITAT FOR HUMANITY IN ATLANTA INC',
 'HABITAT FOR HUMANITY ATLANTA INC',
 'HOLLINGSWORTH GEORGE A',
 'HABITAT FOR HUMANITY IN ATLANTA INC.',
 'UMOJA CHINUA T']

In [54]:
two_neighborhoods.filter(
    (pl.col('own_adrno') == '420253')
)['own1'].unique().to_list()

['CENTURY FUND LLC', 'MAPLE STREET RE LLC']

In [55]:
two_neighborhoods.filter(
    (pl.col('own_adrno') == '420253')
)

parid_strip,own1,own_adrno,own_adrstr,own_zip,neighborhood
str,str,str,str,str,str
"""14005600100077…","""CENTURY FUND L…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005600100077…","""CENTURY FUND L…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
"""14005700100241…","""MAPLE STREET R…","""420253""","""PO BOX 420253""","""30342""","""South Atlanta"""
