# Process Data for Equity Loss Analysis

## Data Sources
- Fulton County digest parcel data from 2011 to 2022 (selected for LUC=101, SFHs), excel
- Fulton County digest parcel data for 2022 (for geocoding), geojson
- Fulton County sales data from 2011 to 2022, txt
- Atlanta Neighborhood Statistical Areas with supplemental data from Census (), 2022, csv from Neighborhood Nexus
- Neighborhood characteristics? unknown

**Note: NSAs in DeKalb are excluded, we do not have data for all years**

Those neighborhoods are:
- Candler Park, Druid Hills
- Lake Claire
- East Lake
- Kirkwood
- Edgewood
- East Atlanta
- Emory University/Center for Disease Control
- Part of Morningside/Lenox Park

This leaves _ neighborhoods (see appendix for list)

In [1]:
import os
import csv
import pandas as pd
import geopandas as gpd

pd.set_option('display.max_columns', 150)
pd.options.display.float_format = '{:.2f}'.format

In [6]:

def contains_keyword(filename: str, keywords: list[str]) -> str:
    return any(keyword in filename for keyword in keywords)

def clean_and_cast_column(column: pd.Series, var_map: dict) -> pd.Series:
    to_dtype = var_map[column.name]
    fill_val = None
    
    if to_dtype == "int" or to_dtype == "float":
        fill_val = 0
    elif to_dtype == "string":
        fill_val = ""
    else:
        raise ValueError(f"{to_dtype} is not a valid data type!")
    
    if ((to_dtype == "int" or to_dtype == "float")
        and (column.dtype == "string" or column.dtype == "object")):
        # Remove commas from number strings before converting
        column = column.astype("str").str.replace(",", "").astype('float')
    
    # Record number of filled nulls
    print(f"Number of nulls in column {column.name}: {column.isna().sum()}")
    column = column.fillna(fill_val)
    return column.astype(to_dtype)

### Create a file with Fulton County digest data for all years

In [3]:
# Select desired digest files
FULTON_DIR = './data/raw_fulton/'
fulton_files = os.listdir(FULTON_DIR)

keywords = ["DIGEST", "NF", "SF"]
digest_cols = {
    "Taxyr": "int",
    "Parid": "string",
    "Situs Adrno": "string",
    "Situs Adrdir": "string",
    "Situs Adrstr": "string",
    "Situs Adrsuf": "string",
    "Cityname": "string",
    "Luc": "string",
    "Calcacres": "float",
    "Own1": "string",
    "Own2": "string",
    "Owner Adrno": "string",
    "Owner Adradd": "string",
    "Owner Adrdir": "string",
    "Owner Adrstr": "string",
    "Owner Adrsuf": "string",
    "Cityname.1": "string",
    "Statecode": "string",
    "Zip1": "string",
    "D Yrblt": "int",
    "D Effyr": "int",
    "D Yrremod": "int",
    "Sfla": "float"
}

desired_files = filter(lambda file: contains_keyword(file, keywords), fulton_files)

# Read desired files and only parse desired cols
# Need to ensure Luc is read in as a str so we can filter appropriately
desired_files_dfs = [
    pd.read_excel(
        FULTON_DIR + file,
        usecols=digest_cols,
        dtype={"Luc": "str"}
    ) for file in desired_files
]

In [4]:
# Concat selected digest files, select for LUC = 101 (SFH), drop complete duplicates
# Record total number of parcels
digest_full = pd.concat(desired_files_dfs)
digest_full = digest_full[digest_full['Luc'] == '101']

In [7]:
# After filtering for Luc, we can continue to cast other columns
rename = {
    "Taxyr": "TAXYR",
    "Parid": "PARID",
    "Cityname.1": "own_cityname",
    "Zip1": "own_zip"
}
init_len = len(digest_full)

digest_full = digest_full.drop_duplicates()
print(f"Init len: {init_len}. Number of dropped duplicates: {init_len - len(digest_full)}. Final len: {len(digest_full)}")

# Records nulls and set datatypes
for column in digest_full.columns:
    digest_full[column] = clean_and_cast_column(digest_full[column], digest_cols)
    
digest_full = digest_full.rename(columns=rename)

Init len: 2785428. Number of dropped duplicates: 0. Final len: 2785428
Number of nulls in column Taxyr: 0
Number of nulls in column Parid: 0
Number of nulls in column Situs Adrno: 45
Number of nulls in column Situs Adrdir: 2783362
Number of nulls in column Situs Adrstr: 22
Number of nulls in column Situs Adrsuf: 189812
Number of nulls in column Cityname: 4973
Number of nulls in column Luc: 0
Number of nulls in column Calcacres: 572
Number of nulls in column Own1: 0
Number of nulls in column Own2: 2331336
Number of nulls in column Owner Adrno: 67785
Number of nulls in column Owner Adradd: 2780399
Number of nulls in column Owner Adrdir: 2723357
Number of nulls in column Owner Adrstr: 3572
Number of nulls in column Owner Adrsuf: 213848
Number of nulls in column Cityname.1: 3167
Number of nulls in column Statecode: 3482
Number of nulls in column Zip1: 4131
Number of nulls in column D Yrblt: 3899
Number of nulls in column D Effyr: 2262396
Number of nulls in column D Yrremod: 2659392
Number 

In [8]:
digest_full[digest_full['Owner Adrno'] == ""].sample(3)

Unnamed: 0,TAXYR,PARID,Situs Adrno,Situs Adrdir,Situs Adrstr,Situs Adrsuf,Cityname,Luc,Calcacres,Own1,Own2,Owner Adrno,Owner Adradd,Owner Adrdir,Owner Adrstr,Owner Adrsuf,own_cityname,Statecode,own_zip,D Yrblt,D Effyr,D Yrremod,Sfla
112795,2010,14F0045 LL0250,1748.0,,NISKEY LAKE,RD,ATLANTA,101,0.9,FOXWORTHY INC,,,,,P O BOX 724017,,ATLANTA,GA,31139.0,1948,0,0,864.0
26377,2016,09F270101091407,5478.0,,ROCK LAKE,DR,FUL,101,0.33,PAVUS INVEST LLC,,,,,,,,,,2004,0,0,2304.0
59239,2016,12 315309000443,10560.0,,SUMMER RIDGE,DR,FUL,101,0.42,TEN FIVE SIXTY SUMMER RIDGE,LAND TRUST & STEVENS RONALD S,,,,P O BOX 468602,,ATLANTA,GA,31146.0,1986,0,0,2874.0


Note: cases where Owner Adrno is empty is often because it is a PO BOX in the Owner Adrstr column

In [9]:
# Quickly validate no data quality issues by looking at number of parcels per year
digest_full.groupby("TAXYR")['PARID'].count()

TAXYR
2010    208196
2011    208421
2012    208964
2013    209749
2014    210860
2015    212335
2016    202821
2017    216088
2018    217793
2019    219790
2020    222101
2021    223659
2022    224651
Name: PARID, dtype: int64

Note: It looks like a few parcels are lost between 2015 and 2016 (approx 10K)

In [10]:
# Export to CSV, Parquet
OUTPUT_PATH = 'output/fulton_parcels_all'
digest_full.to_csv(OUTPUT_PATH + '.csv')
digest_full.to_parquet(OUTPUT_PATH + '.parquet')

### Create a file with Fulton County sales data for all years

In [12]:
# Select desired sales files
keywords = ["STANDARDS SALES"]
sale_cols = {
    "Taxyr": "int",
    "Parid": "string",
    "Saledt": "string",
    "Luc": "string",
    "SALES PRICE": "float",
    "FAIR MARKET VALUE": "float",
    "DEED TYPE": "string",
    "Saleval": "string",
    "Costval": "string",
    "GRANTOR": "string",
    "GRANTEE": "string"
}

desired_files = filter(lambda file: contains_keyword(file, keywords), fulton_files)

desired_files_dfs = [
    pd.read_csv(
        FULTON_DIR + file,
        sep='\t',
        encoding='latin-1',
        usecols=sale_cols,
        quoting=csv.QUOTE_NONE,
        skipfooter=1,
        on_bad_lines="warn"
    ) for file in desired_files
]

  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(
  pd.read_csv(


In [13]:
# Concat selected digest files, select for LUC = 101 (SFH), drop complete duplicates
# Drop DEEDTYPE, low sales
# Record total number of sales
rename = {
    "Taxyr": "TAXYR",
    "Parid": "PARID",
}

sales_full = pd.concat(desired_files_dfs)
sales_full = sales_full[sales_full['Luc'] == '101']

init_len = len(sales_full)
sales_full = sales_full.drop_duplicates()
print(f"Init len: {init_len}. Number of dropped duplicates: {init_len - len(sales_full)}")

# Records nulls and set datatypes
for column in sales_full.columns:
    sales_full[column] = clean_and_cast_column(sales_full[column], sale_cols)
    
sales_full = sales_full.rename(columns=rename)

Init len: 275814. Number of dropped duplicates: 461
Number of nulls in column Taxyr: 0
Number of nulls in column Parid: 0
Number of nulls in column Luc: 0
Number of nulls in column Saledt: 0
Number of nulls in column SALES PRICE: 19
Number of nulls in column FAIR MARKET VALUE: 0
Number of nulls in column DEED TYPE: 2
Number of nulls in column Costval: 0
Number of nulls in column Saleval: 118
Number of nulls in column GRANTOR: 18
Number of nulls in column GRANTEE: 8


In [14]:
# Check distribution of data for quick validation
sales_full.groupby("TAXYR")['PARID'].count()

TAXYR
2011    26777
2012    22684
2013    26074
2014    11711
2015    11256
2016    14978
2017    13210
2018    28281
2019    29134
2020    29570
2021    28645
2022    33033
Name: PARID, dtype: int64

In [15]:
sales_full.groupby("TAXYR").describe()

Unnamed: 0_level_0,SALES PRICE,SALES PRICE,SALES PRICE,SALES PRICE,SALES PRICE,SALES PRICE,SALES PRICE,SALES PRICE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE,FAIR MARKET VALUE
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
TAXYR,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
2011,26777.0,162950.94,630129.89,0.0,0.0,56050.0,189900.0,51718108.0,26777.0,210525.7,292278.99,0.0,42100.0,115300.0,276200.0,8750000.0
2012,22684.0,164550.67,424239.44,0.0,0.0,49500.0,200515.0,20500000.0,22684.0,241854.35,325947.29,1500.0,56600.0,142800.0,314800.0,13698600.0
2013,26074.0,166081.78,440832.36,0.0,0.0,47000.0,212000.0,17038094.0,26074.0,238072.86,323868.49,0.0,39500.0,142225.0,325900.0,6322800.0
2014,11711.0,331640.54,351925.01,0.0,79000.0,255000.0,455000.0,6350000.0,11711.0,294653.6,320690.06,100.0,63900.0,226960.0,402150.0,6189700.0
2015,11256.0,357182.81,392959.23,0.0,101425.0,275000.0,475000.0,13914719.0,11256.0,314046.58,324030.38,1390.0,74887.5,244970.0,428770.0,6151600.0
2016,14978.0,348266.13,401923.49,0.0,85000.0,258325.0,474500.0,17455046.0,14978.0,346104.72,368631.02,0.0,91500.0,257500.0,470000.0,6000000.0
2017,13210.0,386344.64,386724.4,0.0,139000.0,294000.0,517500.0,7200000.0,13210.0,306637.0,320307.43,0.0,75725.0,225900.0,425000.0,6000000.0
2018,28281.0,268587.37,931256.97,0.0,1.0,108000.0,360000.0,135635876.0,28281.0,307217.92,359431.09,0.0,75800.0,198500.0,418000.0,7150000.0
2019,29134.0,522038.36,2227323.74,0.0,1.0,137000.0,379900.0,40120000.0,29134.0,328591.65,367677.75,100.0,107200.0,207000.0,427975.0,7861300.0
2020,29570.0,303400.4,1085088.76,0.0,1.0,155000.0,379000.0,58800000.0,29570.0,364300.06,393939.77,0.0,133500.0,243900.0,467500.0,11350000.0


In [16]:
# Export to CSV, Parquet
OUTPUT_PATH = 'output/fulton_parcels_all'
digest_full.to_csv(OUTPUT_PATH + '.csv')
digest_full.to_parquet(OUTPUT_PATH + '.parquet')

### Geocode digest data and determine NSA of each parcel

In [17]:
# Join all years digest to 2022 Fulton County geocoded parcel boundaries on ParcelID
# Record # parcels not matched (ones that changed over the years, so now have a different ParcelID)

In [18]:
# Spatially join parcel geodata to Atlanta NSAs to get the neighborhood of each parcel

In [19]:
# Drop NSAs from Atlanta NSA df which has no matches (e.g. those outside of Fulton County) - tho we want to retain all Fulton for comparison
# Write the NSAs we include and those we don't to a text file for Appendix

### Join parcel to sales?