# Owner and Transaction Scale Calculations

In [1]:
import pandas as pd

pd.set_option('display.max_columns', 150)
pd.options.display.float_format = '{:,.3f}'.format

In [2]:
# Cleaned digest and sales data from clean_data.ipynb
FILES_PATH = 'output/'
digest_full = pd.read_parquet(FILES_PATH + 'digest_full_clean.parquet')
sales_full = pd.read_parquet(FILES_PATH + 'sales_full_clean.parquet')

## Modify data to enable aggreggating on entity key

### Identify same owners in parcel data
- Drop any rows without Owner Address
- Create an Owner Address (labeled: "owner_addr") column that is the concatentation of owner address number, owner address string, and owner zip.
- If address string contains numbers, then it is a PO BOX. However, a lot are formatted in different ways, such as P O BOX 123, PO BOX 123, P.O. BOX 123, etc. We can retain the number from the address string, and manually prepend PO BOX, so all will have an identical format.

**Why:** these values get us a highly accurate key for same owner. Owner address string does not contain postfixes like ST, AVE, etc. that might cause issues. Combined with owner number and owner zip, we can say with high confidence that the address is the same while avoiding many common differences amongst the same address (ST vs STREET, etc.). This method is prefered over names which has a higher chance of false positive, and large corporations may operate with differently named subsidaries. This method may also undercount, if a company uses multiple addresses, but this is somewhat unlikely and undercounting is simply an acceptable limitation. It is acceptable since large investors (who would use different addresses) will own so many properties with each subsidary that it will be binned in the correct bin regardless.

In [3]:
# Re-format PO BOXES
re_box_and_numbers = r".*BOX.*[0-9].*"
re_capture_numbers = r"([0-9]+)"

digest_full["mod_own_adrstr"] = digest_full["Owner Adrstr"].copy(deep=True)

mask = digest_full["mod_own_adrstr"].str.contains(re_box_and_numbers, regex=True)

digest_full.loc[mask, "mod_own_adrstr"] = "PO BOX " + digest_full.loc[
    mask, "mod_own_adrstr"
].str.extract(re_capture_numbers)[0]

In [4]:
mask.sum()

62973

In [5]:
# Print total number of PO BOXES without a number in their address string
re_po_box_no_number = r"^(?!.*\d)[P]+.* BOX.*"
len(digest_full[digest_full["mod_own_adrstr"].str.contains(
    re_po_box_no_number, regex=True
)][["Owner Adrno", "mod_own_adrstr"]])

37

In [6]:
# Regex to clean by replacing dots, commas, and multiple spaces
# Also make all strings uppercase (they should be already)

re_dots_commas = r"[.,]+"
re_multiple_spaces = r"\s{2,}"

digest_full["owner_addr"] = (
    digest_full["Owner Adrno"].astype(str) + " " +
    digest_full["mod_own_adrstr"] + " " +
    digest_full["own_zip"]
).str.replace(
    re_dots_commas,
    "",
    regex=True
).str.replace(
    re_multiple_spaces,
    " ",
    regex=True
).str.upper()

In [7]:
digest_full[
    ["Owner Adrno", "Owner Adrstr", "mod_own_adrstr", "own_zip", "owner_addr"]
].sample(10)

Unnamed: 0,Owner Adrno,Owner Adrstr,mod_own_adrstr,own_zip,owner_addr
1531599,6215,HARBOUR OVERLOOK,HARBOUR OVERLOOK,30005,6215 HARBOUR OVERLOOK 30005
2706444,6680,CEDAR HURST,CEDAR HURST,30349,6680 CEDAR HURST 30349
680950,1015,NORTH VIRGINIA,NORTH VIRGINIA,30306,1015 NORTH VIRGINIA 30306
1119471,320,PARK GLEN,PARK GLEN,30327,320 PARK GLEN 30327
2274673,810,NISKEY LAKE,NISKEY LAKE,30331,810 NISKEY LAKE 30331
2434162,150,MOSS CREEK,MOSS CREEK,30214,150 MOSS CREEK 30214
2352472,4036,RIVER BREEZE,RIVER BREEZE,23321,4036 RIVER BREEZE 23321
689918,12,HIGHLAND PARK,HIGHLAND PARK,30306,12 HIGHLAND PARK 30306
2020504,2316,PINE RIDGE,PINE RIDGE,34109,2316 PINE RIDGE 34109
550069,885,EDWIN,EDWIN,30318,885 EDWIN 30318


### Identify same owners in sales data

**Method**:
- Sales data does not contain buyer or seller address. We can't simply use GRANTEE or GRANTOR name, because names can be different for the same owner corporation (subsidaries, typos). Instead we identify buyer and seller address by:
    - Match GRANTEE name to parcel data on GRANTEE = Own1 (owner name) and extract owner_addr for CURRENT TAXYR
    - Match GRANTOR name to parcel data on GRANTOR = Own1 (owner name) and extract owner_addr for PREVIOUS TAXYR
    - For names where the GRANTEE or GRANTOR name doesn't match exactly (due to typos, etc.), we can take the owner_addr with the same method ONLY IF there was only one sale in the given TAXYR. In the case of multiple sales in one TAXYR, the last purchaser appears to be recorded in the parcel data as the owner (see evidence below); if we tried to match an earlier sale in that year, we would get the wrong owner address. This is a problem because we want the purchaser address for each sale to appropriately account for flipping activity for example.
    - Else, try to find an exact owner name match from all parcel data, not limited to PARID and TAXYR; use the last match if a match is found (last because that is most recent address of company).
- In short:
    - Try to match by owner name, PARID, and TAXYR
    - If no match, get match from just PARID and TAXYR, ONLY IF there is a single transcation in the given TAXYR for that PARID
    - Else, try to find an exact owner name match from all parcel data, not limited to PARID and TAXYR; use the last match if a match is found.
    - Where none of the above methods work, drop if total count is insignificant

In [8]:
# Minor cleaning on GRANTEE, GRANTOR, and Own1 (parcel data)
# Regex to clean by replacing dots, commas, and multiple spaces
# Also make all strings uppercase (they should be already)
re_dots_commas = r"[.,]+"
re_multiple_spaces = r"\s{2,}"

for col in ["GRANTEE", "GRANTOR"]:
    sales_full[col] = sales_full[col].str.replace(
        re_dots_commas, "", regex=True
    ).str.replace(
        re_multiple_spaces, " ", regex=True
    ).str.upper()
    
digest_full["Own1"] = digest_full["Own1"].str.replace(
    re_dots_commas, "", regex=True
).str.replace(
    re_multiple_spaces, " ", regex=True
).str.upper()

In [9]:
count_sales_yr = pd.DataFrame(
    sales_full.groupby(["TAXYR", "PARID"])["PARID"].count()
).rename(columns={"PARID": "count_sales_yr"})

sales_full = sales_full.merge(
    count_sales_yr,
    on=["TAXYR", "PARID"],
    how="inner"
)

more_than_one_sale_yr = len(
    sales_full[sales_full["count_sales_yr"] > 1].drop_duplicates(
        subset=["TAXYR", "PARID"]
    )
)

more_than_one_sale_yr_valid = len(
    sales_full[
       (sales_full["count_sales_yr"] > 1)
       & (sales_full["Saleval"] == "0")
    ].drop_duplicates(
        subset=["TAXYR", "PARID"]
    )
)

print(f"Count of properties that sold multiple times in one year: {more_than_one_sale_yr}")
print(f"Count of properties that sold multiple times in one year (valid only): {more_than_one_sale_yr_valid}")
count_sales_yr.sort_values(by="count_sales_yr", ascending=False).head(5)

Count of properties that sold multiple times in one year: 12354
Count of properties that sold multiple times in one year (valid only): 5937


Unnamed: 0_level_0,Unnamed: 1_level_0,count_sales_yr
TAXYR,PARID,Unnamed: 2_level_1
2022,07 320100600219,8
2011,14 013500030907,5
2021,09F370001531401,4
2013,14 015800010133,4
2011,13 0194 LL0810,4


In [10]:
print(len(sales_full))
for person in ["GRANTEE", "GRANTOR"]:
    
    digest_df = digest_full[['PARID', 'TAXYR', 'owner_addr', 'Own1']].copy(deep=True)
    sale_df = sales_full[["TAXYR", "PARID", f"{person}", "count_sales_yr"]].copy(deep=True)
    
    if person == "GRANTOR":
        digest_df["TAXYR"] = digest_df["TAXYR"] + 1

    exact_match = sale_df.merge(
        digest_df,
        left_on=["PARID", "TAXYR", f"{person}"],
        right_on=["PARID", "TAXYR", "Own1"],
    ).rename(
        columns={"Own1": f"{person}_exact", "owner_addr": f"{person}_exact_addr"}
    ).drop_duplicates(subset=["PARID", "TAXYR"])

    exact_name_match = sale_df.merge(
        digest_df[["Own1", "owner_addr"]].drop_duplicates(subset="Own1", keep="last"),
        left_on=[f"{person}"],
        right_on=["Own1"],
    ).rename(columns={
        "Own1": f"{person}_only_exact_name", "owner_addr": f"{person}_only_exact_name_addr"
    }).drop_duplicates(subset=[f"{person}_only_exact_name"])

    single_sale_match = sale_df[
        sale_df["count_sales_yr"] < 2
    ].merge(
        digest_df,
        left_on=["PARID", "TAXYR"],
        right_on=["PARID", "TAXYR"],
    ).rename(
        columns={"Own1": f"{person}_single_sale", "owner_addr": f"{person}_single_sale_addr"}
    ).drop_duplicates(subset=["PARID", "TAXYR"])
    
    sales_full = sales_full.merge(
        exact_match[["TAXYR", "PARID", f"{person}_exact", f"{person}_exact_addr"]],
        left_on=["TAXYR", "PARID", f"{person}"],
        right_on=["TAXYR", "PARID", f"{person}_exact"],
        how="left"
    )
    
    sales_full = sales_full.merge(
        single_sale_match[["TAXYR", "PARID", f"{person}_single_sale", f"{person}_single_sale_addr"]],
        on=["TAXYR", "PARID"],
        how="left"
    )
    
    sales_full = sales_full.merge(
        exact_name_match[[f"{person}_only_exact_name", f"{person}_only_exact_name_addr"]],
        left_on=[f"{person}"],
        right_on=[f"{person}_only_exact_name"],
        how="left"
    )

len(sales_full)

180692


180692

In [11]:
for person in ["GRANTEE", "GRANTOR"]:
    print(f"Person: {person} ---")
    sales_full[f"{person}_match"] = sales_full[f"{person}_exact"]
    sales_full[f"{person}_match_addr"] = sales_full[f"{person}_exact_addr"]
    num_matched = len(sales_full[sales_full[f'{person}_match'].notna()])
    print(f"Number exact matched: {num_matched}")
    print(f"Pct exact matched: {num_matched / len(sales_full)}")
    print("")
    
    for match in ["single_sale", "only_exact_name"]:
        sales_full[f"{person}_match"] = sales_full[f"{person}_match"].fillna(
            sales_full[f"{person}_{match}"]
        )
        sales_full[f"{person}_match_addr"] = sales_full[f"{person}_match_addr"].fillna(
            sales_full[f"{person}_{match}_addr"]
        )
        prev_matched = num_matched
        num_matched = len(sales_full[sales_full[f'{person}_match'].notna()])
        print(f"Number of additional matches with {match}: {num_matched - prev_matched}")
        print(f"Number prev matches + {match} matched: {num_matched}")
        print(f"Pct prev matches + {match} matched: {num_matched / len(sales_full)}")
        print("")
    
    print("")
    print("")

Person: GRANTEE ---
Number exact matched: 142531
Pct exact matched: 0.7888063666349368

Number of additional matches with single_sale: 20342
Number prev matches + single_sale matched: 162873
Pct prev matches + single_sale matched: 0.9013846766874017

Number of additional matches with only_exact_name: 12199
Number prev matches + only_exact_name matched: 175072
Pct prev matches + only_exact_name matched: 0.9688973501870587



Person: GRANTOR ---
Number exact matched: 64062
Pct exact matched: 0.35453700219157464

Number of additional matches with single_sale: 80562
Number prev matches + single_sale matched: 144624
Pct prev matches + single_sale matched: 0.8003896132645607

Number of additional matches with only_exact_name: 21671
Number prev matches + only_exact_name matched: 166295
Pct prev matches + only_exact_name matched: 0.9203229805414739





Save a random sample of 100 to manually verify

In [12]:
sales_full.sample(100)[
    ["TAXYR", "PARID", "GRANTEE", "GRANTEE_match", "GRANTOR", "GRANTOR_match"]
].sort_values(by="TAXYR").to_csv('output/sales_matches_sample.csv', index=False)

## Drop govt and bank entities from transcation data
- Retain a dataset where these were only dropped from GRANTEE, because someone buying from one of these entities is still influencing the market and thus this should be included in their transcation scale

In [13]:
govt_keywords = ['FEDERAL', 'SECRETARY', 'CITY', 'COUNTY', 'FANNIE', 'FREDDIE']
bank_keywords = [
    'BANK', 'MORTGAGE', 'LENDING', 'LOAN',
    'FINANCE', 'FUND', 'CREDIT', 'TRUST'
]
govt = []
banks = []

govt += sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['Own1'].unique().tolist()

banks += sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['Own1'].unique().tolist()

# Save record of entities classified as govt or bank
with open("./output/govt.txt", "w") as f:
    f.write("\n".join(govt))
with open("./output/banks.txt", "w") as f:
    f.write("\n".join(banks))

In [14]:
init_len_sales = len(sales_full)
init_len_digest = len(digest_full)

sales_grantee_dropped = sales_full[
    ~(sales_full["GRANTEE"].isin(govt + banks))
].copy(deep=True)

sales_full = sales_full[
    ~((sales_full["GRANTEE"].isin(govt + banks))
    | (sales_full["GRANTOR"].isin(govt + banks)))
]

print("Number dropped from sales: ", init_len_sales - len(sales_full))

Number dropped from sales:  30408


## Identify corporate owners, create corp owner flags for each record
- grantee, grantor in sales
- own1 in digest

In [15]:
# Any with risk of false positive like "CO" need to have a space prepended or postpended
corp_keywords = [
    'LLC', ' INC', 'LLP', 'L.L.C', 'L.L.P', 'I.N.C', 'L L C',
    'L L P', ' L P', ' LP', 'LTD', ' CORP', 'CORPORATION',
    'COMPANY', ' CO ', 'LIMITED', 'PARTNERSHIP', 'PARTNERSHIPS',
    'ASSOCIATION', 'ASSOC', 'INCORPORATED', 'INCORP',
    'L.T.D', 'LTD', "HOME", "SOLUTIONS"
]

# Make a list of all corp owners
corps = sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['Own1'].unique().tolist()

with open("./output/corp_names.txt", "w") as f:
    f.write("\n".join(corps))

In [16]:
# Flag for any corp owner
sales_full["GRANTEE_corp_flag"] = sales_full['GRANTEE'].isin(corps).astype(int)
sales_full["GRANTOR_corp_flag"] = sales_full['GRANTOR'].isin(corps).astype(int)

digest_full["own_corp_flag"] = digest_full["Own1"].isin(corps).astype(int)

In [17]:
sales_full[['GRANTEE', 'GRANTEE_corp_flag', 'GRANTOR', 'GRANTOR_corp_flag']].sample(10)

Unnamed: 0,GRANTEE,GRANTEE_corp_flag,GRANTOR,GRANTOR_corp_flag
161067,LEVY SAMUEL A &,0,ROBERT S FELDBERG AND ELAINE Z FELDBER,0
76736,WESTLAKE KAELEN,0,MALCOLM ANDREW LAW JR,0
107800,LESCAR BRIAN A,0,T-DOMUS LLC,1
62157,PARIKH AMY &,0,DUBOSE VIVIAN N,0
90359,REPP THOMAS X & HEATHER LEA,0,SWEITZER ARIELLE NAPP & MATTHEW JAMES,0
48719,HSR PROPERTIES LLC,1,KS03 HOLDINGS LLC,1
89536,BLACKBURN BRADLEY &,0,SHORT WILLIAM B,0
27065,SELLARS KESHA,0,ROUSE II LETONY,0
180156,CHINN JAMES & LEISHA,0,CODY BARBARA CHINN & JAMES N,0
60005,PORTER ASA & FINDLING NATALIE,0,MOORE TERRY & HOPE,0


In [18]:
digest_full[['Own1', 'own_corp_flag']].sample(10)

Unnamed: 0,Own1,own_corp_flag
1096059,COOK KATHRYN J,0
167735,YOSHIKAWA HITOE,0
2740576,SCOTT DIXON DELITTA,0
2717711,DOWNEY HUFF MELODY M,0
2589292,BELL JESSICA E,0
1253174,WEBB PATTIE J,0
585003,MARTINEZ IVAN SORIANO,0
1916200,WANG SWAN,0
2533775,DAVY TRUST,0
2166866,KIM MYUNG HOON & KYUNG SIK,0


## Create a rental property flag

In [19]:
digest_full["rental_flag"] = 0
digest_full.loc[
    ((digest_full["Situs Adrno"] != digest_full["Owner Adrno"])
    & (digest_full["Situs Adrstr"] != digest_full["Owner Adrstr"])),
    "rental_flag"
] = 1

In [20]:
digest_full[['Situs Adrno', 'Owner Adrno', 'Situs Adrstr', 'Owner Adrstr', 'rental_flag']].sample(20)

Unnamed: 0,Situs Adrno,Owner Adrno,Situs Adrstr,Owner Adrstr,rental_flag
2450348,314,314,WOODMILL,WOODMILL,0
1299344,4075,0,KENORA,PO BOX 595,1
2669791,3680,3680,EMILY,EMILY,0
1949646,205,205,SPRING HOLLOW,SPRING HOLLOW,0
860872,6530,6530,WRIGHT,WRIGHT,0
2433960,6882,7720,FIRESIDE,16TH,1
254041,3390,1087,LA VISTA,TRAILBLAZER,1
2679461,760,760,BIRCH,BIRCH,0
1609159,595,595,SOUTHSHORE,SHORE,0
580199,3367,3367,RUBY H HARPER,RUBY H. HARPER,0


## For each sale, create a dummy variable for each sale type
- corp purchase from ind
- corp sale to ind
- ind to ind
- corp to corp

In [21]:
# Sale type matrix

sales_full['corp_bought_ind'] = 0
sales_full['corp_sold_ind'] = 0
sales_full['ind_to_ind'] = 0
sales_full['corp_to_corp'] = 0

sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 1) & (sales_full["GRANTOR_corp_flag"] == 0), 'corp_bought_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 0) & (sales_full["GRANTOR_corp_flag"] == 0), 'ind_to_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 0) & (sales_full["GRANTOR_corp_flag"] == 1), 'corp_sold_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 1) & (sales_full["GRANTOR_corp_flag"] == 1), 'corp_to_corp'
] = 1

# Validate sale matrix is correct
sales_full[[
    "GRANTEE", "GRANTEE_corp_flag", "GRANTOR", "GRANTOR_corp_flag", "corp_bought_ind", "ind_to_ind",
    "corp_sold_ind", "corp_to_corp"
]].sample(10)

Unnamed: 0,GRANTEE,GRANTEE_corp_flag,GRANTOR,GRANTOR_corp_flag,corp_bought_ind,ind_to_ind,corp_sold_ind,corp_to_corp
49447,CHALLENGER LLEWELLYN E JR &,0,ASHTON ATLANTA RESIDENTIAL LLC,1,0,0,1,0
121200,GILBERT BILLY C,0,D R HORTON INC,1,0,0,1,0
49069,ECHEGOYEN ROSSANA BOBADILLA,0,SALDIERNA ANDREA D ZUNIGA,0,0,1,0,0
146144,REYNOLDS JENNIFER,0,VAUGHN RANDALL D & HOLLY C,0,0,1,0,0
150455,BROWN SASHA J,0,WLODAR BRYAN M,0,0,1,0,0
128648,DOEJODE RAVISHAMKAR J,0,KLUGH JASON & CHRISTINE,0,0,1,0,0
90155,ARVM 5 LLC,1,WILSON NAOMI,0,1,0,0,0
41795,RICHARDSON TERRA JONES & ANTHONIO,0,SUNRISE BUILDERS INC,1,0,0,1,0
138654,RIGDON CLAYTON C &,0,CONSUEGRA ESTELA M,0,0,1,0,0
76234,DONOHUE ROBERT,0,ALLEN SPENCER R JR,0,0,1,0,0


## Create ownership scale table

In [22]:
owned_fulton_yr = pd.DataFrame(
    digest_full.groupby(["TAXYR", "owner_addr"])["PARID"].count()
).rename(columns={"PARID": "count_owned_fulton_yr"}).reset_index()
owned_fulton_yr

assoc_owner_names = pd.DataFrame(
    digest_full.groupby(["owner_addr"]).agg({"Own1": list})
).rename(columns={"Own1": "assoc_owner_names"}).reset_index()

owner_scale = owned_fulton_yr.merge(
    assoc_owner_names,
    on=["owner_addr"],
    how="left"
)

owner_scale.sort_values(by="count_owned_fulton_yr", ascending=False).head(5)

Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
2367978,2022,5001 PLAZA ON THE 78746,714,"[ALTO ASSET COMPANY 2 LLC, ALTO ASSET COMPANY ..."
2173221,2021,5001 PLAZA ON THE 78746,646,"[ALTO ASSET COMPANY 2 LLC, ALTO ASSET COMPANY ..."
187082,2011,0 PO BOX 650043 75265,639,"[FEDERAL NATIONAL MORTGAGE ASSOCIATION, FEDERA..."
2284826,2022,1850 PARKWAY 30067,598,"[FKH SFR PROPCO D L P, RM1 SFR PROPCO A LP, FK..."
2235644,2022,0 PO BOX 4090 85261,578,"[YAMASA CO LTD, HOME SFR BORROWER IV LLC, HOME..."


In [23]:
owner_scale[
    owner_scale["TAXYR"] == 2020
].sort_values(by="count_owned_fulton_yr", ascending=False).head(15)

Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1979621,2020,5001 PLAZA ON THE 78746,539,"[ALTO ASSET COMPANY 2 LLC, ALTO ASSET COMPANY ..."
1893046,2020,1717 MAIN 75201,493,"[2018 3 IH BORROWER LP, 2018 3 IH BORROWER LP,..."
1949988,2020,3505 KOGER BLVD 30096,345,"[FYR SFR BORROWER LLC, FYR SFR BORROWER LLC, F..."
1949403,2020,3495 PIEDMONT 30305,299,"[BUTLER GLIDDEN COOPER LLC, BUTLER GLIDDEN COO..."
1885939,2020,1508 BROOKHOLLOW 92705,219,"[TAH 2017 2 BORROWER LLC, TAH 2017 2 BORROWER ..."
1973405,2020,4645 HAWTHORNE 20016,211,"[SFR ATL OWNER 7 L P, SFR ATL OWNER 8 L P, SFR..."
1897285,2020,1850 PARKWAY 30067,173,"[FKH SFR PROPCO D L P, RM1 SFR PROPCO A LP, FK..."
1895624,2020,180 STETSON 60601,150,"[HPA II BORROWER 2020 1 LLC, HPA II BORROWER 2..."
2009136,2020,6836 MORRISON 28211,123,"[MNSF II W1 LLC, MNSF II W1 LLC, MNSF II W1 LL..."
1949986,2020,3505 KOGER 30096,123,"[RNTR 3 LLC, RNTR 3 LLC, FYR SFR BORROWER LLC,..."


In [24]:
owner_scale[
    (owner_scale["TAXYR"] == 2020)
    & (owner_scale["owner_addr"] == "591 PUTNAM 6830")
].sort_values(by="count_owned_fulton_yr", ascending=False).head(15)

Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1996439,2020,591 PUTNAM 6830,3,"[STAR 2021-SFR2 BORROWER L P, STAR 2021 SFR2 B..."


### Identify major institutional investors

In [66]:
import re

owner_keywords = {
    "Amherst": ["AMHERST", "ARVM"],
    "Cerberus": ["CERBERUS", "FKH", "RM1 ", "RMI "],
    "Progress": ["PROGRESS", "FYR"],
    "Invitation": ["INVITATION", "IH "],
    "Colony": ["COLONY", "STARWOOD", "CSH", "CAH "],
    "Sylvan": ["SYLVAN", "RNTR"],
    "Tricon": ["TRICON", "TAH"]
}

for owner in owner_keywords:
    query_str = "|".join(owner_keywords[owner])

    filtered_rows = owner_scale[
        (owner_scale["TAXYR"] == 2020) &
        owner_scale["assoc_owner_names"].apply(lambda x: any(
            ((re.search(query_str, name)) for name in x)
        ))
    ]
    
    filtered_rows = filtered_rows[
        filtered_rows["count_owned_fulton_yr"] > 49
    ]
    
    total = filtered_rows["count_owned_fulton_yr"].sum()
    print(f"{owner} owned {total} properties in 2020")
    display(filtered_rows)
    
    addresses = filtered_rows["owner_addr"].unique().tolist()
    print(addresses)
    
    names = [set(x) for x in filtered_rows[
        filtered_rows["count_owned_fulton_yr"] > 49
    ]["assoc_owner_names"].to_list()]

    names = set().union(*names)
    names = ", ".join(names)

    with open(f"./output/{owner}_names.txt", "w") as f:
        f.write("KEYWORDS: " + query_str + "\n")
        f.write("ADDRESSES: " + ", ".join(addresses) + "\n\n")
        f.write("NAMES\n--------------------\n")
        f.write(names)

Amherst owned 720 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1979621,2020,5001 PLAZA ON THE 78746,539,"[ALTO ASSET COMPANY 2 LLC, ALTO ASSET COMPANY ..."
1979626,2020,5001 PLAZA ON THE LAKE 78746,91,"[JEFF 1 LLC, VICTRUM ANDREW, JEFF 1 LLC, JEFF ..."
2026786,2020,8300 MOPAC EXPRESSWAY 78759,90,"[LHF 4 ASSETS LLC, FIREBIRD SFE I LLC, FIREBIR..."


['5001 PLAZA ON THE 78746', '5001 PLAZA ON THE LAKE 78746', '8300 MOPAC EXPRESSWAY 78759']
Cerberus owned 256 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1848500,2020,0 PO BOX 2249 30028,83,"[CERBERUS SFR HOLDINGS LP, CERBERUS SFR HOLDIN..."
1897285,2020,1850 PARKWAY 30067,173,"[FKH SFR PROPCO D L P, RM1 SFR PROPCO A LP, FK..."


['0 PO BOX 2249 30028', '1850 PARKWAY 30067']
Progress owned 619 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1848935,2020,0 PO BOX 4090 85261,101,"[YAMASA CO LTD, HOME SFR BORROWER IV LLC, HOME..."
1949986,2020,3505 KOGER 30096,123,"[RNTR 3 LLC, RNTR 3 LLC, FYR SFR BORROWER LLC,..."
1949988,2020,3505 KOGER BLVD 30096,345,"[FYR SFR BORROWER LLC, FYR SFR BORROWER LLC, F..."
1982216,2020,5100 TAMARIND REEF 820,50,"[FYR SFR BORROWER LLC, FYR SFR BORROWER LLC, F..."


['0 PO BOX 4090 85261', '3505 KOGER 30096', '3505 KOGER BLVD 30096', '5100 TAMARIND REEF 820']
Invitation owned 677 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1893046,2020,1717 MAIN 75201,493,"[2018 3 IH BORROWER LP, 2018 3 IH BORROWER LP,..."
2029627,2020,8665 HARTFORD 85255,122,"[CSH PROPERTY ONE LLC, CSH PROPERTY ONE LLC, C..."
2032169,2020,901 MAIN 75202,62,"[2015 3 IH2 BORROWER LP, 2015 3 IH2 BORROWER L..."


['1717 MAIN 75201', '8665 HARTFORD 85255', '901 MAIN 75202']
Colony owned 122 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
2029627,2020,8665 HARTFORD 85255,122,"[CSH PROPERTY ONE LLC, CSH PROPERTY ONE LLC, C..."


['8665 HARTFORD 85255']
Sylvan owned 422 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1949403,2020,3495 PIEDMONT 30305,299,"[BUTLER GLIDDEN COOPER LLC, BUTLER GLIDDEN COO..."
1949986,2020,3505 KOGER 30096,123,"[RNTR 3 LLC, RNTR 3 LLC, FYR SFR BORROWER LLC,..."


['3495 PIEDMONT 30305', '3505 KOGER 30096']
Tricon owned 219 properties in 2020


Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
1885939,2020,1508 BROOKHOLLOW 92705,219,"[TAH 2017 2 BORROWER LLC, TAH 2017 2 BORROWER ..."


['1508 BROOKHOLLOW 92705']


## Create transcation scale table (includes purchases from govt + banks)

In [None]:
purchases_fulton_yr = pd.DataFrame(
    sales_grantee_dropped.groupby(["TAXYR", "GRANTEE_match_addr"])["PARID"].count().astype(int)
).reset_index().rename(columns={"PARID": "Purchases Fulton", "GRANTEE_match_addr": "entity_addr"})

sales_fulton_yr = pd.DataFrame(
    sales_grantee_dropped.groupby(["TAXYR", "GRANTOR_match_addr"])["PARID"].count()
).reset_index().rename(columns={"PARID": "Sales Fulton", "GRANTOR_match_addr": "entity_addr"})

sale_scale = purchases_fulton_yr.merge(
    sales_fulton_yr,
    on=["TAXYR", "entity_addr"],
    how="outer"
)

sale_scale = sale_scale.fillna(0)
sale_scale["Purchases Fulton"] = sale_scale["Purchases Fulton"].astype(int)
sale_scale["Sales Fulton"] = sale_scale["Sales Fulton"].astype(int)
sale_scale["total_trans_fulton"] = sale_scale["Purchases Fulton"] + sale_scale["Sales Fulton"]

sale_scale.sort_values(by="total_trans_fulton", ascending=False).head(5)

Unnamed: 0,TAXYR,entity_addr,Purchases Fulton,Sales Fulton,total_trans_fulton
8855,2012,0 PO BOX 650043 75265,17,555,572
107106,2020,5001 PLAZA ON THE 78746,362,121,483
110878,2020,8800 ROSWELL 30350,11,417,428
121246,2021,5001 PLAZA ON THE 78746,250,172,422
139504,2022,8800 ROSWELL 30350,45,334,379


## Save owner and transcation scale

In [None]:
OUTPUT_PATH = 'output/'

owner_scale.to_csv(OUTPUT_PATH + 'owner_scale.csv', index=False)
sale_scale.to_csv(OUTPUT_PATH + 'sale_scale.csv', index=False)

## Create a valid sales only dataset

In [None]:
# Finish sales cleaning after dropping govt inst and banks
sales_full_valid = sales_full[sales_full["Saleval"] == "0"]
print(f"Number of valid sales: {len(sales_full_valid)}")

Number of valid sales: 112689


## Save

In [None]:
sales_full.to_parquet(OUTPUT_PATH + 'sales_full_final.parquet', index=False)
sales_full_valid.to_parquet(OUTPUT_PATH + 'sales_full_valid.parquet', index=False)
digest_full.to_parquet(OUTPUT_PATH + 'digest_full_final.parquet', index=False)