# Owner and Transaction Scale Calculations
Summary needed

In [51]:
import pandas as pd

pd.set_option('display.max_columns', 150)
pd.options.display.float_format = '{:,.3f}'.format

In [52]:
# Cleaned digest and sales data from clean_data.ipynb
FILES_PATH = 'output/'
digest_full = pd.read_parquet(FILES_PATH + 'digest_full_clean.parquet')
sales_full = pd.read_parquet(FILES_PATH + 'sales_full_clean.parquet')

## Modify data to enable aggreggating on entity key

### Identify same owners in parcel data
- Drop any rows without Owner Address
- Create an Owner Address (labeled: "owner_addr") column that is the concatentation of owner address number, owner address string, and owner zip.
- If address string contains numbers, then it is a PO BOX. However, a lot are formatted in different ways, such as P O BOX 123, PO BOX 123, P.O. BOX 123, etc. We can retain the number from the address string, and manually prepend PO BOX, so all will have an identical format.

**Why:** these values get us a highly accurate key for same owner. Owner address string does not contain postfixes like ST, AVE, etc. that might cause issues. Combined with owner number and owner zip, we can say with high confidence that the address is the same while avoiding many common differences amongst the same address (ST vs STREET, etc.). This method is prefered over names which has a higher chance of false positive, and large corporations may operate with differently named subsidaries. This method may also undercount, if a company uses multiple addresses, but this is somewhat unlikely and undercounting is simply an acceptable limitation. It is acceptable since large investors (who would use different addresses) will own so many properties with each subsidary that it will be binned in the correct bin regardless.

In [53]:
# Re-format PO BOXES
re_box_and_numbers = r".*BOX.*[0-9].*"
re_capture_numbers = r"([0-9]+)"

digest_full["mod_own_adrstr"] = digest_full["Owner Adrstr"].copy(deep=True)

mask = digest_full["mod_own_adrstr"].str.contains(re_box_and_numbers, regex=True)

digest_full.loc[mask, "mod_own_adrstr"] = "PO BOX " + digest_full.loc[
    mask, "mod_own_adrstr"
].str.extract(re_capture_numbers)[0]

In [54]:
# Print total number of PO BOXES without a number in their address string
re_po_box_no_number = r"^(?!.*\d)[P]+.* BOX.*"
len(digest_full[digest_full["mod_own_adrstr"].str.contains(
    re_po_box_no_number, regex=True
)][["Owner Adrno", "mod_own_adrstr"]])

37

In [55]:
# Regex to clean by replacing dots, commas, and multiple spaces
# Also make all strings uppercase (they should be already)

re_dots_commas = r"[.,]+"
re_multiple_spaces = r"\s{2,}"

digest_full["owner_addr"] = (
    digest_full["Owner Adrno"].astype(str) + " " +
    digest_full["mod_own_adrstr"] + " " +
    digest_full["own_zip"]
).str.replace(
    re_dots_commas,
    "",
    regex=True
).str.replace(
    re_multiple_spaces,
    " ",
    regex=True
).str.upper()

In [56]:
digest_full[
    ["Owner Adrno", "Owner Adrstr", "mod_own_adrstr", "own_zip", "owner_addr"]
].sample(10)

Unnamed: 0,Owner Adrno,Owner Adrstr,mod_own_adrstr,own_zip,owner_addr
2333086,4671,DERBY LOOP,DERBY LOOP,30213,4671 DERBY LOOP 30213
1732274,5595,MILLWICK,MILLWICK,30005,5595 MILLWICK 30005
1889013,760,WINDWALK,WINDWALK,30076,760 WINDWALK 30076
813112,280,KING,KING,30342,280 KING 30342
372255,395,LAKE,LAKE,30318,395 LAKE 30318
1411989,12650,OLD SURREY,OLD SURREY,30075,12650 OLD SURREY 30075
639844,261,POLAR ROCK,POLAR ROCK,30315,261 POLAR ROCK 30315
2338643,920,HIGHLAND HILL,HIGHLAND HILL,30349,920 HIGHLAND HILL 30349
1472152,235,BLUE SPRUCE,BLUE SPRUCE,30005,235 BLUE SPRUCE 30005
1593356,11095,LINBROOK,LINBROOK,30097,11095 LINBROOK 30097


### Identify same owners in sales data

**Method**:
- Sales data does not contain buyer or seller address. We can't simply use GRANTEE or GRANTOR name, because names can be different for the same owner corporation (subsidaries, typos). Instead we identify buyer and seller address by:
    - Match GRANTEE name to parcel data on GRANTEE = Own1 (owner name) and extract owner_addr for CURRENT TAXYR
    - Match GRANTOR name to parcel data on GRANTOR = Own1 (owner name) and extract owner_addr for PREVIOUS TAXYR
    - For names where the GRANTEE or GRANTOR name doesn't match exactly (due to typos, etc.), we can take the owner_addr with the same method ONLY IF there was only one sale in the given TAXYR. In the case of multiple sales in one TAXYR, the last purchaser appears to be recorded in the parcel data as the owner (see evidence below); if we tried to match an earlier sale in that year, we would get the wrong owner address. This is a problem because we want the purchaser address for each sale to appropriately account for flipping activity for example.
    - Else, try to find an exact owner name match from all parcel data, not limited to PARID and TAXYR; use the last match if a match is found (last because that is most recent address of company).
- In short:
    - Try to match by owner name, PARID, and TAXYR
    - If no match, get match from just PARID and TAXYR, ONLY IF there is a single transcation in the given TAXYR for that PARID
    - Else, try to find an exact owner name match from all parcel data, not limited to PARID and TAXYR; use the last match if a match is found.
    - Where none of the above methods work, drop if total count is insignificant

In [57]:
# Minor cleaning on GRANTEE, GRANTOR, and Own1 (parcel data)
# Regex to clean by replacing dots, commas, and multiple spaces
# Also make all strings uppercase (they should be already)
re_dots_commas = r"[.,]+"
re_multiple_spaces = r"\s{2,}"

for col in ["GRANTEE", "GRANTOR"]:
    sales_full[col] = sales_full[col].str.replace(
        re_dots_commas, "", regex=True
    ).str.replace(
        re_multiple_spaces, " ", regex=True
    ).str.upper()
    
digest_full["Own1"] = digest_full["Own1"].str.replace(
    re_dots_commas, "", regex=True
).str.replace(
    re_multiple_spaces, " ", regex=True
).str.upper()

In [58]:
count_sales_yr = pd.DataFrame(
    sales_full.groupby(["TAXYR", "PARID"])["PARID"].count()
).rename(columns={"PARID": "count_sales_yr"})

sales_full = sales_full.merge(
    count_sales_yr,
    on=["TAXYR", "PARID"],
    how="inner"
)

more_than_one_sale_yr = len(
    sales_full[sales_full["count_sales_yr"] > 1].drop_duplicates(
        subset=["TAXYR", "PARID"]
    )
)

more_than_one_sale_yr_valid = len(
    sales_full[
       (sales_full["count_sales_yr"] > 1)
       & (sales_full["Saleval"] == "0")
    ].drop_duplicates(
        subset=["TAXYR", "PARID"]
    )
)

print(f"Count of properties that sold multiple times in one year: {more_than_one_sale_yr}")
print(f"Count of properties that sold multiple times in one year (valid only): {more_than_one_sale_yr_valid}")
count_sales_yr.sort_values(by="count_sales_yr", ascending=False).head(5)

Count of properties that sold multiple times in one year: 12354
Count of properties that sold multiple times in one year (valid only): 5937


Unnamed: 0_level_0,Unnamed: 1_level_0,count_sales_yr
TAXYR,PARID,Unnamed: 2_level_1
2022,07 320100600219,8
2011,14 013500030907,5
2021,09F370001531401,4
2013,14 015800010133,4
2011,13 0194 LL0810,4


In [59]:
print(len(sales_full))
for person in ["GRANTEE", "GRANTOR"]:
    
    digest_df = digest_full[['PARID', 'TAXYR', 'owner_addr', 'Own1']].copy(deep=True)
    sale_df = sales_full[["TAXYR", "PARID", f"{person}", "count_sales_yr"]].copy(deep=True)
    
    if person == "GRANTOR":
        digest_df["TAXYR"] = digest_df["TAXYR"] + 1

    exact_match = sale_df.merge(
        digest_df,
        left_on=["PARID", "TAXYR", f"{person}"],
        right_on=["PARID", "TAXYR", "Own1"],
    ).rename(
        columns={"Own1": f"{person}_exact", "owner_addr": f"{person}_exact_addr"}
    ).drop_duplicates(subset=["PARID", "TAXYR"])

    exact_name_match = sale_df.merge(
        digest_df[["Own1", "owner_addr"]].drop_duplicates(subset="Own1", keep="last"),
        left_on=[f"{person}"],
        right_on=["Own1"],
    ).rename(columns={
        "Own1": f"{person}_only_exact_name", "owner_addr": f"{person}_only_exact_name_addr"
    }).drop_duplicates(subset=[f"{person}_only_exact_name"])

    single_sale_match = sale_df[
        sale_df["count_sales_yr"] < 2
    ].merge(
        digest_df,
        left_on=["PARID", "TAXYR"],
        right_on=["PARID", "TAXYR"],
    ).rename(
        columns={"Own1": f"{person}_single_sale", "owner_addr": f"{person}_single_sale_addr"}
    ).drop_duplicates(subset=["PARID", "TAXYR"])
    
    sales_full = sales_full.merge(
        exact_match[["TAXYR", "PARID", f"{person}_exact", f"{person}_exact_addr"]],
        left_on=["TAXYR", "PARID", f"{person}"],
        right_on=["TAXYR", "PARID", f"{person}_exact"],
        how="left"
    )
    
    sales_full = sales_full.merge(
        single_sale_match[["TAXYR", "PARID", f"{person}_single_sale", f"{person}_single_sale_addr"]],
        on=["TAXYR", "PARID"],
        how="left"
    )
    
    sales_full = sales_full.merge(
        exact_name_match[[f"{person}_only_exact_name", f"{person}_only_exact_name_addr"]],
        left_on=[f"{person}"],
        right_on=[f"{person}_only_exact_name"],
        how="left"
    )

len(sales_full)

180692


180692

In [60]:
for person in ["GRANTEE", "GRANTOR"]:
    print(f"Person: {person} ---")
    sales_full[f"{person}_match"] = sales_full[f"{person}_exact"]
    sales_full[f"{person}_match_addr"] = sales_full[f"{person}_exact_addr"]
    num_matched = len(sales_full[sales_full[f'{person}_match'].notna()])
    print(f"Number exact matched: {num_matched}")
    print(f"Pct exact matched: {num_matched / len(sales_full)}")
    print("")
    
    for match in ["single_sale", "only_exact_name"]:
        sales_full[f"{person}_match"] = sales_full[f"{person}_match"].fillna(
            sales_full[f"{person}_{match}"]
        )
        sales_full[f"{person}_match_addr"] = sales_full[f"{person}_match_addr"].fillna(
            sales_full[f"{person}_{match}_addr"]
        )
        prev_matched = num_matched
        num_matched = len(sales_full[sales_full[f'{person}_match'].notna()])
        print(f"Number of additional matches with {match}: {num_matched - prev_matched}")
        print(f"Number prev matches + {match} matched: {num_matched}")
        print(f"Pct prev matches + {match} matched: {num_matched / len(sales_full)}")
        print("")
    
    print("")
    print("")

Person: GRANTEE ---
Number exact matched: 143965
Pct exact matched: 0.7967425231886304

Number of additional matches with single_sale: 20541
Number prev matches + single_sale matched: 164506
Pct prev matches + single_sale matched: 0.9104221548269984

Number of additional matches with only_exact_name: 11777
Number prev matches + only_exact_name matched: 176283
Pct prev matches + only_exact_name matched: 0.9755993624510216



Person: GRANTOR ---
Number exact matched: 64527
Pct exact matched: 0.35711044207823256

Number of additional matches with single_sale: 80991
Number prev matches + single_sale matched: 145518
Pct prev matches + single_sale matched: 0.8053372589821354

Number of additional matches with only_exact_name: 21275
Number prev matches + only_exact_name matched: 166793
Pct prev matches + only_exact_name matched: 0.9230790516458947





Save a random sample of 100 to manually verify

In [61]:
sales_full.sample(100)[
    ["TAXYR", "PARID", "GRANTEE", "GRANTEE_match", "GRANTOR", "GRANTOR_match"]
].sort_values(by="TAXYR").to_csv('output/sales_matches_sample.csv', index=False)

## Drop govt and bank entities from transcation data
- Retain a dataset where these were only dropped from GRANTEE, because someone buying from one of these entities is still influencing the market and thus this should be included in their transcation scale

In [62]:
govt_keywords = ['FEDERAL', 'SECRETARY', 'CITY', 'COUNTY', 'FANNIE', 'FREDDIE']
bank_keywords = [
    'BANK', 'MORTGAGE', 'LENDING', 'LOAN',
    'FINANCE', 'FUND', 'CREDIT', 'TRUST'
]
govt = []
banks = []

govt += sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in govt_keywords]))
]['Own1'].unique().tolist()

banks += sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in bank_keywords]))
]['Own1'].unique().tolist()

# Save record of entities classified as govt or bank
with open("./output/govt.txt", "w") as f:
    f.write("\n".join(govt))
with open("./output/banks.txt", "w") as f:
    f.write("\n".join(banks))

In [63]:
init_len_sales = len(sales_full)
init_len_digest = len(digest_full)

sales_grantee_dropped = sales_full[
    ~(sales_full["GRANTEE"].isin(govt + banks))
].copy(deep=True)

sales_full = sales_full[
    ~((sales_full["GRANTEE"].isin(govt + banks))
    | (sales_full["GRANTOR"].isin(govt + banks)))
]

print("Number dropped from sales: ", init_len_sales - len(sales_full))

Number dropped from sales:  30408


## Identify corporate owners, create corp owner flags for each record
- grantee, grantor in sales
- own1 in digest

In [64]:
# Any with risk of false positive like "CO" need to have a space prepended or postpended
corp_keywords = [
    'LLC', ' INC', 'LLP', 'L.L.C', 'L.L.P', 'I.N.C', 'L L C',
    'L L P', ' L P', ' LP', 'LTD', ' CORP', 'CORPORATION',
    'COMPANY', ' CO ', 'LIMITED', 'PARTNERSHIP', 'PARTNERSHIPS',
    'ASSOCIATION', 'ASSOC', 'INCORPORATED', 'INCORP',
    'L.T.D', 'LTD', "HOME", "SOLUTIONS"
]

# Make a list of all corp owners
corps = sales_full[
    sales_full['GRANTEE'].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['GRANTEE'].unique().tolist() + sales_full[
    sales_full['GRANTOR'].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['GRANTOR'].unique().tolist() + digest_full[
    digest_full["Own1"].apply(lambda x: any([key in str(x) for key in corp_keywords]))
]['Own1'].unique().tolist()

with open("./output/corp_names.txt", "w") as f:
    f.write("\n".join(corps))

In [65]:
# Flag for any corp owner
sales_full["GRANTEE_corp_flag"] = sales_full['GRANTEE'].isin(corps).astype(int)
sales_full["GRANTOR_corp_flag"] = sales_full['GRANTOR'].isin(corps).astype(int)

digest_full["own_corp_flag"] = digest_full["Own1"].isin(corps).astype(int)

In [66]:
sales_full[['GRANTEE', 'GRANTEE_corp_flag', 'GRANTOR', 'GRANTOR_corp_flag']].sample(10)

Unnamed: 0,GRANTEE,GRANTEE_corp_flag,GRANTOR,GRANTOR_corp_flag
75323,RADOW LISA ROSE &,0,CHESHIRE RANNEY W,0
158816,ANDRO PROPERTIES LLC,1,BIRKHOLZ JOHN,0
55842,COLLINS DENISE & LEE KEVIN R,0,PRISCO AARON,0
48790,DIVAN AMMAR &,0,MURPHY SR THOMAS W,0
73538,ALLEN ANASTASIA,0,JENKINS TIFFANY N,0
176411,CHILD JOSHUA,0,TPG HOMES AT CRABAPPLE LLC,1
18125,BANTA TERRY C,0,STUMM RICHARD LYNN,0
148106,SARVIS JACOB LEE &,0,DAVIS LAKISHA J,0
104426,NAIR HARIKRISHNAN DAMODARAN,0,KRUGER PAMELA JEAN,0
93130,XING GUANGZHEN & LI LI,0,BERGENSKE PETER D & CHERY L,0


In [67]:
digest_full[['Own1', 'own_corp_flag']].sample(10)

Unnamed: 0,Own1,own_corp_flag
2424774,ARNOLD MARY E ET AL,0
1333857,ARNOLD SANDRA K,0
212341,BATTLE NOEL & MARTHA H,0
2564392,HENRY SHIRLEY M,0
1749541,IRISH DAVID B & DAPHNE,0
2288087,RAGLIN ALISA J,0
2265064,BURGESS DONALD & IMOGENE,0
79467,MORALES MIGUEL ALVAREZ,0
789953,PATTERSON J0HN S,0
1263436,MONROE PATRICIA M,0


## Create a rental property flag

In [68]:
digest_full["rental_flag"] = 0
digest_full.loc[
    ((digest_full["Situs Adrno"] != digest_full["Owner Adrno"])
    & (digest_full["Situs Adrstr"] != digest_full["Owner Adrstr"])),
    "rental_flag"
] = 1

In [69]:
digest_full[['Situs Adrno', 'Owner Adrno', 'Situs Adrstr', 'Owner Adrstr', 'rental_flag']].sample(20)

Unnamed: 0,Situs Adrno,Owner Adrno,Situs Adrstr,Owner Adrstr,rental_flag
2498444,550,550,STONEBRIAR,STONEBRIAR,0
718367,2383,2383,ARNO,ARNO,0
2597376,146,9201,MADO,SELBORNE,1
618458,2935,2935,HEATHER,HEATHER,0
1108611,742,742,YORKSHIRE,YORKSHIRE,0
670033,3375,3375,HARRIS,HARRIS,0
1555686,15386,480,BIRMINGHAM,SADDLE HORN,1
803158,5735,5735,RIVERWOOD,RIVERWOOD,0
1550354,275,275,WILDE GREEN,WILDE GREEN,0
1598103,5760,5760,HERSHINGER CLOSE,HERSHINGER CLOSE,0


## For each sale, create a dummy variable for each sale type
- corp purchase from ind
- corp sale to ind
- ind to ind
- corp to corp

In [70]:
# Sale type matrix

sales_full['corp_bought_ind'] = 0
sales_full['corp_sold_ind'] = 0
sales_full['ind_to_ind'] = 0
sales_full['corp_to_corp'] = 0

sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 1) & (sales_full["GRANTOR_corp_flag"] == 0), 'corp_bought_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 0) & (sales_full["GRANTOR_corp_flag"] == 0), 'ind_to_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 0) & (sales_full["GRANTOR_corp_flag"] == 1), 'corp_sold_ind'
] = 1
sales_full.loc[
    (sales_full["GRANTEE_corp_flag"] == 1) & (sales_full["GRANTOR_corp_flag"] == 1), 'corp_to_corp'
] = 1

# Validate sale matrix is correct
sales_full[[
    "GRANTEE", "GRANTEE_corp_flag", "GRANTOR", "GRANTOR_corp_flag", "corp_bought_ind", "ind_to_ind",
    "corp_sold_ind", "corp_to_corp"
]].sample(10)

Unnamed: 0,GRANTEE,GRANTEE_corp_flag,GRANTOR,GRANTOR_corp_flag,corp_bought_ind,ind_to_ind,corp_sold_ind,corp_to_corp
33204,KADAVIL JOE &,0,NELSEN MATTHEW S,0,0,1,0,0
92486,NELSON ANDREW R,0,BAILEY BOYD & RITA I,0,0,1,0,0
4964,RESIDENTIAL HOME OWNER ATLANTA LLC,1,ATL 3 SF LLC,1,0,0,0,1
178216,CANNON EDWARD J III &,0,THOMPSON MICHAEL C,0,0,1,0,0
1003,FYR SFR BORROWER LLC,1,ARLP REO 400 LLC A DELAWARE LIMITED L,1,0,0,0,1
31898,KOCH ANDREW & JORDAN,0,KUTCHINSKI JULIE A,0,0,1,0,0
152868,LI LEI,0,ABRUTIENE EGLE,0,0,1,0,0
142528,SYKES DIONETTA,0,LAKES CHARLES E,0,0,1,0,0
66703,HUSTLE 4 HOUSES LLC,1,FALCON MUTUAL LLC,1,0,0,0,1
153795,CARLOS KARI A,0,SOLOMON JULIE,0,0,1,0,0


## Create ownership scale table

In [71]:
owned_fulton_yr = pd.DataFrame(
    digest_full.groupby(["TAXYR", "owner_addr"])["PARID"].count()
).rename(columns={"PARID": "count_owned_fulton_yr"}).reset_index()
owned_fulton_yr

assoc_owner_names = pd.DataFrame(
    digest_full.groupby(["owner_addr"]).agg({"Own1": list})
).rename(columns={"Own1": "assoc_owner_names"}).reset_index()

owner_scale = owned_fulton_yr.merge(
    assoc_owner_names,
    on=["owner_addr"],
    how="left"
)

owner_scale.sort_values(by="count_owned_fulton_yr", ascending=False).head(5)

Unnamed: 0,TAXYR,owner_addr,count_owned_fulton_yr,assoc_owner_names
2374287,2022,5001 PLAZA ON THE 78746,714,"[EPH 2 ASSETS LLC, ALTO ASSET COMPANY 2 LLC, A..."
2178468,2021,5001 PLAZA ON THE 78746,646,"[EPH 2 ASSETS LLC, ALTO ASSET COMPANY 2 LLC, A..."
187161,2011,0 PO BOX 650043 75265,639,"[FEDERAL NATIONAL MORTGAGE ASSOCIATION, FEDERA..."
2290679,2022,1850 PARKWAY 30067,604,"[FKH SFR PROPCO D L P, FKH SFR PROPCO D L P, R..."
2241215,2022,0 PO BOX 4090 85261,582,"[YAMASA CO LTD, HOME SFR BORROWER IV LLC, HOME..."


## Create transcation scale table (includes purchases from govt + banks)

In [72]:
purchases_fulton_yr = pd.DataFrame(
    sales_grantee_dropped.groupby(["TAXYR", "GRANTEE_match_addr"])["PARID"].count().astype(int)
).reset_index().rename(columns={"PARID": "Purchases Fulton", "GRANTEE_match_addr": "entity_addr"})

sales_fulton_yr = pd.DataFrame(
    sales_grantee_dropped.groupby(["TAXYR", "GRANTOR_match_addr"])["PARID"].count()
).reset_index().rename(columns={"PARID": "Sales Fulton", "GRANTOR_match_addr": "entity_addr"})

sale_scale = purchases_fulton_yr.merge(
    sales_fulton_yr,
    on=["TAXYR", "entity_addr"],
    how="outer"
)

sale_scale = sale_scale.fillna(0)
sale_scale["Purchases Fulton"] = sale_scale["Purchases Fulton"].astype(int)
sale_scale["Sales Fulton"] = sale_scale["Sales Fulton"].astype(int)
sale_scale["total_trans_fulton"] = sale_scale["Purchases Fulton"] + sale_scale["Sales Fulton"]

sale_scale.sort_values(by="total_trans_fulton", ascending=False).head(5)

Unnamed: 0,TAXYR,entity_addr,Purchases Fulton,Sales Fulton,total_trans_fulton
8893,2012,0 PO BOX 650043 75265,17,555,572
107983,2020,5001 PLAZA ON THE 78746,362,121,483
111792,2020,8800 ROSWELL 30350,16,415,431
122222,2021,5001 PLAZA ON THE 78746,250,172,422
140685,2022,8800 ROSWELL 30350,45,334,379


## Save owner and transcation scale

In [73]:
OUTPUT_PATH = 'output/'

owner_scale.to_csv(OUTPUT_PATH + 'owner_scale.csv', index=False)
sale_scale.to_csv(OUTPUT_PATH + 'sale_scale.csv', index=False)

## Create a valid sales only dataset

In [74]:
# Finish sales cleaning after dropping govt inst and banks
sales_full_valid = sales_full[sales_full["Saleval"] == "0"]
print(f"Number of valid sales: {len(sales_full_valid)}")

Number of valid sales: 112689


## Save

In [75]:
sales_full.to_parquet(OUTPUT_PATH + 'sales_full_final.parquet', index=False)
sales_full_valid.to_parquet(OUTPUT_PATH + 'sales_full_valid.parquet', index=False)
digest_full.to_parquet(OUTPUT_PATH + 'digest_full_final.parquet', index=False)