# Find the comparables: real_acc.txt

The file `real_acc.txt` contains important property information like number total appraised value (the target on this exercise), neighborhood, school district, economic group, land value, and more. Let's load this file and grab a subset with the important columns to continue our study.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pickle

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table, save_pickle

In [3]:
real_acct_fn = ROOT_DIR / 'data/external/2016/Real_acct_owner/real_acct.txt'
assert real_acct_fn.exists()

In [4]:
real_acct = Table(real_acct_fn, '2016')

In [5]:
real_acct.get_header()

['acct',
 'yr',
 'mailto',
 'mail_addr_1',
 'mail_addr_2',
 'mail_city',
 'mail_state',
 'mail_zip',
 'mail_country',
 'undeliverable',
 'str_pfx',
 'str_num',
 'str_num_sfx',
 'str',
 'str_sfx',
 'str_sfx_dir',
 'str_unit',
 'site_addr_1',
 'site_addr_2',
 'site_addr_3',
 'state_class',
 'school_dist',
 'map_facet',
 'key_map',
 'Neighborhood_Code',
 'Neighborhood_Grp',
 'Market_Area_1',
 'Market_Area_1_Dscr',
 'Market_Area_2',
 'Market_Area_2_Dscr',
 'econ_area',
 'econ_bld_class',
 'center_code',
 'yr_impr',
 'yr_annexed',
 'splt_dt',
 'dsc_cd',
 'nxt_bld',
 'bld_ar',
 'land_ar',
 'acreage',
 'Cap_acct',
 'shared_cad',
 'land_val',
 'bld_val',
 'x_features_val',
 'ag_val',
 'assessed_val',
 'tot_appr_val',
 'tot_mkt_val',
 'prior_land_val',
 'prior_bld_val',
 'prior_x_features_val',
 'prior_ag_val',
 'prior_tot_appr_val',
 'prior_tot_mkt_val',
 'new_construction_val',
 'tot_rcn_val',
 'value_status',
 'noticed',
 'notice_dt',
 'protested',
 'certified_date',
 'rev_dt',
 'rev_by',
 '

# Load accounts and columns of interest
Let's remove the account numbers that don't meet free-standing single-family home criteria that we found while processing the `building_res.txt` file.

Also, the columns above show a lot of value information along property groups that might come in handy when predicting the appraised value. Now let's get a slice of some of the important columns.

In [6]:
skiprows = real_acct.get_skiprows()

In [7]:
cols = [
    'acct',
    'site_addr_3', # Zip
    'school_dist',
    'Neighborhood_Code',
    'Market_Area_1_Dscr',
    'Market_Area_2_Dscr',
    'center_code',
    'bld_ar',
    'land_ar',
    'acreage',
    'land_val',
    'tot_appr_val', # Target
    'prior_land_val',
    'prior_tot_appr_val',
    'new_own_dt',  # New owner date
]

In [8]:
real_acct_df = real_acct.get_df(skiprows=skiprows, usecols=cols)

In [9]:
real_acct_df.head()

Unnamed: 0,acct,site_addr_3,school_dist,Neighborhood_Code,Market_Area_1_Dscr,Market_Area_2_Dscr,center_code,bld_ar,land_ar,acreage,land_val,tot_appr_val,prior_land_val,prior_tot_appr_val,new_own_dt
0,21440000001,77003,1,8400.12,"1C Midtown, Riverside Terrace, University Areas","1C Midtown, Riverside Terrace, University Areas",61,2537,5000,0.1148,125000.0,145200.0,75000.0,132000.0,2012-09-11 00:00:00.000
1,21470000008,77003,1,8400.12,"1C Midtown, Riverside Terrace, University Areas","1C Midtown, Riverside Terrace, University Areas",61,1000,5000,0.1148,74900.0,75000.0,64000.0,65000.0,1988-01-02 00:00:00.000
2,21480000002,77003,1,8400.12,"1C Midtown, Riverside Terrace, University Areas","1C Midtown, Riverside Terrace, University Areas",61,1496,5000,0.1148,125000.0,85929.0,75000.0,78118.0,2004-07-28 00:00:00.000
3,21650000007,77003,1,8400.12,"1C Midtown, Riverside Terrace, University Areas","1C Midtown, Riverside Terrace, University Areas",61,3387,5000,0.1148,125000.0,549004.0,75000.0,75000.0,2013-10-14 00:00:00.000
4,21650000011,77003,1,8400.12,"1C Midtown, Riverside Terrace, University Areas","1C Midtown, Riverside Terrace, University Areas",61,1508,6250,0.1435,140625.0,181370.0,84375.0,164882.0,2001-05-05 00:00:00.000


Double check if the there is only one account number per row

In [10]:
assert real_acct_df['acct'].is_unique

# Export real_acct

In [11]:
save_fn = ROOT_DIR / 'data/raw/2016/real_acct_comps.pickle'
save_pickle(real_acct_df, save_fn)