# Harris County Appraisal District (HCAD) real and personal property data
The Harris County is the [third most populous](https://en.wikipedia.org/wiki/List_of_the_most_populous_counties_in_the_United_States) county in the USA. Its appraisal district (HCAD) provides a fantastic dataset with each appraised property characteristics (appraised value, fixtures, features...) on a yearly basis. In this notebook we explore these data for the year 2016, to understand what is available, and to select variables that can help us answer if a given property was appraised fairly.

[HCAD data](https://pdata.hcad.org/download/2016.html) consists of several text files, grouped in zipped files as follows:

1. Real_acct_owner.zip
    * **Real_acct.txt**: account including owner name, owner mailing address, $\color{red}{values}$, $\color{red}{site~address}$, and legal descriptions.
    * **Real_neighborhood_code.txt**:  $\color{red}{neighborhood~code}$, group code and description
    * **Parcel_tiebacks.txt**
    * **Permits.txt**: an account including permit type, permit description, and status.
    * **Owners.txt**: multiple owners.
    * **Deeds.txt**: deed information.


2. Real_building_land.zip
    * **Building_res.txt**: $\color{red}{all~residential~information}$
    * **Building_other.txt**: all other real properties, such as commercial and information for income producing properties including occupancy rates and operating income.
    * **Exterior.txt**: $\color{red}{general~data~about~buildings~and~sub~areas,~(style~or~use,~size,~year~built).}$
    * **Fixture.txt**: $\color{red}{characteristics~of~the~building}$. This includes bedrooms, fireplace, bathrooms, stories for residential. Also contains wall height, elevators, and other descriptions for commercial property.
    * **Extra_features.txt**: extra features for an account. This includes wood deck, $\color{red}{pool}$, storage shed, detached garage, etc. This also contains information on cracked slabs and pools.
    * **Structural_elem1.txt**: $\color{red}{Single~Family}$, Multi Family, Condos, Town homes. $\color{red}{Home~Information}$ (CDU, Grade Adjustment, Physical Condition).
    * **Structural_elem2.txt**: Commercial and exempt Properties. These files contain structural elements of a property. This includes information like $\color{red}{physical~condition,~grade,~exterior~wall,~and~foundation~type}$.
    * **Land.txt**: land use, acreage, and land units.
    * **Land_ag.txt**: agricultural and timber land information including land use, acreage, and land units.


3. Real_jur_exempt.zip
    * **Jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **Jur_exemption_cd**: Lists the exemption code associated with an account.
    * **Jur_exemption_dscr**: Lists the jurisdictions and their exemption description.
    * **Jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **Jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **Jur_value.txt**: Lists the jurisdictions and values associated with an account.


4. PP_files.zip
    * **T_business_acct.txt**:  account, including owner name, owner mailing address, values, site address, and legal descriptions, all values and etc.
    * **T_business_detail.txt**: account, items, description and item values.
    * **T_jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **T_jur_value.txt**: Lists the jurisdictions and values associated with an account.
    * **T_jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **T_jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **T_pp_c.txt**: c Pipelines data.
    * **T_pp_e.txt**: e Minerals data.
    * **T_pp_l.txt**: l Electrical Transmission / Distribution Lines data.


5. Hearing_files.zip
    * **ARB_hearings_pp.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_hearings_real.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_protest_pp.txt**: account, protest by, date of protest.
    * **ARB_protest_real.txt**: account, protest by, date of protest.

In addition, the following files are also available for deciphering codes in some of the variables:

1. code_desc_real.txt
2. code_nh_numbers.txt
3. code_nh_numbers_adj.txt
4. code_desc_personal.txt
5. code_jur_list.txt

The first step is then to identify the variables that will influence the appraised value (highlighted in red above). Next, we would like to filter the data to only contain comparables, in this case free standing (a house) single-family properties.

# Find the comparables: building_res.txt

The file `building_res.txt` contains the some of the properties description, including the HCAD account number (column acct) they are associated to. Let's find the account numbers for the free standing single family properties.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table

In [3]:
building_res_fn = ROOT_DIR / 'data/external/2016/Real_building_land/building_res.txt'
assert building_res_fn.exists()

The column names of the HCAD files are stored in the file `Layout_and_Length.txt`. The `src.data.utils.Table` class has methods to read the file column names, and read the data file as a pandas DataFrame.

In [4]:
building_res = Table(building_res_fn, '2016')

In [5]:
building_res.get_header()

['acct',
 'property_use_cd',
 'bld_num',
 'impr_tp',
 'impr_mdl_cd',
 'structure',
 'structure_dscr',
 'dpr_val',
 'cama_replacement_cost',
 'accrued_depr_pct',
 'qa_cd',
 'dscr',
 'date_erected',
 'eff',
 'yr_remodel',
 'yr_roll',
 'appr_by',
 'appr_dt',
 'notes',
 'im_sq_ft',
 'act_ar',
 'heat_ar',
 'gross_ar',
 'eff_ar',
 'base_ar',
 'perimeter',
 'pct',
 'bld_adj',
 'rcnld',
 'size_index',
 'lump_sum_adj']

Since the file may not fit into memory, let's load only the columns of interest.

In [6]:
cols = [
    'acct',      # Property unique account number
    'property_use_cd',
    'bld_num',
    'impr_mdl_cd',
    'impr_tp',
    'structure_dscr',
    'pct',
    'dscr',      # Quality description
    'date_erected',
    'yr_remodel',
    'im_sq_ft',  # Improvement square feet
    'act_ar',    # Actual area
    'heat_ar',   # Heat area
    'gross_ar',  # Gross area
    'eff_ar',    # Affective area
    'base_ar',   # Base area
    'perimeter',
]

In [7]:
building_res_df = building_res.get_df(usecols=cols)

In [8]:
building_res_df.head()

Unnamed: 0,acct,property_use_cd,bld_num,impr_tp,impr_mdl_cd,structure_dscr,dscr,date_erected,yr_remodel,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,base_ar,perimeter,pct
0,21660000012,A1,1,1001,101,Residential,Good,2014,0,2328,2328,2328,2328,2212,2328,300,1.0
1,21700000013,A1,1,1001,101,Residential,Low,1920,2004,1434,1622,1434,1622,1453,1434,170,1.0
2,21870000006,A1,1,1001,101,Residential,Very Low,1940,0,572,652,572,652,598,572,96,1.0
3,21960000001,A1,2,1001,101,Residential,Low,1940,1999,4304,4984,4304,4984,4253,4304,444,1.0
4,22080000004,A1,2,1001,101,Residential,Low,1920,0,528,576,528,576,544,528,96,1.0


In [9]:
building_res_df.dtypes

acct                 int64
property_use_cd     object
bld_num              int64
impr_tp              int64
impr_mdl_cd          int64
structure_dscr      object
dscr                object
date_erected         int64
yr_remodel           int64
im_sq_ft             int64
act_ar               int64
heat_ar              int64
gross_ar             int64
eff_ar               int64
base_ar              int64
perimeter            int64
pct                float64
dtype: object

In [10]:
building_res_df.memory_usage(deep=True)

Index                   128
acct                9210776
property_use_cd    67929473
bld_num             9210776
impr_tp             9210776
impr_mdl_cd         9210776
structure_dscr     78292330
dscr               72476161
date_erected        9210776
yr_remodel          9210776
im_sq_ft            9210776
act_ar              9210776
heat_ar             9210776
gross_ar            9210776
eff_ar              9210776
base_ar             9210776
perimeter           9210776
pct                 9210776
dtype: int64

Let's reduce the in-memory file size by changing the types of the categorical columns, and downcasting the numerical column types where possible.

In [11]:
building_res_df['property_use_cd'] = building_res_df['property_use_cd'].astype('category')
building_res_df['bld_num'] = building_res_df['bld_num'].astype('category')
building_res_df['impr_mdl_cd'] = building_res_df['impr_mdl_cd'].astype('category')
building_res_df['impr_tp'] = building_res_df['impr_tp'].astype('category')
building_res_df['structure_dscr'] = building_res_df['structure_dscr'].astype('category')
building_res_df['dscr'] = building_res_df['dscr'].astype('category')

In [12]:
building_res_df['acct'] = pd.to_numeric(building_res_df['acct'], downcast='unsigned')

In [13]:
building_res_df.loc[:, 'date_erected':'base_ar'] = building_res_df.loc[:, 'date_erected':'base_ar'].apply(pd.to_numeric, downcast='unsigned')
building_res_df['perimeter'] = pd.to_numeric(building_res_df['perimeter'], downcast='integer')
building_res_df['pct'] = pd.to_numeric(building_res_df['pct'], downcast='float')

In [14]:
building_res_df.dtypes

acct                 uint64
property_use_cd    category
bld_num            category
impr_tp            category
impr_mdl_cd        category
structure_dscr     category
dscr               category
date_erected         uint16
yr_remodel            int64
im_sq_ft             uint16
act_ar               uint16
heat_ar              uint16
gross_ar             uint16
eff_ar               uint16
base_ar              uint16
perimeter             int16
pct                 float32
dtype: object

In [15]:
building_res_df.memory_usage(deep=True)

Index                  128
acct               9210776
property_use_cd    1154220
bld_num            1154355
impr_tp            1152139
impr_mdl_cd        1152179
structure_dscr     1152243
dscr               1152109
date_erected       2302694
yr_remodel         9210776
im_sq_ft           2302694
act_ar             2302694
heat_ar            2302694
gross_ar           2302694
eff_ar             2302694
base_ar            2302694
perimeter          2302694
pct                4605388
dtype: int64

In [16]:
building_res_df.head()

Unnamed: 0,acct,property_use_cd,bld_num,impr_tp,impr_mdl_cd,structure_dscr,dscr,date_erected,yr_remodel,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,base_ar,perimeter,pct
0,21660000012,A1,1,1001,101,Residential,Good,2014,0,2328,2328,2328,2328,2212,2328,300,1.0
1,21700000013,A1,1,1001,101,Residential,Low,1920,2004,1434,1622,1434,1622,1453,1434,170,1.0
2,21870000006,A1,1,1001,101,Residential,Very Low,1940,0,572,652,572,652,598,572,96,1.0
3,21960000001,A1,2,1001,101,Residential,Low,1940,1999,4304,4984,4304,4984,4253,4304,444,1.0
4,22080000004,A1,2,1001,101,Residential,Low,1920,0,528,576,528,576,544,528,96,1.0


# Select comparables

From file `data/external/2016/code_desc_real` we find that the following columns define the free-standing single-family homes:

1. property_use_cd: **A1** = Real, Residential, Single-Family
2. impr_mdl_cd: **101** = Residential 1 Family
3. impr_tp: **1001** = Residential Single Family
4. structure_dscr: **Residential**
5. pct: **1** = 100% built

Let's filter the rows based on these criteria.

In [17]:
cond0 = building_res_df['property_use_cd'] == 'A1'
cond1 = building_res_df['impr_mdl_cd'] == 101
cond2 = building_res_df['impr_tp'] == 1001
cond3 = building_res_df['structure_dscr'] == 'Residential'
cond4 = building_res_df['pct'] == 1  # 100% built home

In [18]:
building_res_comps = building_res_df.loc[cond0 & cond1 & cond2 & cond3 & cond4, :]

In [19]:
building_res_comps.shape

(993906, 17)

The property values are reported in the `real_acct.txt` file. There is only one set of appraised values for each account number. Let's make sure our filtered `building_res_comps` only contains accounts that have just one building.

In [20]:
total_bld_per_acct = building_res_comps.groupby('acct')['bld_num'].count()
one_bld_in_acct = total_bld_per_acct[total_bld_per_acct == 1].index

In [21]:
assert one_bld_in_acct.is_unique, f"Non-unique accounts: {one_bld_in_acct}"

In [22]:
cond0 = building_res_comps['acct'].isin(one_bld_in_acct)
building_res_comps = building_res_comps.loc[cond0, :]

In [23]:
building_res_comps.shape

(957687, 17)

# Select columns in comparables
Since the accounts remaining all have the same free-standing single-family defining columns values, we would like to export only the columns that change for this subset of the data.

In [24]:
building_res_comps.columns

Index(['acct', 'property_use_cd', 'bld_num', 'impr_tp', 'impr_mdl_cd',
       'structure_dscr', 'dscr', 'date_erected', 'yr_remodel', 'im_sq_ft',
       'act_ar', 'heat_ar', 'gross_ar', 'eff_ar', 'base_ar', 'perimeter',
       'pct'],
      dtype='object')

In [25]:
cols = [
    'acct',     # Property unique account number
    'dscr',     # Quality description
    'date_erected',
    'yr_remodel',
    'im_sq_ft', # Improvement square feet
    'act_ar',   # Actual area
    'heat_ar',  # Heat area
    'gross_ar', # Gross area
    'eff_ar',   # Affective area
    'base_ar',  # Base area
    'perimeter',
]

In [26]:
building_res_comps = building_res_comps.loc[:, cols]

In [27]:
building_res_comps.head()

Unnamed: 0,acct,dscr,date_erected,yr_remodel,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,base_ar,perimeter
0,21660000012,Good,2014,0,2328,2328,2328,2328,2212,2328,300
1,21700000013,Low,1920,2004,1434,1622,1434,1622,1453,1434,170
3,21960000001,Low,1940,1999,4304,4984,4304,4984,4253,4304,444
6,22080000008,Low,1940,0,2240,2732,2240,2732,2270,2240,272
8,22620000008,Good,2013,0,1778,1808,1778,1808,1699,1778,256


In [28]:
building_res_comps.shape

(957687, 11)

# Export building_res_comps
That was a lot of work! Let's save it as a pickle file and continue the data conditioning in the next notebook.

In [29]:
from src.data.utils import save_pickle

In [30]:
save_fn = ROOT_DIR / 'data/raw/2016/building_res_comps.pickle'

In [31]:
save_pickle(building_res_comps, save_fn)

# Export unique account numbers of interest: one_bld_in_acct

In [32]:
save_fn = ROOT_DIR / 'data/raw/2016/one_bld_in_acct.pickle'

In [33]:
save_pickle(one_bld_in_acct, save_fn)