# Harris County Appraisal District (HCAD) real and personal property data
The Harris County is the [thrid most populous](https://en.wikipedia.org/wiki/List_of_the_most_populous_counties_in_the_United_States) county in the USA. Its appraisal district (HCAD) provides a fantastic dataset with each appraised property characteristics (appraised value, fixtures, features...) on a yearly basis. In this notebook we explore these data for the year 2016, to understand what is available, and to select variables that can help us answer if a given property was appraised fairly.

[HCAD data](https://pdata.hcad.org/download/2016.html) consists of several text files, grouped in zipped files as follows:

1. Real_acct_owner.zip
    * **Real_acct.txt**: account including owner name, owner mailing address, $\color{red}{values}$, $\color{red}{site~address}$, and legal descriptions.
    * **Real_neighborhood_code.txt**:  $\color{red}{neighborhood~code}$, group code and description
    * **Parcel_tiebacks.txt**
    * **Permits.txt**: an account including permit type, permit description, and status.
    * **Owners.txt**: multiple owners.
    * **Deeds.txt**: deed information.

2. Real_building_land.zip
    * **Building_res.txt**: $\color{red}{all~residential~information}$
    * **Building_other.txt**: all other real properties, such as commercial and information for income producing properties including occupancy rates and operating income.
    * **Exterior.txt**: $\color{red}{general~data~about~buildings~and~sub~areas,~(style~or~use,~size,~year~built).}$
    * **Fixture.txt**: $\color{red}{characteristics~of~the~building}$. This includes bedrooms, fireplace, bathrooms, stories for residential. Also contains wall height, elevators, and other descriptions for commercial property.
    * **Extra_features.txt**: extra features for an account. This includes wood deck, $\color{red}{pool}$, storage shed, detached garage, etc. This also contains information on cracked slabs and pools.
    * **Structural_elem1.txt**: $\color{red}{Single~Family}$, Multi Family, Condos, Town homes. $\color{red}{Home~Information}$ (CDU, Grade Adjustment, Physical Condition).
    * **Structural_elem2.txt**: Commercial and exempt Properties. These files contain structural elements of a property. This includes information like $\color{red}{physical~condition,~grade,~exterior~wall,~and~foundation~type}$.
    * **Land.txt**: land use, acreage, and land units.
    * **Land_ag.txt**: agricultural and timber land information including land use, acreage, and land units.

3. Real_jur_exempt.zip
    * **Jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **Jur_exemption_cd**: Lists the exemption code associated with an account.
    * **Jur_exemption_dscr**: Lists the jurisdictions and their exemption description.
    * **Jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **Jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **Jur_value.txt**: Lists the jurisdictions and values associated with an account.

4. PP_files.zip
    * **T_business_acct.txt**:  account, including owner name, owner mailing address, values, site address, and legal descriptions, all values and etc.
    * **T_business_detail.txt**: account, items, description and item values.
    * **T_jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **T_jur_value.txt**: Lists the jurisdictions and values associated with an account.
    * **T_jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **T_jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **T_pp_c.txt**: c Pipelines data.
    * **T_pp_e.txt**: e Minerals data.
    * **T_pp_l.txt**: l Electrical Transmission / Distribution Lines data.

5. Hearing_files.zip
    * **ARB_hearings_pp.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_hearings_real.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_protest_pp.txt**: account, protest by, date of protest.
    * **ARB_protest_real.txt**: account, protest by, date of protest.

In addition, the following files are also available for deciphering codes in some of the variables:

1. code_desc_real.txt
2. code_nh_numbers.txt
3. code_nh_numbers_adj.txt
4. code_desc_personal.txt
5. code_jur_list.txt

The first step is then to identify the variables that will influence the appraised value (highlighted in red above). Next, we would like to filter the data to only contain comparables, in this case free standing (a house) single-family properties.

# Find the comparables

The file `building_res.txt` contains the some of the properties description, including the HCAD account number (column acct) they are associated to. Let's find the account numbers for the free standing single family properties.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table

In [3]:
building_res_fn = ROOT_DIR / 'data/external/2016/Real_building_land/building_res.txt'
assert building_res_fn.exists()

The column names of the HCAD files are stored in the file `Layout_and_Length.txt`. The `src.data.utils.Table` class has methods to read the file column names, and read the data file as a pandas DataFrame.

In [4]:
building_res = Table(building_res_fn, '2016')

In [5]:
building_res.get_header()

['acct',
 'property_use_cd',
 'bld_num',
 'impr_tp',
 'impr_mdl_cd',
 'structure',
 'structure_dscr',
 'dpr_val',
 'cama_replacement_cost',
 'accrued_depr_pct',
 'qa_cd',
 'dscr',
 'date_erected',
 'eff',
 'yr_remodel',
 'yr_roll',
 'appr_by',
 'appr_dt',
 'notes',
 'im_sq_ft',
 'act_ar',
 'heat_ar',
 'gross_ar',
 'eff_ar',
 'base_ar',
 'perimeter',
 'pct',
 'bld_adj',
 'rcnld',
 'size_index',
 'lump_sum_adj']

In [6]:
building_res_df = building_res.get_df()

In [7]:
building_res_df.head()

Unnamed: 0,acct,property_use_cd,bld_num,impr_tp,impr_mdl_cd,structure,structure_dscr,dpr_val,cama_replacement_cost,accrued_depr_pct,...,heat_ar,gross_ar,eff_ar,base_ar,perimeter,pct,bld_adj,rcnld,size_index,lump_sum_adj
0,21660000012,A1,1,1001,101,R,Residential,256712,259305,0.99,...,2328,2328,2212,2328,300,1.0,1.5,171141.33,0.83,22599
1,21700000013,A1,1,1001,101,R,Residential,55492,135347,0.41,...,1434,1622,1453,1434,170,1.0,1.5,36994.67,0.92,6669
2,21870000006,A1,1,1001,101,R,Residential,10461,47552,0.22,...,572,652,598,572,96,1.0,1.84,5685.33,1.11,2887
3,21960000001,A1,2,1001,101,R,Residential,136211,272422,0.5,...,4304,4984,4253,4304,444,1.0,1.84,74027.72,0.73,7715
4,22080000004,A1,2,1001,101,R,Residential,15308,56696,0.27,...,528,576,544,528,96,1.0,1.84,8319.57,1.11,3754


We would like to decode the values in columns: property_use_cd, bld_num, and impr_tp. For this we need the tables stored in the file `code_desc_ral`, so we can use the `get_codes` method and pass it return to the `map_codes_to_column` method of the Table object.

In [8]:
codes_fn = ROOT_DIR / 'data/external/2016/code_desc_real'
assert codes_fn.exists()

Most column codes are reported in pairs (value, description), but the `property_use_cd` codes are reported as triplet (code, 2nd Cd, description). For this reason we must pass the `split_part=2` argument to the get_code method.

In [9]:
building_state_class_code = building_res.get_code(codes_fn, 'State Class', split_parts=2)

In [10]:
building_type_code = building_res.get_code(codes_fn, 'Building Type Code')

In [11]:
building_style_code = building_res.get_code(codes_fn, 'Building Style')

In [12]:
building_res_df = building_res.map_codes_to_column('property_use_cd', building_state_class_code)
building_res_df = building_res.map_codes_to_column('impr_tp', building_type_code)
building_res_df = building_res.map_codes_to_column('impr_mdl_cd', building_style_code)

In [13]:
building_res_df.head()

Unnamed: 0,acct,property_use_cd,bld_num,impr_tp,impr_mdl_cd,structure,structure_dscr,dpr_val,cama_replacement_cost,accrued_depr_pct,...,heat_ar,gross_ar,eff_ar,base_ar,perimeter,pct,bld_adj,rcnld,size_index,lump_sum_adj
0,21660000012,"Real, Residential, Single-Family",1,Residential Single Family,Residential 1 Family,R,Residential,256712,259305,0.99,...,2328,2328,2212,2328,300,1.0,1.5,171141.33,0.83,22599
1,21700000013,"Real, Residential, Single-Family",1,Residential Single Family,Residential 1 Family,R,Residential,55492,135347,0.41,...,1434,1622,1453,1434,170,1.0,1.5,36994.67,0.92,6669
2,21870000006,"Real, Residential, Single-Family",1,Residential Single Family,Residential 1 Family,R,Residential,10461,47552,0.22,...,572,652,598,572,96,1.0,1.84,5685.33,1.11,2887
3,21960000001,"Real, Residential, Single-Family",2,Residential Single Family,Residential 1 Family,R,Residential,136211,272422,0.5,...,4304,4984,4253,4304,444,1.0,1.84,74027.72,0.73,7715
4,22080000004,"Real, Residential, Single-Family",2,Residential Single Family,Residential 1 Family,R,Residential,15308,56696,0.27,...,528,576,544,528,96,1.0,1.84,8319.57,1.11,3754


Let's look at the value distribution for the property type columns.

In [14]:
building_res_df['property_use_cd'].value_counts().head(10)

Real, Residential, Single-Family    1028348
Personal Prop. Mobile Home            26938
Condo - Apartment Style               26927
Condo - Townhouse (2+ stories)        14454
Real, Residential, Mobile Homes       12130
Condo - Apartment Conversion          11967
Condo - High Rise                      9881
Real, Residential, Two-Family          9062
Inventory Improved                     4140
Real, Residential, 1/2 Duplex          2632
Name: property_use_cd, dtype: int64

In [15]:
building_res_df['impr_tp'].value_counts().head(10)

Residential Single Family         1011477
Residential Condo                   63592
Residential Mobile Homes            41266
Residential Townhome                22083
Residential Duplex                   9809
Mixed Residential / Commercial       1812
Residential Triplex                   641
Farm Single Family Dwelling           488
Residential Fourplex                  138
Recreational/Health                    10
Name: impr_tp, dtype: int64

In [16]:
building_res_df['impr_mdl_cd'].value_counts().head(10)

Residential 1 Family                   1011641
Condominium (Common Element)             63592
Single Wide Residential Mobile Home      33280
Townhome (with Common Element)           21958
Residential 2 Family                      9806
Double Wide Residential Mobile Home       7987
Mixed Res/Com, Res Structure              1808
Residential 3 Family                       620
Farm with Dwelling                         484
Residential 4 Family or More               129
Name: impr_mdl_cd, dtype: int64

In [17]:
building_res_df['structure_dscr'].value_counts().head(10)

Residential            1109970
Mobile Home              41267
Reinforced Concrete         36
Masonry Bearing             33
Wood or Light Steel         25
Open Steel Skeleton         13
Fireproofed Steel            2
Undefined                    1
Name: structure_dscr, dtype: int64

Since we are interested in the free-standing single-family home, let's filter `building_res_df` to account only for these properties.

In [21]:
cond0 = building_res_df['property_use_cd'] == 'Real, Residential, Single-Family'
cond1 = building_res_df['impr_mdl_cd'] == 'Residential 1 Family'
cond2 = building_res_df['impr_tp'] == 'Residential Single Family'
cond3 = building_res_df['structure_dscr'] == 'Residential'

In [22]:
building_res_comps = building_res_df.loc[cond0 & cond1 & cond2 & cond3, :]

The property values are reported in the `real_acct.txt` file. There is only one set of appraised values for each account number. Let's make sure our filtered `building_res_df` only contains accounts that have just one building.

In [23]:
total_bld_per_acct = building_res_comps.groupby('acct')['bld_num'].count()
one_bld_in_acct = total_bld_per_acct[total_bld_per_acct == 1].index

In [24]:
assert one_bld_in_acct.is_unique

In [25]:
cond0 = building_res_comps['acct'].isin(one_bld_in_acct)
building_res_comps = building_res_comps.loc[cond0, :]

In [30]:
building_res_comps['bld_num'].value_counts()

1      961725
2        1696
3         166
4          33
5          11
6           8
10          5
8           4
7           3
101         2
9           2
224         1
120         1
100         1
12          1
Name: bld_num, dtype: int64

**Note**: I don't know why the building numbers above are different than one, given we already filter out accounts with more than one building. Nonetheless, we have one account mapped to one building which is what we need to unequivocally join the `building_res.txt` file with the `real_acct.txt` file.

## Select columns in comparables
Not all columns in `building_res.txt` are clearly related to the appraised value. Let's get the columns that appear to be relevant.

In [41]:
building_res_comps.columns

Index(['acct', 'property_use_cd', 'bld_num', 'impr_tp', 'impr_mdl_cd',
       'structure', 'structure_dscr', 'dpr_val', 'cama_replacement_cost',
       'accrued_depr_pct', 'qa_cd', 'dscr', 'date_erected', 'eff',
       'yr_remodel', 'yr_roll', 'appr_by', 'appr_dt', 'notes', 'im_sq_ft',
       'act_ar', 'heat_ar', 'gross_ar', 'eff_ar', 'base_ar', 'perimeter',
       'pct', 'bld_adj', 'rcnld', 'size_index', 'lump_sum_adj'],
      dtype='object')

In [49]:
cols = [
    'acct',
    'dscr',     # Quality description
    'date_erected',
    'yr_remodel',
    'im_sq_ft', # Improvement square feet
    'act_ar',   # Actual area
    'heat_ar',  # Heat area
    'gross_ar', # Gross area
    'eff_ar',   # Affective area
    'base_ar',  # Base area
    'perimeter',
    'pct',      # Percent completed
]

In [50]:
building_res_comps = building_res_comps.loc[:, cols]

In [51]:
building_res_comps

Unnamed: 0,acct,dscr,date_erected,yr_remodel,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,base_ar,perimeter,pct
0,21660000012,Good,2014,0,2328,2328,2328,2328,2212,2328,300,1.00
1,21700000013,Low,1920,2004,1434,1622,1434,1622,1453,1434,170,1.00
3,21960000001,Low,1940,1999,4304,4984,4304,4984,4253,4304,444,1.00
6,22080000008,Low,1940,0,2240,2732,2240,2732,2270,2240,272,1.00
8,22620000008,Good,2013,0,1778,1808,1778,1808,1699,1778,256,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...
1124519,1371570020005,Good,2015,0,3682,4512,3682,4512,3854,3682,472,0.51
1124520,1372310010003,Low,1910,0,692,746,692,746,710,692,114,1.00
1124522,1373350010027,Average,2015,0,3181,3823,3181,3823,3237,3181,376,0.66
1124523,1953080320060,Average,1983,0,1723,2259,1723,2259,1923,1723,216,1.00


In [47]:
building_res_comps['pct'].value_counts()

1.00    957493
0.66      1105
0.51      1046
0.80       965
0.41       365
         ...  
0.12         1
0.11         1
0.18         1
0.37         1
0.14         1
Name: pct, Length: 97, dtype: int64

# Join comparables with fixtures

In [None]:
fixtures_fn = ROOT_DIR / 'data/external/2016/Real_building_land/fixtures.txt'
assert fixtures_fn.exists()

fixtures = Table(fixtures_fn, '2016')

In [None]:
real_acct_fn = ROOT_DIR / 'data/external/2016/Real_acct_owner/real_acct.txt'
assert real_acct_fn.exists()

real_acct = Table(real_acct_fn, '2016')

In [None]:
cols = [
    'site_addr_1',
    'site_addr_2',
    'site_addr_3',
    'land_val',
    'bld_val',
    'prior_bld_val',
    'x_features_val',
    'ag_val',
    'assessed_val',
    'tot_appr_val',
    'tot_mkt_val',
    'econ_bld_class',
    'nxt_bld',
    'lgl_1',
    'new_construction_val',
    'value_status',
]

In [None]:
building_other_fn = data_fn / 'Real_building_land/building_other.txt'

In [None]:
fixtures_fn = data_fn / 'Real_building_land/fixtures.txt'

In [None]:
extra_features_fn = data_fn / 'Real_building_land/extra_features.txt'