# Harris County Appraisal District (HCAD) real and personal property data
The Harris County is the [third most populous](https://en.wikipedia.org/wiki/List_of_the_most_populous_counties_in_the_United_States) county in the USA. Its appraisal district (HCAD) provides a fantastic dataset with each appraised property characteristics (appraised value, fixtures, features...) on a yearly basis. In this notebook we explore these data for the year 2016, to understand what is available, and to select variables that can help us answer if a given property was appraised fairly.

[HCAD data](https://pdata.hcad.org/download/2016.html) consists of several text files, grouped in zipped files as follows:

1. Real_acct_owner.zip
    * **Real_acct.txt**: account including owner name, owner mailing address, $\color{red}{values}$, $\color{red}{site~address}$, and legal descriptions.
    * **Real_neighborhood_code.txt**:  $\color{red}{neighborhood~code}$, group code and description
    * **Parcel_tiebacks.txt**
    * **Permits.txt**: an account including permit type, permit description, and status.
    * **Owners.txt**: multiple owners.
    * **Deeds.txt**: deed information.


2. Real_building_land.zip
    * **Building_res.txt**: $\color{red}{all~residential~information}$
    * **Building_other.txt**: all other real properties, such as commercial and information for income producing properties including occupancy rates and operating income.
    * **Exterior.txt**: $\color{red}{general~data~about~buildings~and~sub~areas,~(style~or~use,~size,~year~built).}$
    * **Fixture.txt**: $\color{red}{characteristics~of~the~building}$. This includes bedrooms, fireplace, bathrooms, stories for residential. Also contains wall height, elevators, and other descriptions for commercial property.
    * **Extra_features.txt**: extra features for an account. This includes wood deck, $\color{red}{pool}$, storage shed, detached garage, etc. This also contains information on cracked slabs and pools.
    * **Structural_elem1.txt**: $\color{red}{Single~Family}$, Multi Family, Condos, Town homes. $\color{red}{Home~Information}$ (CDU, Grade Adjustment, Physical Condition).
    * **Structural_elem2.txt**: Commercial and exempt Properties. These files contain structural elements of a property. This includes information like $\color{red}{physical~condition,~grade,~exterior~wall,~and~foundation~type}$.
    * **Land.txt**: land use, acreage, and land units.
    * **Land_ag.txt**: agricultural and timber land information including land use, acreage, and land units.


3. Real_jur_exempt.zip
    * **Jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **Jur_exemption_cd**: Lists the exemption code associated with an account.
    * **Jur_exemption_dscr**: Lists the jurisdictions and their exemption description.
    * **Jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **Jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **Jur_value.txt**: Lists the jurisdictions and values associated with an account.


4. PP_files.zip
    * **T_business_acct.txt**:  account, including owner name, owner mailing address, values, site address, and legal descriptions, all values and etc.
    * **T_business_detail.txt**: account, items, description and item values.
    * **T_jur_exempt.txt**: Lists the jurisdictions and exemptions associated with an account and the tax rates.
    * **T_jur_value.txt**: Lists the jurisdictions and values associated with an account.
    * **T_jur_tax_district_exempt_value.txt**: Lists the jurisdictions and their exemption values.
    * **T_jur_tax_district_percent_rate**: Lists the Taxing district percent rates.
    * **T_pp_c.txt**: c Pipelines data.
    * **T_pp_e.txt**: e Minerals data.
    * **T_pp_l.txt**: l Electrical Transmission / Distribution Lines data.


5. Hearing_files.zip
    * **ARB_hearings_pp.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_hearings_real.txt**: account, state code, owner, date of hearing, release date, conclusion code, initial and final values and etc.
    * **ARB_protest_pp.txt**: account, protest by, date of protest.
    * **ARB_protest_real.txt**: account, protest by, date of protest.

In addition, the following files are also available for deciphering codes in some of the variables:

1. code_desc_real.txt
2. code_nh_numbers.txt
3. code_nh_numbers_adj.txt
4. code_desc_personal.txt
5. code_jur_list.txt

The first step is then to identify the variables that will influence the appraised value (highlighted in red above). Next, we would like to filter the data to only contain comparables, in this case free standing (a house) single-family properties.

# Find the comparables: building_res.txt

The file `building_res.txt` contains the some of the properties description, including the HCAD account number (column acct) they are associated to. Let's find the account numbers for the free standing single family properties.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table

In [None]:
building_res_fn = ROOT_DIR / 'data/external/2016/Real_building_land/building_res.txt'
assert building_res_fn.exists()

The column names of the HCAD files are stored in the file `Layout_and_Length.txt`. The `src.data.utils.Table` class has methods to read the file column names, and read the data file as a pandas DataFrame.

In [None]:
building_res = Table(building_res_fn, '2016')

In [None]:
building_res.get_header()

In [None]:
building_res_df = building_res.get_df()

In [None]:
building_res_df.head()

We would like to decode the values in columns: `property_use_cd`, `bld_num`, and `impr_tp`. For this we need the tables stored in the file `code_desc_ral`, so we can use the `get_codes` method and pass its return to the `map_codes_to_column` method of the Table object.

In [None]:
codes_fn = ROOT_DIR / 'data/external/2016/code_desc_real'
assert codes_fn.exists()

Most column codes are reported in pairs (value, description), but the `property_use_cd` codes are reported as triplet (code, 2nd Cd, description). For this reason we must pass the `split_part=2` argument to the get_code method.

In [None]:
building_state_class_code = building_res.get_code(codes_fn, 'State Class', split_parts=2)

In [None]:
building_type_code = building_res.get_code(codes_fn, 'Building Type Code')

In [None]:
building_style_code = building_res.get_code(codes_fn, 'Building Style')

In [None]:
building_res_df = building_res.map_codes_to_column('property_use_cd', building_state_class_code)
building_res_df = building_res.map_codes_to_column('impr_tp', building_type_code)
building_res_df = building_res.map_codes_to_column('impr_mdl_cd', building_style_code)

In [None]:
building_res_df.head()

Let's look at the value distribution for the property type columns.

In [None]:
building_res_df['property_use_cd'].value_counts().head(10)

In [None]:
building_res_df['impr_tp'].value_counts().head(10)

In [None]:
building_res_df['impr_mdl_cd'].value_counts().head(10)

In [None]:
building_res_df['structure_dscr'].value_counts().head(10)

Since we are interested in the free-standing single-family home, let's filter `building_res_df` to account only for these properties.

In [None]:
cond0 = building_res_df['property_use_cd'] == 'Real, Residential, Single-Family'
cond1 = building_res_df['impr_mdl_cd'] == 'Residential 1 Family'
cond2 = building_res_df['impr_tp'] == 'Residential Single Family'
cond3 = building_res_df['structure_dscr'] == 'Residential'
cond4 = building_res_df['pct'] == 1 # 100% built home

In [None]:
building_res_comps = building_res_df.loc[cond0 & cond1 & cond2 & cond3 & cond4, :]

The property values are reported in the `real_acct.txt` file. There is only one set of appraised values for each account number. Let's make sure our filtered `building_res_comps` only contains accounts that have just one building.

In [None]:
total_bld_per_acct = building_res_comps.groupby('acct')['bld_num'].count()
one_bld_in_acct = total_bld_per_acct[total_bld_per_acct == 1].index

In [None]:
assert one_bld_in_acct.is_unique

In [None]:
cond0 = building_res_comps['acct'].isin(one_bld_in_acct)
building_res_comps = building_res_comps.loc[cond0, :]

In [None]:
building_res_comps['bld_num'].value_counts()

**Note**: I don't know why the building numbers above are different than one, given we already filter out accounts with more than one building. Nonetheless, we have one account mapped to one building which is what we need to unequivocally join the `building_res.txt` file with the `real_acct.txt` file.

# Select columns in comparables
Not all columns in `building_res.txt` are clearly related to the appraised value. Let's get the columns that appear to be relevant.

In [None]:
building_res_comps.columns

In [None]:
cols = [
    'acct',     # Property unique account number
    'dscr',     # Quality description
    'date_erected',
    'yr_remodel',
    'im_sq_ft', # Improvement square feet
    'act_ar',   # Actual area
    'heat_ar',  # Heat area
    'gross_ar', # Gross area
    'eff_ar',   # Affective area
    'base_ar',  # Base area
    'perimeter',
    'pct',      # Percent completed
]

In [None]:
building_res_comps = building_res_comps.loc[:, cols]

In [None]:
building_res_comps.head()

# Export building_res_comps
That was a lot of work! Let's save it as a pickle file and continue the data conditioning in the next notebook.

In [None]:
from src.data.utils import save_pickle

In [None]:
save_fn = ROOT_DIR / 'data/raw/2016/building_res_comps.pickle'

In [None]:
save_pickle(building_res_comps, save_fn)

# Export unique account numbers of interest: one_bld_in_acct

In [None]:
save_fn = ROOT_DIR / 'data/raw/2016/one_bld_in_acct.pickle'

In [None]:
save_pickle(one_bld_in_acct, save_fn)