# Find the comparables: structural_elem1.txt

The file `structural_elem1.txt` contains important property information about the building data, like foundation type, exterior wall composition, Heating/AC, and more. Let's load this file and grab a subset with the important columns to continue our study.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pickle

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table, save_pickle

In [3]:
structural_elem1_fn = ROOT_DIR / 'data/external/2016/Real_building_land/structural_elem1.txt'
assert structural_elem1_fn.exists()

In [4]:
structural_elem1 = Table(structural_elem1_fn, '2016')

# Load accounts of interest
Let's remove the account numbers that don't meet free-standing single-family home criteria that we found while processing the `building_res.txt` file.

In [5]:
skiprows = structural_elem1.get_skiprows()

In [6]:
structural_elem1_df = structural_elem1.get_df(skiprows=skiprows)

In [7]:
structural_elem1_df.head()

Unnamed: 0,acct,bld_num,code,adj,type,type_dscr,category_dscr,dor_cd
0,1234170010001,1,4,0.86,CDU,Cond / Desir / Util,Average,A1
1,1234170010001,1,1,0.0,FND,Foundation Type,Slab,A1
2,1247460020012,1,4,0.88,CDU,Cond / Desir / Util,Average,A1
3,1247460020012,1,8,1.29,GRD,Grade Adjustment,B,A1
4,1247460020012,1,6,111.0,XWR,Exterior Wall,Brick / Veneer,A1


# Grab slice of the structural_elem1 variables of interest
Let's look at the number of `type_dscr` present in the data using value counts.

In [8]:
structural_elem1_df.type_dscr.value_counts().apply(lambda x: format(x, 'f'))

Exterior Wall            1359397.000000
Foundation Type           960762.000000
Physical Condition        960682.000000
Heating / AC              960678.000000
Grade Adjustment          960677.000000
Cond / Desir / Util       960676.000000
Cost and Design           152421.000000
Economic Obsolescence          7.000000
Cooling Type                   5.000000
Construction Type              5.000000
Heating Type                   5.000000
Partition Type                 5.000000
Functional Utility             5.000000
Plumbing Type                  5.000000
Sprinkler Type                 5.000000
Name: type_dscr, dtype: object

The vast majority of type descriptions are represented in the first seven entries above. Let's grab those before building the pivot table.

In [9]:
cols = structural_elem1_df.type_dscr.value_counts().head(7).index

In [10]:
cond0 = structural_elem1_df['type_dscr'].isin(cols)
structural_elem1_df = structural_elem1_df.loc[cond0, :]

# Build pivot table
Let's look at one example from a random property account:

In [11]:
structural_elem1_df[structural_elem1_df['acct'] == 1347180010021]

Unnamed: 0,acct,bld_num,code,adj,type,type_dscr,category_dscr,dor_cd
5072046,1347180010021,1,1,0.0,FND,Foundation Type,Slab,A1
5072047,1347180010021,1,3,8.0,HAC,Heating / AC,Central Heat/AC,A1
5072048,1347180010021,1,4,,PCR,Physical Condition,Average,A1
5072049,1347180010021,1,4,0.99,CDU,Cond / Desir / Util,Average,A1
5073997,1347180010021,1,9,1.17,GRD,Grade Adjustment,B-,A1
5073998,1347180010021,1,6,77.7,XWR,Exterior Wall,Brick / Veneer,A1
5073999,1347180010021,1,1,30.0,XWR,Exterior Wall,Frame / Concrete Blk,A1


We would like to build a pivot table using the `type_dscr` entries as columns and the `category_dscr` as values. However, note that the `type_dscr` is not unique for each account number (`acct`). In the example above, the `type_dscr` Exterior Wall is found twice for the same property account. Let's select the first occurrence of each `type_dscr` for the moment. If it turns out that any of these variables is highly correlated to the property's appraised value we might have to use a different approach, i.e. account for all descriptions by relabeling the repeated entries.

In [12]:
structural_elem1_pivot = structural_elem1_df.pivot_table(index='acct',
                                                         columns='type_dscr',
                                                         values='category_dscr',
                                                         aggfunc='first')

In [13]:
structural_elem1_pivot.reset_index(inplace=True)

In [14]:
structural_elem1_pivot.head(20)

type_dscr,acct,Cond / Desir / Util,Cost and Design,Exterior Wall,Foundation Type,Grade Adjustment,Heating / AC,Physical Condition
0,21440000001,Fair,New / Rebuilt,Stucco,Slab,B-,Central Heat/AC,Average
1,21470000008,Unsound,Econ Misimprovement,Frame / Concrete Blk,Slab,D-,,Unsound
2,21480000002,Poor,Econ Misimprovement,Frame / Concrete Blk,Crawl Space,D,,Poor
3,21650000007,Average,New / Rebuilt,Stucco,Slab,B+,Central Heat/AC,Average
4,21650000011,Fair,New / Rebuilt,Frame / Concrete Blk,Slab,C,Central Heat/AC,Average
5,21660000011,Fair,Econ Misimprovement,Asbestos,Crawl Space,D,Central Heat/AC,Fair
6,21660000012,Average,,Frame / Concrete Blk,Slab,B,Central Heat/AC,Average
7,21700000013,Average,Partial,Stucco,Crawl Space,D+,,Average
8,21750000003,Average,,Brick / Veneer,Slab,B+,Central Heat/AC,Average
9,21750000013,Fair,Extensive,Frame / Concrete Blk,Crawl Space,B,Central Heat/AC,Average


In [15]:
assert structural_elem1_pivot['acct'].is_unique

# Fix column names
We would like the column names to be all lower case, with no spaces nor non-alphanumeric characters.

In [16]:
from src.data.utils import fix_column_names

In [17]:
structural_elem1_pivot.columns

Index(['acct', 'Cond / Desir / Util', 'Cost and Design', 'Exterior Wall',
       'Foundation Type', 'Grade Adjustment', 'Heating / AC',
       'Physical Condition'],
      dtype='object', name='type_dscr')

In [18]:
structural_elem1_pivot = fix_column_names(structural_elem1_pivot)

In [19]:
structural_elem1_pivot.columns

Index(['acct', 'cond_desir_util', 'cost_and_design', 'exterior_wall',
       'foundation_type', 'grade_adjustment', 'heating_ac',
       'physical_condition'],
      dtype='object')

### Find duplicated rows

In [20]:
cond0 = structural_elem1_pivot.duplicated()
structural_elem1_pivot.loc[cond0, :]

Unnamed: 0,acct,cond_desir_util,cost_and_design,exterior_wall,foundation_type,grade_adjustment,heating_ac,physical_condition


# Describe and clean the columns

Now we must describe each column by answering:

* Meaning
* Descriptive statistics or value counts
* Data type

There is no explicit document provided by HCAD explaining all the variables, but most are easy to guess for using their name.

## Condition, desirability, utility:  cond_desir_util

In [21]:
from src.data.utils import fix_category_col

In [22]:
structural_elem1_pivot['cond_desir_util'].value_counts(normalize=True)

Average      0.691470
Good         0.140875
Fair         0.093587
Very Good    0.041447
Poor         0.020060
Very Poor    0.005749
Excellent    0.005571
Unsound      0.001242
Name: cond_desir_util, dtype: float64

In [23]:
order = ['Excellent', 'Very Good', 'Good', 'Average', 'Fair', 'Poor', 'Very Poor', 'Unsound']
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'cond_desir_util', order=order)

The new column type is: CategoricalDtype(categories=['Excellent', 'Very Good', 'Good', 'Average', 'Fair', 'Poor',
                  'Very Poor', 'Unsound'],
                 ordered=True)


The number of missing values is: 0


## cost_and_design

In [24]:
structural_elem1_pivot['cost_and_design'].value_counts(normalize=True)

New / Rebuilt          0.421937
Partial                0.258141
Extensive              0.199091
Total                  0.060875
Econ Misimprovement    0.059746
Condo Code 1           0.000158
Condo Code 4           0.000053
Name: cost_and_design, dtype: float64

In [25]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'cost_and_design')

The new column type is: CategoricalDtype(categories=['Condo Code 1', 'Condo Code 4', 'Econ Misimprovement',
                  'Extensive', 'New / Rebuilt', 'Partial', 'Total'],
                 ordered=False)


The number of missing values is: 805429


## exterior_wall

In [26]:
structural_elem1_pivot['exterior_wall'].value_counts(normalize=True)

Brick / Veneer          0.470118
Frame / Concrete Blk    0.300622
Brick / Masonry         0.102839
Aluminum / Vinyl        0.040302
Stucco                  0.031921
Shake Shingle           0.020870
Asbestos                0.019353
Stone                   0.013973
Metal, Light            0.000001
Frame / Res Stucco      0.000001
Name: exterior_wall, dtype: float64

In [27]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'exterior_wall')

The new column type is: CategoricalDtype(categories=['Aluminum / Vinyl', 'Asbestos', 'Brick / Masonry',
                  'Brick / Veneer', 'Frame / Concrete Blk',
                  'Frame / Res Stucco', 'Metal, Light', 'Shake Shingle',
                  'Stone', 'Stucco'],
                 ordered=False)


The number of missing values is: 1


## foundation_type

In [28]:
structural_elem1_pivot['foundation_type'].value_counts(normalize=True)

Slab                0.903931
Crawl Space         0.094759
Full Basement       0.000673
Partial Basement    0.000636
Name: foundation_type, dtype: float64

In [29]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'foundation_type')

The new column type is: CategoricalDtype(categories=['Crawl Space', 'Full Basement', 'Partial Basement', 'Slab'], ordered=False)


The number of missing values is: 0


## grade_adjustment

In [30]:
structural_elem1_pivot['grade_adjustment'].value_counts(normalize=True)

C     0.347699
C+    0.206818
B-    0.096475
B     0.082971
C-    0.072634
B+    0.050494
D     0.038653
D+    0.027677
A-    0.021622
A     0.018226
A+    0.014111
D-    0.010852
X-    0.005150
X     0.002758
E     0.001774
E+    0.001227
X+    0.000628
E-    0.000231
Name: grade_adjustment, dtype: float64

In [31]:
letters = ['A', 'B', 'C', 'D', 'E', 'X']
signs = ['+', '', '-']
order = [letter + sign for letter in letters for sign in signs]

In [32]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'grade_adjustment', order=order)

The new column type is: CategoricalDtype(categories=['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D',
                  'D-', 'E+', 'E', 'E-', 'X+', 'X', 'X-'],
                 ordered=True)


The number of missing values is: 0


## heating_ac

In [33]:
structural_elem1_pivot['heating_ac'].value_counts(normalize=True)

Central Heat/AC    0.897521
None               0.079938
Central Heat       0.020844
A/C Only           0.001697
Name: heating_ac, dtype: float64

None here means there is no heating / AC unit in the property.

In [34]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'heating_ac')

The new column type is: CategoricalDtype(categories=['A/C Only', 'Central Heat', 'Central Heat/AC', 'None'], ordered=False)


The number of missing values is: 1


## physical_condition

In [35]:
structural_elem1_pivot['physical_condition'].value_counts(normalize=True)

Average      0.831449
Good         0.083243
Fair         0.048969
Very Good    0.016354
Poor         0.011392
Excellent    0.005605
Very Poor    0.001959
Unsound      0.001030
Name: physical_condition, dtype: float64

In [36]:
order = ['Excellent', 'Very Good', 'Good', 'Average', 'Fair', 'Poor', 'Very Poor', 'Unsound']

In [37]:
structural_elem1_pivot = fix_category_col(structural_elem1_pivot, 'physical_condition', order=order)

The new column type is: CategoricalDtype(categories=['Excellent', 'Very Good', 'Good', 'Average', 'Fair', 'Poor',
                  'Very Poor', 'Unsound'],
                 ordered=True)


The number of missing values is: 0


# Export real_acct

In [38]:
save_fn = ROOT_DIR / 'data/raw/2016/structural_elem1_comps.pickle'
save_pickle(structural_elem1_pivot, save_fn)