# Find the comparables: structural_elem1.txt

The file `structural_elem1.txt` contains important property information about the building data, like foundation type, exterior wall composition, Heating/AC, and more. Let's load this file and grab a subset with the important columns to continue our study.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pickle

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table, save_pickle

In [3]:
structural_elem1_fn = ROOT_DIR / 'data/external/2016/Real_building_land/structural_elem1.txt'
assert structural_elem1_fn.exists()

In [4]:
structural_elem1 = Table(structural_elem1_fn, '2016')

In [5]:
structural_elem1_df = structural_elem1.get_df()

# Load accounts of interest
Let's remove the account numbers that don't meet free-standing single-family home criteria that we found while processing the `building_res.txt` file.

In [6]:
one_bld_in_acct_fn = ROOT_DIR / 'data/raw/2016/one_bld_in_acct.pickle'

In [7]:
with open(one_bld_in_acct_fn, 'rb') as f:
    one_bld_in_acct = pickle.load(f)

In [8]:
cond0 = structural_elem1_df['acct'].isin(one_bld_in_acct)
structural_elem1_df = structural_elem1_df.loc[cond0, :]

In [9]:
structural_elem1_df.head()

Unnamed: 0,acct,bld_num,code,adj,type,type_dscr,category_dscr,dor_cd
0,975030000036,1,92,1.35,CAD,Cost and Design,Extensive,A1
1,982110000009,1,4,,PCR,Physical Condition,Average,A1
2,924620000001,1,91,1.5,CAD,Cost and Design,Partial,A1
3,924620000001,1,4,,PCR,Physical Condition,Average,A1
4,924620000001,1,1,0.0,FND,Foundation Type,Slab,A1


# Grab slice of the structural_elem1 variables of interest
Let's look at the number of `type_dscr` present in the data using value counts.

In [10]:
structural_elem1_df.type_dscr.value_counts()

Exterior Wall            1359393
Foundation Type           960758
Physical Condition        960678
Heating / AC              960674
Grade Adjustment          960673
Cond / Desir / Util       960672
Cost and Design           152419
Economic Obsolescence          7
Cooling Type                   5
Sprinkler Type                 5
Functional Utility             5
Partition Type                 5
Heating Type                   5
Plumbing Type                  5
Construction Type              5
Name: type_dscr, dtype: int64

The vast majority of type descriptions are represented in the first six entries above. Let's grab those before building the pivot table.

In [11]:
cols = structural_elem1_df.type_dscr.value_counts().head(6).index

In [12]:
cond0 = structural_elem1_df['type_dscr'].isin(cols)
structural_elem1_df = structural_elem1_df.loc[cond0, :]

# Build pivot table
Let's look at one example from a random property account:

In [13]:
structural_elem1_df[structural_elem1_df['acct'] == 1347180010021]

Unnamed: 0,acct,bld_num,code,adj,type,type_dscr,category_dscr,dor_cd
3549080,1347180010021,1,1,30.0,XWR,Exterior Wall,Frame / Concrete Blk,A1
3549082,1347180010021,1,1,0.0,FND,Foundation Type,Slab,A1
3549466,1347180010021,1,3,8.0,HAC,Heating / AC,Central Heat/AC,A1
3549865,1347180010021,1,4,0.99,CDU,Cond / Desir / Util,Average,A1
3549924,1347180010021,1,6,77.7,XWR,Exterior Wall,Brick / Veneer,A1
3549926,1347180010021,1,9,1.17,GRD,Grade Adjustment,B-,A1
3550240,1347180010021,1,4,,PCR,Physical Condition,Average,A1


We would like to build a pivot table using the `type_dscr` entries as columns and the `category_dscr` as values. However, note that the `type_dscr` is not unique for each account number (`acct`). In the example above, the `type_dscr` Exterior Wall is found twice for the same property account. Let's select the first occurrence of each `type_dscr` for the moment. If it turns out that any of these variables is highly correlated to the property's appraised value we might have to use a different approach, i.e. account for all descriptions by relabeling the repeated entries.

In [14]:
structural_elem1_pivot = structural_elem1_df.pivot_table(index='acct',
                                                         columns='type_dscr',
                                                         values='category_dscr',
                                                         aggfunc='first')

In [15]:
structural_elem1_pivot.reset_index(inplace=True)

In [16]:
structural_elem1_pivot.head()

type_dscr,acct,Cond / Desir / Util,Exterior Wall,Foundation Type,Grade Adjustment,Heating / AC,Physical Condition
0,21440000001,Fair,Stucco,Slab,B-,Central Heat/AC,Average
1,21470000008,Unsound,Frame / Concrete Blk,Slab,D-,,Unsound
2,21480000002,Poor,Frame / Concrete Blk,Crawl Space,D,,Poor
3,21650000007,Average,Stucco,Slab,B+,Central Heat/AC,Average
4,21650000011,Fair,Frame / Concrete Blk,Slab,C,Central Heat/AC,Average


In [17]:
assert structural_elem1_pivot['acct'].is_unique

# Export real_acct

In [18]:
save_fn = ROOT_DIR / 'data/raw/2016/structural_elem1_comps.pickle'
save_pickle(structural_elem1_pivot, save_fn)