# Find the comparables: extra_features.txt

The file `extra_features.txt` contains important property information like number and quality of pools, detached garages, outbuildings, canopies, and more. Let's load this file and grab a subset with the important columns to continue our study.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pickle

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table, save_pickle

In [3]:
extra_features_fn = ROOT_DIR / 'data/external/2016/Real_building_land/extra_features.txt'
assert extra_features_fn.exists()

In [4]:
extra_features = Table(extra_features_fn, '2016')

In [5]:
extra_features.get_header()

['acct',
 'bld_num',
 'count',
 'grade',
 'cd',
 's_dscr',
 'l_dscr',
 'cat',
 'dscr',
 'note',
 'uts']

# Load accounts of interest
Let's remove the account numbers that don't meet free-standing single-family home criteria that we found while processing the `building_res.txt` file.

In [6]:
skiprows = extra_features.get_skiprows()

In [7]:
extra_features_df = extra_features.get_df(skiprows=skiprows)

In [8]:
extra_features_df.head()

Unnamed: 0,acct,bld_num,count,grade,cd,s_dscr,l_dscr,cat,dscr,note,uts
0,21440000001,0,2,4,RRS1,WDUtSh,Frame Utility Shed,OB,Outbuildings,,110.0
1,21440000001,0,2,4,RRS1,WDUtSh,Frame Utility Shed,OB,Outbuildings,,130.0
2,21480000002,1,1,4,ROGV,OtherRs,Residential Other Gross Value,MS,Miscellaneous,SALV GAR APMT.,0.5
3,21650000007,0,1,4,RRP5,GnPool,Gunite Pool,PL,Pools,,368.0
4,21700000013,0,1,5,RRG1,FrmGar,Frame Detached Garage,GR,Garage,,225.0


In [9]:
extra_features_df.l_dscr.value_counts().head(25)

Frame Detached Garage                       181621
Frame Utility Shed                           93378
Gunite Pool                                  87999
Canopy - Residential                         85893
Carport - Residential                        76603
Pool SPA with Heater                         37243
Metal Utility Shed                           19563
Foundation Repaired                          17655
Cracked Slab                                 16123
Residential Other Gross Value                14816
Brick or Stone Detached Garage               13804
Frame Detached Garage w/living area abov      7059
Custom Outdoor Kitchen                        4301
Reinforced Concrete Pool                      3729
Basic Outdoor Kitchen                         2304
Utility Building - Metal                      2007
4 Side closed Metal Pole Barn                 1889
Brick or Stone Detached Garage w/living       1461
Utility Building - Frame                      1430
Light Wood Deck Lt Posts Boat D

# Grab slice of the extra features of interest
With the value counts on the extra feature description performed above we can see that the majority of the features land in the top 15 categories. Let's filter out the rests of the columns.

In [10]:
cols = extra_features_df.l_dscr.value_counts().head(15).index

In [11]:
cond0 = extra_features_df['l_dscr'].isin(cols)
extra_features_df = extra_features_df.loc[cond0, :]

# Build pivot tables for count and grade
There appear to be two important values related to each extra feature: uts (units area in square feet) and grade. Since a property can have multiple features of the same class, e.g. frame utility shed, let's aggregate them by adding the uts values, and also by taking the mean of the same class feature grades.

Let's build individual pivot tables for each and merge them before saving them out.

In [12]:
extra_features_pivot_uts = extra_features_df.pivot_table(index='acct',
                                                         columns='l_dscr',
                                                         values='uts',
                                                         aggfunc='sum',
                                                         fill_value=0)

In [13]:
extra_features_pivot_uts.head()

l_dscr,Basic Outdoor Kitchen,Brick or Stone Detached Garage,Canopy - Residential,Carport - Residential,Cracked Slab,Custom Outdoor Kitchen,Foundation Repaired,Frame Detached Garage,Frame Detached Garage w/living area abov,Frame Utility Shed,Gunite Pool,Metal Utility Shed,Pool SPA with Heater,Reinforced Concrete Pool,Residential Other Gross Value
acct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
21440000001,0.0,0,0.0,0,0,0.0,0.0,0,0.0,240.0,0.0,0,0.0,0,0.0
21480000002,0.0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0,0.5
21650000007,0.0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,368.0,0,0.0,0,0.0
21700000013,0.0,0,0.0,0,0,0.0,0.0,225,0.0,0.0,0.0,0,0.0,0,0.0
21750000013,0.0,0,0.0,0,0,0.0,0.0,450,300.0,0.0,0.0,0,0.0,0,0.0


In [14]:
extra_features_pivot_grade = extra_features_df.pivot_table(index='acct',
                                                           columns='l_dscr',
                                                           values='grade',
                                                           aggfunc='mean',
                                                           )

In [15]:
extra_features_pivot_grade.head()

l_dscr,Basic Outdoor Kitchen,Brick or Stone Detached Garage,Canopy - Residential,Carport - Residential,Cracked Slab,Custom Outdoor Kitchen,Foundation Repaired,Frame Detached Garage,Frame Detached Garage w/living area abov,Frame Utility Shed,Gunite Pool,Metal Utility Shed,Pool SPA with Heater,Reinforced Concrete Pool,Residential Other Gross Value
acct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
21440000001,,,,,,,,,,4.0,,,,,
21480000002,,,,,,,,,,,,,,,4.0
21650000007,,,,,,,,,,,4.0,,,,
21700000013,,,,,,,,5.0,,,,,,,
21750000013,,,,,,,,4.0,4.0,,,,,,


In [16]:
extra_features_uts_grade = extra_features_pivot_uts.merge(extra_features_pivot_grade,
                                                          how='left',
                                                          left_index=True,
                                                          right_index=True,
                                                          suffixes=('_uts', '_grade'),
                                                          validate='one_to_one')

In [17]:
extra_features_uts_grade.head()

l_dscr,Basic Outdoor Kitchen_uts,Brick or Stone Detached Garage_uts,Canopy - Residential_uts,Carport - Residential_uts,Cracked Slab_uts,Custom Outdoor Kitchen_uts,Foundation Repaired_uts,Frame Detached Garage_uts,Frame Detached Garage w/living area abov_uts,Frame Utility Shed_uts,...,Custom Outdoor Kitchen_grade,Foundation Repaired_grade,Frame Detached Garage_grade,Frame Detached Garage w/living area abov_grade,Frame Utility Shed_grade,Gunite Pool_grade,Metal Utility Shed_grade,Pool SPA with Heater_grade,Reinforced Concrete Pool_grade,Residential Other Gross Value_grade
acct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21440000001,0.0,0,0.0,0,0,0.0,0.0,0,0.0,240.0,...,,,,,4.0,,,,,
21480000002,0.0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,,,,,,,,,,4.0
21650000007,0.0,0,0.0,0,0,0.0,0.0,0,0.0,0.0,...,,,,,,4.0,,,,
21700000013,0.0,0,0.0,0,0,0.0,0.0,225,0.0,0.0,...,,,5.0,,,,,,,
21750000013,0.0,0,0.0,0,0,0.0,0.0,450,300.0,0.0,...,,,4.0,4.0,,,,,,


In [18]:
assert extra_features_uts_grade.index.is_unique

add `acct` column to make easier the merging process ahead

In [19]:
extra_features_uts_grade.reset_index(inplace=True)

# Fix column names
We would like the column names to be all lower case, with no spaces nor non-alphanumeric characters.

In [20]:
from src.data.utils import fix_column_names

In [21]:
extra_features_uts_grade.columns

Index(['acct', 'Basic Outdoor Kitchen_uts',
       'Brick or Stone Detached Garage_uts', 'Canopy - Residential_uts',
       'Carport - Residential_uts', 'Cracked Slab_uts',
       'Custom Outdoor Kitchen_uts', 'Foundation Repaired_uts',
       'Frame Detached Garage_uts',
       'Frame Detached Garage w/living area abov_uts',
       'Frame Utility Shed_uts', 'Gunite Pool_uts', 'Metal Utility Shed_uts',
       'Pool SPA with Heater_uts', 'Reinforced Concrete Pool_uts',
       'Residential Other Gross Value_uts', 'Basic Outdoor Kitchen_grade',
       'Brick or Stone Detached Garage_grade', 'Canopy - Residential_grade',
       'Carport - Residential_grade', 'Cracked Slab_grade',
       'Custom Outdoor Kitchen_grade', 'Foundation Repaired_grade',
       'Frame Detached Garage_grade',
       'Frame Detached Garage w/living area abov_grade',
       'Frame Utility Shed_grade', 'Gunite Pool_grade',
       'Metal Utility Shed_grade', 'Pool SPA with Heater_grade',
       'Reinforced Concrete Poo

In [22]:
extra_features_uts_grade = fix_column_names(extra_features_uts_grade)

In [23]:
extra_features_uts_grade.columns

Index(['acct', 'basic_outdoor_kitchen_uts',
       'brick_or_stone_detached_garage_uts', 'canopy_residential_uts',
       'carport_residential_uts', 'cracked_slab_uts',
       'custom_outdoor_kitchen_uts', 'foundation_repaired_uts',
       'frame_detached_garage_uts',
       'frame_detached_garage_w_living_area_abov_uts',
       'frame_utility_shed_uts', 'gunite_pool_uts', 'metal_utility_shed_uts',
       'pool_spa_with_heater_uts', 'reinforced_concrete_pool_uts',
       'residential_other_gross_value_uts', 'basic_outdoor_kitchen_grade',
       'brick_or_stone_detached_garage_grade', 'canopy_residential_grade',
       'carport_residential_grade', 'cracked_slab_grade',
       'custom_outdoor_kitchen_grade', 'foundation_repaired_grade',
       'frame_detached_garage_grade',
       'frame_detached_garage_w_living_area_abov_grade',
       'frame_utility_shed_grade', 'gunite_pool_grade',
       'metal_utility_shed_grade', 'pool_spa_with_heater_grade',
       'reinforced_concrete_pool_grade'

### Find duplicated rows

In [24]:
cond0 = extra_features_uts_grade.duplicated()
extra_features_uts_grade.loc[cond0, :]

Unnamed: 0,acct,basic_outdoor_kitchen_uts,brick_or_stone_detached_garage_uts,canopy_residential_uts,carport_residential_uts,cracked_slab_uts,custom_outdoor_kitchen_uts,foundation_repaired_uts,frame_detached_garage_uts,frame_detached_garage_w_living_area_abov_uts,...,custom_outdoor_kitchen_grade,foundation_repaired_grade,frame_detached_garage_grade,frame_detached_garage_w_living_area_abov_grade,frame_utility_shed_grade,gunite_pool_grade,metal_utility_shed_grade,pool_spa_with_heater_grade,reinforced_concrete_pool_grade,residential_other_gross_value_grade


# Describe

In [25]:
extra_features_uts_grade.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 429701 entries, 0 to 429700
Data columns (total 31 columns):
 #   Column                                          Non-Null Count   Dtype  
---  ------                                          --------------   -----  
 0   acct                                            429701 non-null  int64  
 1   basic_outdoor_kitchen_uts                       429701 non-null  float64
 2   brick_or_stone_detached_garage_uts              429701 non-null  int64  
 3   canopy_residential_uts                          429701 non-null  float64
 4   carport_residential_uts                         429701 non-null  int64  
 5   cracked_slab_uts                                429701 non-null  int64  
 6   custom_outdoor_kitchen_uts                      429701 non-null  float64
 7   foundation_repaired_uts                         429701 non-null  float64
 8   frame_detached_garage_uts                       429701 non-null  int64  
 9   frame_detached_garage_w_li

In [26]:
extra_features_uts_grade.describe()

Unnamed: 0,acct,basic_outdoor_kitchen_uts,brick_or_stone_detached_garage_uts,canopy_residential_uts,carport_residential_uts,cracked_slab_uts,custom_outdoor_kitchen_uts,foundation_repaired_uts,frame_detached_garage_uts,frame_detached_garage_w_living_area_abov_uts,...,custom_outdoor_kitchen_grade,foundation_repaired_grade,frame_detached_garage_grade,frame_detached_garage_w_living_area_abov_grade,frame_utility_shed_grade,gunite_pool_grade,metal_utility_shed_grade,pool_spa_with_heater_grade,reinforced_concrete_pool_grade,residential_other_gross_value_grade
count,429701.0,429701.0,429701.0,429701.0,429701.0,429701.0,429701.0,429701.0,429701.0,429701.0,...,4294.0,17653.0,180708.0,7047.0,87819.0,87949.0,18365.0,37211.0,3729.0,14550.0
mean,927251900000.0,0.006876,19.495272,52.361754,71.469364,58.690378,0.010087,74.098459,213.68257,9.658162,...,3.261877,4.001076,4.54628,3.964027,4.490624,3.88246,4.630054,3.846954,3.998391,3.992131
std,264289100000.0,0.498158,116.572033,146.89009,185.677686,315.733131,0.101181,375.256007,273.810794,79.669605,...,0.960405,0.043224,0.551933,0.716937,0.594026,0.465277,0.577235,0.49462,0.284616,0.284215
min,21440000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,770500100000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
50%,993750000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,4.0,5.0,4.0,4.5,4.0,5.0,4.0,4.0,4.0
75%,1143770000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,462.0,0.0,...,4.0,4.0,5.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0
max,1373580000000.0,256.0,11086.0,15450.0,19000.0,9153.0,5.0,8952.0,10406.0,2920.0,...,6.0,5.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0


# Export real_acct

In [27]:
save_fn = ROOT_DIR / 'data/raw/2016/extra_features_uts_grade_comps.pickle'
save_pickle(extra_features_uts_grade, save_fn)