# Find the comparables: extra_features.txt

The file `extra_features.txt` contains important property information like number and quality of pools, detached garages, outbuildings, canopies, and more. Let's load this file and grab a subset with the important columns to continue our study.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import pickle

import pandas as pd

from src.definitions import ROOT_DIR
from src.data.utils import Table, save_pickle

In [None]:
extra_features_fn = ROOT_DIR / 'data/external/2016/Real_building_land/extra_features.txt'
assert extra_features_fn.exists()

In [None]:
extra_features = Table(extra_features_fn, '2016')

In [None]:
extra_features_df = extra_features.get_df()

# Load accounts of interest
Let's remove the account numbers that don't meet free-standing single-family home criteria that we found while processing the `building_res.txt` file.

In [None]:
one_bld_in_acct_fn = ROOT_DIR / 'data/raw/2016/one_bld_in_acct.pickle'

In [None]:
with open(one_bld_in_acct_fn, 'rb') as f:
    one_bld_in_acct = pickle.load(f)

In [None]:
cond0 = extra_features_df['acct'].isin(one_bld_in_acct)
extra_features_df = extra_features_df.loc[cond0, :]

In [None]:
extra_features_df.head()

In [None]:
extra_features_df.columns

In [None]:
extra_features_df.dscr.value_counts()

# Grab slice of the extra features of interest
With the value counts on the extra feature description performed above we can see that the majority of the features land in the top 6 categories. Let's filter out the rests of the columns.

In [None]:
cols = extra_features_df.dscr.value_counts().head(6).index

In [None]:
cond0 = extra_features_df['dscr'].isin(cols)
extra_features_df = extra_features_df.loc[cond0, :]

# Build pivot tables for count and grade
There appear to be two important values related to each extra feature:count and grade. Let's build individual pivot tables for each and merge them before saving them out.

In [None]:
extra_features_pivot_count = extra_features_df.pivot_table(index='acct',
                                                           columns='dscr',
                                                           values='count',
                                                           fill_value=0)

In [None]:
extra_features_pivot_count.head()

In [None]:
extra_features_pivot_grade = extra_features_df.pivot_table(index='acct',
                                                           columns='dscr',
                                                           values='grade')

In [None]:
extra_features_pivot_grade.head()

In [None]:
extra_features_count_grade = extra_features_pivot_count.merge(extra_features_pivot_grade,
                                                              how='left',
                                                              left_index=True,
                                                              right_index=True,
                                                              suffixes=('_count', '_grade'),
                                                              validate='one_to_one')

In [None]:
extra_features_count_grade.head()

In [None]:
assert extra_features_count_grade.index.is_unique

add `acct` column to make easier the merging process ahead

In [None]:
extra_features_count_grade.reset_index(inplace=True)

# Export real_acct

In [None]:
save_fn = ROOT_DIR / 'data/raw/2016/extra_features_count_grade_comps.pickle'
save_pickle(extra_features_count_grade, save_fn)