# Load previously selected subdivision samples and feature

The file being imported below was constructed in the last exploratory notebooks (2.2-rp-hcad-eda-appraised-value-perc-diff-subdiv-comps-hypothesis-test). These data have been cleaned, merged, and selected so we can start preparing it for modeling.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pickle

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from src.definitions import ROOT_DIR
from src.data.utils import save_pickle

In [3]:
plt.style.use('seaborn-poster')

In [4]:
subdiv_fn = ROOT_DIR / 'data/interim/2016/subdiv.pickle'
assert subdiv_fn.exists()

In [5]:
subdiv = pd.read_pickle(subdiv_fn)

In [6]:
subdiv.sample(10)

Unnamed: 0_level_0,date_erected,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,perimeter,fireplace_metal_prefab,fixtures_addl,fixtures_total,...,new_own_dt,lgl_1,base_area_pri,base_area_upr,mas_brk_garage_pri,open_frame_porch_pri,open_mas_porch_pri,exterior_wall,yy_pct_change,kmeans_label
acct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1347180010002,2015,2321.0,2914,2321.0,2914,2537.0,244.0,1.0,2.0,12.0,...,2015-05-08,LT 2 BLK 1,2321.0,0.0,399.0,0.0,152.0,Stone,5.919743,2
1347180010052,2014,3078.0,3872,3078.0,3872,3223.0,352.0,1.0,2.0,15.0,...,2014-05-30,LT 52 BLK 1,1778.0,1300.0,400.0,210.0,144.0,Stone,0.184155,1
1347180010010,2014,2030.0,2528,2030.0,2528,2214.0,238.0,1.0,2.0,10.0,...,2015-05-14,LT 10 BLK 1,2030.0,0.0,396.0,102.0,0.0,Stone,3.863895,2
1347180010037,2014,3206.0,3968,3206.0,3968,3341.0,360.0,1.0,2.0,15.0,...,2014-08-29,LT 37 BLK 1,1894.0,1312.0,440.0,220.0,62.0,Brick / Veneer,0.092911,0
1347180030004,2014,1824.0,2472,1824.0,2472,2058.0,216.0,1.0,2.0,10.0,...,2014-12-12,LT 4 BLK 3,1824.0,0.0,400.0,100.0,148.0,Frame / Concrete Blk,0.045583,2
1347180030019,2014,1943.0,2437,1943.0,2437,2125.0,220.0,1.0,2.0,10.0,...,2014-08-27,LT 19 BLK 3,1943.0,0.0,380.0,102.0,12.0,Stone,0.072175,1
1347180020012,2015,2587.0,3297,2587.0,3297,2737.0,450.0,1.0,2.0,12.0,...,2015-08-24,LT 12 BLK 2,1492.0,1095.0,491.0,130.0,89.0,Stone,6.219183,0
1347180020013,2014,1874.0,2465,1874.0,2465,2090.0,220.0,0.0,2.0,10.0,...,2015-05-26,LT 13 BLK 2,1874.0,0.0,420.0,122.0,49.0,Stone,2.378514,2
1347180020028,2014,2832.0,3448,2832.0,3448,2930.0,438.0,1.0,2.0,12.0,...,2015-01-16,LT 28 BLK 2,1542.0,1290.0,462.0,130.0,24.0,Brick / Veneer,0.04362,0
1347180010006,2014,1844.0,2406,1844.0,2406,2051.0,218.0,1.0,2.0,10.0,...,2015-04-24,LT 6 BLK 1,1844.0,0.0,420.0,112.0,30.0,Brick / Veneer,1.512663,2


# Drop previous year related features
Since some of the houses weren't finished by January of the prior year (2015), features that use the prior year value will be bias towards lower values. Let's drop these features.

In [7]:
subdiv.shape

(111, 33)

In [8]:
subdiv.info()

<class 'pandas.core.frame.DataFrame'>
UInt64Index: 111 entries, 1347180010011 to 1347180030030
Data columns (total 33 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   date_erected            111 non-null    uint16        
 1   im_sq_ft                111 non-null    float64       
 2   act_ar                  111 non-null    uint16        
 3   heat_ar                 111 non-null    float64       
 4   gross_ar                111 non-null    uint16        
 5   eff_ar                  111 non-null    float64       
 6   perimeter               111 non-null    float64       
 7   fireplace_metal_prefab  111 non-null    float32       
 8   fixtures_addl           111 non-null    float32       
 9   fixtures_total          111 non-null    float32       
 10  room_bedroom            111 non-null    float32       
 11  room_full_bath          111 non-null    float32       
 12  room_half_bath          111

In [9]:
drop = [
    'yy_pct_change', # Some of the houses weren't finished in 2015
    'prior_land_val', # Some of the houses weren't finished in 2015
    'prior_tot_appr_val', # Some of the houses weren't finished in 2015
    'new_own_dt', # Latest purchase date
    'lgl_1', # House lot
]

In [10]:
subdiv.drop(drop, axis=1, inplace=True)

In [11]:
subdiv.info()

<class 'pandas.core.frame.DataFrame'>
UInt64Index: 111 entries, 1347180010011 to 1347180030030
Data columns (total 28 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   date_erected            111 non-null    uint16  
 1   im_sq_ft                111 non-null    float64 
 2   act_ar                  111 non-null    uint16  
 3   heat_ar                 111 non-null    float64 
 4   gross_ar                111 non-null    uint16  
 5   eff_ar                  111 non-null    float64 
 6   perimeter               111 non-null    float64 
 7   fireplace_metal_prefab  111 non-null    float32 
 8   fixtures_addl           111 non-null    float32 
 9   fixtures_total          111 non-null    float32 
 10  room_bedroom            111 non-null    float32 
 11  room_full_bath          111 non-null    float32 
 12  room_half_bath          111 non-null    float32 
 13  room_rec                111 non-null    float32 
 14  roo

# Create dummy features for categorical variables
Since the `exterior_wall` category was created using all the houses in HCAD, there are seven categories that are not present in this subdivision. Let's remove them from the possible categories list so the `pd.get_dummy` method doesn't get confused.

In [12]:
subdiv['exterior_wall'].value_counts()

Brick / Veneer          44
Frame / Concrete Blk    36
Stone                   31
Stucco                   0
Shake Shingle            0
Metal, Light             0
Frame / Res Stucco       0
Brick / Masonry          0
Asbestos                 0
Aluminum / Vinyl         0
Name: exterior_wall, dtype: int64

In [13]:
subdiv['exterior_wall'] = subdiv['exterior_wall'].cat.remove_unused_categories()

In [14]:
subdiv['exterior_wall'].value_counts()

Brick / Veneer          44
Frame / Concrete Blk    36
Stone                   31
Name: exterior_wall, dtype: int64

In [15]:
subdiv_dummy = pd.get_dummies(subdiv, columns=['kmeans_label', 'exterior_wall'], drop_first=True)

In [16]:
subdiv_dummy.head()

Unnamed: 0_level_0,date_erected,im_sq_ft,act_ar,heat_ar,gross_ar,eff_ar,perimeter,fireplace_metal_prefab,fixtures_addl,fixtures_total,...,tot_appr_val,base_area_pri,base_area_upr,mas_brk_garage_pri,open_frame_porch_pri,open_mas_porch_pri,kmeans_label_1,kmeans_label_2,exterior_wall_Frame / Concrete Blk,exterior_wall_Stone
acct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1347180010011,2014,2869.0,3545,2869.0,3545,3003.0,374.0,1.0,2.0,15.0,...,276747.0,1764.0,1105.0,399.0,220.0,57.0,0,0,1,0
1347180010013,2014,1858.0,2524,1858.0,2524,2098.0,218.0,0.0,2.0,10.0,...,215920.0,1858.0,0.0,399.0,112.0,155.0,0,1,0,1
1347180010021,2014,2579.0,3214,2579.0,3214,2683.0,230.0,1.0,2.0,15.0,...,258753.0,2130.0,0.0,399.0,136.0,100.0,0,0,0,0
1347180010023,2014,2031.0,2626,2031.0,2626,2246.0,234.0,1.0,2.0,10.0,...,165326.0,2031.0,0.0,396.0,119.0,80.0,0,1,1,0
1347180010035,2014,2371.0,3133,2371.0,3133,2458.0,216.0,0.0,2.0,12.0,...,238820.0,1683.0,0.0,566.0,100.0,96.0,0,1,0,1


# Split data

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X = subdiv_dummy.drop(['tot_appr_val'], axis=1)
y = subdiv_dummy['tot_appr_val']

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 10-fold cross-validation, feature selection and modeling

I'll follow a variant of the steps on page [245 of Hastie, Tibshirani, and Friedman (2009)](https://web.stanford.edu/~hastie/ElemStatLearn/). In this case, I first split the data (previous step) and use the training data in cross-validation to:
1. Standardize the variables
2. Perform feature elimination with Lasso
3. Build a multi-linear Ridge regression model with the selected features
4. Determine the model scores on the held-out data fold

By summarizing the model scores across fold (cross-validation error) we can form an idea of the prediction error on unseen data.

https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html
https://www.tomasbeuzen.com/post/scikit-learn-gridsearch-pipelines/
https://www.aaai.org/Papers/Workshops/2007/WS-07-05/WS07-05-007.pdf
https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-6-10
https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/
https://stats.stackexchange.com/questions/266225/step-by-step-explanation-of-k-fold-cross-validation-with-grid-search-to-optimise
