<a href="https://colab.research.google.com/github/LucasO21/ml-with-python/blob/main/ames-housing-prediction/ames_prediction_feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In this notebook, I explore feature selection techniques to determaine the best features to use in predicting *saleprice* for the ames housing dataset. 

This is part a larger project to predict *saleprice* for the ames dataset. This project is focused on exploring machine learning techniques as well as implementation in python. Since this is part of a learning project, my goal was to reduce the number of features I work with so I can focus on implementation and not be overwhelmed by the high number of features.

## Notebook Settings

In [1]:
# mount drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
# set working directory
import os

cwd = '/content/gdrive/MyDrive/ml-with-python/aimes_analysis'
os.chdir(cwd)

## Package Imports

In [3]:
# installs
! pip install feature_engine &> /dev/null

In [4]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Load Data

In [7]:
# load data
df = pd.read_csv('data/ames_training_data.csv')
df.shape

(2197, 82)

In [8]:
# view head
df.head()

Unnamed: 0,order,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,sale_condition,saleprice
0,534,531363010,20,RL,80.0,9605,Pave,,Reg,Lvl,...,0,,,,0,4,2009,WD,Normal,159000
1,803,906203120,20,RL,90.0,14684,Pave,,IR1,Lvl,...,0,,,,0,6,2009,WD,Normal,271900
2,956,916176030,20,RL,,14375,Pave,,IR1,Lvl,...,0,,,,0,1,2009,COD,Abnorml,137500
3,460,528180130,120,RL,48.0,6472,Pave,,Reg,Lvl,...,0,,,,0,4,2009,WD,Normal,248500
4,487,528290030,80,RL,61.0,9734,Pave,,IR1,Lvl,...,0,,,,0,5,2009,WD,Normal,167000


__Observation:__ At first glance, we can see that *order* and *pid* are not really useful for predicting house prices. Therefore I make a note to drop these features later.

## Data Inspection

In [9]:
# check for nulls
nulls = []
for feature in df.columns:
    if df[feature].isnull().sum() > 0:
        nulls.append(feature)

features_with_nulls_df = df[nulls].isnull().sum() / len(df)
features_with_nulls_df

lot_frontage      0.164770
alley             0.934911
mas_vnr_type      0.010014
mas_vnr_area      0.010014
bsmt_qual         0.030496
bsmt_cond         0.030496
bsmt_exposure     0.031406
bsmtfin_type_1    0.030496
bsmtfin_sf_1      0.000455
bsmtfin_type_2    0.030951
bsmtfin_sf_2      0.000455
bsmt_unf_sf       0.000455
total_bsmt_sf     0.000455
electrical        0.000455
bsmt_full_bath    0.000455
bsmt_half_bath    0.000455
fireplace_qu      0.485207
garage_type       0.054620
garage_yr_blt     0.055530
garage_finish     0.055530
garage_cars       0.000455
garage_area       0.000455
garage_qual       0.055530
garage_cond       0.055530
pool_qc           0.994538
fence             0.809285
misc_feature      0.963587
dtype: float64

In [10]:
# get list of features to drop

# 1: features with null proportion of null values >= 0.2
# 2: order and pid can also be dropped

features_to_drop_names = features_with_nulls_df[features_with_nulls_df >= 0.2].index.tolist()
features_to_drop_names.extend(['order', 'pid'])
features_to_drop_names

['alley', 'fireplace_qu', 'pool_qc', 'fence', 'misc_feature', 'order', 'pid']

In [11]:
# drop unwanted features
df.drop(features_to_drop_names, axis=1, inplace=True)
df.shape

(2197, 75)

## Train Test Split

In [12]:
# imports
from sklearn.model_selection import train_test_split

In [13]:
# get x and y features
X = df.drop('saleprice', axis=1)
y = df['saleprice']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
X_train.shape, X_test.shape

((1537, 74), (660, 74))

In [14]:
# make copy of X_train and X_test
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()

## Data Cleaning

### Missing Value Imputation

In [15]:
# imports
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from sklearn.pipeline import Pipeline

In [16]:
# separate feature names
num_feature_names = X_train.select_dtypes(exclude='object').columns.tolist()
cat_feature_names = X_train.select_dtypes(include='object').columns.tolist()

In [17]:
# create imputers
num_imputer = MeanMedianImputer(imputation_method='median', variables=num_feature_names)
cat_imputer = CategoricalImputer(imputation_method='frequent', variables=cat_feature_names)

In [18]:
# imputer pipeline
imputer_pipe = Pipeline([
    ('num_imputer', num_imputer),
    ('cat_imputer', cat_imputer)
])

In [19]:
# fit imputer pipeline to train data
imputer_pipe.fit(X_train)

# transform test data
X_train = imputer_pipe.transform(X_train)
X_test = imputer_pipe.transform(X_test)

X_train.shape, X_test.shape

((1537, 74), (660, 74))

### Categorical Feature Encoding

In [20]:
# imports
from feature_engine.encoding import OrdinalEncoder, RareLabelEncoder

In [21]:
# encode categorical featues
cat_encode_pipe = Pipeline([
    ('rare_label_enc', RareLabelEncoder(tol=0.05, n_categories=4, variables=cat_feature_names)),
    ('ordinal_enc', OrdinalEncoder(encoding_method='arbitrary'))
])

In [22]:
# fit categorical encoder to train data
cat_encode_pipe.fit(X_train)

# transform train and test data
X_train_enc = cat_encode_pipe.transform(X_train)
X_test_enc = cat_encode_pipe.transform(X_test)

X_train_enc.shape, X_test_enc.shape

  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)
  "considered frequent".format(var)


((1537, 74), (660, 74))

## Feature Selection - Data Prep

### Remove Constants & Quasi-Constants

In [23]:
# imports
from feature_engine.selection import DropConstantFeatures

In [25]:
# create selector
constants_selector = DropConstantFeatures(tol=0.99, variables=None, missing_values='raise')

# fit selector
constants_selector.fit(X_train_enc)

# view constant features to drop
constants_selector.features_to_drop_

['street', 'utilities', 'condition_2', 'pool_area']

__Observation:__ *street, utilities, condition_2 and pool* will be dropped.

In [26]:
# transform train and test data
X_train_qc = constants_selector.transform(X_train_enc)
X_test_qc = constants_selector.transform(X_test_enc)

X_train_qc.shape, X_test_qc.shape

((1537, 70), (660, 70))

### Remove Duplicated Features

In [27]:
# imports
from feature_engine.selection import DropDuplicateFeatures

In [28]:
# create duplicate features selector
dup_features = DropConstantFeatures(variables=None, missing_values='raise')

# fit duplicate features selector to train data
dup_features.fit(X_train_qc)

# view duplicated features
dup_features.features_to_drop_

[]

__Observation:__ There are no duplicated features to drop.

### Remove Correlated Numeric Features

In [29]:
# import
from feature_engine.selection import SmartCorrelatedSelection
from sklearn.ensemble import RandomForestRegressor

In [30]:
# random forest model
rf = RandomForestRegressor(n_estimators=200, random_state=42, max_depth=4)

# correlation selector
corr_selector = SmartCorrelatedSelection(
    variables=X_train_qc.select_dtypes(exclude='object').columns.tolist(),
    method='pearson',
    threshold=0.8,
    missing_values='raise',
    selection_method='model_performance',
    estimator=rf,
    scoring='neg_mean_absolute_error',
    cv=5

)

# fit
corr_selector.fit(X_train_qc, y_train)

SmartCorrelatedSelection(cv=5,
                         estimator=RandomForestRegressor(max_depth=4,
                                                         n_estimators=200,
                                                         random_state=42),
                         missing_values='raise',
                         scoring='neg_mean_absolute_error',
                         selection_method='model_performance',
                         variables=['ms_subclass', 'ms_zoning', 'lot_frontage',
                                    'lot_area', 'lot_shape', 'land_contour',
                                    'lot_config', 'land_slope', 'neighborhood',
                                    'condition_1', 'bldg_type', 'house_style',
                                    'overall_qual', 'overall_cond',
                                    'year_built', 'year_remod/add',
                                    'roof_style', 'roof_matl', 'exterior_1st',
                                    'exterior_

In [31]:
# view correlated feature sets
corr_selector.correlated_feature_sets_

[{'2nd_flr_sf', 'house_style'},
 {'exterior_1st', 'exterior_2nd'},
 {'1st_flr_sf', 'total_bsmt_sf'},
 {'gr_liv_area', 'totrms_abvgrd'},
 {'garage_area', 'garage_cars'}]

In [32]:
# view dropped correlated features
corr_selector.features_to_drop_

['house_style', 'exterior_2nd', '1st_flr_sf', 'totrms_abvgrd', 'garage_area']

In [33]:
# fit correlation selector to train and test sets
X_train_corr = corr_selector.transform(X_train_qc)
X_test_corr = corr_selector.transform(X_test_qc)

X_train_corr.shape, X_test_corr.shape

((1537, 65), (660, 65))

## Feature Importance - Embedded Methods

### Tree Importance

In [34]:
# import
from sklearn.feature_selection import SelectFromModel, SelectKBest

In [35]:
# create selector
ti_selector = SelectFromModel(RandomForestRegressor(n_estimators=100, max_depth=4, random_state=42))

# fit selector
ti_selector.fit(X_train_corr, y_train)

# view selected features
ti_selected_features = X_train_corr.columns[(ti_selector.get_support())]
ti_selected_features

Index(['overall_qual', 'year_built', 'total_bsmt_sf', 'gr_liv_area',
       'screen_porch'],
      dtype='object')

In [36]:
# transform train and test data
X_train_ti = X_train_corr[ti_selected_features.tolist()]
X_test_ti = X_test_corr[ti_selected_features.tolist()]

X_train_ti.shape, X_test_ti.shape

((1537, 5), (660, 5))

### Recursive Feature Selection

In [37]:
# imports
from sklearn.feature_selection import RFE

In [38]:
# create selector
rs_selector = RFE(RandomForestRegressor(n_estimators=100, max_depth=4, random_state=42))

# fit selector
rs_selector.fit(X_train_corr, y_train)

# view selected features
rs_selected_features = X_train_corr.columns[(rs_selector.get_support())]
rs_selected_features

Index(['ms_zoning', 'lot_frontage', 'lot_area', 'land_slope', 'neighborhood',
       'overall_qual', 'year_built', 'year_remod/add', 'roof_matl',
       'exterior_1st', 'mas_vnr_area', 'exter_qual', 'bsmt_qual',
       'bsmt_exposure', 'bsmtfin_sf_1', 'bsmt_unf_sf', 'total_bsmt_sf',
       'central_air', '2nd_flr_sf', 'gr_liv_area', 'bsmt_full_bath',
       'full_bath', 'bedroom_abvgr', 'kitchen_qual', 'fireplaces',
       'garage_type', 'garage_finish', 'garage_cars', 'wood_deck_sf',
       'open_porch_sf', 'screen_porch', 'mo_sold'],
      dtype='object')

In [39]:
# get train and test sets
X_train_rfe = X_train_corr[rs_selected_features.tolist()]
X_test_rfe = X_test_corr[rs_selected_features.tolist()]

X_train_rfe.shape, X_test_rfe.shape

((1537, 32), (660, 32))

### Lasso Feature Selection

In [40]:
# imports
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler

In [41]:
# standard scaler
sc = StandardScaler()
sc.fit(X_train_corr)

StandardScaler()

In [42]:
# get selector
lasso_selector = SelectFromModel(Lasso(alpha=100))

# fit selector
lasso_selector.fit(sc.transform(X_train_corr), y_train)

# view selected features
lasso_selected_features = X_train_corr.columns[(lasso_selector.get_support())]
lasso_selected_features

Index(['ms_subclass', 'ms_zoning', 'lot_frontage', 'lot_area', 'lot_shape',
       'land_contour', 'lot_config', 'land_slope', 'neighborhood',
       'condition_1', 'bldg_type', 'overall_qual', 'overall_cond',
       'year_built', 'year_remod/add', 'roof_style', 'roof_matl',
       'exterior_1st', 'mas_vnr_type', 'mas_vnr_area', 'exter_qual',
       'exter_cond', 'foundation', 'bsmt_qual', 'bsmt_exposure',
       'bsmtfin_type_1', 'bsmtfin_sf_1', 'bsmtfin_type_2', 'bsmtfin_sf_2',
       'total_bsmt_sf', 'heating_qc', 'central_air', '2nd_flr_sf',
       'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath', 'bsmt_half_bath',
       'full_bath', 'half_bath', 'bedroom_abvgr', 'kitchen_abvgr',
       'kitchen_qual', 'functional', 'fireplaces', 'garage_type',
       'garage_yr_blt', 'garage_finish', 'garage_cars', 'garage_qual',
       'garage_cond', 'paved_drive', 'wood_deck_sf', 'open_porch_sf',
       'enclosed_porch', '3ssn_porch', 'screen_porch', 'misc_val', 'mo_sold',
       'yr_sold', '

In [43]:
# get train and test sets
X_train_lasso = X_train_corr[lasso_selected_features.tolist()]
X_test_lasso = X_test_corr[lasso_selected_features.tolist()]

X_train_lasso.shape, X_test_lasso.shape

((1537, 61), (660, 61))

## Comparing Model Performance

In [44]:
# import
from sklearn.model_selection import cross_val_score

In [45]:
# function to build random forest model and compare performance in train and test data
def get_random_forest_comparison(x_train, y_train, x_test, y_test, method):
    rf = RandomForestRegressor(n_estimators=200, random_state=42, max_depth=4)
    train_score = cross_val_score(rf, x_train, y_train, cv=5,  scoring='neg_mean_squared_error')
    train_score = np.sqrt(abs(train_score.mean()))
        
    test_score = cross_val_score(rf, x_test, y_test, cv=5, scoring='neg_mean_squared_error')
    test_score = np.sqrt(abs(test_score.mean()))
    
    print('Feature Selection Method: ', method)
    print('------------------------------------------------')
    print('Train Set RMSE: ', train_score)
    print('Test Set: RMSE', test_score) 

In [46]:
# correlated features selection
get_random_forest_comparison(X_train_corr, y_train, X_test_corr, y_test, method='Correlated Feature Selection')

Feature Selection Method:  Correlated Feature Selection
------------------------------------------------
Train Set RMSE:  31877.45755874098
Test Set: RMSE 33385.724257097805


In [47]:
# tree importance
get_random_forest_comparison(X_train_ti, y_train, X_test_ti.fillna(0), y_test, method='Tree Importance')

Feature Selection Method:  Tree Importance
------------------------------------------------
Train Set RMSE:  31832.601335170697
Test Set: RMSE 33387.93711807962


In [48]:
# recursive feature selection
get_random_forest_comparison(X_train_rfe, y_train, X_test_rfe.fillna(0), y_test, method='Recursive Feature Selection')

Feature Selection Method:  Recursive Feature Selection
------------------------------------------------
Train Set RMSE:  31795.67586603215
Test Set: RMSE 33313.12216390058


In [49]:
# lasso
get_random_forest_comparison(X_train_lasso, y_train, X_test_lasso.fillna(0), y_test, method='Lasso Feature Selection')

Feature Selection Method:  Lasso Feature Selection
------------------------------------------------
Train Set RMSE:  31858.423447196503
Test Set: RMSE 33221.59665143214


__Observation:__ The RFE and Lasso method of feature selection are showing the best performance. We should be fine with using any of those methods to select features. For the purpose of this project, we'll use the features derived from the RFE method. 

## Save Training & Test Feature Lists

In [50]:
# save recursive feature elimination dataset
X_train_rfe.columns

Index(['ms_zoning', 'lot_frontage', 'lot_area', 'land_slope', 'neighborhood',
       'overall_qual', 'year_built', 'year_remod/add', 'roof_matl',
       'exterior_1st', 'mas_vnr_area', 'exter_qual', 'bsmt_qual',
       'bsmt_exposure', 'bsmtfin_sf_1', 'bsmt_unf_sf', 'total_bsmt_sf',
       'central_air', '2nd_flr_sf', 'gr_liv_area', 'bsmt_full_bath',
       'full_bath', 'bedroom_abvgr', 'kitchen_qual', 'fireplaces',
       'garage_type', 'garage_finish', 'garage_cars', 'wood_deck_sf',
       'open_porch_sf', 'screen_porch', 'mo_sold'],
      dtype='object')