In [1]:
import numpy as np
import pandas as pd

In [5]:
df_numeric =pd.read_csv("data/df_numeric.csv")

In [6]:
df_numeric.head(2)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,Street,LotShape,Utilities,LandSlope,OverallQual,OverallCond,YearBuilt,...,MoSold,YrSold,SalePrice,GarageYrBlt_missing_ind,LotFrontage_missing_ind,MasVnrArea_missing_ind,1stFlrSF_log,1stFlr_2ndFlr_SF,OverallGrade,SimplGarageQual
0,60,65.0,8450,2,4,4,3,7,5,2003,...,2,2008,208500,0,0,0,6.75227,1710,35,1
1,20,80.0,9600,2,4,4,3,6,8,1976,...,5,2007,181500,0,0,0,7.140453,1262,48,1


We are going to implement elements for filter feature selectors based on the following criteria:

* Small variance
* One of each pair of features, which are correlated together more than x

### 1. Extract Target Variable
Before doing any transformations we will extract our target variable to keep it as it is. Even though we can do some transformations to it, it is a good practice to do it separately:

In [7]:
y = df_numeric.SalePrice
df_numeric.drop("SalePrice",axis=1, inplace=True)

## Part 1: Removing Features With Small Variance
First of all, we will remove the columns with very little variance. Small variance equals small predictive power because all houses have very similar values.

In [8]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(0.1)
df_transformed = vt.fit_transform(df_numeric)

In [9]:
# Check the number of variables in the table and find out how many features we have deleted.
df_numeric.shape

(1458, 59)

In [10]:
df_numeric.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,Street,LotShape,Utilities,LandSlope,OverallQual,OverallCond,YearBuilt,...,MiscVal,MoSold,YrSold,GarageYrBlt_missing_ind,LotFrontage_missing_ind,MasVnrArea_missing_ind,1stFlrSF_log,1stFlr_2ndFlr_SF,OverallGrade,SimplGarageQual
0,60,65.0,8450,2,4,4,3,7,5,2003,...,0,2,2008,0,0,0,6.75227,1710,35,1
1,20,80.0,9600,2,4,4,3,6,8,1976,...,0,5,2007,0,0,0,7.140453,1262,48,1
2,60,68.0,11250,2,3,4,3,7,5,2001,...,0,9,2008,0,0,0,6.824374,1786,35,1
3,70,60.0,9550,2,3,4,3,7,5,1915,...,0,2,2006,0,0,0,6.867974,1717,35,1
4,60,84.0,14260,2,3,4,3,8,5,2000,...,0,12,2008,0,0,0,7.04316,2198,40,1


In [11]:
df_transformed.shape

(1458, 50)

In [12]:
df_transformed.head()

AttributeError: 'numpy.ndarray' object has no attribute 'head'

#### Note!
fit_transform() in sklearn transforms an object from DataFrame to numpy.array and we are losing column names, so we need to do some tricks to get them back!

We don't need column names for modeling but it helps with the interpretation of modeling results

In [13]:
# columns we have selected
# get_support() is method of VarianceThreshold and stores boolean of each variable in the numpy array.
selected_columns = df_numeric.columns[vt.get_support()]

# transforming an array back to a data-frame preserves column labels
df_transformed = pd.DataFrame(df_transformed, columns = selected_columns)

## Part 2: Removing Correlated Features

The goal of this part is to remove one feature from each highly correlated pair.

We are going to do this in 3 steps:

1. Calculate a correlation matrix
2. Get pairs of highly correlated features
3. Remove correlated columns

In [14]:
# step 1
df_corr = df_transformed.corr().abs()

# step 2
indices = np.where(df_corr > 0.8) 
indices = [(df_corr.index[x], df_corr.columns[y]) 
for x, y in zip(*indices)
    if x != y and x < y]

# step 3
for idx in indices: #each pair
    try:
        df_transformed.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass

The code above will drop one column from each pair that is correlated at least 0.8. If this happens twice, use try-except block to allow the code to continue even when KeyError occurs.

In [16]:
# We can check the correlated columns by printing the indices:
print(list(indices))

[('TotalBsmtSF', '1stFlrSF'), ('GrLivArea', 'TotRmsAbvGrd'), ('GrLivArea', '1stFlr_2ndFlr_SF'), ('TotRmsAbvGrd', '1stFlr_2ndFlr_SF'), ('GarageCars', 'GarageArea'), ('GarageQual', 'GarageCond')]


In [17]:
# Check the number of variables in the table and find out how many features we have deleted.
df_transformed.shape

(1458, 45)

## Part 3: Forward Regression
We have removed the features with no information and correlated features so far. The last thing we will do before modeling is to select the k-best features in terms of the relationship with the target variable. We will use the forward wrapper method for that:

In [18]:
from sklearn.feature_selection import f_regression, SelectKBest

skb = SelectKBest(f_regression, k=10)
X = skb.fit_transform(df_transformed, y)

We need to import the SelectKBest method. Plus, we have to decide what algorithm we are going to use for the actual selection. Since we want to do a forward regression, we also imported f_regression. We could use some other technique if, for example, the target variable was categorical.

#### We have assigned our target variable SalePrice into y in the beginning of this tutorial

#### The type of X was again changed to array

Convert X back to a data-frame and assign back the correct column names.

HINT: Use the method get_support() from the SelectKBest instance to find the features that were selected.

In [21]:
# Postion of top 10 columns
skb.get_support()

# coluum_names
df_transformed.columns[skb.get_support()]

X = pd.DataFrame(X, columns = df_transformed.columns[skb.get_support()])


Now, X consists of 10 features which should be pretty good predictors of our target variable, SalePrice.