## House Prices dataset: Feature Selection

In the following cells, we will select a group of variables, the most predictive ones, to build our machine learning model. 

### Why do we select variables?

- For production: Fewer variables mean smaller client input requirements (e.g. customers filling out a form on a website or mobile app), and hence less code for error handling. This reduces the chances of introducing bugs.

- For model performance: Fewer variables mean simpler, more interpretable, better generalizing models


**We will select variables using the Lasso regression: Lasso has the property of setting the coefficient of non-informative variables to zero. This way we can identify those variables and remove them from our final model.**


### Setting the seed

It is important to note, that we are engineering variables and pre-processing data with the idea of deploying the model. Therefore, from now on, for each step that includes some element of randomness, it is extremely important that we **set the seed**. This way, we can obtain reproducibility between our research and our development code.

This is perhaps one of the most important lessons that you need to take away from this course: **Always set the seeds**.

Let's go ahead and load the dataset.

In [6]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# loading data
import os
def load_data(path):
    # point to the data set directory and choose the file to load
    p = path
    os.chdir(p) 

    #choose a data set
    while True:
        files = []
        with os.scandir(p) as dir:
            for count,entry in enumerate(dir):
                print(f"{count}) {entry.name}")
                files.append(entry.name)




        data = int(input('Enter the file index: '))
        data = os.path.join(p,files[data])
        if not data.endswith('.csv'):
            p = data
            print("\nThis is a directory\n")
        else:
            print(data)
            break
            
    return data

path = r'E:\Documents\Data\Datasets'

In [8]:
data1 = load_data(path)
data2 = load_data(path)

0) ArXiv_old.csv
1) mbti_1.csv
2) news.csv
3) US_Accidents_Dec19.csv
4) cannabis.csv
5) r_dataisbeautiful_posts.csv
6) deepnlp
7) fake-and-real-news-dataset
8) game-of-thrones-srt
9) books.csv
10) graduate-admissions
11) Islander_data.csv
12) mushrooms.csv
13) netflix_titles.csv
14) news-headlines-dataset-for-sarcasm-detection
15) AB_NYC_2019.csv
16) winequality-red.csv
17) StudentsPerformance.csv
18) ted-talks
19) young-people-survey
20) diamonds.csv
21) fake_job_postings.csv
22) developer_survey_2019
23) ETH_1h.csv.csv
24) Video 2020-04-17_2020-05-15 iCburks
25) Video 2013-04-06_2020-05-15 Gabe Flomo
26) Yelp Data
27) Udemy
28) NLP Data
29) FATAL ENCOUNTERS DOT ORG SPREADSHEET (See Read me tab) - Form Responses.csv
30) data-police-shootings-master
Enter the file index: 27

This is a directory

0) House Prices
Enter the file index: 0

This is a directory

0) data_description.txt
1) test.csv
2) train.csv
3) xtrain.csv
4) xtest.csv
Enter the file index: 3
E:\Documents\Data\Datasets\Udem

In [9]:
x_train = pd.read_csv(data1)
x_test = pd.read_csv(data2)

In [10]:
x_train

Unnamed: 0,Id,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,931,12.211060,0.000000,0.75,0.461171,0.377048,1.0,1.0,0.333333,1.000000,1.0,0.0,0.0,0.875000,0.375,0.5,0.75,0.571429,0.777778,0.50,0.014706,0.049180,0.2,0.285714,0.857143,0.933333,0.25,0.000000,0.666667,0.75,1.0,0.75,0.75,0.75,1.000000,0.002835,0.666667,0.0,0.673479,0.239935,1.0,1.00,1.0,1.0,0.559760,0.000000,0.0,0.523250,0.000000,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.000000,0.2,0.833333,0.018692,1.000000,0.75,0.430183,0.6,0.8,1.0,0.116686,0.032907,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.545455,0.75,0.5,0.8,0.0,0.0,0.0
1,657,11.887931,0.000000,0.75,0.456066,0.399443,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.416667,0.375,0.5,0.75,0.571429,0.444444,0.75,0.360294,0.049180,0.2,0.285714,0.571429,0.600000,0.50,0.033750,0.666667,0.75,0.4,0.50,0.75,0.25,0.666667,0.142807,0.666667,0.0,0.114724,0.172340,1.0,1.00,1.0,1.0,0.434539,0.000000,0.0,0.406196,0.333333,0.0,0.333333,0.5,0.375,0.333333,0.666667,0.250000,1.0,0.000000,0.2,0.833333,0.457944,0.666667,0.25,0.220028,0.6,0.8,1.0,0.000000,0.000000,0.0,0.000000,0.000,0.0,0.0,0.50,0.5,0.0,0.636364,0.50,0.5,0.8,0.0,0.0,0.0
2,46,12.675764,0.588235,0.75,0.394699,0.347082,1.0,1.0,0.000000,0.333333,1.0,0.0,0.0,0.958333,0.375,0.5,1.00,0.571429,0.888889,0.50,0.036765,0.098361,0.6,0.285714,0.428571,0.400000,0.50,0.257500,1.000000,0.75,1.0,1.00,0.75,0.25,1.000000,0.080794,0.666667,0.0,0.601951,0.286743,1.0,1.00,1.0,1.0,0.627205,0.000000,0.0,0.586296,0.333333,0.0,0.666667,0.0,0.250,0.333333,1.000000,0.333333,1.0,0.333333,0.8,0.833333,0.046729,0.666667,0.50,0.406206,0.6,0.8,1.0,0.228705,0.149909,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.090909,1.00,0.5,0.8,0.0,0.0,0.0
3,1349,12.278393,0.000000,0.75,0.388581,0.493677,1.0,1.0,0.666667,0.666667,1.0,0.0,0.0,0.500000,0.375,0.5,0.75,0.571429,0.666667,0.50,0.066176,0.163934,0.2,0.285714,0.857143,0.933333,0.25,0.000000,0.666667,0.75,1.0,0.75,0.75,1.00,1.000000,0.255670,0.666667,0.0,0.018114,0.242553,1.0,1.00,1.0,1.0,0.566920,0.000000,0.0,0.529943,0.333333,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.250000,1.0,0.333333,0.4,0.833333,0.084112,0.666667,0.50,0.362482,0.6,0.8,1.0,0.469078,0.045704,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.636364,0.25,0.5,0.8,1.0,0.0,0.0
4,56,12.103486,0.000000,0.75,0.577658,0.402702,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.416667,0.375,0.5,0.75,0.571429,0.555556,0.50,0.323529,0.737705,0.2,0.285714,0.571429,0.666667,0.50,0.170000,0.333333,0.75,0.4,0.50,0.75,0.25,0.333333,0.086818,0.666667,0.0,0.434278,0.233224,1.0,0.75,1.0,1.0,0.549026,0.000000,0.0,0.513216,0.000000,0.0,0.666667,0.0,0.375,0.333333,0.333333,0.416667,1.0,0.333333,0.8,0.833333,0.411215,0.666667,0.50,0.406206,0.6,0.8,1.0,0.000000,0.000000,0.0,0.801181,0.000,0.0,0.0,0.75,0.5,0.0,0.545455,0.50,0.5,0.8,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1309,764,12.727838,0.235294,0.75,0.504203,0.387820,1.0,1.0,0.000000,0.333333,1.0,0.0,0.0,1.000000,0.375,0.5,0.75,0.857143,0.777778,0.50,0.073529,0.180328,0.2,0.285714,0.857143,0.933333,0.50,0.420625,0.666667,0.75,1.0,0.75,0.75,0.50,1.000000,0.206060,0.666667,0.0,0.041338,0.204910,1.0,1.00,1.0,1.0,0.504851,0.586004,0.0,0.692428,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.500000,1.0,0.333333,0.8,0.833333,0.093458,0.666667,0.75,0.603667,0.6,0.8,1.0,0.000000,0.234004,0.0,0.000000,0.375,0.0,0.0,0.75,0.5,0.0,0.545455,0.75,0.5,0.8,0.0,0.0,0.0
1310,836,11.759786,0.000000,0.75,0.388581,0.391317,1.0,1.0,0.000000,0.333333,1.0,0.0,0.0,0.250000,0.375,0.5,0.75,0.571429,0.333333,0.75,0.441176,0.262295,0.2,0.285714,0.857143,0.600000,0.25,0.000000,0.333333,0.75,0.4,0.75,0.75,0.25,0.333333,0.078313,0.666667,0.0,0.290293,0.174632,1.0,0.50,1.0,1.0,0.439537,0.000000,0.0,0.410869,0.000000,0.0,0.666667,0.0,0.250,0.333333,0.666667,0.166667,0.5,0.000000,0.2,0.833333,0.130841,0.333333,0.50,0.307475,0.6,0.8,1.0,0.338390,0.000000,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.090909,1.00,0.5,0.8,0.0,0.0,0.0
1311,1217,11.626254,0.411765,0.25,0.434909,0.377157,1.0,1.0,0.000000,0.333333,1.0,0.0,0.0,0.250000,0.125,0.5,0.25,0.285714,0.555556,0.50,0.235294,0.540984,0.2,0.285714,0.857143,0.933333,0.25,0.000000,0.333333,0.75,0.0,0.00,0.25,0.00,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,1.0,0.50,1.0,1.0,0.519487,0.311966,0.0,0.615356,0.000000,0.0,0.666667,0.0,0.500,0.666667,0.333333,0.500000,1.0,0.000000,0.2,0.833333,0.299065,0.333333,0.50,0.380113,0.6,0.8,1.0,0.000000,0.000000,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.272727,1.00,0.5,0.8,0.0,0.0,0.0
1312,560,12.363076,0.588235,0.75,0.388581,0.176055,1.0,1.0,0.000000,0.333333,1.0,0.0,0.0,0.625000,0.375,0.5,1.00,0.571429,0.666667,0.50,0.022059,0.049180,0.2,0.285714,0.857143,0.933333,0.50,0.011250,0.666667,0.75,1.0,0.75,0.75,1.00,0.833333,0.000000,0.666667,0.0,0.638179,0.224877,1.0,1.00,1.0,1.0,0.582551,0.000000,0.0,0.544554,0.000000,0.0,0.666667,0.0,0.250,0.333333,0.666667,0.416667,1.0,0.333333,0.6,0.833333,0.028037,1.000000,0.50,0.296192,0.6,0.8,1.0,0.166861,0.036563,0.0,0.000000,0.000,0.0,0.0,0.75,0.5,0.0,0.818182,0.00,0.5,0.8,1.0,0.0,0.0


In [11]:
# get the target (remember the target is log transformed)
y_train = x_train['SalePrice']
y_test = x_test['SalePrice']

# drop uneccessary columns from our datasets
x_train.drop(['Id','SalePrice'], axis = 1, inplace = True)
x_test.drop(['Id','SalePrice'], axis = 1, inplace = True)

### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so remember to set the seed.

* [Select from model docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html)
* [Lasso regression docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
* [Article on L1 and L2](https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c)

In [12]:
# We will do the model fitting and feature selection
# altogether in a few lines of code

# first, we specify the Lasso Regression model, and we
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which
# will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel = SelectFromModel(Lasso(alpha = .005, random_state = 0))

# train Lasso model and select features
sel.fit(x_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [13]:
# the features that were selected are marked with true
sel.get_support()

array([ True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True,  True,
       False,  True, False, False, False, False, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False, False, False,  True, False,
       False,  True,  True,  True, False,  True,  True, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False])

In [14]:
selected = x_train.columns[(sel.get_support())]

print(f'Total Features: {x_train.shape[1]}')
print(f'Selected Features: {len(selected)}')
print(f'Non selected features: {x_train.shape[1] - len(selected)}')    

Total Features: 82
Selected Features: 21
Non selected features: 61


In [15]:
selected

Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'OverallCond',
       'YearRemodAdd', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive',
       'SaleCondition'],
      dtype='object')

In [16]:
selected = x_train.columns[(sel.estimator_.coef_ != 0).ravel().tolist()]
selected

Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'OverallCond',
       'YearRemodAdd', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive',
       'SaleCondition'],
      dtype='object')

In [19]:
pd.Series(selected).to_csv(r'E:\Documents\Data\Datasets\Udemy\selected_features.csv', index = False)

  """Entry point for launching an IPython kernel.
