# House Price dataset : Feature Selection

In this notebook, we will select features which are the most predictive ones to build our machine learning model.

**We will use Lasso regression. This algorithm has a property of setting coefficients to zero if they are non-informative. This way we can remove them from our model.**

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso

pd.set_option('display.max_columns', None)

import warnings
warnings.simplefilter(action='ignore')

In [9]:
X_train=pd.read_csv('xtrain.csv')
X_test=pd.read_csv('xtest.csv')

print(X_train.head())
print(X_test.head())

     Id  MSSubClass  MSZoning  LotFrontage   LotArea  Street  Alley  LotShape  \
0   931    0.000000      0.75     0.461171  0.377048     1.0    0.5  0.333333   
1   657    0.000000      0.75     0.456066  0.399443     1.0    0.5  0.333333   
2    46    0.588235      0.75     0.394699  0.347082     1.0    0.5  0.000000   
3  1349    0.000000      0.75     0.388581  0.493677     1.0    0.5  0.666667   
4    56    0.000000      0.75     0.577658  0.402702     1.0    0.5  0.333333   

   LandContour  Utilities  ...  MiscFeature  MiscVal    MoSold  YrSold  \
0     1.000000        1.0  ...          1.0      0.0  0.545455    0.75   
1     0.333333        1.0  ...          1.0      0.0  0.636364    0.50   
2     0.333333        1.0  ...          1.0      0.0  0.090909    1.00   
3     0.666667        1.0  ...          1.0      0.0  0.636364    0.25   
4     0.333333        1.0  ...          1.0      0.0  0.545455    0.50   

   SaleType  SaleCondition  SalePrice  LotFrontage_na  MasVnrArea_na

In [10]:
y_train=X_train['SalePrice']
y_test=X_test['SalePrice']

X_train.drop(['Id','SalePrice'],axis=1,inplace=True)
X_test.drop(['Id','SalePrice'],axis=1,inplace=True)   

## Feature Selection


In [11]:
sfm_=SelectFromModel(estimator=Lasso(alpha=0.005,random_state=1))

sfm_.fit(X_train,y_train)

sfm_.get_support()

array([ True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True,  True,
       False,  True, False, False, False, False, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False, False, False,  True, False,
       False,  True,  True,  True, False,  True,  True, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

In [14]:
selected_feat=X_train.columns[(sfm_.get_support())]
print('Total Features: {}'.format(X_train.shape[1]))
print('Selected Features: {}'.format(len(selected_feat)))
print('Features with coefficient shrank to zero: {}'.format(np.sum(sfm_.estimator_.coef_==0)))

Total Features: 82
Selected Features: 20
Features with coefficient shrank to zero: 62


In [15]:
print('Selected features are :',selected_feat)

Selected features are : Index(['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'OverallCond',
       'YearRemodAdd', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir',
       '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces',
       'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars',
       'PavedDrive'],
      dtype='object')


In [19]:
pd.Series(selected_feat).to_csv('selected_features.csv',index=False,header='0')