# Feature Selection

## outline:

- [**01. Importing libraries**](#01)
- [**02. Load DataSet**](#02)
- [**03. Feature Selection**](#03)


---

<a id="01"></a>

### **01. Importing libraries** 


---



In [15]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel


# display all columns 
pd.pandas.set_option('display.max_columns',None)


---

<a id="02"></a>

### **02. Load DataSet** 


---



In [16]:
X_train = pd.read_csv('Data/CleanedData/xtrain.csv')
X_test = pd.read_csv('Data/CleanedData/xtest.csv')

X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,LandContour,LotConfig,Neighborhood,Condition1,BldgType,HouseStyle,OverallQual,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,PoolArea,MiscVal,SaleType,SaleCondition
0,0.75,0.75,0.321429,0.358615,0.333333,1.0,0.0,0.863636,0.4,0.75,0.6,0.777778,0.014706,0.04918,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,1.0,0.666667,0.666667,0.666667,1.0,0.146757,0.825429,0.347048,1.0,1.0,1.0,1.0,0.585326,0.0,0.521003,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.0,0.75,0.018692,1.0,0.75,0.503432,0.5,0.5,1.0,0.738563,0.457777,0.0,0.0,0.0,0.0,0.666667,0.75
1,0.75,0.75,0.316824,0.380598,0.333333,0.333333,0.0,0.363636,0.4,0.75,0.6,0.444444,0.360294,0.04918,0.0,0.0,0.6,0.5,0.666667,0.739293,0.666667,0.5,0.333333,0.666667,0.0,0.8,0.583073,0.343417,0.271419,1.0,1.0,1.0,1.0,0.460219,0.0,0.403993,0.333333,0.333333,0.5,0.375,0.333333,0.666667,0.25,1.0,0.0,0.75,0.457944,0.5,0.25,0.291435,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.75
2,0.916667,0.75,0.263578,0.329323,0.0,0.333333,0.0,0.954545,0.4,1.0,0.6,0.888889,0.036765,0.098361,1.0,0.0,0.4,0.3,0.666667,0.918887,1.0,1.0,1.0,0.666667,0.0,1.0,0.492331,0.781495,0.396136,1.0,1.0,1.0,1.0,0.651365,0.0,0.584125,0.333333,0.666667,0.0,0.25,0.333333,1.0,0.333333,1.0,0.333333,0.75,0.046729,0.5,0.5,0.480457,0.5,0.5,1.0,0.824968,0.692969,0.0,0.0,0.0,0.0,0.666667,0.75
3,0.75,0.75,0.307738,0.473952,0.666667,0.666667,0.0,0.454545,0.4,0.75,0.6,0.666667,0.066176,0.163934,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,1.0,0.666667,0.666667,1.0,1.0,0.688983,0.129421,0.349856,1.0,1.0,1.0,1.0,0.592381,0.0,0.5277,0.333333,0.666667,0.0,0.375,0.333333,0.666667,0.25,1.0,0.333333,0.75,0.084112,0.5,0.5,0.437886,0.5,0.5,1.0,0.912638,0.507473,0.0,0.0,0.0,0.0,0.666667,0.75
4,0.75,0.75,0.434355,0.383803,0.333333,0.333333,0.0,0.363636,0.4,0.75,0.6,0.555556,0.323529,0.737705,0.0,0.0,0.6,0.7,0.666667,0.888552,0.333333,0.5,0.333333,0.666667,0.0,0.6,0.503143,0.666204,0.339815,1.0,0.75,1.0,1.0,0.574729,0.0,0.510963,0.0,0.666667,0.0,0.375,0.333333,0.333333,0.416667,1.0,0.333333,0.75,0.411215,0.5,0.5,0.480457,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.75


In [17]:
y_train = pd.read_csv('Data/CleanedData/ytrain.csv')
y_test = pd.read_csv('Data/CleanedData/ytest.csv')

---

<a id="03"></a>

### **03. Feature Selection** 


---



We will do the model fitting and feature selection

first, we specify the Lasso Regression model, and we select a suitable alpha (equivalent of penalty). `The bigger the alpha the less features that will be selected.`
Then we use the selectFromModel object from sklearn, which will select automatically the features which coefficients are non-zero

In [18]:
model = Lasso(alpha=0.0009, random_state=42)
sel_ = SelectFromModel(model)

sel_.fit(X_train, y_train)

In [19]:
print(f"The number of feature that are selected : {sel_.get_support().sum()}")


selected_features = X_train.columns[(sel_.get_support())]
print(f"the features that are selected : \n{selected_features}")

The number of feature that are selected : 40
the features that are selected : 
Index(['MSSubClass', 'MSZoning', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'Condition1', 'HouseStyle', 'OverallQual',
       'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'ExterQual', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtUnfSF', 'HeatingQC', 'CentralAir', '1stFlrSF', 'GrLivArea',
       'BsmtFullBath', 'HalfBath', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'ScreenPorch',
       'SaleCondition'],
      dtype='object')


In [20]:
selected_features = pd.Series(selected_features)
selected_features.to_csv('SelectedFeatures/selected_features.csv', index=False)