# Feature Selection

## Feature Selection

from http://scikit-learn.org/stable/modules/feature_selection.html

### Feature Selection using f_regression

In [1]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression,f_classif,mutual_info_regression

iris = load_iris()
X, y = iris.data, iris.target

print ('X.shape = ' + str(X.shape[0]) + ' rows x ' + str(X.shape[1]) + ' columns')

X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
X_new.shape

print ('X_new.shape = ' + str(X_new.shape[0]) + ' rows x ' + str(X_new.shape[1]) + ' columns')

X.shape = 150 rows x 4 columns
X_new.shape = 150 rows x 2 columns


## Lab Instruction

### Part 1: Importing the Dataset

Import "Feature Selection Lab.csv". 

In [2]:
import pandas as pd
df = pd.read_csv('Feature Selection Lab.csv')

### Part 2: Preprocessing

Preprocess the dataset. Try to understand the process.

In [3]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### Part 3: Perform Feature Selection

You can import a Feature Selection  using the codes from above. Use the following methods:
1. F-Test for regression
2. Mutual Information for regression
3. Summarize which feature are the most important.

In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
columns = df.columns[df.isnull().sum()>500].tolist()

In [6]:
df = df.drop(columns=columns)

In [7]:
df = df.dropna(axis=0)

In [8]:
df.shape

(1094, 76)

In [9]:
df_y = df.SalePrice.copy()
df.drop('SalePrice', axis=1, inplace=True)

In [10]:
df = pd.concat([pd.get_dummies(df.select_dtypes(include=['object'])), df.select_dtypes(include=['int64','float64'])], axis=1)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1094 entries, 0 to 1459
Columns: 260 entries, MSZoning_C (all) to YrSold
dtypes: float64(3), int64(34), uint8(223)
memory usage: 563.0 KB


In [12]:
import numpy as np
from sklearn.model_selection import train_test_split

df.reset_index(drop=True, inplace=True)
df = df.reindex_axis(sorted(df.columns), axis=1)
X = df.copy().values
y = df_y.values

In [13]:
selector = SelectKBest(f_regression, k=100)
selector.fit_transform(X, y)
scores = selector.scores_

In [14]:
feat_scores = pd.DataFrame()
feat_scores["F Score"] = selector.scores_
feat_scores["P Value"] = selector.pvalues_
feat_scores["Support"] = selector.get_support()
feat_scores["Attribute"] = df.columns

In [15]:
feat_scores.loc[feat_scores.Support==True]

Unnamed: 0,F Score,P Value,Support,Attribute
0,673.687042,4.425681e-116,True,1stFlrSF
1,110.045058,1.351709e-24,True,2ndFlrSF
3,31.906116,2.062805e-08,True,BedroomAbvGr
9,22.370140,2.542702e-06,True,BsmtCond_Fa
13,28.689769,1.035516e-07,True,BsmtExposure_Av
14,130.962813,1.015168e-28,True,BsmtExposure_Gd
16,133.235514,3.655997e-29,True,BsmtExposure_No
17,182.803517,1.257949e-38,True,BsmtFinSF1
20,28.173368,1.342927e-07,True,BsmtFinType1_BLQ
21,284.317343,7.126722e-57,True,BsmtFinType1_GLQ


In [16]:
X_new = selector.fit_transform(X, y)

### Part 4: Perform Linear Regression

After feature elimination, try fitting linear regression again to see if you have improved or hurt your model performance.

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [18]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y, test_size=0.33, random_state=42)

In [19]:
from sklearn import linear_model

lr = linear_model.LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

-101414924.38836738

In [20]:
lr2 = linear_model.LinearRegression()
lr2.fit(X_train_new, y_train_new)
lr2.score(X_test_new, y_test_new)

0.8291373335289451