# Feature selection

## Dataset: Pima Indians Diabetes Database
- https://www.kaggle.com/uciml/pima-indians-diabetes-database

In [2]:
import pandas as pd

In [3]:
data_file = './data/diabetes.csv'
data = pd.read_csv(data_file)

In [4]:
data.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [5]:
X = data.drop(['Outcome'], axis=1)
Y = data['Outcome']

## 1. VarianceThreshold
- `sklearn.feature_selection.VarianceThreshold`
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold

In [6]:
from sklearn.feature_selection import VarianceThreshold

In [7]:
selector = VarianceThreshold(threshold=0.2)
X_reduced = selector.fit_transform(X)

In [8]:
print(X.shape)
print(X_reduced.shape)

(768, 8)
(768, 7)


In [9]:
print(X.head(2))
print(60 * '=')
print(X_reduced[0:2, :])

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
[[  6.  148.   72.   35.    0.   33.6  50. ]
 [  1.   85.   66.   29.    0.   26.6  31. ]]


### 참고
- 분산은 각 변수의 스케일에 따라 값이 작거나 커질 수 있으니 주의.
- threshold=0으로 하면, 모든 값이 같은 변수가 제거될 것임

## 2. Chi-squared based feature selection
- `sklearn.feature_selection.SelectKBest`: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest
- `sklearn.feature_selection.chi2`: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

In [12]:
from sklearn.feature_selection import SelectKBest, chi2

In [17]:
selector = SelectKBest(chi2, k='all')
#selector = SelectKBest(chi2, k=5)로 하면 5개만 뽑힘
X_reduced = selector.fit_transform(X, Y)

In [18]:
print(X.shape)
print(X_reduced.shape)

(768, 8)
(768, 8)


In [19]:
print(X.head(2))
print(60 * '=')
print(X_reduced[0:2, :])

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
[[  6.    148.     72.     35.      0.     33.6     0.627  50.   ]
 [  1.     85.     66.     29.      0.     26.6     0.351  31.   ]]


In [20]:
print(selector.scores_)
print(selector.pvalues_)

[ 111.51969064 1411.88704064   17.60537322   53.10803984 2175.56527292
  127.66934333    5.39268155  181.30368904]
[4.55261043e-026 5.48728628e-309 2.71819252e-005 3.15697650e-013
 0.00000000e+000 1.32590849e-029 2.02213728e-002 2.51638830e-041]


## 3. Recursive feature elimination (RFE)
- scikit-learn에서 제공하는 wrapper 방식의 feature selection
- 일종의 Backward elimination과 유사함(feature를 줄여나가는 것)
- `sklearn.feature_selection.RFE`: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE
- `sklearn.feature_selection.RFECV`: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html#sklearn.feature_selection.RFECV

In [21]:
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

In [22]:
svc = SVC(kernel='linear')
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(n_splits=2), scoring='accuracy', verbose=1)
rfecv.fit(X, Y)

Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.


RFECV(cv=StratifiedKFold(n_splits=2, random_state=None, shuffle=False),
   estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
   n_jobs=1, scoring='accuracy', step=1, verbose=1)

In [23]:
print(rfecv.n_features_)
print(rfecv.support_)

7
[ True  True  True  True False  True  True  True]


In [24]:
X.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

### 참고
- `RFECV`는 그 자체로 scikit-learn의 estimator 역할을 수행한다.

## 4. Selecting features by model
- `sklearn.feature_selection.SelectFromModel`: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel

## 4.1. L1-regularized model

In [25]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

In [26]:
lr = LogisticRegression(penalty='l1', C=0.05)
lr.fit(X, Y)

LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [27]:
lr.coef_

array([[ 0.09659289,  0.01787956, -0.02350164,  0.00109293,  0.00036831,
         0.01597492,  0.        , -0.00232724]])

In [33]:
selector = SelectFromModel(estimator=lr, threshold=0.000001, prefit=True)
# 위에서 select 되지 않은 것들은 쳐냄. threshold를 아주 작게 줘서 이거보다 작으면 제거
X_reduced = selector.transform(X)

In [34]:
print(X.shape)
print(X_reduced.shape)

(768, 8)
(768, 7)


In [35]:
print(X.head(2))
print(60 * '=')
print(X_reduced[0:2, :])

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
[[  6.  148.   72.   35.    0.   33.6  50. ]
 [  1.   85.   66.   29.    0.   26.6  31. ]]


## 4.2. Tree model

In [36]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [37]:
rf = RandomForestClassifier(n_estimators=20, max_features='auto')
rf.fit(X, Y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [38]:
rf.feature_importances_

array([0.08596673, 0.26743884, 0.08765176, 0.06696782, 0.06910012,
       0.17024112, 0.1209441 , 0.13168951])

In [39]:
selector = SelectFromModel(estimator=rf, threshold=0.1, prefit=True)
X_reduced = selector.transform(X)

In [40]:
print(X.shape)
print(X_reduced.shape)

(768, 8)
(768, 4)


In [41]:
print(X.head(2))
print(60 * '=')
print(X_reduced[0:2, :])

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
[[148.     33.6     0.627  50.   ]
 [ 85.     26.6     0.351  31.   ]]


### 참고
- Attribute에 **coef_** 또는 **feature_importances**가 있는 estimator만 사용 가능함: Lasso, LogisticRegression, LinearSVC, Tree계열 모델들
- 참고: http://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_boston.html