## Feature Selection Techniques

In [2]:
import pandas as pd
import numpy as np
import datetime
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')

numerical = pd.read_csv('7.03/numerical.csv')
categorical = pd.read_csv('7.03/categorical.csv')
targets = pd.read_csv('7.03/target.csv')

### Correlations
But you already know that :)

In [3]:
targets.merge(numerical, left_index=True, right_index=True).corr()['TARGET_D'].sort_values(ascending=False)

TARGET_D    1.000000
TARGET_B    0.774232
LASTGIFT    0.061678
RAMNTALL    0.044810
AVGGIFT     0.044299
              ...   
IC6        -0.023673
POBC2      -0.024375
IC15       -0.025528
EC4        -0.026064
CLUSTER2   -0.029085
Name: TARGET_D, Length: 317, dtype: float64

In [25]:
numerical.shape, targets.shape

((95412, 315), (95412, 2))

In [27]:
numerical.corrwith(targets['TARGET_D']).sort_values(ascending=False)

LASTGIFT    0.061678
RAMNTALL    0.044810
AVGGIFT     0.044299
HV2         0.039860
MAXRAMNT    0.039224
              ...   
IC6        -0.023673
POBC2      -0.024375
IC15       -0.025528
EC4        -0.026064
CLUSTER2   -0.029085
Length: 315, dtype: float64

### Variance Threshold

Variance is the average of the differences between each value and the mean of the column.

Coefficient of Determination (R2, aka r-squared), is similar to the Correlation Coefficient (R, aka Pearson Correlation). 

R is the correlation between y_pred and y_test. R-square is the square of this coefficient, indicating the percentage of variation explained by your regression line out of the total variation.

This is not a perfect method and should always be compared to others to find which one fits your dataset best. High variance features do not guarantee a good model, but low variance features are likely to have low predictive ability, specially if they tend to a constant value.

Check the docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html). There is a function that uses pearson correlation to do the filtering for you:

In [10]:
# Using the variance threshold technique
from sklearn.feature_selection import VarianceThreshold
selection = VarianceThreshold(threshold=(.9))
# This drops the columns that have a variance less than this threshold
selection.fit(numerical)
temp = selection.transform(numerical)
temp

array([[0.00000000e+00, 6.00000000e+01, 5.00000000e+00, ...,
        9.55150000e+04, 4.00000000e+00, 3.90000000e+01],
       [1.00000000e+00, 4.60000000e+01, 6.00000000e+00, ...,
        1.48535000e+05, 2.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 6.16116493e+01, 3.00000000e+00, ...,
        1.50780000e+04, 4.00000000e+00, 6.00000000e+01],
       ...,
       [1.00000000e+00, 6.00000000e+01, 5.00000000e+00, ...,
        1.89641000e+05, 3.00000000e+00, 3.40000000e+01],
       [0.00000000e+00, 5.80000000e+01, 7.00000000e+00, ...,
        4.69300000e+03, 4.00000000e+00, 1.10000000e+01],
       [2.00000000e+00, 8.00000000e+01, 5.00000000e+00, ...,
        1.85114000e+05, 1.00000000e+00, 1.20000000e+01]])

In [11]:
temp = pd.DataFrame(temp)
numerical.shape, temp.shape

((95412, 315), (95412, 305))

In [12]:
pd.DataFrame(data=(numerical.columns,selection.variances_,selection.get_support()), index=('column_name','variance','statement'))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,305,306,307,308,309,310,311,312,313,314
column_name,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,...,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
variance,909809.748379,208.815897,2.888193,7.320489,86.617457,25.659247,131.572233,227.939712,313.607695,19.286819,...,20.490351,77.161898,628.385285,194.785482,60.482486,115.991629,3056354531.68346,0.25,1.150778,351.620199
statement,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,False,True,True


In [15]:
removed_columns = pd.DataFrame(data=(numerical.columns,selection.variances_,selection.get_support()), index=('column_name','variance','statement')).T
removed_columns[removed_columns['statement'] == False]

Unnamed: 0,column_name,variance,statement
26,ETH6,0.46141,False
171,TPE6,0.539863,False
172,TPE7,0.483732,False
254,ANC5,0.500561,False
255,ANC6,0.284182,False
260,ANC11,0.750543,False
264,ANC15,0.150632,False
288,HC15,0.243664,False
296,MHUC2,0.751454,False
312,HPHONE_D,0.25,False


High variance does not mean high correlation!

### Recursive Feature Elimination

The algorithm works like a backward elimination. It starts with all features in the training set and iteratively removes features that are not significant, until the specified number of features is reached.

In [18]:
from sklearn.feature_selection import RFE
from sklearn import linear_model

lm = linear_model.LinearRegression()
rfe = RFE(lm, n_features_to_select=20, verbose=False, step=50)
rfe.fit(numerical, targets)

RFE(estimator=LinearRegression(), n_features_to_select=20, step=50,
    verbose=False)

In [19]:
# After we run the algorithm, it labels the top features as 1 and the rest are marked in an increasing order of importance.
df = pd.DataFrame(data = rfe.ranking_, columns=['Rank'])
df['Column_name'] = numerical.columns
df[df['Rank']==1]

Unnamed: 0,Rank,Column_name
2,1,INCOME
16,1,POP90C1
17,1,POP90C2
18,1,POP90C3
19,1,POP90C4
20,1,POP90C5
30,1,ETH10
76,1,DW3
96,1,HHD5
97,1,HHD6


In [21]:
df

Unnamed: 0,Rank,Column_name
0,7,TCODE
1,7,AGE
2,1,INCOME
3,6,WEALTH1
4,6,HIT
...,...,...
310,5,AVGGIFT
311,7,CONTROLN
312,3,HPHONE_D
313,2,RFA_2F


In [23]:
numerical

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,...,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2
0,0,60.000000,5,9,0,0,39,34,18,10,...,14,5.0,12.0,10.0,4,7.741935,95515,0,4,39
1,1,46.000000,6,9,16,0,15,55,11,6,...,1,10.0,25.0,25.0,18,15.666667,148535,0,2,1
2,1,61.611649,3,1,2,0,20,29,33,6,...,14,2.0,16.0,5.0,12,7.481481,15078,1,4,60
3,0,70.000000,1,4,2,0,23,14,31,3,...,7,2.0,11.0,10.0,9,6.812500,172556,1,4,41
4,0,78.000000,3,2,60,1,28,9,53,26,...,8,3.0,15.0,15.0,14,6.864865,7112,1,2,26
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95407,1,61.611649,5,9,0,14,36,47,11,7,...,0,25.0,25.0,25.0,9,25.000000,184568,0,1,12
95408,1,48.000000,7,9,1,0,31,43,19,4,...,0,20.0,20.0,20.0,9,20.000000,122706,1,1,2
95409,1,60.000000,5,9,0,0,18,46,20,7,...,4,3.0,10.0,10.0,3,8.285714,189641,1,3,34
95410,0,58.000000,7,9,0,0,28,35,20,9,...,18,5.0,21.0,18.0,4,12.146341,4693,1,4,11


In [20]:
rfe.ranking_

array([7, 7, 1, 6, 6, 5, 6, 6, 5, 6, 7, 6, 5, 7, 7, 7, 1, 1, 1, 1, 1, 4,
       2, 7, 4, 2, 3, 4, 4, 3, 1, 3, 7, 2, 2, 2, 2, 3, 3, 4, 2, 4, 3, 3,
       2, 2, 3, 3, 5, 4, 6, 5, 3, 3, 2, 2, 2, 2, 2, 5, 4, 3, 4, 3, 3, 4,
       5, 2, 4, 4, 6, 6, 6, 3, 3, 6, 1, 3, 2, 4, 3, 4, 3, 7, 7, 4, 6, 2,
       2, 2, 2, 7, 2, 2, 2, 2, 1, 1, 2, 1, 2, 3, 3, 5, 4, 5, 3, 2, 2, 2,
       5, 6, 6, 7, 6, 6, 5, 5, 3, 3, 2, 1, 2, 2, 3, 5, 6, 4, 5, 6, 5, 7,
       6, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 3, 3, 3, 3, 6, 3, 3, 7, 6, 6, 7,
       5, 4, 3, 5, 5, 5, 5, 6, 5, 1, 1, 6, 3, 3, 1, 1, 1, 1, 2, 3, 3, 7,
       7, 5, 6, 5, 7, 3, 4, 6, 5, 4, 7, 7, 7, 7, 6, 2, 3, 2, 2, 3, 2, 2,
       3, 2, 2, 2, 2, 3, 4, 7, 7, 5, 5, 4, 5, 5, 6, 6, 4, 6, 6, 5, 4, 5,
       4, 5, 4, 7, 5, 4, 7, 4, 4, 4, 4, 4, 5, 5, 5, 3, 4, 4, 4, 3, 2, 3,
       6, 6, 6, 7, 7, 6, 5, 6, 4, 7, 4, 7, 4, 1, 7, 5, 4, 4, 5, 3, 4, 4,
       3, 5, 6, 6, 4, 5, 5, 6, 5, 5, 7, 5, 6, 6, 6, 5, 6, 6, 5, 7, 6, 6,
       6, 7, 1, 7, 6, 7, 7, 4, 7, 7, 2, 4, 5, 3, 4,

In [8]:
list(df[df['Rank']==1]['Column_name'])

['INCOME',
 'POP90C1',
 'POP90C2',
 'POP90C3',
 'ETH10',
 'DW3',
 'DW5',
 'HV4',
 'RHP4',
 'HUPA1',
 'HUPA2',
 'MC1',
 'MC2',
 'TPE3',
 'TPE4',
 'TPE5',
 'TPE6',
 'ANC6',
 'HC15',
 'CARDPM12']

In [2]:
u'FC Ban�k Ostrava'

'FC Ban�k Ostrava'

#### Which one is better for our case study? 