--------------<Feature Selection>------------
1. Notation
    the process of selecting out the most significant features from a given datase

2. Importance
   It enables the machine learning algorithm to train faster.
   It reduces the complexity of a model and makes it easier to interpret.
   It improves the accuracy of a model if the right subset is chosen.
   It reduces Overfitting.

3. Topics
    1) Difference between feature selection and dimensionality reduction
    2) Different types of feature selection methods
    3) Implementation of different feature selection methods with scikit-learn  

#----🔥Feature Selection VS. Dimensionality Reduction------
1） Both methods tend to reduce the number of attributes in the dataset；
2） Dimensionality reduction：create new combinations of attributes (sometimes known as feature transformation),  
Example: Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis,
3)  feature selection: include and exclude attributes present in the data without changing them.


# 3 Methods: Filter methods, Wrapper methods, and Embedded methods.

-------------Method1: Filter Method (data preprocessing step)--------------
#Filter out irrelevant features before classification process starts.
 ✨give rank on the basis of statistical scores which tend to determine the features' correlation with the outcome variable；
 ✨Features give rank on the basis of statistical scores which tend to determine the features' correlation with the outcome variable.
 ⚠️ correlation coefficients for different types of data
    
    Feature/Response     Continuous                Categorical
           Continous     Pearson's Correlation     LDA
           Categorical   Anova                     Chi-Squire
       
       Assumption:
         1）Pearson's Correlation: independence, linearity 
         
 ⚠️Examples: Chi-squared test, information gain, and correlation coefficient scores

---------Method2: Wrapper Method--------------
 ✨searches for a feature: best-suited for the machine learning algorithm and aims to improve the mining performance.  To evaluate the features, the predictive accuracy used for classification tasks and goodness of cluster is evaluated using clustering.
 
 🔥Examples:
 1) Forward Selection: The procedure starts with an empty set of features [reduced set]. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set.

2) Backward Elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.

3) Combination of forward selection and backward elimination: 
The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.

4) Recursive Feature elimination: a greedy search to find the best performing feature subset. 
   1. Iteratively creates models and determines the best or the worst performing feature (eliminate) at each iteration. 
   2. Constructs the subsequent models with the left features until all the features are explored.
   3. Then ranks the features based on the order of their elimination. In the worst case, if a dataset contains N number of features RFE will do a greedy search for 2^N combinations of features.
 
----------Method3: Embedded Method------------------
1）Takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration；
2）Introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients)
3）Example：Regularization methods （penalization method)
          --> penalize a feature given a coefficient threshold
          i.e., LASSO, 
          Elastic Net, 
          Ridge Regression(create a parsimonious model when the number of predictor variables in a set exceeds the number of observations, or when a data set has multicollinearity.
          
          
-----------Extra: Some thinking about those methods❓---------
filter method: 
       1) not incorporate ML; 
       2) may fail to find the best subset of features in situations when there is not enough data to model the statistical correlation of the features
       
wrapper method:
       1) Include ML --> may lead to overfitting
       2) Can always provide the best subset of features despite insufficient data;
   

In [4]:
#---------Case Study in Python-----------
import pandas as pd
import numpy as np
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [21]:
#---------Method1:Filter Method-----------
#-----1.1. Use Variance------
from sklearn.feature_selection import VarianceThreshold
VarianceThreshold(threshold=3).fit_transform(X)

array([[  6. , 148. ,  72. , ...,   0. ,  33.6,  50. ],
       [  1. ,  85. ,  66. , ...,   0. ,  26.6,  31. ],
       [  8. , 183. ,  64. , ...,   0. ,  23.3,  32. ],
       ...,
       [  5. , 121. ,  72. , ..., 112. ,  26.2,  30. ],
       [  1. , 126. ,  60. , ...,   0. ,  30.1,  47. ],
       [  1. ,  93. ,  70. , ...,   0. ,  30.4,  23. ]])

In [31]:
#------1.2. Use Coefficient--------
#SelectKBest class that can be used with a suite of different statistical tests to 
# --> select a specific number of features
# Import the necessary libraries first
from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr

#Test1: User Pearson
test1 = SelectKBest(score_func=pearsonr, k=4)
fit1 = test.fit(X, Y)
np.set_printoptions(precision=3)
print(fit1.scores_)
features_T1 = fit1.transform(X)
# Summarize selected features
print(features_T1[0:5,:])

#Test2: Use Chi-squire 
test2 = SelectKBest(score_func=chi2, k=4)
fit2 = test2.fit(X, Y)

# Summarize scores for each attribute
print(fit2.scores_)

features_T2 = fit2.transform(X)
# Summarize selected features
print(features_T2[0:5,:])


[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]
[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]
[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


In [32]:
#Test3: Multual Information
#⚠️Evaluate the correlation between categorical X and categorical Y
from sklearn.feature_selection import SelectKBest
from minepy import MINE
def mic(x, y):
     m = MINE()
     m.compute_score(x, y)
     return (m.mic(), 0.5) #0.5: fixed p-value
test3 = SelectKBest(score_func=mic, k=4)
fit3 = test3.fit(X, Y)

print(fit3.scores_)
features_T3 = fit3.transform(X)
# Summarize selected features
print(features_T3[0:5,:])

ModuleNotFoundError: No module named 'minepy'

In [14]:
#--------------Method2:Wrapper Methods----------
#1. Recursive Feature Elimination 
#How to work: 
#(1)Recursively removing attributes and building a model on those attributes that remain
#(2)uses the model accuracy to identify which attributes 
# (and combination of attributes) contribute the most to predicting the target attribute

# Import your necessary dependencies
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))
#🔥Top 3 features as preg, mass, and pedi.

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]




In [18]:
#-----Method3: Embeded Method-------
# Try: Ridge regression which is basically a regularization technique
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X,Y)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

In [19]:
# A helper method for pretty-printing the coefficients
def pretty_print_coefs(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

print ("Ridge model:", pretty_print_coefs(ridge.coef_))

Ridge model: 0.021 * X0 + 0.006 * X1 + -0.002 * X2 + 0.0 * X3 + -0.0 * X4 + 0.013 * X5 + 0.145 * X6 + 0.003 * X7


In [39]:
#More Thoughts about Embeded Methods: L1 and L2
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

#Use Logistic Regression with L1 for Feature Selection:
Selection1=SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(X, Y)

#Use L1+L2 to construct a new LR model:
class LR(LogisticRegression):
    def __init__(self, threshold=0.01, dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='liblinear', max_iter=100,
                 multi_class='ovr', verbose=0, warm_start=False, n_jobs=1):
        self.threshold = threshold
        LogisticRegression.__init__(self, penalty='l1', dual=dual, tol=tol, C=C,
                 fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight=class_weight,
                 random_state=random_state, solver=solver, max_iter=max_iter,
                 multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)
        self.l2 = LogisticRegression(penalty='l2', dual=dual, tol=tol, C=C, fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight = class_weight, random_state=random_state, solver=solver, max_iter=max_iter, multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)

def fit(self, X, y, sample_weight=None):
        super(LR, self).fit(X, y, sample_weight=sample_weight)
        self.coef_old_ = self.coef_.copy()
        self.l2.fit(X, y, sample_weight=sample_weight)
        cntOfRow, cntOfCol = self.coef_.shape

        for i in range(cntOfRow):
            for j in range(cntOfCol):
                coef = self.coef_[i][j]
                if coef != 0:
                    idx = [j]
                    coef1 = self.l2.coef_[i][j]
                    for k in range(cntOfCol):
                        coef2 = self.l2.coef_[i][k]
                        # in L2, the difference between threshold + self.coef in L1 is 0
                        if abs(coef1-coef2) < self.threshold and j != k and self.coef_[i][k] == 0:
                            idx.append(k)
                    mean = coef / len(idx)
                    self.coef_[i][idx] = mean
        return self

#threshold is the difference between two coefficients        
Selection2=SelectFromModel(LR(threshold=0.5, C=0.1)).fit_transform(X, Y)




#⚠️Some tips when applying Ridege Regression:
1. It is also known as L2-Regularization.
2. For correlated features, it means that they tend to get similar coefficients.
3. Feature having negative coefficients don't contribute that much. But in a more complex scenario where you are dealing with lots of features, then this score will definitely help you in the ultimate feature selection decision-making process.

In [40]:
#-----3.3.Based on Tree-----
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier

#Use GBDT
SelectFromModel(GradientBoostingClassifier()).fit_transform(X, Y)


array([[148. ,  33.6,  50. ],
       [ 85. ,  26.6,  31. ],
       [183. ,  23.3,  32. ],
       ...,
       [121. ,  26.2,  30. ],
       [126. ,  30.1,  47. ],
       [ 93. ,  30.4,  23. ]])

In [None]:
#------🔥4. Addition: Decreasing Dimention with PCA + LDA--------
from sklearn.decomposition import PCA
PCA(n_components=2).fit_transform(X)

from sklearn.lda import LDA
LDA(n_components=2).fit_transform(X, Y)


Reference
1. DataCamp: Beginner's Guide to Feature Selection in Python
https://www.datacamp.com/community/tutorials/feature-selection-python
2. Buntine, Wray (et al.), Subspace, Latent Structure, and Feature Selection: Statistical and Optimization Perspectives Workshop
3. Feature Selection using Genetic Algorithms: https://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html
4.  https://www.zhihu.com/question/28641663/answer/139203996