## Feature selection

特征选择指的是从一大组特征中选出最优的一组特征。特征选择有很多优点，并且也有很多方法可以做到这一点。

### Intro

特征选择或变量选择是指从数据集中的总特征中选择相关特征或变量子集，以构建机器学习算法的过程。

其优点在于：

    - 提高准确率
    - 简单的模型易于解释
    - 更短的训练时间
    - 通过降低过拟合提高泛化能力
    - 易实现
    - 在模型使用中减少数据错误的风险（无关或嘈杂的特征可能会引入模型预测的错误）
    - 处理冗余变量（消除贡献不大特征，避免不必要的计算）
    - 避免在高维空间中出现不良学习行为（在高维空间中，特征数量较大，模型可能受到维度灾难的影响。特征选择有助于专注于最有信息的特征，防止不良学习行为，提高了模型的整体性能。）

方法：
    特征选择的技术主要分为三类：过滤方法（filter methods），包装方法（wrapper methods）和嵌入方法（embedded methods）：

    1. 过滤方法 filter methods
        - 基本方法      basic methods
        - 单变量方法    univariate methods
        - 信息增益      information gain
        - Fischer score（F统计量）
        - 相关矩阵热力图 correlation matrix with heatmap
    
    2. 包装方法 wrapper method
        - 前向选择      forward selection
        - 后向选择      backward elimination
        - 穷举特征选择  exhaustive feature selection
        - 递归特征消除  recursive feature elimination
        - 带交叉验证的递归特征消除 recursive feature elimination with CV
    
    3. 嵌入方法 embedded methods
        - Lasso
        - Ridge
        - 树重要性 tree importance

In [10]:
'''

Attribute Information: (classes: edible=e, poisonous=p)
属性信息：（类别：可食用=e，有毒=p）

cap-shape（菌盖形状）: 
bell（钟形）=b, conical（圆锥形）=c, convex（凸形）=x, flat（平坦）=f, knobbed（带有小球状突起）=k, sunken（凹陷）=s

cap-surface（菌盖表面）: 
fibrous（纤维状）=f, grooves（有沟槽）=g, scaly（鳞状）=y, smooth（光滑）=s

cap-color（菌盖颜色）: 
brown（棕色）=n, buff（浅黄色）=b, cinnamon（肉桂色）=c, gray（灰色）=g, green（绿色）=r, pink（粉色）=p, purple（紫色）=u, red（红色）=e, white（白色）=w, yellow（黄色）=y

bruises（瘀伤）: 
bruises（有瘀伤）=t, no（没有瘀伤）=f

odor（气味）: 
almond（杏仁味）=a, anise（茴香味）=l, creosote（煤焦油味）=c, fishy（鱼腥味）=y, foul（难闻的）=f, musty（霉味）=m, none（没有气味）=n, pungent（刺激性味道）=p, spicy（辛辣味）=s

gill-attachment（褶附着）: 
attached（附着）=a, descending（下垂）=d, free（自由）=f, notched（缺口）=n

gill-spacing（褶间距）: 
close（紧密）=c, crowded（拥挤）=w, distant（远离）=d

gill-size（褶大小）: 
broad（宽阔）=b, narrow（狭窄）=n

gill-color（褶颜色）: 
black（黑色）=k, brown（棕色）=n, buff（浅黄色）=b, chocolate（巧克力色）=h, gray（灰色）=g, green（绿色）=r, orange（橙色）=o, pink（粉色）=p, purple（紫色）=u, red（红色）=e, white（白色）=w, yellow（黄色）=y

stalk-shape（柄形状）: 
enlarging（扩大）=e, tapering（变细）=t

stalk-root（柄根）: 
bulbous（球根）=b, club（短而厚）=c, cup（杯状）=u, equal（相等）=e, rhizomorphs（根状体）=z, rooted（有根）=r, missing（缺失）=?

stalk-surface-above-ring（环上柄表面）: 
fibrous（纤维状）=f, scaly（鳞状）=y, silky（丝状）=k, smooth（光滑）=s

stalk-surface-below-ring（环下柄表面）: 
fibrous（纤维状）=f, scaly（鳞状）=y, silky（丝状）=k, smooth（光滑）=s

stalk-color-above-ring（环上柄颜色）: 
brown（棕色）=n, buff（浅黄色）=b, cinnamon（肉桂色）=c, gray（灰色）=g, orange（橙色）=o, pink（粉色）=p, red（红色）=e, white（白色）=w, yellow（黄色）=y

stalk-color-below-ring（环下柄颜色）: 
brown（棕色）=n, buff（浅黄色）=b, cinnamon（肉桂色）=c, gray（灰色）=g, orange（橙色）=o, pink（粉色）=p, red（红色）=e, white（白色）=w, yellow（黄色）=y

veil-type（蒙皮类型）: 
partial（部分）=p, universal（整体）=u

veil-color（蒙皮颜色）: 
brown（棕色）=n, orange（橙色）=o, white（白色）=w, yellow（黄色）=y

ring-number（环数量）: 
none（没有）=n, one（一个）=o, two（两个）=t

ring-type（环类型）: 
cobwebby（网状）=c, evanescent（瞬间消失的）=e, flaring（张开的）=f, large（大型）=l, none（没有）=n, pendant（垂下的）=p, sheathing（包覆的）=s, zone（带状）=z

spore-print-color（孢子印颜色）: 
black（黑色）=k, brown（棕色）=n, buff（浅黄色）=b, chocolate（巧克力色）=h, green（绿色）=r, orange（橙色）=o, purple（紫色）=u, white（白色）=w, yellow（黄色）=y

population（种群）: 
abundant（丰富）=a, clustered（聚集）=c, numerous（众多）=n, scattered（分散）=s, several（几个）=v, solitary（孤独）=y

habitat（生长环境）: 
grasses（草地）=g, leaves（叶子）=l, meadows（草地）=m, paths（小路）

'''

'\n\nAttribute Information: (classes: edible=e, poisonous=p)\n属性信息：（类别：可食用=e，有毒=p）\n\ncap-shape（菌盖形状）: \nbell（钟形）=b, conical（圆锥形）=c, convex（凸形）=x, flat（平坦）=f, knobbed（带有小球状突起）=k, sunken（凹陷）=s\n\ncap-surface（菌盖表面）: \nfibrous（纤维状）=f, grooves（有沟槽）=g, scaly（鳞状）=y, smooth（光滑）=s\n\ncap-color（菌盖颜色）: \nbrown（棕色）=n, buff（浅黄色）=b, cinnamon（肉桂色）=c, gray（灰色）=g, green（绿色）=r, pink（粉色）=p, purple（紫色）=u, red（红色）=e, white（白色）=w, yellow（黄色）=y\n\nbruises（瘀伤）: \nbruises（有瘀伤）=t, no（没有瘀伤）=f\n\nodor（气味）: \nalmond（杏仁味）=a, anise（茴香味）=l, creosote（煤焦油味）=c, fishy（鱼腥味）=y, foul（难闻的）=f, musty（霉味）=m, none（没有气味）=n, pungent（刺激性味道）=p, spicy（辛辣味）=s\n\ngill-attachment（褶附着）: \nattached（附着）=a, descending（下垂）=d, free（自由）=f, notched（缺口）=n\n\ngill-spacing（褶间距）: \nclose（紧密）=c, crowded（拥挤）=w, distant（远离）=d\n\ngill-size（褶大小）: \nbroad（宽阔）=b, narrow（狭窄）=n\n\ngill-color（褶颜色）: \nblack（黑色）=k, brown（棕色）=n, buff（浅黄色）=b, chocolate（巧克力色）=h, gray（灰色）=g, green（绿色）=r, orange（橙色）=o, pink（粉色）=p, purple（紫色）=u, red（红色）=e, white（白色）=w, yellow（黄色）=y

In [11]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

## 1. 过滤方法

过滤方法通常用作预处理步骤，在这个过程中，特征的选择和任何机器学习算法无关。相反，特征选择的依据是他们在各种统计测试中与预测变量的相关性进行划分的。这些方法的特点如下：

    - 依赖于特征的特性 characteristics
    - 不使用机器学习算法
    - 与模型无关
    - 计算成本低
    - 预测性能通常低于包装法
    - 非常适合快速筛选和去除无关特征

过滤方法包括了以下几种技术：

    - 基本方法              Basic methods
    - 单变量特征选择        Univariate feature selection
    - 信息增益              Information gain
    - 费舍尔分数            Fischer score
    - 特征选择的方差分析F值  ANOVA F-Value for Feature Selection
    - 相关矩阵热力图        Correlation Matrix with Heatmap

过滤方法的步骤一般是：

全部特征的集合 - 选择最好的自己 - 训练算法 - 性能评估

### 1.1 基本方法 Basic Methods

基本方法中，我们要去除常量和准常量特征 
remove constant and quasi-constant features

#### 1.1.1 去除常数特征 Remove constant features

常数特征指的是在数据集中所有样本的此特征值都相同（只有一个值）的特征。这样的特征不提供任何信息，无法让机器学习模型区分或者预测目标。

识别和移除常量特征，是实现特征选择和更易于解释的机器学习模型的第一步。要识别常量特征，我们可以使用sklearn中的Variance Threshould函数。

接下来使用Santander客户满意度数据来演示如何识别常量特征。

In [12]:
X_train = pd.read_csv('s_train.csv', nrows = 35000)
X_test = pd.read_csv('s_test.csv', nrows = 15000)

X_train.drop(['TARGET'], axis = 1, inplace = True)

X_train

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,1,2,23,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.170000
1,3,2,34,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.030000
2,4,2,23,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.770000
3,8,2,37,0.0,195.00,195.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.970000
4,10,2,39,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34995,69974,2,48,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,98642.430000
34996,69976,2,65,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,128930.100000
34997,69977,2,23,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
34998,69981,2,28,0.0,0.00,0.00,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,114747.060000


In [13]:
X_test

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var29_ult3,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38
0,2,2,32,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40532.100000
1,5,2,35,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45486.720000
2,6,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46993.950000
3,7,2,24,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,187898.610000
4,9,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,73649.730000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14995,29822,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016
14996,29824,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,141868.410000
14997,29825,2,53,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55323.540000
14998,29827,2,23,0.0,0.0,0.0,0.0,0.0,0,0,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65211.420000


#### 重要！！

在所有的特征选择过程中，仅通过检查训练集来选择特征是一种好的做法，这样是为了避免过度拟合。（例如防止不小心让训练集知道了测试集的信息）。

In [15]:
# sklearn 的方差阈值是一种简单的特征选择基准方法。它会删除方差不符合
# 某个阈值的所有特征。默认情况下，它会删除所有零方差特征，即在所有样
# 本中具有相同值的特征。

from sklearn.feature_selection import VarianceThreshold

sel = VarianceThreshold(threshold= 0)
sel.fit(X_train) # 此fit会找到方差为0的特征

# .get_support会返回布尔类型向量，表明那些特征是在经过筛选后被保留的

# 这样能够得到非常数特征的数量

len(X_train.columns[sel.get_support()])

319

能看到有 370 - 319 = 51个变量都是常数变量。我们接下来用transform方法来减少训练集和测试集的特征数量。

In [16]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape

(35000, 319)

In [17]:
X_test.shape

(15000, 319)

通过去除常数特征，我们极大的缩小了特征空间！

#### 1.1.2 去除准常数特征 remove quasi-constant features

准常数特征值的是在数据集中的绝大多数样本中都显示相同值的特征。一般来说，这些特征提供的信息很少，甚至无法让机器学习模型区分或者预测目标。但也有例外。因此，在移除这类特征时，我们应该小心谨慎。

为了去除准常数特征，我们还是使用sklearn中的VarianceThreshold。