# 第5章 用转换器抽取特征

　　本章所讨论的是如何从数据集中抽取数值和类别型特征，并选出最佳特征，前提是数据集确
实包含这些特征。我们还会介绍特征抽取的常用模式和技巧。  

本章主要介绍以下几个概念：
- 从数据集中抽取特征
- 创建新特征
- 选取好特征
- 创建转换器，处理数据集

## 5.1 特征抽取

　　特征抽取是数据挖掘任务最为重要的一个环节，一般而言，它对最终结果的影响要高过数据
挖掘算法本身。

#### 5.1.1 在模型中表示事实

In [2]:
import os
import pandas as pd

In [7]:
adult_filename = './Data/adult.data'
adult = pd.read_csv(adult_filename, header=None,
                    names=["Age", "Work-Class", "fnlwgt",
                          "Education", "Education-Num",
                          "Marital-Status", "Occupation",
                           "Relationship", "Race", "Sex",
                           "Capital-gain", "Capital-loss",
                           "Hours-per-week", "Native-Country",
                           "Earnings-Raw"])

In [8]:
adult.columns

Index(['Age', 'Work-Class', 'fnlwgt', 'Education', 'Education-Num',
       'Marital-Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-Country',
       'Earnings-Raw'],
      dtype='object')

In [9]:
adult.tail()

Unnamed: 0,Age,Work-Class,fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-Country,Earnings-Raw
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


#### 5.1.2 通用的特征创建模式

In [10]:
adult["Work-Class"].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'], dtype=object)

In [12]:
adult["LongHours"] = adult["Hours-per-week"] > 40

In [13]:
adult.tail()

Unnamed: 0,Age,Work-Class,fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-Country,Earnings-Raw,LongHours
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,False
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,False
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,False
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K,False
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K,False


#### 5.1.3 创建好的特征

　　建模过程中，需要对真实世界中的对象进行简化，这会导致信息的丢失，这也就是为什么没有一套能够用于任何数据集的通用的数据挖掘方法。数据挖掘的行家里手需要拥有数据来源领域
的知识，没有的话，要积极去掌握。他们弄清楚问题是什么，了解有哪些可用数据后，在此基础
上，才能创建解决问题所需的模型。

## 5.2 特征选择

通常特征数量很多，但我们只想选用其中一小部分。有如下几个原因：


- **降低复杂度**：随着特征数量的增加，很多数据挖掘算法需要更多的时间和资源。减少特征数量，是提高算法运行速度，减少资源使用的好方法。  
  
- **降低噪音**：增加额外特征并不总会提升算法的表现。额外特征可能扰乱算法的正常工作，这些额外特征间的相关性和模式没有实际应用价值（这种情况在小数据集上很常见）。只选择合适的特征有助于减少出现没有实际意义的相关性的几率。  
  
- **增加模型可读性**：根据成千上万个特征创建的模型来解答一个问题，对计算机来说很容易，但模型对我们自己来说就晦涩无比。因此，使用更少的特征，创建我们自己可以理解的模型，就很有必要。

In [14]:
import numpy as np
X = np.arange(30).reshape([10,3])
X

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23],
       [24, 25, 26],
       [27, 28, 29]])

In [16]:
X[:,1] = 1
X

array([[ 0,  1,  2],
       [ 3,  1,  5],
       [ 6,  1,  8],
       [ 9,  1, 11],
       [12,  1, 14],
       [15,  1, 17],
       [18,  1, 20],
       [21,  1, 23],
       [24,  1, 26],
       [27,  1, 29]])

In [17]:
from sklearn.feature_selection import VarianceThreshold

In [18]:
vt = VarianceThreshold()

In [19]:
Xt = vt.fit_transform(X)

In [20]:
Xt

array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14],
       [15, 17],
       [18, 20],
       [21, 23],
       [24, 26],
       [27, 29]])

In [21]:
vt.variances_

array([ 74.25,   0.  ,  74.25])

　　无论什么时候，拿到数据后，先做下类似简单、直接的分析，对数据集的特点做到心中有数。方差为0的特征不但对数据挖掘没有丝毫用处，相反还会拖慢算法的运行速度。

##### 选择最佳特征

In [22]:
X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss","Hours-per-week"]].values

In [23]:
y = (adult["Earnings-Raw"] == ' >50K').values

In [24]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer = SelectKBest(score_func=chi2, k=3)

In [25]:
Xt_chi2 = transformer.fit_transform(X, y)

In [26]:
Xt_chi2

array([[   39,  2174,     0],
       [   50,     0,     0],
       [   38,     0,     0],
       ..., 
       [   58,     0,     0],
       [   22,     0,     0],
       [   52, 15024,     0]], dtype=int64)

In [27]:
transformer.scores_

array([  8.60061182e+03,   2.40142178e+03,   8.21924671e+07,
         1.37214589e+06,   6.47640900e+03])

In [28]:
from scipy.stats import pearsonr

In [29]:
def multivariate_pearsonr(X, y):
    scores, pvalues = [], []
    for column in range(X.shape[1]):
        cur_score, cur_p = pearsonr(X[:,column], y)
        scores.append(abs(cur_score))
        pvalues.append(cur_p)
    return (np.array(scores), np.array(pvalues))  

In [30]:
transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)
Xt_pearson = transformer.fit_transform(X, y)
print(transformer.scores_)

[ 0.2340371   0.33515395  0.22332882  0.15052631  0.22968907]


In [32]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy')
scores_pearson = cross_val_score(clf, Xt_pearson, y,scoring='accuracy')

In [33]:
print("scores_chi2", scores_chi2)
print("scores_pearson", scores_pearson)

scores_chi2 [ 0.82577851  0.82992445  0.83009306]
scores_pearson [ 0.76930164  0.7694859   0.77315028]


## 5.3 创建特征

In [34]:
data_filename = "./Data/ad.data"

In [38]:
def convert_number(x):
    try:
        return float(x)
    except ValueError:
        return np.nan

In [39]:
from collections import defaultdict

In [42]:
converters = defaultdict(convert_number)

In [45]:
converters[1558] = lambda x: 1 if x.strip() == "ad." else 0

In [46]:
ads = pd.read_csv(data_filename, header=None, converters=converters)

  interactivity=interactivity, compiler=compiler, result=result)


In [47]:
ads.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
