<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#数据处理" data-toc-modified-id="数据处理-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>数据处理</a></span><ul class="toc-item"><li><span><a href="#Standardization,-or-mean-removal-and-variance-scaling" data-toc-modified-id="Standardization,-or-mean-removal-and-variance-scaling-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Standardization, or mean removal and variance scaling</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Normalization</a></span></li><li><span><a href="#Binarization（离散化）" data-toc-modified-id="Binarization（离散化）-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Binarization（离散化）</a></span></li><li><span><a href="#Encoding-categorical-features" data-toc-modified-id="Encoding-categorical-features-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Encoding categorical features</a></span></li><li><span><a href="#Imputation-of-missing-values" data-toc-modified-id="Imputation-of-missing-values-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Imputation of missing values</a></span></li></ul></li><li><span><a href="#特征选择" data-toc-modified-id="特征选择-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>特征选择</a></span><ul class="toc-item"><li><span><a href="#SelectFromModel" data-toc-modified-id="SelectFromModel-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SelectFromModel</a></span></li><li><span><a href="#※各特征独立考量" data-toc-modified-id="※各特征独立考量-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>※各特征独立考量</a></span><ul class="toc-item"><li><span><a href="#Removing-features-with-low-variance" data-toc-modified-id="Removing-features-with-low-variance-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Removing features with low variance</a></span></li><li><span><a href="#Univariate-feature-selection" data-toc-modified-id="Univariate-feature-selection-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Univariate feature selection</a></span></li></ul></li></ul></li><li><span><a href="#降维（考虑了所有特征间的整体贡献）" data-toc-modified-id="降维（考虑了所有特征间的整体贡献）-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>降维（考虑了所有特征间的整体贡献）</a></span><ul class="toc-item"><li><span><a href="#主成分分析PCA" data-toc-modified-id="主成分分析PCA-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>主成分分析PCA</a></span></li><li><span><a href="#Truncated-SVD（截断的奇异矩阵分解）" data-toc-modified-id="Truncated-SVD（截断的奇异矩阵分解）-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Truncated SVD（截断的奇异矩阵分解）</a></span></li></ul></li></ul></div>

# sklearn模块基础学习【2】

快速入门参考学习文档：  
  
>  https://sklearn.apachecn.org/docs/0.21.3/      
>  https://sklearn.apachecn.org/docs/0.21.3/50.html

## 数据处理

数据集：[ML DATASETS](http://archive.ics.uci.edu/ml/)

### Standardization, or mean removal and variance scaling

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, **_then scale it by dividing non-constant features by their standard deviation_**.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

[Should I normalize/standardize/rescale the data](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html,"Should I normalize/standardize/rescale the data")

[**StandardScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

The preprocessing module further provides a utility class StandardScaler that implements the Transformer API to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set.


[**MinMaxScaler**](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

Scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. 

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df  = pd.read_csv("forestfires.csv")
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [3]:
df1 = df.loc[:,"FFMC":"rain"]
df1.head()

Unnamed: 0,FFMC,DMC,DC,ISI,temp,RH,wind,rain
0,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0
1,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0
2,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0
3,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2
4,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0


In [4]:
from sklearn.model_selection import train_test_split

# 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(df1.astype(float), df['area'], test_size=0.3)

In [5]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

ss.fit(X_train.astype(float))

X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test)

In [6]:
X_test_ss

array([[ 0.19833798,  0.10387793,  0.15785763, ...,  2.44483192,
         1.94361423, -0.08050154],
       [-0.41459806,  2.50988941,  1.02064002, ..., -1.03279191,
        -0.2445808 , -0.08050154],
       [ 0.31184465,  1.6142209 ,  0.54338085, ...,  0.61450358,
        -1.01044906, -0.08050154],
       ...,
       [ 1.08369003, -0.0956917 , -0.07322406, ..., -0.78874814,
         0.24776308, -0.08050154],
       [ 0.03942863, -1.38890292, -1.94161823, ...,  0.24843792,
         0.74010696, -0.08050154],
       [ 0.92478069,  0.03841909,  0.44467761, ..., -1.33784664,
         0.24776308, -0.08050154]])

In [7]:
print(X_test.shape, X_test_ss.shape)

(156, 8) (156, 8)


In [8]:
X_test_ss.mean(axis=0)

array([-0.06141768,  0.26343129,  0.20168851,  0.16762858,  0.14628027,
        0.0728359 , -0.05346441, -0.06223793])

In [9]:
X_test_ss.std(axis=0)

array([1.69632577, 1.04685974, 0.84088763, 1.48284975, 0.99770054,
       0.98003792, 0.92811413, 0.1872176 ])

In [10]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()

mms.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [11]:
mms.transform(X_train)

array([[0.91266376, 0.34735744, 0.87750767, ..., 0.20238095, 0.15294118,
        0.        ],
       [0.88646288, 0.44193324, 0.80800094, ..., 0.52380952, 0.63529412,
        0.        ],
       [0.90174672, 0.11613352, 0.08602785, ..., 0.14285714, 0.47058824,
        0.        ],
       ...,
       [0.87991266, 0.14464534, 0.09971678, ..., 0.53571429, 0.36470588,
        0.        ],
       [0.87772926, 0.14360223, 0.80127449, ..., 0.11904762, 0.25882353,
        0.        ],
       [0.94104803, 0.47635605, 0.69188105, ..., 0.44047619, 0.57647059,
        0.        ]])

###  Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

Normalizer类也拥有fit、transform等转换器API拥有的常见方法，但实际上fit和transform对其是没有实际意义的，因为归一化操作是对每个样本单独进行变换，不存在针对所有样本上的统计学习过程。这里的设计，仅仅是为了供sklearn中的pipeline等API调用时，传入该对象时，各API的方法能够保持一致性，方便使用pipeline。

In [12]:
from sklearn.preprocessing import Normalizer

norm = Normalizer()

In [13]:
norm.fit(X_train)
X_train_norm = norm.transform(X_train)
X_train_norm

array([[0.12051853, 0.13372067, 0.98231751, ..., 0.04182856, 0.00287571,
        0.        ],
       [0.12761217, 0.18160194, 0.97125486, ..., 0.08273756, 0.00883469,
        0.        ],
       [0.69673891, 0.2720093 , 0.61392044, ..., 0.20514668, 0.03723032,
        0.        ],
       ...,
       [0.60495241, 0.29347195, 0.6162911 , ..., 0.40018902, 0.02667927,
        0.        ],
       [0.13036744, 0.06288143, 0.98840393, ..., 0.03597336, 0.0044607 ,
        0.        ],
       [0.15074612, 0.22474876, 0.9580037 , ..., 0.08383742, 0.0093511 ,
        0.        ]])

In [14]:
sum(np.square(X_train_norm[1]))

1.0

### Binarization（离散化）

Feature binarization is the process of thresholding numerical features to get boolean values. This can be useful for downstream probabilistic estimators that make assumption that the input data is distributed according to a multi-variate Bernoulli distribution. For instance, this is the case for the sklearn.neural_network.BernoulliRBM.

In [15]:
df['DC'].mean()

547.9400386847191

In [16]:
from sklearn.preprocessing import Binarizer

bi = Binarizer(548)
DC_bi = bi.fit_transform(df[['DC']])
DC_bi

array([[0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],

In [17]:
df['DC_bi'] = DC_bi[:, 0]

In [18]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0


Return indices of half-open bins to which each value of x belongs.
```python
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
```  
pandas.cut是按分位数划分的

In [19]:
pd.cut(df['DC'], 5)

0       (7.047, 178.44]
1      (519.52, 690.06]
2      (519.52, 690.06]
3       (7.047, 178.44]
4       (7.047, 178.44]
5      (348.98, 519.52]
6      (348.98, 519.52]
7      (519.52, 690.06]
8       (690.06, 860.6]
9       (690.06, 860.6]
10      (690.06, 860.6]
11      (690.06, 860.6]
12     (519.52, 690.06]
13     (519.52, 690.06]
14      (690.06, 860.6]
15      (690.06, 860.6]
16      (7.047, 178.44]
17     (519.52, 690.06]
18      (7.047, 178.44]
19      (7.047, 178.44]
20      (690.06, 860.6]
21      (690.06, 860.6]
22     (178.44, 348.98]
23     (519.52, 690.06]
24     (519.52, 690.06]
25     (519.52, 690.06]
26     (519.52, 690.06]
27     (519.52, 690.06]
28      (690.06, 860.6]
29      (690.06, 860.6]
             ...       
487    (519.52, 690.06]
488    (519.52, 690.06]
489    (519.52, 690.06]
490    (519.52, 690.06]
491    (519.52, 690.06]
492    (519.52, 690.06]
493    (519.52, 690.06]
494    (519.52, 690.06]
495    (519.52, 690.06]
496    (519.52, 690.06]
497    (519.52, 

### Encoding categorical features

We could encode categorical features as integers, but such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

**OneHotEncoder**(一般不用)
```python
class sklearn.preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', dtype=<type 'numpy.float64'>, 
                                          sparse=True, handle_unknown='error')```

Convert categorical variable into dummy/indicator variables
```python
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)
```

In [20]:
modelData = pd.get_dummies(data=df, columns=['month','day'])
modelData.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_nov,month_oct,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,1,0,0,0,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0


###  Imputation of missing values

For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). **_A better strategy is to impute the missing values, i.e., to infer them from the known part of the data._**


The [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) class provides basic strategies for imputing missing values, either using the mean, the median or the most frequent value of the row or column in which the missing values are located. This class also allows for different missing values encodings.



**The imputation strategy:**
1. If “mean”, then replace missing values using the mean along the axis.
2. If “median”, then replace missing values using the median along the axis.
3. If “most_frequent”, then replace missing using the most frequent value along the axis.
4. If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

```python
class sklearn.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
```

In [71]:
df.loc[:, 'DC_na'] = np.nan
df.loc[df['DC']>=600, 'DC_na'] = df['DC']

In [72]:
from sklearn.impute import SimpleImputer

im = SimpleImputer()

In [73]:
df.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,DC_bi,DC_na
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,0.0,
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,1.0,669.1
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,1.0,686.9
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,0.0,
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0,0.0,


In [28]:
im.fit_transform(df[['DC_na']])

array([[703.07754491],
       [669.1       ],
       [686.9       ],
       [703.07754491],
       [703.07754491],
       [703.07754491],
       [703.07754491],
       [608.2       ],
       [692.6       ],
       [698.6       ],
       [698.6       ],
       [713.        ],
       [665.3       ],
       [686.5       ],
       [699.6       ],
       [713.9       ],
       [703.07754491],
       [664.2       ],
       [703.07754491],
       [703.07754491],
       [692.6       ],
       [724.3       ],
       [703.07754491],
       [703.07754491],
       [703.07754491],
       [601.4       ],
       [668.        ],
       [686.5       ],
       [721.4       ],
       [728.6       ],
       [692.3       ],
       [709.9       ],
       [706.8       ],
       [718.3       ],
       [724.3       ],
       [730.2       ],
       [669.1       ],
       [682.6       ],
       [686.9       ],
       [703.07754491],
       [703.07754491],
       [624.2       ],
       [647.1       ],
       [698

**【一些实践中的 tips】**  
1. 尽量不要把包含个别特征缺失值的样本删除，实践中最好使用一些业务经验来做一些合理的推测值的填充，利用好样本
2. 如果没有合适的推测手段来填充，可以填充一些像-999,-1这样的没有意义的值
3. 其他一些可能用到的方法:
    * np.nan
    * np.inf
    * df.fillna
    * df.replace

## 特征选择

### SelectFromModel

This can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

```python
class sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
```

- L1-based feature selection（实际应用中少于树模型，这里演示用L1正则的模型来选取特征）
- Tree-based feature selection（实际应用中优先考虑，这里演示RF）

In [74]:
# 查看帮助文档
help(SelectFromModel)

Help on class SelectFromModel in module sklearn.feature_selection._from_model:

class SelectFromModel(sklearn.base.MetaEstimatorMixin, sklearn.feature_selection._base.SelectorMixin, sklearn.base.BaseEstimator)
 |  SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
 |  
 |  Meta-transformer for selecting features based on importance weights.
 |  
 |  .. versionadded:: 0.17
 |  
 |  Parameters
 |  ----------
 |  estimator : object
 |      The base estimator from which the transformer is built.
 |      This can be both a fitted (if ``prefit`` is set to True)
 |      or a non-fitted estimator. The estimator must have either a
 |      ``feature_importances_`` or ``coef_`` attribute after fitting.
 |  
 |  threshold : string, float, optional default None
 |      The threshold value to use for feature selection. Features whose
 |      importance is greater or equal are kept while the others are
 |      discarded. If "median" (resp. "mean"), then the ``th

> - estimator：对象。构建特征选择实例的基本分类器。如果参数prefit为True，则该参数可以由一个已经训练过的分类器初始化。如果prefit为False，则该参数只能传入没有经过训练的分类器实例  
  
  
> - threshold：字符串，浮点数，（可选的）默认为None。该参数指定特征选择的阈值，词语在分类模型中对应的系数值大于该值时被保留，否则被移除。如果该参数为字符串类型，则可设置的值有”mean”表示系数向量值的均值，”median”表示系数向量值的中值，也可以为”0.x*mean”或”0.x*median”。当该参数设置值为None时，如果分类器具有罚项且罚项设置为l1，则阈值为1e-5，否则该值为”mean”  
  
  
> - prefit：布尔类型。默认值为False。是否对传入的基本分类器事先进行训练。如果设置该值为True，则需要对传入的基本分类器进行训练，如果设置该值为False，则只需要传入分类器实例即可

In [29]:
from sklearn.feature_selection import SelectFromModel

In [37]:
xdata = modelData.drop("area",axis = 1).fillna(-999)
ydata = modelData['area']

In [38]:
from sklearn.linear_model import Lasso

lasso = Lasso()
lasso.fit(xdata, ydata)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [39]:
xdata.head()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,...,month_nov,month_oct,month_sep,day_fri,day_mon,day_sat,day_sun,day_thu,day_tue,day_wed
0,7,5,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,...,0,0,0,1,0,0,0,0,0,0
1,7,4,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,...,0,1,0,0,0,0,0,0,1,0
2,7,4,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,...,0,1,0,0,0,1,0,0,0,0
3,8,6,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,...,0,0,0,1,0,0,0,0,0,0
4,8,6,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,...,0,0,0,0,0,0,1,0,0,0


In [40]:
lasso.coef_

array([ 1.85720489,  0.        , -0.03700954,  0.09217834, -0.01157946,
       -0.53603543,  0.72312094, -0.231194  ,  1.26363755, -0.        ,
        0.        ,  0.        , -0.        ,  0.        ,  0.        ,
        0.        , -0.        , -0.        , -0.        ,  0.        ,
       -0.        ,  0.        ,  5.21977984, -0.        , -0.        ,
        7.00192208, -0.        ,  0.        ,  0.        , -0.        ])

In [35]:
model = SelectFromModel(lasso, prefit=True)

In [36]:
model.transform(xdata)

array([[  7. ,  86.2,  26.2, ...,   6.7,   0. ,   0. ],
       [  7. ,  90.6,  35.4, ...,   0.9,   0. ,   0. ],
       [  7. ,  90.6,  43.7, ...,   1.3,   0. ,   1. ],
       ...,
       [  7. ,  81.6,  56.7, ...,   6.7,   0. ,   0. ],
       [  1. ,  94.4, 146. , ...,   4. ,   0. ,   1. ],
       [  6. ,  79.5,   3. , ...,   4.5,   0. ,   0. ]])

**注意：实际中，我们对于one-hot处理后的那些列一般不会删除，除非这些列的系数都为0，才会删除**

In [41]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(xdata, ydata)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [42]:
rf.feature_importances_

array([5.34644708e-02, 4.62341430e-02, 5.39475470e-02, 1.20864442e-01,
       3.27448183e-02, 4.69004734e-02, 4.02714863e-01, 6.77922778e-02,
       6.71592368e-02, 1.55172745e-04, 3.90417828e-03, 5.71109779e-04,
       7.49947160e-03, 2.16358895e-04, 7.54656719e-04, 1.95300809e-08,
       8.14230713e-03, 2.60470690e-04, 6.83718444e-04, 9.32061675e-04,
       1.04199943e-09, 3.38023628e-04, 8.17767218e-03, 9.13170294e-04,
       1.11723857e-02, 2.00352999e-02, 4.47294814e-03, 2.82117074e-02,
       7.89782089e-03, 3.83917306e-03])

In [51]:
model_rf = SelectFromModel(rf, prefit=True)
model_rf.transform(xdata)

array([[ 7. ,  5. , 86.2, ...,  8.2, 51. ,  6.7],
       [ 7. ,  4. , 90.6, ..., 18. , 33. ,  0.9],
       [ 7. ,  4. , 90.6, ..., 14.6, 33. ,  1.3],
       ...,
       [ 7. ,  4. , 81.6, ..., 21.2, 70. ,  6.7],
       [ 1. ,  4. , 94.4, ..., 25.6, 42. ,  4. ],
       [ 6. ,  3. , 79.5, ..., 11.8, 31. ,  4.5]])

In [52]:
model_rf.transform(xdata).shape

(517, 8)

In [53]:
xdata.shape

(517, 30)

### ※各特征独立考量
  
**【注意】很少用，因为实际中很难确定需要设置的阈值/个数/比例**

#### Removing features with low variance

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

In [54]:
from sklearn.feature_selection import VarianceThreshold

v = VarianceThreshold(0.5)
v.fit_transform(xdata)

array([[ 7. ,  5. , 86.2, ...,  8.2, 51. ,  6.7],
       [ 7. ,  4. , 90.6, ..., 18. , 33. ,  0.9],
       [ 7. ,  4. , 90.6, ..., 14.6, 33. ,  1.3],
       ...,
       [ 7. ,  4. , 81.6, ..., 21.2, 70. ,  6.7],
       [ 1. ,  4. , 94.4, ..., 25.6, 42. ,  4. ],
       [ 6. ,  3. , 79.5, ..., 11.8, 31. ,  4.5]])

In [55]:
v.fit_transform(xdata).shape

(517, 9)

#### Univariate feature selection
1. SelectKBest removes all but the k highest scoring features
2. SelectPercentile removes all but a user-specified highest scoring percentage of features

These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
* For regression: f_regression, mutual_info_regression
* For classification: chi2, f_classif, mutual_info_classif

In [58]:
from sklearn.feature_selection import SelectKBest

In [62]:
# 查看官方文档
help(SelectKBest)

Help on class SelectKBest in module sklearn.feature_selection._univariate_selection:

class SelectKBest(_BaseFilter)
 |  SelectKBest(score_func=<function f_classif at 0x000000001AAFFBF8>, k=10)
 |  
 |  Select features according to the k highest scores.
 |  
 |  Read more in the :ref:`User Guide <univariate_feature_selection>`.
 |  
 |  Parameters
 |  ----------
 |  score_func : callable
 |      Function taking two arrays X and y, and returning a pair of arrays
 |      (scores, pvalues) or a single array with scores.
 |      Default is f_classif (see below "See also"). The default function only
 |      works with classification tasks.
 |  
 |  k : int or "all", optional, default=10
 |      Number of top features to select.
 |      The "all" option bypasses selection, for use in a parameter search.
 |  
 |  Attributes
 |  ----------
 |  scores_ : array-like of shape (n_features,)
 |      Scores of features.
 |  
 |  pvalues_ : array-like of shape (n_features,)
 |      p-values of featur

In [67]:
from sklearn.feature_selection import f_regression

skb = SelectKBest(f_regression, k=20)
skb.fit_transform(xdata, ydata)

array([[ 7. ,  5. , 86.2, ...,  0. ,  0. ,  0. ],
       [ 7. ,  4. , 90.6, ...,  0. ,  0. ,  0. ],
       [ 7. ,  4. , 90.6, ...,  1. ,  0. ,  0. ],
       ...,
       [ 7. ,  4. , 81.6, ...,  0. ,  1. ,  0. ],
       [ 1. ,  4. , 94.4, ...,  1. ,  0. ,  0. ],
       [ 6. ,  3. , 79.5, ...,  0. ,  0. ,  0. ]])

In [68]:
skb.fit_transform(xdata, ydata).shape

(517, 20)

## 降维（考虑了所有特征间的整体贡献）  
  
  **【注意】一般实际应用的其实也较少**
  - 难得选取到需要的业务特征
  - 机器学习中会使用正则项来惩罚共线性

```python
class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)  
  
class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)
```

### 主成分分析PCA

PCA的工作原理是将原始数据集映射到一个新的空间，在这个空间中，矩阵的新列向量是每个正交的。从数据分析的角度来看，PCA将数据的协方差矩阵转化为能够 "解释 "一定比例的方差的列向量。

- 最大方差解释（保留的特征数越多，方差解释越大）：http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020209.html
- 最小平方误差解释：
http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020216.html  
  
【注意】正常数据集其实很少用PCA，用的最多的是在图像压缩上（如只需要抓住图片中主要的人脸部分）

In [64]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD

# 一定记得要先对数据做标准化，再对其做PCA降维(因为量纲会影响协方差矩阵计算) 
ss = StandardScaler()
xdata_ss = ss.fit_transform(xdata)

pca = PCA(20)  # 保留20个主成分
pca.fit_transform(xdata_ss)

# 查看20个主成分能解释多大的方差比例
np.sum(pca.explained_variance_ratio_)

0.9165650429690326

In [69]:
pca.fit_transform(xdata_ss).shape

(517, 20)

### Truncated SVD（截断的奇异矩阵分解）  
  
TruncatedSVD与PCA非常相似，但不同的是，它直接对样本矩阵X进行工作，而不是对其协方差矩阵进行工作。

Truncated SVD与普通SVD的不同之处在于，它产生的因子化结果的列数是等于我们指定的截断数的。
例如，给定一个n×n矩阵，普通SVD将生成具有n列的矩阵，而截断后的SVD将生成我们指定的列数。

In [65]:
tsvd = TruncatedSVD(20)  # 保留20个
tsvd.fit_transform(xdata_ss)

array([[ 4.20393887, -0.81969546, -0.46081681, ...,  0.60921858,
         0.01381626,  0.16708439],
       [-0.02041855,  0.88110739, -0.91015821, ..., -0.20278749,
        -1.44123271,  0.20772519],
       [ 0.2674066 ,  1.06898908, -1.0189032 , ...,  0.49693935,
        -0.40004288,  1.56095058],
       ...,
       [ 0.50563481,  0.56382819,  2.85012055, ..., -0.12241451,
         0.30171322, -0.02277013],
       [-1.76737222, -0.79189878,  0.27323316, ...,  0.44888657,
         0.2317635 , -0.08663814],
       [ 4.23336759,  0.70608683, -0.42100672, ...,  7.92798779,
         4.41944341,  6.03773562]])

In [70]:
tsvd.fit_transform(xdata_ss).shape

(517, 20)