## Chapter05 Sklearn数据集变换

***

### 数据集变换主要包括
- 数据预处理 preprocessing data
- 特征抽取 feature extraction
- 特征变换 
- 维数约减 dimentionality reduction

***

### 本章主要内容

- 4.1 Pipline and Feature union
- 4.2 Feature extraction
- 4.3 Preprocessing data
- 4.4 dimentionality reduction
- 4.5 random projection
- 4.6 Kernel approximation
- 4.7 Pairwise metrics, Affinities, Kernels
- 4.8 Transforming the prediction target

***

- 4.1 Pipline and Featrue union

Pipeline用(key, value)列表来构建，其中key是一个标识步骤step的名称字符串，值value是一个estimator对象

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), 
              ('clf', SVC())]
pipe = Pipeline(estimators)
pipe

Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [3]:
# 查看pipe中的所有步骤
pipe.steps

[('reduce_dim',
  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)),
 ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]

In [7]:
# 根据步骤名称查询步骤
pipe.named_steps['reduce_dim']

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [8]:
# 修改pipe中某个estimator的某个参数
# pipe.set_params(estimator step name__parameter = n)
# 注意stepname与parameter之间是两个_
pipe.set_params(clf__C = 10)

Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [9]:
# case
from sklearn.model_selection import GridSearchCV
params = dict(reduce_dim__n_components = [2, 5, 10], 
              clf__C = [0.1 , 10, 100])
grid_search = GridSearchCV(pipe, param_grid = params)

In [None]:
# 单个阶段可以用参数替换，而且非最后阶段的step还可以将参数设置为None
from sklearn,linear_model import LogisticRegression
params = dict(reduce_dim = [None, PCA(5), PCA(10)], 
              clf = [SVC(), LogisticRegression()], 
              clf__C = [0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid = params)