<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#数据预处理" data-toc-modified-id="数据预处理-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>数据预处理</a></span><ul class="toc-item"><li><span><a href="#标准化（Standardization,-Z-score-normalization,-or-mean-removal-and-variance-scaling）" data-toc-modified-id="标准化（Standardization,-Z-score-normalization,-or-mean-removal-and-variance-scaling）-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>标准化（Standardization, Z-score normalization, or mean removal and variance scaling）</a></span></li><li><span><a href="#归一化（Scaling-features-to-a-range）" data-toc-modified-id="归一化（Scaling-features-to-a-range）-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>归一化（Scaling features to a range）</a></span></li></ul></li><li><span><a href="#缺失值填补" data-toc-modified-id="缺失值填补-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>缺失值填补</a></span></li><li><span><a href="#特征选择" data-toc-modified-id="特征选择-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>特征选择</a></span></li><li><span><a href="#降维" data-toc-modified-id="降维-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>降维</a></span></li></ul></div>

In [1]:
import pandas as pd

#### 数据预处理

[`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)包含了表格数据的预处理方法，包括归一化、标准化、编码等的方法，这些方法的作用是避免特征之间范围或量纲对模型产生影响（例如 `K-means` 等），以及将一些模型无法直接处理的数据进行编码转化。

##### 标准化（Standardization, Z-score normalization, or mean removal and variance scaling）

标准化的方法是，减去均值后除以标准差，这样如果原来特征数据服从正态分布，那么经过处理的数据就会服从均值为0，方差为1的标准正态分布，这个值有时被称为Z分数（即个这个值距离平均数多少个标准差）：
$$x^{\prime} = \frac{x - \bar{x}}{\mu}$$

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons

data_set = make_moons(n_samples=200, noise=0.2, random_state=0)
X, y = data_set[0], data_set[1]
scaler = StandardScaler()
scaler.fit(X)

# print(scaler.mean_, scaler.var_)

X_std = scaler.transform(X)

# print(X_std.mean(), X_std.std())

X_std_trans = scaler.fit_transform(X)
pd.DataFrame(X_std_trans)

# 可以通过 scaler.inverse_transform(X_std_trans) 进行还原

Unnamed: 0,0,1
0,0.287164,0.388199
1,1.306415,-1.680581
2,-0.718878,-0.005701
3,-1.755822,-1.237113
4,1.429622,-0.968282
...,...,...
195,-1.170324,1.194141
196,1.247791,-0.227332
197,-0.233006,-0.465483
198,-0.252812,1.475487


##### 归一化（Scaling features to a range）

归一化是指将数据整理到一个固定的范围内，如 $[0,1]$，`sklearn`中主要用`sklearn.preprocessing.MinMaxScaler`和`sklearn.preprocessing.MaxAbsScaler`两种方式实现。  

`MinMaxScaler(feature_range=(0, 1))`的具体方法是将数据压缩（或扩大到）0到1之间（这个范围是默认参数，可以修改）：
$$x^{\prime} = \frac{x - x_{min}}{x_{max}-x_{min}}$$

`sklearn.preprocessing.MaxAbsScaler()`则会将数据缩放到 $[-1,1]$ 范围内，具体做法是：
$$x^{\prime} = \frac{x }{\vert x\vert_{max}}$$

In [12]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_circles

data_set = make_circles(n_samples=200, noise=0.4, random_state=0)
X, y = data_set[0], data_set[1]

scaler = MinMaxScaler()

X_mimx = scaler.fit_transform(X)
print(X_mimx.min(axis=0), X_mimx.max(axis=0))

[0. 0.] [1. 1.]


#### 缺失值填补

#### 特征选择

#### 降维

另一些降维的算法，参考[heucoder/dimensionality_reduction_alo_codes](https://github.com/heucoder/dimensionality_reduction_alo_codes)