#### Author：马肖
#### E-Mail：maxiaoscut@aliyun.com
#### GitHub：https://github.com/Albertsr

### 1. One-Hot 编码

### 每一列特征需要构建的状态寄存器的位数等于该列特征独立取值的个数
#### 使用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候，其中只有一位有效。

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
X = pd.DataFrame([['Male', 'CN'], ['Female', 'USA'], ['Female', 'UK']])
X

Unnamed: 0,0,1
0,Male,CN
1,Female,USA
2,Female,UK


In [3]:
enc = OneHotEncoder()
enc.fit_transform(X).toarray()

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

In [4]:
enc.transform([['Female', 'UK'], ['Male', 'CN']]).toarray()

array([[1., 0., 0., 1., 0.],
       [0., 1., 1., 0., 0.]])

#### categories_ : list of arrays

In [5]:
enc.categories_

[array(['Female', 'Male'], dtype=object),
 array(['CN', 'UK', 'USA'], dtype=object)]

#### Return feature names for output features

In [6]:
enc.get_feature_names()

array(['x0_Female', 'x0_Male', 'x1_CN', 'x1_UK', 'x1_USA'], dtype=object)

#### 参数handle_unknown
- handle_unknown='ignore'：对于未知类别特征，则对应哑变量全部设置为0值
- handle_unknown='error'：对于未知类别特征，将进行报错

In [7]:
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)
enc.transform([['Female', 'UK'], ['Male', 'JP']]).toarray()

array([[1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.]])

In [8]:
enc = OneHotEncoder(handle_unknown='error')
enc.fit(X)

try:
    enc.transform([['Female', 'UK'], ['Male', 'JP']]).toarray()
except ValueError:
    print('got exception')

got exception


### 2. pandas.get_dummies构造哑变量

##### pandas.get_dummies(data, prefix=None, prefixsep='', dummy_na=False, columns=None, sparse=False, drop_first=False)

In [9]:
dict_ = {'Nation': ['CN', 'US', 'UK'], 'Explorer': ['Firefox','Chrome','Safari'], 'Quantity': [1, 2, 3]}
df = pd.DataFrame(dict_)
df

Unnamed: 0,Nation,Explorer,Quantity
0,CN,Firefox,1
1,US,Chrome,2
2,UK,Safari,3


In [10]:
df_dummies = pd.get_dummies(df, prefix=['Nation', 'Explorer'], prefix_sep='_')    
df_dummies

Unnamed: 0,Quantity,Nation_CN,Nation_UK,Nation_US,Explorer_Chrome,Explorer_Firefox,Explorer_Safari
0,1,1,0,0,0,1,0
1,2,0,0,1,1,0,0
2,3,0,1,0,0,0,1
