## 类别变量处理

**实验任务：**将泰坦尼克号乘客数据集的类别变量做序数编码或独热编码

**实验课时：**0.5课时

**实验目的：**

* 了解类别变量处理在机器学习中的作用；
* 理解序数编码和独热编码的联系和差异；
* 掌握类别变量处理的实现方法。

可以看出，数据集中变量`pclass`、`sex`和`embarked`都是类别变量，即其值不是数值型。Scikit-learn中的建模函数多数只允许数值型变量，因此需要将类别型变量转换为数值型变量。例如，变量`pclass`中`1st`都转换为0，`2nd`都转换为1，`3rd`都转换为2。

载入需要用到的程序包。

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
pd.set_option('mode.chained_assignment',None)

读取读取泰坦尼克号乘客数据集。

In [47]:
titanic3_file_path="./titanic3.xls"
titanic3 = pd.read_excel(titanic3_file_path)
titanic3.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


随机划分训练集和测试集。

In [48]:
X = titanic3[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = titanic3['survived']
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 123)

填充训练集和测试集的缺失值。

In [49]:
imp = SimpleImputer(strategy='median')
train_X[['age','fare']] = imp.fit_transform(train_X[['age','fare']])
test_X[['age','fare']] = imp.transform(test_X[['age','fare']])
train_X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
164,1,male,35.0,0,0,26.55,C
974,3,male,30.0,1,0,16.1,S
759,3,female,36.0,1,0,17.4,S
613,3,male,26.0,0,0,18.7875,C
848,3,male,41.0,2,0,14.1083,S


### 1. 序数编码

序数编码为类别变量的每个独立值单独指定一个序号。

调用程序包`sklearn.preprocessing`中的构造函数`OrdinalEncoder()`创建序数编码器。

调用序数编码器的函数`fit()`训练序数编码器。

In [50]:
oe = OrdinalEncoder()
oe.fit(train_X[['pclass','sex','embarked']])

OrdinalEncoder()

调用序数编码器的函数`transform()`做训练集的序数编码，返回编码完成的Numpy数组。查看序数编码器的属性`categories_`得到每个序数对应的类别。

In [51]:
train_X[['pclass','sex','embarked']] = oe.transform(train_X[['pclass','sex','embarked']])
oe.categories_

[array([1, 2, 3], dtype=int64),
 array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S', nan], dtype=object)]

查看编码后的结果。

In [52]:
train_X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
164,0.0,1.0,35.0,0,0,26.55,0.0
974,2.0,1.0,30.0,1,0,16.1,2.0
759,2.0,0.0,36.0,1,0,17.4,2.0
613,2.0,1.0,26.0,0,0,18.7875,0.0
848,2.0,1.0,41.0,2,0,14.1083,2.0


调用序数编码器的函数transform()做测试集的序数编码。

In [53]:
test_X[['pclass','sex','embarked']] = oe.transform(test_X[['pclass','sex','embarked']])
test_X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
1244,2.0,0.0,16.0,1,1,8.5167,0.0
798,2.0,1.0,27.0,0,0,7.05,2.0
437,1.0,0.0,24.0,1,2,65.0,2.0
84,0.0,1.0,39.0,1,0,71.2833,0.0
1307,2.0,1.0,27.0,0,0,7.225,0.0


可以看出，所有类别型变量都已经转换为数值型变量。对于多于2个独立类别的变量，这样的处理隐含了不同类别的序数关系。例如，变量embarked中，0对应Cherbourg，1对应Queenstown，2对应Southampton，3对应Unknown。如果不加以处理，则隐含了序列关系，即Cherbourg小于Queenstown小于Southampton，这显然不是我们想要的。

### 2. 独热编码

独热编码(one-hot encoding)为每个独立值创建一个哑变量(dummy variable)。例如，变量`embarked`经过类别编码后，有4种独立值，分别为0、1、2和3。独热编码会将这一变量变换为4个变量，对于值为0的记录编码为`[1,0,0,0]`，对于值为1的记录编码为`[0,1,0,0]`，依次类推。

还原编码之前的训练集和测试集。

In [54]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 123)
train_X[['age','fare']] = imp.transform(train_X[['age','fare']])
test_X[['age','fare']] = imp.transform(test_X[['age','fare']])
train_X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked
164,1,male,35.0,0,0,26.55,C
974,3,male,30.0,1,0,16.1,S
759,3,female,36.0,1,0,17.4,S
613,3,male,26.0,0,0,18.7875,C
848,3,male,41.0,2,0,14.1083,S


得到类别变量和数值变量的变量名。

In [55]:
categorical_cols = ['pclass','sex','embarked']
numeric_cols = list(set(X.columns) - set(categorical_cols))
numeric_cols

['sibsp', 'age', 'fare', 'parch']

调用程序包`sklearn.preprocessing`中的构造函数`OneHotEncoder()`创建独热编码器，其中

* 参数`categories`表示每个类别变量的独立值，`'auto'`为自动得到，也可以为每个类别变量的独立值列表，默认为`'auto'`；
* 参数`drop`表示删除的独热编码变量用于避免共线性问题，`'first'`为删除第一个变量，`None`为不删除任何变量，默认为`None`；
* 参数`sparse`表示是否输出稀疏矩阵，默认为是。

调用独热编码器的函数`fit()`训练独热编码器，这里避免共线性，删除每个变量独热编码后的第一个变量。
例如，变量`sex`在独热编码后会变成2个变量，第1个表示是否为`female`，第2个表示是否为`male`，如果第1个变量为1，则第2个变量必然为0，因此存在共线性。删除独热编码后的第一个变量可以避免共线性。

In [56]:
ohe = OneHotEncoder(drop = 'first', sparse = False)
ohe.fit(train_X[['pclass','sex','embarked']])

OneHotEncoder(drop='first', sparse=False)

调用独热编码器的函数transform()做训练集的独热编码，返回编码后的数值型Numpy数组。并与其他数值变量按列连接。

查看独热编码器的属性categories_得到每个独热编码变量对应的类别。

In [57]:
train_X = np.hstack((ohe.transform(train_X[categorical_cols]), train_X[numeric_cols]))
ohe.categories_

[array([1, 2, 3], dtype=int64),
 array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S', nan], dtype=object)]

生成独热编码后的变量名，并创建数据框。

In [58]:
cols = ["pclass_2nd","pclass_3rd","sex_male","embarked_Queenstown","embarked_Southampton","embarked_Unknown","sibsp","parch","age","fare"]
train_X = pd.DataFrame(train_X, columns = cols)
train_X.head()

Unnamed: 0,pclass_2nd,pclass_3rd,sex_male,embarked_Queenstown,embarked_Southampton,embarked_Unknown,sibsp,parch,age,fare
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,35.0,26.55,0.0
1,0.0,1.0,1.0,0.0,1.0,0.0,1.0,30.0,16.1,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,36.0,17.4,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,0.0,26.0,18.7875,0.0
4,0.0,1.0,1.0,0.0,1.0,0.0,2.0,41.0,14.1083,0.0


调用独热编码器的函数`transform()`做测试集的独热编码，返回编码后的数值型Numpy数组。并与其他数值变量按列连接

In [59]:
test_X = np.hstack((ohe.transform(test_X[categorical_cols]), test_X[numeric_cols]))
test_X = pd.DataFrame(test_X, columns = cols)
test_X.head()

Unnamed: 0,pclass_2nd,pclass_3rd,sex_male,embarked_Queenstown,embarked_Southampton,embarked_Unknown,sibsp,parch,age,fare
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,16.0,8.5167,1.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,27.0,7.05,0.0
2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,24.0,65.0,2.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,39.0,71.2833,0.0
4,0.0,1.0,1.0,0.0,0.0,0.0,0.0,27.0,7.225,0.0


在做完独热编码后，列索引对应的变量意义如下：

列索引0 - 1：变量pclass是否为2nd和3rd，1为是，0为否；
列索引2：变量sex是否为male，1为是，0为否；
列索引3 - 5：变量embarked是否为Queenstown、Southampton和Unknown，1为是，0为否；
列索引6 - 9：变量age、sibsp、parch和fare