# 4.2.处理类别数据

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame([
    ['green', 'M', 10.1, 'class1'],
    ['red', 'L', 13.5, 'class2'],
    ['blue', 'XL', 15.3, 'class1']
])
df.columns = ['color', 'size', 'price', 'classlabel']

In [3]:
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


## 4.2.1.有序特征的映射

In [4]:
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1
}

In [5]:
help(size_mapping)

Help on dict object:

class dict(object)
 |  dict() -> new empty dictionary
 |  dict(mapping) -> new dictionary initialized from a mapping object's
 |      (key, value) pairs
 |  dict(iterable) -> new dictionary initialized as if via:
 |      d = {}
 |      for k, v in iterable:
 |          d[k] = v
 |  dict(**kwargs) -> new dictionary initialized with the name=value pairs
 |      in the keyword argument list.  For example:  dict(one=1, two=2)
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, key, /)
 |      True if D has a key k, else False.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize s

In [6]:
df['size'] = df['size'].map(size_mapping)

In [7]:
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


如果在后续过程中需要将整数值还原为有序字符串，可以简单地定义一个逆映射字典inv_size_mapping，通过map方法将inv_size_mapping应用于经过转换的特征列上。

In [8]:
inv_size_mapping = {v:k for k, v in size_mapping.items()}

In [9]:
inv_size_mapping

{3: 'XL', 2: 'L', 1: 'M'}

In [10]:
df_cp = df.copy()
df_cp['size'] = df_cp['size'].map(inv_size_mapping)
df_cp

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


## 4.2.2.类标的编码

类标并不是有序的，而且对于特定的字符串类标，赋予哪个数值给它对于我们来说并不重要

In [11]:
class_mapping = {label:idx for idx, label in 
                enumerate(np.unique(df['classlabel']))}
class_mapping

{'class1': 0, 'class2': 1}

接下来，我们可以使用映射字典将类标转换为整数

In [12]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


将映射字典中的键-值对倒置，以将转换过的类标还原回原始的字符串表示

In [13]:
inv_class_mapping = {v:k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


使用scikit-learn中的LabelEncoder类可以更加方便地完成对类标的整数编码工作

In [14]:
from sklearn.preprocessing import LabelEncoder
help(LabelEncoder)

Help on class LabelEncoder in module sklearn.preprocessing._label:

class LabelEncoder(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  Encode target labels with value between 0 and n_classes-1.
 |  
 |  This transformer should be used to encode target values, *i.e.* `y`, and
 |  not the input `X`.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_targets>`.
 |  
 |  .. versionadded:: 0.12
 |  
 |  Attributes
 |  ----------
 |  classes_ : array of shape (n_class,)
 |      Holds the label for each class.
 |  
 |  Examples
 |  --------
 |  `LabelEncoder` can be used to normalize labels.
 |  
 |  >>> from sklearn import preprocessing
 |  >>> le = preprocessing.LabelEncoder()
 |  >>> le.fit([1, 2, 2, 6])
 |  LabelEncoder()
 |  >>> le.classes_
 |  array([1, 2, 6])
 |  >>> le.transform([1, 1, 2, 6])
 |  array([0, 0, 1, 2]...)
 |  >>> le.inverse_transform([0, 0, 1, 2])
 |  array([1, 1, 2, 6])
 |  
 |  It can also be used to transform non-numerical labels (as long as they

In [15]:
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0])

还可以使用inverse_transform方法将整数类标还原为原始的字符串表示

In [16]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

使用上述方法处理数据集中标称数据格式的color列

In [17]:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

虽然颜色的值并没有特定的顺序，但是学习算法将假定green大于blue、red大于green。虽然算法的这一假定并不合理，但最终还是能够生成有用的结果。然而，这个结果可能不是最优的。

## 4.2.3.标称特征上的独热编码

独热编码（one-hot encoding）

这种方法的理念就是创建一个新的虚拟特征（dummy feature），虚拟特征的每一列各代表标称数据的一个值。

In [18]:
from sklearn.preprocessing import OneHotEncoder
help(OneHotEncoder)

Help on class OneHotEncoder in module sklearn.preprocessing._encoders:

class OneHotEncoder(_BaseEncoder)
 |  Encode categorical features as a one-hot numeric array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
 |  encoding scheme. This creates a binary column for each category and
 |  returns a sparse matrix or dense array (depending on the ``sparse``
 |  parameter)
 |  
 |  By default, the encoder derives the categories based on the unique values
 |  in each feature. Alternatively, you can also specify the `categories`
 |  manually.
 |  
 |  This encoding is needed for feeding categorical data to many scikit-learn
 |  estimators, notably linear models and SVMs with the standard kernels.
 |  
 |  Note: a one-hot encoding of y labels should use a LabelBinarizer
 |  instead.
 |  
 |  Read more in the :ref: