# 5.4 特征工程

## 5.4.1 分类特征

独热编码反映分类特征

In [2]:
data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]

In [3]:
from sklearn.feature_extraction import DictVectorizer

In [4]:
vec = DictVectorizer(sparse=False, dtype=int)

In [5]:
vec.fit_transform(data)

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

查看每一列的含义

In [6]:
vec.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

由于被编码的数据中有许多0，因此用稀疏矩阵表示会非常有效。

这时候我们可以发挥sparse matrix的优势——只记录非0值的位置。

DictVectorizer的内部实现是将数据直接转换成sparse矩阵，如果sparse为False， 再把sparse矩阵转换成numpy.ndarray型数组。

In [16]:
vec = DictVectorizer(sparse=True, dtype=int)

In [17]:
vec.fit_transform(data)

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>

In [18]:
vec.fit_transform(data).toarray()

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]])

## 5.4.2 文本特征

In [10]:
sample = ['problem of evil', 'evil queen', 'horizon problem']

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
vec = CountVectorizer()

In [20]:
x = vec.fit_transform(sample)

In [21]:
x

<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [23]:
import pandas as pd

In [24]:
pd.DataFrame(x.toarray(), columns=vec.get_feature_names())

Unnamed: 0,evil,horizon,of,problem,queen
0,1,0,1,1,0
1,1,0,0,0,1
2,0,1,0,1,0
