## 5.2.1 特征散列化

### 例5-5 特征散列化（又称“散列戏法”）

In [1]:
import pandas as pd
import json

In [3]:
# Load the first 10000 reviews
# 加载前10 000条点评
f = open('data/yelp_dataset/review.json')
js = []
for i in range(10000):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
review_df.shape

(10000, 9)

In [6]:
review_df.head()

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,ujmEBvifdJM6h6RLv4wQIg,0,2013-05-07 04:34:36,1,Q1sbwvVQXV2734tPgoKj4Q,1.0,Total bill for this horrible service? Over $8G...,6,hG7b0MtEbXx5QzbzE6C_VA
1,NZnhc2sEQy3RmzKTZnqtwQ,0,2017-01-14 21:30:33,0,GJXCdrto3ASJOqKeVWPi6Q,5.0,I *adore* Travis at the Hard Rock's new Kelly ...,0,yXQM5uF2jS6es16SJzNHfg
2,WTqjgwHlXbSFevF32_DJVw,0,2016-11-09 20:09:03,0,2TzJjDVDEuAW6MR5Vuc1ug,5.0,I have to say that this office really has it t...,3,n6-Gk65cPZL6Uz8qRm3NYw
3,ikCg8xy5JIg_NGPx-MSIDA,0,2018-01-09 20:56:38,0,yi0R0Ugj_xUx_Nek0-_Qig,5.0,Went in for a lunch. Steak sandwich was delici...,0,dacAIZ6fTM6mqwW5uxkskg
4,b1b1eb3uo-w561D0ZfCEiQ,0,2018-01-30 23:07:38,0,11a8sVPMUFtaC7_ABRkmtw,1.0,Today was my second out of three sessions I ha...,7,ssoyf2_x0EQMed6fgHeMyQ


In [4]:
# we will define m as equal to the unique number of business_id
# 定义m为唯一的business_id的数量
m = len(review_df.business_id.unique())
m

4618

In [7]:
from sklearn.feature_extraction import FeatureHasher

In [8]:
h = FeatureHasher(n_features=m, input_type='string')
f = h.transform(review_df['business_id'])

In [9]:
# 散列化对特征可解释性的影响
review_df['business_id'].unique().tolist()[0:5]

['ujmEBvifdJM6h6RLv4wQIg',
 'NZnhc2sEQy3RmzKTZnqtwQ',
 'WTqjgwHlXbSFevF32_DJVw',
 'ikCg8xy5JIg_NGPx-MSIDA',
 'b1b1eb3uo-w561D0ZfCEiQ']

In [12]:
f.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [13]:
# We can see how this will make a difference in the future by looking at the size of each
# 不是很好，但是看一下特征的存储空间大小
from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id']))
print('Our hashed numpy array, in bytes: ', getsizeof(f))

Our pandas Series, in bytes:  790104
Our hashed numpy array, in bytes:  56


我们可以清楚地看到，特征散列化对计算能力大有裨益，但牺牲了直观的用户可解释性。 对于大数据集，当从数据探索和可视化进展到机器学习流程时，我们可以很容易地在二者 之间做出取舍。