2.3 标准化数据

不同特征之间具有不同的量纲，由此造成数值之间的差异。为了消除特征之间量纲和取值范围的差异可能会造成的影响，需要对数据进行标准化处理。

2.3.1 离差标准化（01标准化,将原始数据映射到[0，1]区间）：$x_1=\frac{x-min}{max-min}$

In [17]:
import pandas as pd 
import numpy as np

In [2]:
data = pd.read_excel('data/meal_order_detail.xlsx')
display(data.head())
data[['counts','amounts']].describe()

Unnamed: 0,detail_id,order_id,dishes_id,logicprn_name,parent_class_name,dishes_name,itemis_add,counts,amounts,cost,place_order_time,discount_amt,discount_reason,kick_back,add_inprice,add_info,bar_code,picture_file,emp_id
0,2956,417,610062,,,蒜蓉生蚝,0,1,49,,2016-08-01 11:05:36,,,,0,,,caipu/104001.jpg,1442
1,2958,417,609957,,,蒙古烤羊腿\r\n\r\n\r\n,0,1,48,,2016-08-01 11:07:07,,,,0,,,caipu/202003.jpg,1442
2,2961,417,609950,,,大蒜苋菜,0,1,30,,2016-08-01 11:07:40,,,,0,,,caipu/303001.jpg,1442
3,2966,417,610038,,,芝麻烤紫菜,0,1,25,,2016-08-01 11:11:11,,,,0,,,caipu/105002.jpg,1442
4,2968,417,610003,,,蒜香包,0,1,13,,2016-08-01 11:11:30,,,,0,,,caipu/503002.jpg,1442


Unnamed: 0,counts,amounts
count,2779.0,2779.0
mean,1.111191,45.337172
std,0.625428,36.80855
min,1.0,1.0
25%,1.0,25.0
50%,1.0,35.0
75%,1.0,56.0
max,10.0,178.0


In [3]:
#定义函数
def min_max_scaler(x):
    return (x-x.min())/(x.max()-x.min())


In [4]:
data['max_min_amounts'] = min_max_scaler(data['amounts'])
data.head()


Unnamed: 0,detail_id,order_id,dishes_id,logicprn_name,parent_class_name,dishes_name,itemis_add,counts,amounts,cost,place_order_time,discount_amt,discount_reason,kick_back,add_inprice,add_info,bar_code,picture_file,emp_id,max_min_amounts
0,2956,417,610062,,,蒜蓉生蚝,0,1,49,,2016-08-01 11:05:36,,,,0,,,caipu/104001.jpg,1442,0.271186
1,2958,417,609957,,,蒙古烤羊腿\r\n\r\n\r\n,0,1,48,,2016-08-01 11:07:07,,,,0,,,caipu/202003.jpg,1442,0.265537
2,2961,417,609950,,,大蒜苋菜,0,1,30,,2016-08-01 11:07:40,,,,0,,,caipu/303001.jpg,1442,0.163842
3,2966,417,610038,,,芝麻烤紫菜,0,1,25,,2016-08-01 11:11:11,,,,0,,,caipu/105002.jpg,1442,0.135593
4,2968,417,610003,,,蒜香包,0,1,13,,2016-08-01 11:11:30,,,,0,,,caipu/503002.jpg,1442,0.067797


In [16]:
data[['max_min_amounts','amounts','counts']].corr(method='pearson')     #相似度矩阵

Unnamed: 0,max_min_amounts,amounts,counts
max_min_amounts,1.0,1.0,-0.174648
amounts,1.0,1.0,-0.174648
counts,-0.174648,-0.174648,1.0


In [15]:
data['counts'].corr(data['amounts'])   #相关系数

-0.1746479366151912

In [13]:
data.corr?

2.3.2 标准差标准化（零均值标准化或z分标准化，经处理后的数据均值为0，标准差为1）$x_1=\frac{x-mean}{std}$

In [14]:
#定义函数
def stander_scaler(x):
    return (x-x.mean())/(x.std())


In [18]:
data['stander_amounts'] = stander_scaler(data['amounts'])

In [21]:
data['stander_amounts'].mean()

1.469577328565541e-16

In [22]:
data['stander_amounts'].std()

0.999999999999995

2.3.3 小数定标标准化（通过移动数据的小数位数，将数据映射到区间[-1,1],移动的小数位数取决于数据绝对值的最大值。）$x_1=\frac{x}{10^k}$(k:向上取整（log10(|x|.max())）)

In [22]:
def decimal_scaler(x):
    return (x/10 ** np.ceil(np.log10(x.abs().max())))

In [31]:
data['decimal_amounts'] = decimal_scaler(data['amounts'])
print(data[['amounts','decimal_amounts']])

      amounts  decimal_amounts
0          49            0.049
1          48            0.048
2          30            0.030
3          25            0.025
4          13            0.013
...       ...              ...
2774       10            0.010
2775       40            0.040
2776       13            0.013
2777       30            0.030
2778       33            0.033

[2779 rows x 2 columns]
0.178 0.001


### 小结
离差标准化，标准差标准化、小数定标标准化三种标准化方法各有优势。其中，离差标准化方法简单，便于理解，但过大或者过小的异常值都会对结果产生影响；标准差标准化受数据分布的影响较小；小数定标标准化方法的适用范围广，并且受数据分布的影响较小，相比较前两种方法，该方法适用程度适中。

2.4 转换数据

2.4.1 类别型数据的哑变量处理

数据分析模型中有相当一部分的算法模型都要求输入的特征为数值型，但实际数据中特征的类型不一定只有数值型，还会存在相当一部分的类别型，这部分的特征需要经过哑变量处理才可以放入模型之中。
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)

In [33]:
pd.get_dummies(data[['amounts','dishes_name']])

Unnamed: 0,amounts,dishes_name_ 42度海之蓝,dishes_name_ 北冰洋汽水,dishes_name_38度剑南春,dishes_name_50度古井贡酒,dishes_name_52度泸州老窖,dishes_name_53度茅台,dishes_name_一品香酥藕,dishes_name_三丝鳝鱼,dishes_name_三色凉拌手撕兔,...,dishes_name_香辣腐乳炒虾,dishes_name_香酥两吃大虾,dishes_name_鱼香肉丝拌面,dishes_name_鲜美鳝鱼,dishes_name_鸡蛋、肉末肠粉,dishes_name_麻辣小龙虾,dishes_name_黄尾袋鼠西拉子红葡萄酒,dishes_name_黄油曲奇饼干,dishes_name_黄花菜炒木耳,dishes_name_黑米恋上葡萄
0,49,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,48,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,30,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,25,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,13,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2774,10,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2775,40,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2776,13,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2777,30,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
data['dishes_name'].value_counts()

白饭/大碗          92
凉拌菠菜           77
谷稻小庄           72
麻辣小龙虾          65
白饭/小碗          60
               ..
百里香奶油烤紅酒牛肉      1
冰镇花螺            1
红酒土豆烧鸭腿\r\n     1
五香酱驴肉\r\n       1
照烧鸡腿\r\n        1
Name: dishes_name, Length: 154, dtype: int64

2.4.2 连续型变量的离散化

(一）等宽法
将数据的值域分成具有相同宽度的区间，区间的个数由数据本身的特点决定或者用户指定，与制作频率分布表类似。
pandas提供了cut函数，可以进行连续型数据的等宽离散化，其基础语法格式如下。
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

In [42]:
pd.cut(data['amounts'],bins = 5)
# pd.cut(data['amounts'],bins = 5,right=False)
# pd.cut(data['amounts'],5,labels=list('abcde'))    #labels表示离散化后各个类别的名称，right代表右侧是否为闭区间,默认是

0        (36.4, 71.8]
1        (36.4, 71.8]
2       (0.823, 36.4]
3       (0.823, 36.4]
4       (0.823, 36.4]
            ...      
2774    (0.823, 36.4]
2775     (36.4, 71.8]
2776    (0.823, 36.4]
2777    (0.823, 36.4]
2778    (0.823, 36.4]
Name: amounts, Length: 2779, dtype: category
Categories (5, interval[float64]): [(0.823, 36.4] < (36.4, 71.8] < (71.8, 107.2] < (107.2, 142.6] < (142.6, 178.0]]

In [43]:
#统计每个区间菜品数
pd.value_counts(pd.cut(data['amounts'],5))

(0.823, 36.4]     1488
(36.4, 71.8]       885
(71.8, 107.2]      233
(142.6, 178.0]     130
(107.2, 142.6]      43
Name: amounts, dtype: int64

In [51]:
# 将amounts按下面规则分组转换为分类特征：
# （<=20,特价菜）、（<=100,常规菜）、（>100,特色菜）
data['菜品类别'] = pd.cut(data['amounts'],bins=[0,20,100,data['amounts'].max()],labels=['特价菜','常规菜','特色菜'])
display(data[['amounts','菜品类别']])

Unnamed: 0,amounts,菜品类别
0,49,常规菜
1,48,常规菜
2,30,常规菜
3,25,常规菜
4,13,特价菜
...,...,...
2774,10,特价菜
2775,40,常规菜
2776,13,特价菜
2777,30,常规菜


In [52]:
#求各类菜品点单数量
data['菜品类别'].value_counts()

常规菜    1920
特价菜     686
特色菜     173
Name: 菜品类别, dtype: int64

In [54]:
#用分组、聚合实现
data.groupby('菜品类别')['菜品类别'].count()
# data.groupby('菜品类别')['菜品类别'].agg('count')

菜品类别
特价菜     686
常规菜    1920
特色菜     173
Name: 菜品类别, dtype: int64

使用等宽法离散化的缺陷为：等宽法离散化对数据分布具有较高要求，若数据分布不均匀，那么各个类的数目也会变得非常不均匀，有些区间包含许多数据，而另外一些区间的数据极少，这会严重损坏所建立的模型

(二）等频法
相对等宽法而言，等频法避免了类分布不均匀的问题，但同时也有可能将数值非常接近的两个值分到不同的区间以满足每个区间对数据个数的要求。

In [69]:
#https://blog.csdn.net/Accelerating/article/details/116048021
data.quantile?

In [73]:
def SameRateCut(data,k):
    w = data.quantile(np.arange(0,1+1.0/k,1.0/k))
    data = pd.cut(data,w)
    return data
result = SameRateCut(data['amounts'],5)
result.value_counts()

(39.0, 58.0]     580
(18.0, 32.0]     564
(58.0, 178.0]    530
(1.0, 18.0]      530
(32.0, 39.0]     515
Name: amounts, dtype: int64

(三)聚类分析法