### 分位数与桶分析

In [1]:
import numpy as np
np.random.seed(42)
import pandas as pd

In [2]:
frame = pd.DataFrame(np.random.randn(1000), columns=['data'])
frame.head()

Unnamed: 0,data
0,0.496714
1,-0.138264
2,0.647689
3,1.52303
4,-0.234153


In [3]:
quartiles = pd.cut(frame.data, 4)
quartiles

0       (0.306, 2.079]
1      (-1.468, 0.306]
2       (0.306, 2.079]
3       (0.306, 2.079]
4      (-1.468, 0.306]
            ...       
995    (-1.468, 0.306]
996     (0.306, 2.079]
997     (0.306, 2.079]
998    (-1.468, 0.306]
999     (0.306, 2.079]
Name: data, Length: 1000, dtype: category
Categories (4, interval[float64]): [(-3.248, -1.468] < (-1.468, 0.306] < (0.306, 2.079] < (2.079, 3.853]]

cut返回的Categorical对象可以直接传递给groupby:

In [4]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}


grouped = frame.data.groupby(quartiles)
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-3.248, -1.468]",-3.241267,-1.478522,60.0,-1.856827
"(-1.468, 0.306]",-1.463515,0.301547,562.0,-0.442008
"(0.306, 2.079]",0.3073,2.075401,357.0,0.91812
"(2.079, 3.853]",2.092387,3.852731,21.0,2.446729


这些就是等长桶，为了根据样本分位数计算出等大小的桶，则需要使用qcut，通过传递labels=False来获得分位数数值:

In [5]:
grouping = pd.qcut(frame.data, 10, labels=False)
grouped = frame.data.groupby(grouping)
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,-3.241267,-1.245739,100.0,-1.656631
1,-1.244655,-0.808298,100.0,-0.997777
2,-0.802277,-0.52452,100.0,-0.651002
3,-0.52286,-0.241236,100.0,-0.391406
4,-0.240325,0.02451,100.0,-0.102573
5,0.026091,0.248221,100.0,0.140939
6,0.249384,0.513786,100.0,0.374631
7,0.514439,0.81351,100.0,0.657138
8,0.813517,1.305479,100.0,1.023412
9,1.307143,3.852731,100.0,1.796588


In [6]:
grouping

0      6
1      4
2      7
3      9
4      4
      ..
995    3
996    9
997    7
998    2
999    7
Name: data, Length: 1000, dtype: int64