# Quantile and Bucket Analysis

Pandas has some tools, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combining these functions with groupby, it becomes very simple to perform bucket or quantile analysis on a data set. Consider a simple random data set and an equal-length bucket categoriazation using cut:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [5]:
frame = DataFrame({'data1': np.random.randn(1000),
                'data2': np.arange(1000)})

In [9]:
factor = pd.cut(frame.data1, bins = 4)

In [11]:
help(pd.cut)

# also used to change variables from one to other form

Help on function cut in module pandas.core.reshape.tile:

cut(x, bins, right: 'bool' = True, labels=None, retbins: 'bool' = False, precision: 'int' = 3, include_lowest: 'bool' = False, duplicates: 'str' = 'raise', ordered: 'bool' = True)
    Bin values into discrete intervals.
    
    Use `cut` when you need to segment and sort data values into bins. This
    function is also useful for going from a continuous variable to a
    categorical variable. For example, `cut` could convert ages to groups of
    age ranges. Supports binning into an equal number of bins, or a
    pre-specified array of bins.
    
    Parameters
    ----------
    x : array-like
        The input array to be binned. Must be 1-dimensional.
    bins : int, sequence of scalars, or IntervalIndex
        The criteria to bin by.
    
        * int : Defines the number of equal-width bins in the range of `x`. The
          range of `x` is extended by .1% on each side to include the minimum
          and maximum values 

In [35]:
factor.head(10)

0    (-1.457, 0.0217]
1       (1.501, 2.98]
2     (0.0217, 1.501]
3    (-1.457, 0.0217]
4       (1.501, 2.98]
5    (-1.457, 0.0217]
6    (-1.457, 0.0217]
7    (-2.943, -1.457]
8     (0.0217, 1.501]
9    (-1.457, 0.0217]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-2.943, -1.457] < (-1.457, 0.0217] < (0.0217, 1.501] < (1.501, 2.98]]

The Factor object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the data2 column like so:

In [36]:
def stats(group):
    return {'min': group.min(), 'max': group.max(),
    'count': group.count(), 'mean': group.mean(),
    'median': group.median()}

In [37]:
grouped = frame.data2.groupby(factor)

In [42]:
operation = grouped.apply(stats).unstack()

In [43]:
operation

Unnamed: 0_level_0,min,max,count,mean,median
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-2.943, -1.457]",7.0,990.0,54.0,518.87037,509.0
"(-1.457, 0.0217]",0.0,999.0,447.0,489.221477,483.0
"(0.0217, 1.501]",2.0,995.0,438.0,510.246575,511.5
"(1.501, 2.98]",1.0,987.0,61.0,480.508197,484.0


These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use qcut. I’ll pass labels=False to just get quantile numbers.

In [56]:
# Return Quantile numbers

grouping = pd.qcut(frame.data2, 10, labels = False)

In [61]:
grouped.apply(stats).unstack()

Unnamed: 0_level_0,min,max,count,mean,median
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-2.943, -1.457]",7.0,990.0,54.0,518.87037,509.0
"(-1.457, 0.0217]",0.0,999.0,447.0,489.221477,483.0
"(0.0217, 1.501]",2.0,995.0,438.0,510.246575,511.5
"(1.501, 2.98]",1.0,987.0,61.0,480.508197,484.0
