# Discretization and Binning

Continuous data is often discretized or otherwised separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [2]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 35 to 60, and finally 60 and older. To do so, you have to use cut, a function in pandas:

In [3]:
bins = [18, 25, 35, 60, 100]

In [4]:
cats = pd.cut(ages, bins)

cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [5]:
cats.value_counts()

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open while the square bracket means it is closed (inclusive). Which side is closed can be changed by passing right=False:

In [6]:
pd.cut(ages, [18, 26, 36, 61, 100], right= False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

You can also pass your own bin names by passing a list or array to the labels option:

In [7]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [8]:
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass cut a integer number of bins instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

In [9]:
data = np.random.rand(20)

data

array([0.14553824, 0.26160565, 0.4774652 , 0.65485408, 0.06597414,
       0.22830944, 0.98807244, 0.32894473, 0.19533165, 0.41152602,
       0.14073897, 0.13427171, 0.43359351, 0.78083827, 0.9024486 ,
       0.16895479, 0.46754164, 0.648236  , 0.74745253, 0.68364899])

In [10]:
pd.cut(data, 4, precision= 1)

[(0.07, 0.3], (0.07, 0.3], (0.3, 0.5], (0.5, 0.8], (0.07, 0.3], ..., (0.07, 0.3], (0.3, 0.5], (0.5, 0.8], (0.5, 0.8], (0.5, 0.8]]
Length: 20
Categories (4, interval[float64, right]): [(0.07, 0.3] < (0.3, 0.5] < (0.5, 0.8] < (0.8, 1.0]]

In [14]:
arr = np.arange(20)

pd.cut(arr, 2, precision=1)

[(-0.02, 9.5], (-0.02, 9.5], (-0.02, 9.5], (-0.02, 9.5], (-0.02, 9.5], ..., (9.5, 19.0], (9.5, 19.0], (9.5, 19.0], (9.5, 19.0], (9.5, 19.0]]
Length: 20
Categories (2, interval[float64, right]): [(-0.02, 9.5] < (9.5, 19.0]]

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead, by definition you will obtain roughly equal-size bins:

In [20]:
data = np.random.randn(1000) # Normally distributed

In [21]:
cats = pd.qcut(data, 4) # Cut into quartiles

In [23]:
cats

[(-3.255, -0.683], (-3.255, -0.683], (0.689, 3.048], (-0.683, 0.0145], (0.0145, 0.689], ..., (0.689, 3.048], (0.0145, 0.689], (0.0145, 0.689], (-0.683, 0.0145], (-3.255, -0.683]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.255, -0.683] < (-0.683, 0.0145] < (0.0145, 0.689] < (0.689, 3.048]]

In [24]:
pd.value_counts(cats)

(-3.255, -0.683]    250
(-0.683, 0.0145]    250
(0.0145, 0.689]     250
(0.689, 3.048]      250
dtype: int64

Similar to cut you can pass your own quantiles (numbers between 0 and 1, inclusive):

In [25]:
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1])

[(-3.255, -1.239], (-1.239, 0.0145], (1.308, 3.048], (-1.239, 0.0145], (0.0145, 1.308], ..., (0.0145, 1.308], (0.0145, 1.308], (0.0145, 1.308], (-1.239, 0.0145], (-1.239, 0.0145]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.255, -1.239] < (-1.239, 0.0145] < (0.0145, 1.308] < (1.308, 3.048]]

In [26]:
pd.value_counts(data)

-1.274335    1
 0.848151    1
-0.862615    1
-0.992831    1
 0.245990    1
            ..
 0.458109    1
-0.141726    1
-0.397333    1
 0.255278    1
-0.736576    1
Length: 1000, dtype: int64