# Pandas `qcut()`

This is a notebook for the medium article [All Pandas qcut() you should know for binning numerical data based on sample quantiles](https://bindichen.medium.com/all-pandas-qcut-you-should-know-for-binning-numerical-data-based-on-sample-quantiles-c8b13a8ed844)

Please check out article for instructions

**License**: [BSD 2-Clause](https://opensource.org/licenses/BSD-2-Clause)


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30, 50, 78]})
df

Unnamed: 0,age
0,2
1,67
2,40
3,32
4,4
5,15
6,82
7,99
8,26
9,30


In [3]:
df.describe()

Unnamed: 0,age
count,12.0
mean,43.75
std,31.729324
min,2.0
25%,23.25
50%,36.0
75%,69.75
max,99.0


## 1. Discretizing into equal-sized buckets

In [4]:
df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30, 50, 78]})

df['age_group'] = pd.qcut(df['age'], 3)

In [5]:
df

Unnamed: 0,age,age_group
0,2,"(1.999, 28.667]"
1,67,"(55.667, 99.0]"
2,40,"(28.667, 55.667]"
3,32,"(28.667, 55.667]"
4,4,"(1.999, 28.667]"
5,15,"(1.999, 28.667]"
6,82,"(55.667, 99.0]"
7,99,"(55.667, 99.0]"
8,26,"(1.999, 28.667]"
9,30,"(28.667, 55.667]"


In [6]:
df['age_group'].value_counts()

(1.999, 28.667]     4
(28.667, 55.667]    4
(55.667, 99.0]      4
Name: age_group, dtype: int64

In [7]:
# Take a look at the result
df['age_group']

0      (1.999, 28.667]
1       (55.667, 99.0]
2     (28.667, 55.667]
3     (28.667, 55.667]
4      (1.999, 28.667]
5      (1.999, 28.667]
6       (55.667, 99.0]
7       (55.667, 99.0]
8      (1.999, 28.667]
9     (28.667, 55.667]
10    (28.667, 55.667]
11      (55.667, 99.0]
Name: age_group, dtype: category
Categories (3, interval[float64]): [(1.999, 28.667] < (28.667, 55.667] < (55.667, 99.0]]

In [8]:
df.sort_values('age_group')

Unnamed: 0,age,age_group
0,2,"(1.999, 28.667]"
4,4,"(1.999, 28.667]"
5,15,"(1.999, 28.667]"
8,26,"(1.999, 28.667]"
2,40,"(28.667, 55.667]"
3,32,"(28.667, 55.667]"
9,30,"(28.667, 55.667]"
10,50,"(28.667, 55.667]"
1,67,"(55.667, 99.0]"
6,82,"(55.667, 99.0]"


## 2. Discretizing into buckets with a list of quantiles

In [9]:
df['age_group'] = pd.qcut(df['age'], [0, .1, .5, 1])
df.sort_values('age_group')

Unnamed: 0,age,age_group
0,2,"(1.999, 5.1]"
4,4,"(1.999, 5.1]"
3,32,"(5.1, 36.0]"
5,15,"(5.1, 36.0]"
8,26,"(5.1, 36.0]"
9,30,"(5.1, 36.0]"
1,67,"(36.0, 99.0]"
2,40,"(36.0, 99.0]"
6,82,"(36.0, 99.0]"
7,99,"(36.0, 99.0]"


In [10]:
df.sort_values('age_group')['age_group'].value_counts()

(36.0, 99.0]    6
(5.1, 36.0]     4
(1.999, 5.1]    2
Name: age_group, dtype: int64

In [11]:
# Same as pd.qcut(df['age'], 4)
df['age_group'] = pd.qcut(df['age'], [0, .25, .5, .75, 1])

In [12]:
df

Unnamed: 0,age,age_group
0,2,"(1.999, 23.25]"
1,67,"(36.0, 69.75]"
2,40,"(36.0, 69.75]"
3,32,"(23.25, 36.0]"
4,4,"(1.999, 23.25]"
5,15,"(1.999, 23.25]"
6,82,"(69.75, 99.0]"
7,99,"(69.75, 99.0]"
8,26,"(23.25, 36.0]"
9,30,"(23.25, 36.0]"


In [13]:
df['age_group']

0     (1.999, 23.25]
1      (36.0, 69.75]
2      (36.0, 69.75]
3      (23.25, 36.0]
4     (1.999, 23.25]
5     (1.999, 23.25]
6      (69.75, 99.0]
7      (69.75, 99.0]
8      (23.25, 36.0]
9      (23.25, 36.0]
10     (36.0, 69.75]
11     (69.75, 99.0]
Name: age_group, dtype: category
Categories (4, interval[float64]): [(1.999, 23.25] < (23.25, 36.0] < (36.0, 69.75] < (69.75, 99.0]]

In [14]:
df['age_group'].value_counts()

(1.999, 23.25]    3
(23.25, 36.0]     3
(36.0, 69.75]     3
(69.75, 99.0]     3
Name: age_group, dtype: int64

In [15]:
df.describe()

Unnamed: 0,age
count,12.0
mean,43.75
std,31.729324
min,2.0
25%,23.25
50%,36.0
75%,69.75
max,99.0


## 3. Adding custom labels

In [16]:
labels=['Millennial', 'Gen X', 'Boomer', 'Greatest']
df['age_group'] = pd.qcut(df['age'], [0, .1, 0.3, .6, 1], labels=labels)

In [17]:
df

Unnamed: 0,age,age_group
0,2,Millennial
1,67,Greatest
2,40,Boomer
3,32,Boomer
4,4,Millennial
5,15,Gen X
6,82,Greatest
7,99,Greatest
8,26,Gen X
9,30,Boomer


In [18]:
df['age_group']

0     Millennial
1       Greatest
2         Boomer
3         Boomer
4     Millennial
5          Gen X
6       Greatest
7       Greatest
8          Gen X
9         Boomer
10      Greatest
11      Greatest
Name: age_group, dtype: category
Categories (4, object): ['Millennial' < 'Gen X' < 'Boomer' < 'Greatest']

In [19]:
df.sort_values('age_group')

Unnamed: 0,age,age_group
0,2,Millennial
4,4,Millennial
5,15,Gen X
8,26,Gen X
2,40,Boomer
3,32,Boomer
9,30,Boomer
1,67,Greatest
6,82,Greatest
7,99,Greatest


In [20]:
df['age_group'].value_counts().sort_index()

Millennial    2
Gen X         2
Boomer        3
Greatest      5
Name: age_group, dtype: int64

## 4. Returning bins with retbins=True

In [21]:
# It is useful when q is passed as a single number value
result, bins = pd.qcut(
    df['age'], 
    5,                  # A single number value
    retbins=True
)

In [22]:
bins

array([ 2. , 17.2, 30.8, 46. , 75.8, 99. ])

## 5. Configuring the bin precision with `precision`

In [23]:
# You may notice all the bin interval values 
# we have made so far have some decimal points, for example
pd.qcut(df['age'], 3)

0      (1.999, 28.667]
1       (55.667, 99.0]
2     (28.667, 55.667]
3     (28.667, 55.667]
4      (1.999, 28.667]
5      (1.999, 28.667]
6       (55.667, 99.0]
7       (55.667, 99.0]
8      (1.999, 28.667]
9     (28.667, 55.667]
10    (28.667, 55.667]
11      (55.667, 99.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(1.999, 28.667] < (28.667, 55.667] < (55.667, 99.0]]

In [24]:
# We can set the precision to 0 to avoid any decimal place.
pd.qcut(df['age'], 3, precision=0)

0      (1.0, 29.0]
1     (56.0, 99.0]
2     (29.0, 56.0]
3     (29.0, 56.0]
4      (1.0, 29.0]
5      (1.0, 29.0]
6     (56.0, 99.0]
7     (56.0, 99.0]
8      (1.0, 29.0]
9     (29.0, 56.0]
10    (29.0, 56.0]
11    (56.0, 99.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(1.0, 29.0] < (29.0, 56.0] < (56.0, 99.0]]

In [25]:
# for 1 decimal place.
pd.qcut(df['age'], 3, precision=1)

0      (1.9, 28.7]
1     (55.7, 99.0]
2     (28.7, 55.7]
3     (28.7, 55.7]
4      (1.9, 28.7]
5      (1.9, 28.7]
6     (55.7, 99.0]
7     (55.7, 99.0]
8      (1.9, 28.7]
9     (28.7, 55.7]
10    (28.7, 55.7]
11    (55.7, 99.0]
Name: age, dtype: category
Categories (3, interval[float64]): [(1.9, 28.7] < (28.7, 55.7] < (55.7, 99.0]]

### Thanks for reading

This is a notebook for the medium article [All Pandas qcut() you should know for binning numerical data based on sample quantiles](https://bindichen.medium.com/all-pandas-qcut-you-should-know-for-binning-numerical-data-based-on-sample-quantiles-c8b13a8ed844)

Please check out article for instructions