# Transforming numerical data into categorical data

- Discretizing into equal-sized bins
- Adding custom bins
- Adding labels to bins
- Configuring leftmost edge with right=False
- Include the lowest value with include_lowest=True
- Passing an IntervalIndex to bins
- Returning bins with retbins=True
- Creating unordered categories

In [3]:
# %load command1.py
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

%config InlineBackend.figure_format='svg'
plt.rcParams['figure.dpi']=120

pd.options.display.float_format='{:,.2f}'.format
pd.set_option('display.max_colwidth', None)


**Discretizing into equal-sized bins**

In [7]:
df = pd.DataFrame({'age': [2, 67, 40, 32, 4, 15, 82, 99, 26, 30]})
df['age_group'] = pd.cut(df['age'], 3)
df

df['age_group']

Unnamed: 0,age,age_group
0,2,"(1.903, 34.333]"
1,67,"(66.667, 99.0]"
2,40,"(34.333, 66.667]"
3,32,"(1.903, 34.333]"
4,4,"(1.903, 34.333]"
5,15,"(1.903, 34.333]"
6,82,"(66.667, 99.0]"
7,99,"(66.667, 99.0]"
8,26,"(1.903, 34.333]"
9,30,"(1.903, 34.333]"


0     (1.903, 34.333]
1      (66.667, 99.0]
2    (34.333, 66.667]
3     (1.903, 34.333]
4     (1.903, 34.333]
5     (1.903, 34.333]
6      (66.667, 99.0]
7      (66.667, 99.0]
8     (1.903, 34.333]
9     (1.903, 34.333]
Name: age_group, dtype: category
Categories (3, interval[float64, right]): [(1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]]

It shows dtype: category with 3 label values: (1.903, 34.333] , (34.333, 66.667] , and (66.667, 99.0]. Those label values are ordered as indicated with the symbol <. Behind the theme, an interval is calculated as follows in order to generate the equal-sized bins:

interval = (max_value — min_value) / num_of_bins
         = (99 - 2) / 3
         = 32.33333
         
        (<--32.3333-->] < (<--32.3333-->] < (<--32.3333-->]
        
        (1.903, 34.333] < (34.333, 66.667] < (66.667, 99.0]

**Adding custom bins**

In [15]:
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 19, 61, 100])
df

df.sort_values('age_group')

# count that how many values fall into each bin
df['age_group'].value_counts() # 'age_group' becomes index
df['age_group'].value_counts().sort_index() # sort the above result by the newly created index

Unnamed: 0,age,age_group
0,2,"(0, 12]"
1,67,"(61, 100]"
2,40,"(19, 61]"
3,32,"(19, 61]"
4,4,"(0, 12]"
5,15,"(12, 19]"
6,82,"(61, 100]"
7,99,"(61, 100]"
8,26,"(19, 61]"
9,30,"(19, 61]"


Unnamed: 0,age,age_group
0,2,"(0, 12]"
4,4,"(0, 12]"
5,15,"(12, 19]"
2,40,"(19, 61]"
3,32,"(19, 61]"
8,26,"(19, 61]"
9,30,"(19, 61]"
1,67,"(61, 100]"
6,82,"(61, 100]"
7,99,"(61, 100]"


(19, 61]     4
(61, 100]    3
(0, 12]      2
(12, 19]     1
Name: age_group, dtype: int64

(0, 12]      2
(12, 19]     1
(19, 61]     4
(61, 100]    3
Name: age_group, dtype: int64

**Adding labels to bins**

In [19]:
bins=[0, 12, 19, 61, 100]
labels=['<12', 'Teen', 'Adult', 'Older']
df['age_group'] = pd.cut(df['age'], bins, labels=labels)
df

df.sort_values('age_group')
df['age_group'].value_counts().sort_index()

Unnamed: 0,age,age_group
0,2,<12
1,67,Older
2,40,Adult
3,32,Adult
4,4,<12
5,15,Teen
6,82,Older
7,99,Older
8,26,Adult
9,30,Adult


Unnamed: 0,age,age_group
0,2,<12
4,4,<12
5,15,Teen
2,40,Adult
3,32,Adult
8,26,Adult
9,30,Adult
1,67,Older
6,82,Older
7,99,Older


<12      2
Teen     1
Adult    4
Older    3
Name: age_group, dtype: int64

**Configuring leftmost edge with `right=False`**

There is an argument `right` in Pandas `cut()` to configure whether bins include the rightmost edge or not. right defaults to True, which mean bins like`[0, 12, 19, 61, 100]` indicate `(0,12]`, `(12,19]`, `(19,61]`,`(61,100]` . To include the leftmost edge, we can set `right=False`:

In [20]:
pd.cut(df['age'], bins=[0, 12, 19, 61, 100], right=False)

0      [0, 12)
1    [61, 100)
2     [19, 61)
3     [19, 61)
4      [0, 12)
5     [12, 19)
6    [61, 100)
7    [61, 100)
8     [19, 61)
9     [19, 61)
Name: age, dtype: category
Categories (4, interval[int64, left]): [[0, 12) < [12, 19) < [19, 61) < [61, 100)]

**Including the lowest value with `include_lowest=True`**

Suppose you would like to divide the above age values into 2–12, 12–19, 19–60, 61–100 instead. You will get a result contains `NaN` when setting the bins to `[2, 12, 19, 61, 100]`.

In [23]:
df['age_group'] = pd.cut(df['age'], bins=[2, 12, 19, 61, 100])
df

df['age_group'] = pd.cut(
    df['age'], 
    bins=[2, 12, 19, 61, 100], 
    include_lowest=True
)
df

Unnamed: 0,age,age_group
0,2,
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(2.0, 12.0]"
5,15,"(12.0, 19.0]"
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


Unnamed: 0,age,age_group
0,2,"(1.999, 12.0]"
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(1.999, 12.0]"
5,15,"(12.0, 19.0]"
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


**Passing an IntervalIndex to bins**

In [26]:
bins = pd.IntervalIndex.from_tuples([(0, 12), (19, 61), (61, 100)])
bins

# Next, let’s pass it to the argument bins
df['age_group'] = pd.cut(df['age'], bins)
df

IntervalIndex([(0, 12], (19, 61], (61, 100]], dtype='interval[int64, right]')

Unnamed: 0,age,age_group
0,2,"(0.0, 12.0]"
1,67,"(61.0, 100.0]"
2,40,"(19.0, 61.0]"
3,32,"(19.0, 61.0]"
4,4,"(0.0, 12.0]"
5,15,
6,82,"(61.0, 100.0]"
7,99,"(61.0, 100.0]"
8,26,"(19.0, 61.0]"
9,30,"(19.0, 61.0]"


**Returning bins with retbins=True**

There is an argument called `retbin` to return the bins. If it is set to `True`, the result will return the `bins` and it is useful when `bins` is passed as a single number value

In [29]:
result, bins = pd.cut(
    df['age'], 
    bins=4,            # A single number value
    retbins=True
)

result
bins

0    (1.903, 26.25]
1     (50.5, 74.75]
2     (26.25, 50.5]
3     (26.25, 50.5]
4    (1.903, 26.25]
5    (1.903, 26.25]
6     (74.75, 99.0]
7     (74.75, 99.0]
8    (1.903, 26.25]
9     (26.25, 50.5]
Name: age, dtype: category
Categories (4, interval[float64, right]): [(1.903, 26.25] < (26.25, 50.5] < (50.5, 74.75] < (74.75, 99.0]]

array([ 1.903, 26.25 , 50.5  , 74.75 , 99.   ])

**Creating unordered categories**

`ordered=False` will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:

In [30]:
pd.cut(
    df['age'], 
    bins=[0, 12, 19, 61, 100], 
    labels=['<12', 'Teen', 'Adult', 'Older'], 
    ordered=False,
)

0      <12
1    Older
2    Adult
3    Adult
4      <12
5     Teen
6    Older
7    Older
8    Adult
9    Adult
Name: age, dtype: category
Categories (4, object): ['<12', 'Teen', 'Adult', 'Older']