# Binning

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' (or 'buckets'). There are two main functions in pandas to do this:

- cut(): creates equal sized buckets (same widths), but number of observations in each bin may vary
- qcut(): equal number of observations in each bin, but the bin widths may vary

### cut function

The cut function returns a Categorical type object. What gets displayed are the various categories for each element. However, the Categorical object has two attributes: `codes` and `categories`. The `categories` are what you'd think. It is an array of your unique categories in the proper order. The `codes` are the positions of that categories array that each element of the Categorical object is referencing.


See: https://pandas.pydata.org/docs/reference/api/pandas.cut.html and https://stackoverflow.com/questions/66775129/hide-labels-from-pandas-cut-customised-interval-index

In [None]:
import pandas as pd
import numpy as np

# Discretize into three equal-sized bins.
# the output is the bin for each data point, as well as the 3 bins
s = np.array([1, 7, 5, 6.5, 6, 3])
# the cut function returns 
result = pd.cut(s, 3)
print(result)

In [None]:
print('\ncategories\n', result.categories)
print('codes\n', result.codes)

In [None]:
# we can have the cutoffs returned also with retbins set to True
bins, cutoffs = pd.cut(s, 3, retbins=True)
print('bins\n', bins)
print('\ncutoffs\n',cutoffs)

In [None]:
#To add labels to bins
pd.cut(s, 3,labels=["Small","Medium","Large"])

In [None]:
# count the number of observations in each bin
pd.cut(s,3,labels=["Small","Medium","Large"]).value_counts()

### Sample dataset

In [None]:
# read sample dataset
data = pd.read_csv('../datasets/feedback.csv')
data.head()

In [None]:
# let's bin the price 
pd.cut( data['price'], 3 )

In [None]:
# add the bin as a new variable on the dataset
# labels = False will number the bins (in this case: 0, 1, 2)
data['price_bin'] = pd.cut( data['price'], 3, labels=False)
data.head()

In [None]:
# let's see how well the observations are spread over the bins 
data['price_bin'].value_counts()

### Providing the cutoffs

It is possible to specify the cutoffs for the binning using numpy linspace, which return evenly spaced numbers over a specified interval (see https://numpy.org/doc/stable/reference/generated/numpy.linspace.html).

Other functions that are similar are pandas arange() (https://numpy.org/doc/stable/reference/generated/numpy.arange.html) and interval_range() (https://pandas.pydata.org/docs/reference/api/pandas.interval_range.html).

In [None]:
data['price'].describe()

In [None]:
# let's keep observations up to 300
data = data[ data['price'] <= 300]

In [None]:
# evenly spaced out numbers between 0 and 300
# 11 numbers to get 10 buckets
np.linspace(0, 300, 11)

In [None]:
data['price_decile'] = pd.cut( data['price'], bins=np.linspace(0, 300, 11), labels=False )
data.head()

In [None]:
data['price_decile'].value_counts()

#### Zip

Let's use the zip function to tie together the cutoffs and labels


In [None]:
# we can have the cutoffs returned also with retbins set to True
bins, cutoffs = pd.cut(data['price'], 5, retbins=True)
cutoffs

In [None]:
bin_labels = ['very low', 'low', 'medium', 'high', 'very high']

In [None]:
bin_info = pd.DataFrame(zip(cutoffs, bin_labels),
                             columns=['Cutoff', 'Description'])
bin_info

### qcut

Let's repeat this with qcut. qcut will have the same #obs in each bucket, so outliers don't have much influence.

In [None]:
print( 'cut\n', pd.cut(s, 3) )
print( '\nqcut\n', pd.qcut(s, 3) )

In [None]:
print( 'cut\n', pd.cut(s, 3).value_counts() )
print( '\nqcut\n', pd.qcut(s, 3).value_counts() )

In [None]:
# re-read sample dataset
data = pd.read_csv('../datasets/feedback.csv')

# add the bin as a new variable on the dataset
# labels = False will number the bins (in this case: 0, 1, 2)
data['price_bin'] = pd.qcut( data['price'], 3, labels=False)
data.head()

In [None]:
data['price_bin'].value_counts()

In [None]:
# why not exact counts?
df_sorted=data.sort_values('price')
df_sorted.head()

In [None]:
df_sorted[["price"]][17091:17105]