### As we saw on the previous class...discretisation helps to handle outliers 

Discretisation helps handle outliers by **placing these values into the lower or higher intervals**, together with the remaining inlier values of the distribution. Thus, these outlier observations no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval / bucket. In addition, by creating appropriate bins or intervals, **discretisation can help spread the values of a skewed variable** across a set of bins with equal number of observations.

####  Unsupervised discretisation methods

Last class, we talked about one method: *equal width discretisation* and some ways to do it, with Pandas and NumPy, feature engine and Sckit-learn. Today we'll learn about another method: **equal frequency discretisation**.

## Equal frequency discretisation

Equal frequency discretisation divides the scope of possible values of the variable into N bins, where each bin carries the same amount of observations. This is particularly useful for skewed variables as it spreads the observations over the different bins equally. We find the interval boundaries by determining the quantiles.

Equal frequency discretisation using quantiles consists of dividing the continuous variable into N quantiles, N to be defined by the user.

Equal frequency binning is straightforward to implement and by spreading the values of the observations more evenly it may help boost the algorithm's performance. 

### Let's begin!

As we did with equal width discretisation, we will learn how to perform equal frequency discretisation using

- pandas and NumPy
- Feature-engine
- Scikit-learn

In [0]:
pip install -U feature-engine

In [0]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from feature_engine.discretisation import EqualFrequencyDiscretiser

First, let's load numerical variables of the Titanic dataset and separate into test and train sets.

In [0]:
data = pd.read_csv('/dbfs/FileStore/CDS2024/titanic.csv', usecols=['Age', 'Fare', 'Survived'])
data.head()

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Age', 'Fare']],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

The variables Age and Fare contain missing data, that I will fill by extracting a random sample of the variable, just like we did last class.

In [0]:
def impute_na(data, variable):
    # function to fill NA with a random sample

    df = data.copy()

    # random sampling
    df[variable+'_random'] = df[variable]

    # extract the random sample to fill the na
    random_sample = X_train[variable].dropna().sample(
        df[variable].isnull().sum(), random_state=0)

    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample

    return df[variable+'_random']

In [0]:
# replace NA in both train and test sets

X_train['Age'] = impute_na(data, 'Age')
X_test['Age'] = impute_na(data, 'Age')

X_train['Fare'] = impute_na(data, 'Fare')
X_test['Fare'] = impute_na(data, 'Fare')

Distribution of age and fare.

In [0]:
X_train[['Age', 'Fare']].hist(bins=30, figsize=(8,4))
plt.show()

## Equal frequency discretisation with pandas and NumPy

The interval limits are the quantile limits. We can find out them with pandas *qcut* (quantile cut) function. We need to indicate how many bins we want, in this case let's use 10.

In [0]:
Age_disccretised, intervals = pd.qcut(X_train['Age'], 10, labels=None, retbins=True, precision=1, duplicates='raise')

X_train['Age_disccretised'] = Age_disccretised
X_train.head()
# pd.concat([Age_disccretised, X_train['Age']], axis=1).head(10)
# retbins = True indicates that we want to capture the limits of each interval (so we can then use them to cut the test set)

We can see in the above output we placed each Age observation within one interval. However, note how the interval widths are different.

We can visualise the interval cut points below:

In [0]:
intervals

And because we generated the bins using the quantile cut method, we should have roughly the **same amount of observations per bin**. Let's check it:

In [0]:
Age_disccretised.value_counts()

We can also add labels to the bins instead of having the interval boundaries, as follows:

In [0]:
labels = ['Q'+str(i) for i in range(1,11)]
labels

In [0]:
Age_disccretised, intervals = pd.qcut(X_train['Age'], 10, labels=labels,
                                      retbins=True,
                                      precision=1, duplicates='raise')

Age_disccretised.head()

In order to transform the test set, we need to use pandas cut method (instead of qcut) and pass the quantile edges calculated in the training set.

In [0]:
X_test['Age_disc_label'] = pd.cut(x = X_test['Age'], bins=intervals, labels=labels)
X_test['Age_disc'] = pd.cut(x = X_test['Age'], bins=intervals)

X_test.head(10)

Let's check if we have equal frequency (equal number of observations per bin):

In [0]:
X_test.groupby('Age_disc')['Age'].count().plot.bar()

The intervals have roughly the same ammount of observations. However, we can see that the top intervals have less observations. This may happen with skewed distributions if we try to divide in a high number of intervals. To make the value spread more homogeneous, we should discretise in **less intervals**.

## Equal frequency discretisation with Feature-Engine
First, let's folow the same steps as we did before: separte into train and test set and replace missing values.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['Age', 'Fare']],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [0]:
X_train['Age'] = impute_na(data, 'Age')
X_test['Age'] = impute_na(data, 'Age')

X_train['Fare'] = impute_na(data, 'Fare')
X_test['Fare'] = impute_na(data, 'Fare')

With feature engine we can automate the process for many variables in one line of code.

In [0]:
disc = EqualFrequencyDiscretiser(q=10, variables = ['Age', 'Fare'])
disc.fit(X_train)

In [0]:
# in the binner dict, we can see the limits of the intervals. Note that the intervals have different widths.
disc.binner_dict_

In [0]:
# Let's transform train and test:
train_t = disc.transform(X_train)
test_t = disc.transform(X_test)

In [0]:
train_t.head()

Let's explore the number of observations per bucket:

In [0]:
t1 = train_t.groupby(['Age'])['Age'].count() / len(train_t)
t2 = test_t.groupby(['Age'])['Age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

In [0]:
t1 = train_t.groupby(['Fare'])['Fare'].count() / len(train_t)
t2 = test_t.groupby(['Fare'])['Fare'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

Note how equal frequency discretisation obtains a better value spread across the different intervals.

## Equal frequency discretisation with Scikit-learn
Using *KBinsDiscretizer* algorithm from Scikit-learn.

In [0]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Age', 'Fare']],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

In [0]:
# replace NA in both train and test sets

X_train['Age'] = impute_na(data, 'Age')
X_test['Age'] = impute_na(data, 'Age')

X_train['Fare'] = impute_na(data, 'Fare')
X_test['Fare'] = impute_na(data, 'Fare')

In [0]:
disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
disc.fit(X_train[['Age', 'Fare']])

In [0]:
disc.bin_edges_

In [0]:
train_t = disc.transform(X_train[['Age', 'Fare']])
train_t = pd.DataFrame(train_t, columns = ['Age', 'Fare'])
train_t.head()

In [0]:
test_t = disc.transform(X_test[['Age', 'Fare']])
test_t = pd.DataFrame(test_t, columns = ['Age', 'Fare'])

In [0]:
t1 = train_t.groupby(['Age'])['Age'].count() / len(train_t)
t2 = test_t.groupby(['Age'])['Age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')

In [0]:
t1 = train_t.groupby(['Fare'])['Fare'].count() / len(train_t)
t2 = test_t.groupby(['Fare'])['Fare'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=0)
plt.ylabel('Number of observations per bin')