In [None]:
!pip install openturns

# Exploration of Target Distribution
In this notebook we will try to explore the distribution of the `target` feature. Since there are a lot of investment IDs and not always 'more data' has positive effects (e.g. on training time), we try to aggregate some investments. Because the assumption is that many investments correlate with each other or correspond to the same investment category, which should lead to the same underlying distribution. The goal is to cluster similar values in order to merge different investment id'.

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import openturns as ot
import pandas as pd
import scipy
import seaborn as sns

sns.set_style("ticks", {'axes.grid': True})

market_data_path = r"../input/ubiquant-parquet"
model_data_path = r"../input/distribution-data-targets-raw"

We will create a function in order to ease the data read routine

In [None]:
def get_data(path: str, investment_id: int = None, columns: list = None):
    """
    Get the market Data.

    Parameters
    ----------
    investment_id : int, optional
        An investment ID between 0 - 3773. If None (Default) you will get the whole training data.
    columns : list
        Specify columns to import. If None (default) all columns will be imported.
    """
    if investment_id is None:
        train_path = os.path.join(path, "train.parquet")

        return pd.read_parquet(train_path, columns=columns)
    else:
        id_path = os.path.join(path, "investment_ids", "{0}.parquet")

        return pd.read_parquet(id_path.format(investment_id), columns=columns)

To retrieve the data we use the custom `GetData` class, where we can define to load individual investment `id`´.

In [None]:
target = get_data(market_data_path, columns=["target"])

Firstly, we look at the plain `target` data and look at its histogram.

In [None]:
bins_width = int(180 / 2)
sns.histplot(target, color='darkblue', stat='density', bins=bins_width, alpha=0.50, label="Target")


Next, we will try to fit a `normal distribution`,

In [None]:
_, bins = np.histogram(target, bins=bins_width)

params = scipy.stats.norm.fit(target)
norm_fit_line = scipy.stats.norm.pdf(bins, *params)


and plot its results:

In [None]:
sns.histplot(target, color='darkblue', stat='density', bins=bins_width, alpha=0.50, label="Target")
plt.plot(bins, norm_fit_line, color="red", label="Normal Dist.")
plt.legend()

We can see, that this distribution is far away from a normal distribution. Moreover, the distribution is not symmetrical, which can cause some performance issues in the machine learning algorithm. Maybe, after removing the outlier, the distribution looks more symmetric.

## Log Returns
As the description states, `targets` are `return rates`. These `return rates` ($r_\alpha$) are well suited for most uses, but there are some characteristics that complicate the use of arithmetic `return rates` in some academic and valuation setting. Therefore, we will try to logarithmize them (referred as `log returns` and denoted as $r_l$) to then assess whether the `targets` then fit more closely to a normal distribution. We can transform the `return rates` into `log returns` with the relation:
$$r_l=\ln{\left(r_\alpha + 1\right)}.$$
The back-transformation is then defined as:
$$r_\alpha = \exp{\left(r_l\right)}-1.$$

In [None]:
log_target = np.log(target/100 + 1)

Let us have a look at its distribution now:

In [None]:
_, log_bins = np.histogram(log_target, bins=bins_width)

log_params = scipy.stats.norm.fit(log_target)
log_norm_fit_line = scipy.stats.norm.pdf(log_bins, *log_params)

sns.histplot(log_target, color='darkblue', stat='density', bins=bins_width, alpha=0.50, label="Log Return")
plt.plot(log_bins, log_norm_fit_line, color="red", label="Normal Dist.")
plt.legend()

As one can see, the `log returns` seems to look and fit better the normal curve. Moreover, the maximum density of the data is now placed around the middle, which makes the distribution more symmetrical.

## Distribution of the Investment IDs
We can still neglect any normal distribution. On possible solution to retrieve the underlying distribution is to use the `openturns` package. It tests many distributions and in the end returns the distribution that performs best on a criterion of choice. Our criterion of choice is the [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion). Furthermore, we will do the test for every `investment id`.

In [None]:
# Since the target is very big, it takes alot of time to compute this. Thus, I precompute this and saved the result.
# target_distribution = pd.DataFrame(columns=["ID", "Model", "Parameter", "BIC"])
# target_distribution.index.name = "Index"
# total_ids = 3773
#
# for ids in range(0, total_ids):
#     try:
#         print("\r>Processing ID (Total: {0}) ".format(total_ids), end=str(ids))
#
#         target = get_data(market_data_path, investment_id=ids, columns=["target"]).values.flatten()
#         target = np.log(target/100 + 1)
#
#         sample = ot.Sample([[x] for x in target])
#         tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
#         best_model_bic, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)
#         split = str(best_model_bic).split("(")
#
#         target_distribution.loc[ids] = [ids, split[0], split[-1][:-1], best_bic]
#
#     except Exception:
#         print("\r> Could not calculate ID .", end=str(ids))
#
# target_distribution.to_csv(os.path.join(model_data_path, "distribution_data_targets.csv"))

We can now group the determined distributions:

In [None]:
target_distribution = pd.read_csv(os.path.join(model_data_path, "distribution_data_targets_raw.csv"))

x = target_distribution['Model'].value_counts()
label = target_distribution['Model'].unique()

plt.figure(figsize=(5, 5))
plt.pie(x, labels=label, autopct='%1.1f%%')
plt.tight_layout()
plt.show()

As we can see, over the half of the underlying distribution of the `targets` could follow a `Laplace` and 34% a `Logistic Distribution`. On the next step, we will try to merge the minor `distribution` classes. In order to cluster the distribution of the `other` classes, we look at its distribution as a whole. Firstly, we filter the `other` class:

In [None]:
other = target_distribution[(target_distribution["Model"] != "Student") & (target_distribution["Model"] != "Logistic") & (
        target_distribution["Model"] != "Laplace")]
other.head()

Secondly, we join all the `target` values of the corresponding `ID` and look at its distribution

In [None]:
ids = other["ID"].values

other_targets = [get_data(market_data_path, investment_id=item, columns=["target"]).values.tolist() for item in ids]
other_targets = np.array([val[0] for sublist in other_targets for val in sublist])
other_targets = np.log(other_targets/100 + 1)

bins_width = int(180 / 2)
sns.histplot(other_targets, color='darkblue', stat='density', bins=bins_width, alpha=0.50, label="Target")

The next step is to determine the best fitted distribution with the `openturns` package:

In [None]:
sample = ot.Sample([[x] for x in other_targets])
tested_factories = ot.DistributionFactory.GetContinuousUniVariateFactories()
best_model_bic, best_bic = ot.FittingTest.BestModelBIC(sample, tested_factories)

print(best_model_bic)

It says, that a `Student t Distribution` matches best with the data. Let us have a look:

In [None]:
_, bins = np.histogram(other_targets, bins=bins_width)

t_params = scipy.stats.t.fit(other_targets)
t_fit_line = scipy.stats.t.pdf(bins, *t_params)

sns.histplot(other_targets, color='darkblue', stat='density', bins=bins_width, alpha=0.50, label="Log Return")
plt.plot(bins, t_fit_line, color="red", label="Students t Dist.")
plt.legend()

In fact, the `Students t Distribution` fits very well with the data. Thus, we can cluster the data as the `Student t Distribution` and add the parameter to them:

In [None]:
target_distribution.loc[(target_distribution["Model"] != "Student") & (target_distribution["Model"] != "Logistic") & (
        target_distribution["Model"] != "Laplace"), 'Parameter'] = "nu = 4.75649, mu = -3.49823e-05, sigma = 0.00661145"
target_distribution.loc[(target_distribution["Model"] != "Student") & (target_distribution["Model"] != "Logistic") & (
        target_distribution["Model"] != "Laplace"), 'BIC'] = best_bic
target_distribution.loc[(target_distribution["Model"] != "Student") & (target_distribution["Model"] != "Logistic") & (
        target_distribution["Model"] != "Laplace"), 'Model'] = "Student"

Let us plot the new percentages of the three distributions:

In [None]:
x = target_distribution['Model'].value_counts()
label = target_distribution['Model'].unique()

plt.figure(figsize=(5, 5))
plt.pie(x, labels=label, autopct='%1.1f%%')
plt.tight_layout()
plt.show()