<a href="https://colab.research.google.com/github/KarelZe/thesis/blob/feature-engineering/notebooks/3.0-mb-data_preprocessing_explanatory_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install gcsfs==2022.10.0
!pip install scikit-learn==1.1.3
# !pip install SciencePlots==1.0.9
!pip install pandas-datareader
!pip install seaborn==0.12.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement scikit-learn==1.1.3 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2, 0.22.2.post1, 0.23.0rc1, 0.23.0, 0.23.1, 0.23.2, 0.24.dev0, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2, 1.0rc1, 1.0rc2, 1.0, 1.0.1, 1.0.2)[0m
[31mERROR: No matching distribution found for scikit-learn==1.1.3[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us

In [2]:
import os
import random

from dateutil.relativedelta import *

import gcsfs
import google.auth
from google.colab import auth

import numpy as np
from numpy.testing import assert_almost_equal
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
from pandas._testing.asserters import assert_almost_equal
import pandas as pd
import pandas_datareader.data as web

from scipy import stats
from scipy.stats import kurtosis, skew

import seaborn as sns
from typing import List, Tuple, Optional, Union

In [3]:
# set N used in n-largest or smallest
N = 10

In [4]:
# set style
plt.style.use('seaborn-notebook')

# set ratio of figure
ratio = (16,9)

# plt.style.use(['science','nature', 'no-latex'])

In [5]:
# connect to google cloud storage
auth.authenticate_user()
credentials, _ = google.auth.default()
fs = gcsfs.GCSFileSystem(project="thesis", token=credentials)
fs_prefix = "gs://"


In [6]:
# set fixed seed
def seed_everything(seed):
    """
    Seeds basic parameters for reproducibility of results.
    """
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed)
    # pandas and numpy as discussed here: https://stackoverflow.com/a/52375474/5755604
    np.random.seed(seed)


seed = 42
seed_everything(seed)


In [None]:
# replace with sampled data set later
files = fs.glob(
    "thesis-bucket-option-trade-classification/data/preprocessed/matched_ise_quotes_min_mem_usage_part_*.parquet",
    recursive=True,
)
files = [fs_prefix + sub for sub in files]

columns = [
    "UNDERLYING_SYMBOL",
    "QUOTE_DATETIME",
    "SEQUENCE_NUMBER",
    "ROOT",
    "EXPIRATION",
    "STRK_PRC",
    "OPTION_TYPE",
    "TRADE_SIZE",
    "TRADE_PRICE",
    "BEST_BID",
    "BEST_ASK",
    "order_id",
    "ask_ex",
    "bid_ex",
    "bid_size_ex",
    "ask_size_ex",
    "price_all_lead",
    "price_all_lag",
    "optionid",
    "day_vol",
    "price_ex_lead",
    "price_ex_lag",
    "buy_sell",
]

dfs = []
for gc_file in files:
    df = pd.read_parquet(gc_file, columns=columns)
    dfs.append(df)
data = pd.concat(dfs)


In [None]:
data = data.sample(frac=0.1, axis=0, random_state=seed)

## Notes on data set 🗃️

**Overview on ticker symbols in 🇺:**
- `others` probably identified by `.`. Index probably identified by `^` e. g., `^NDX` for Nasdq. The `SPY` ETF is however just `SPY`.
- 5th letter has a special meaning as found in [this table](https://en.wikipedia.org/wiki/Ticker_symbol):

| Letter                  | Letter contd.              | Letter contd.                                    |
|--------------------------------|-------------------------------------|------------------------------------------------|
| A – Class "A"                  | K – Nonvoting (common)              | U – Units                                      |
| B – Class "B"                  | L – Miscellaneous                   | V – Pending issue and distribution             |
| C – NextShares                 | M – fourth class – preferred shares | W – Warrants                                   |
| D – New issue or reverse split | N – third class – preferred shares  | X – Mutual fund                                |
| E – Delinquent SEC filings     | O – second class – preferred shares | Y – American depositary receipt (ADR)          |
| F – Foreign                    | P – first class preferred shares    | Z – Miscellaneous situations                   |
| G – first convertible bond     | Q – In bankruptcy                   | Special codes                                  |
| H – second convertible bond    | R – Rights                          | PK – A Pink Sheet, indicating over-the-counter |
| I – third convertible bond     | S – Shares of beneficial interest   | SC – Nasdaq Small Cap                          |
| J – Voting share – special     | T – With warrants or rights         | NM – Nasdaq National Market                    |


**Coverage:**

*	Options on U.S. listed Stock, ETFs, and Indices disseminated over the Options Price Reporting Authority (OPRA) market data feed 
*	Global Trading Hours (GTH) trades are included if between 03:00am-09:15am U.S. Eastern, and for the 16:15pm 17:00pm Curb session.  GTH trades outside of these time ranges will *not* be included. 

Found [here.](https://datashop.cboe.com/documents/Option_Trades_Layout.pdf)

**Exchange Identifier:**

- 5 = Chicago Board Options Exchange (CBOE)
- 6 = International Securities Exchange (ISE)

Found [here.](https://datashop.cboe.com/documents/livevol_exchange_ids.csv)

Adapted from the cboe data shop found at [option trades](https://datashop.cboe.com/documents/Option_Trades_Layout.pdf) and [option quotes](https://datashop.cboe.com/documents/Option_Quotes_Layout.pdf).

|     Column Label                                                          |     Data   Type     |     Description                                                                                                                                                                                                         |
|---------------------------------------------------------------------------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     underlying_symbol                                                     |     string          |     The underlying stock or index.  An index will utilize a caret (^) prefix,   i.e. ^NDX,^SPX,^VIX…etc.  Underlyings   with classes may utilize a dot (.) instead of a slash or space, i.e. BRK.B,   RDS.A, RDS.B.     |
|     quote_datetime                                                        |     datetime        |     The trading date and timestamp of the trade in   U.S. Eastern time. Ex:  yyyymm-dd   hh:mm:ss.000                                                                                                                   |
|     sequence_number                                                       |     integer         |     Trade Sequence Number for the execution reported   by OPRA                                                                                                                                                          |
|     root                                                                  |     string          |     The option trading class symbol.  Non-standard roots may end with a digit                                                                                                                                           |
|     expiration                                                            |     date            |     The explicit expiration date of the option:   yyyy-mm-dd                                                                                                                                                            |
|     strike                                                                |     numeric         |     The exercise/strike price of the option                                                                                                                                                                             |
|     option_type                                                           |     string          |     C for Call options, P for Put options                                                                                                                                                                               |
|     exchange_id                                                           |     integer         |     An identifier for the options exchange the trade   was executed on.  For a mapping, please   see Exchange ID   Mappings                                                                                             |
|     trade_size                                                            |     integer         |     The trade quantity                                                                                                                                                                                                  |
|     trade_price                                                           |     numeric         |     The trade price                                                                                                                                                                                                     |
|     trade_condition_id                                                    |     integer         |     The trade or sale condition of the execution.  For a mapping, please see Trade   Condition ID Mapping                                                                                                               |
|     canceled_trade_condition_id                                           |     integer         |     This field is no longer supported and will default   to 0 (zero).  See IDs 40-43 in the   Trade Condition ID Mapping file above                                                                                     |
|     best_bid                                                              |     numeric         |     The best bid price (NBB) at the time of the trade                                                                                                                                                                   |
|     best_ask                                                              |     numeric         |     The best ask/offer price (NBO) at the time of the   trade                                                                                                                                                           |
|     bid_size              |     integer         |     The largest size from an options exchange   participant on the best bid price (NBB)                                                                                                                                   |
|     bid                   |     numeric         |     The best bid price (NBB) at the interval time   stamp                                                                                                                                                                 |
|     ask_size              |     integer         |     The largest size from an options exchange   participant on the best offer price (NBO)                                                                                                                                 |
|     ask                   |     numeric         |     The best offer price (NBO) at the interval time   stamp                                                                                                                                                               |

## Dtypes, distributions, and memory consumption 🔭

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
print(data.shape)

In [None]:
print(data.shape)
# drop identical rows, if present 
data.drop_duplicates(inplace=True)
print(data.shape)

In [None]:
data.nunique()

In [None]:
data.head().T

## Basic features🧸

### Correlations 🎲

In [None]:
corr = data.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

In [None]:
sample = data.sample(n=1000, random_state=seed)
sns.pairplot(sample, vars=["STRK_PRC","TRADE_SIZE", "TRADE_PRICE", "BEST_BID", "BEST_ASK", "ask_ex", "bid_ex", "ask_size_ex", "bid_size_ex", "price_all_lag", "price_all_lead", "day_vol"])

### Correlation with target 🎲

In [None]:
sort_criteria = corr["buy_sell"].abs().sort_values(ascending=False)
corr_target = corr.sort_values("buy_sell", ascending=False)["buy_sell"]
corr_target.loc[sort_criteria.index].to_frame()

**Observation:**
* Overall correlations are relatively low. Typical for financial data.
* Size related features like `ask_size_ex` or `bid_size_ex` have the highest correlation with the target. Thus, can be promising to be included in the model. Consider when constructing feature sets, that size features are included.
* Features like `optionid`, `order_id`, and `SEQUENCE_NUMBER` are also among the features with the highest correlations. Remove misleading columns.

In [None]:
# remove some columns, which will NOT be used in model
data.drop(columns=["optionid","SEQUENCE_NUMBER", "order_id"], inplace=True)

In [None]:
# just keep ROOT
data.drop(columns="UNDERLYING_SYMBOL", inplace=True)

### Collinearity of features 🎲

In [None]:
# adapted from here: https://www.kaggle.com/code/willkoehrsen/featuretools-for-good

# Select upper triangle of correlation matrix
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.975)]

print(to_drop)

In [None]:
# Set the threshold
threshold = 0.975

# Empty dictionary to hold correlated variables
above_threshold_vars = {}

# For each column, record the variables that are above the threshold
for col in corr:
    above_threshold_vars[col] = list(corr.index[corr[col] > threshold])

In [None]:
correlating_cols = pd.Series(above_threshold_vars)
correlating_cols

**Observations:**
* Some columns are highly correlated. This is very intuitive.
* It seems problematic to include both `BEST_BID` and `bid_ex`. This is also true for `BEST_ASK` and `ask_ex`. `price_all_lead` and `price_all_lag` seem to be less problematic.
* Define feature sets as such, that the number of highly correlated variables is minimized. But maintain groups so that a comparsion with classical rules is still possible.

## Preparation 🥗

### Visualization helper 🐜

In [None]:
def plot_kde_target(var_name:str, clip:Optional[List[float]]=None):
  """
  Plot kde plots for buys (+1) and sells (-1) with regard to \
  the feature 'var_name'.

  Args:
      var_name (str): name of the feature
      clip (Optional[List[float]], optional): clipping range. Defaults to None.
  """
  corr_var = data["buy_sell"].corr(data[var_name])

  median_sell = data[data['buy_sell'] == -1][var_name].median()
  median_buy = data[data['buy_sell'] == 1][var_name].median()

  fig, ax = plt.subplots()
  for i in [-1,1]:
    sns.kdeplot(data=data[data["buy_sell"]==i], x=var_name, clip=clip, label=str(i), cumulative=False, common_grid=True)
  ax.title.set_text(f"Distribution of '{var_name}'")
  ax.legend()
  sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
  plt.show()
  print(f"The correlation between {var_name} and the 'buy_sell' is {corr_var: 0.4f}")
  print(f'Median value of sells = {median_sell: 0.4f}') 
  print(f'Median value of buys = {median_buy: 0.4f}')  

In [None]:
def plot_kde_target_comparsion(var_name:str, clip:Optional[List[float]]=None, years:List[int]=[2006, 2015, 2016])->None:
    """
    Plot several kde plots side by side for the feature.

    Args:
        var_name (str): name of the feature
        clip (Optional[List[float]], optional): clipping range. Defaults to None.
        years (List[int], optional): years to compare. Defaults to [2006, 2015, 2016].
    """
    years = [2006, 2015, 2016]
    f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

    f.suptitle(f"Distribution of `{var_name}`")

    for y, year in enumerate(years):
      for i in [-1,1]:
          sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x=var_name, clip=clip, label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
          ax[y].xaxis.label.set_text(str(year))

    f.legend()

In [None]:
us_rec = web.DataReader("USREC", "fred", data["date"].min(), data["date"].max())


def plot_recessions()->None:
    """
    Add recession indicator to plot and entry to legend.
    """
    l = 0
    month = relativedelta(months=+1)
    for i, (date, val) in enumerate(us_rec["USREC"].items()):
        if val == 1:
            # if boolean = 1 -> print bar until next month
            # _ labels are ignored in legend https://stackoverflow.com/a/44633022/5755604
            plt.axvspan(date, date + month, edgecolor="none", alpha=0.25, label =  "_"*l + "recession")
            l += 1

In [None]:
def plot_time_series(feature: Union[str, List[str]], aggregation:Union[str, List[str]]="count")->pd.Series:
    """
    Plot feature over time. Aggregate using 'aggregation'.

    Args:
        feature (str): feature to plot.
        aggregation (str, optional): aggregation operation. Defaults to "count".
    
    Returns:
        pd.Series: time series
    """

    if isinstance(feature, str):
      feature = [feature]
    if isinstance(aggregation, str):
      aggregation = [aggregation]

    time_series = data[feature].groupby(data["date"]).agg(aggregation)
    time_series.columns = time_series.columns.to_flat_index()

    ax = sns.lineplot(data=time_series)
    ax.yaxis.label.set_text(' / '.join(aggregation))
    ax.title.set_text(f"'{' / '.join(feature)}' over time")
    plot_recessions()
    ax.legend()
    plt.show()
    
    return time_series

### Time features ⏰

In [None]:
# apply positional encoding to dates
data["date_month_sin"] = np.sin(2 * np.pi * data["QUOTE_DATETIME"].dt.year / 12)
data["date_month_cos"] = np.cos(2 * np.pi * data["QUOTE_DATETIME"].dt.year / 12)

# time (daily)
seconds_in_day = 24*60*60
seconds = (data["QUOTE_DATETIME"] - data["QUOTE_DATETIME"].dt.normalize()).dt.total_seconds()

data["date_time_sin"] = np.sin(2*  np.pi* seconds / seconds_in_day)
data["date_time_cos"] = np.cos(2 * np.pi* seconds / seconds_in_day)

# year min-max scaled
data["date_year_min"] = (data["QUOTE_DATETIME"].dt.year - 2005) / (2017 - 2005)

# time to maturity
data["ttm"] = (
    data["EXPIRATION"].dt.to_period("M")
    - data["QUOTE_DATETIME"].dt.to_period("M")
).apply(lambda x: x.n)

# day, month and year
data["day"] = data["QUOTE_DATETIME"].dt.day
data["month"] = data["QUOTE_DATETIME"].dt.month
data["year"] = data["QUOTE_DATETIME"].dt.year
data["date"] = data["QUOTE_DATETIME"].dt.date

### Binned features 🥫

Bin features similarily to how they are used in the robustness tests.

In [None]:
bins_tradesize = [0, 1, 3, 5, 11, np.inf]
trade_size_labels = ["(0,1]", "(1,3]", "(3,5]", "(5,11]", ">11"]
data["TRADE_SIZE_binned"] = pd.cut(
    data["TRADE_SIZE"], bins_tradesize, labels=trade_size_labels
)

bins_years = [2004, 2007, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]
year_labels = [
    "2005-2007",
    "2008-2010",
    "2011",
    "2012",
    "2013",
    "2014",
    "2015",
    "2016",
    "2017",
]
data["year_binned"] = pd.cut(data["year"], bins_years, labels=year_labels)

bins_ttm = [-1, 1, 2, 3, 6, 12, np.inf]
ttm_labels = [
    "ttm <= 1 month",
    "ttm (1-2] month",
    "ttm (2-3] month",
    "ttm (3-6] month",
    "ttm (6-12] month",
    "ttm > 12 month",
]
data["ttm_binned"] = pd.cut(data["ttm"], bins_ttm, labels=ttm_labels)

### Trade features 💴
Construct features that are used in classical rules.

In [None]:
# Degree how much trade size is filled
data["rel_bid_size_ex"] = data["TRADE_SIZE"] / data["bid_size_ex"]
data["rel_ask_size_ex"] = data["TRADE_SIZE"] / data["ask_size_ex"]

# spread in $ between ask and bid
data['spread_ex'] = data['ask_ex'] - data['bid_ex']

# Calculate change similar to tick rule
data["chg_lead_ex"] = data["TRADE_PRICE"] - data["price_ex_lead"]

# Calculate change similar to reverse tick rule
data["chg_lag_ex"] = data["TRADE_PRICE"]- data["price_ex_lag"]

# Midspread
mid = 0.5 * (data["ask_ex"] + data["bid_ex"])

# Absolute distance from mid
data["abs_mid_ex"] = data["TRADE_PRICE"] - mid
data["mid_ex"] = mid

### Underlying features 🫀

In [None]:
data["symbol_is_index"] = data['UNDERLYING_SYMBOL'].str.startswith("^")

# special code from 5th character in symbol
data["special_code"] = data['UNDERLYING_SYMBOL'].str[4]

# Security type similar to Grauer et. al p. 35
data['security_type'] = np.where(data["symbol_is_index"],"index option", np.where(data["special_code"].notnull(),"other", "stock option"))
data['security_type'] = data['security_type'].astype("category")

### Categorical features 🎰

In [None]:
# binarize

# select categorical e. g., option type and strings e. g., ticker
cat_columns = data.select_dtypes(include=["category", "object"]).columns.tolist()
print(cat_columns)

cat_columns_bin = ["bin_" + x for x in cat_columns]

# binarize categorical similar to Borisov et al.
data[cat_columns_bin] = data[cat_columns].apply(lambda x: pd.factorize(x)[0])

## General overview 🌄

### Trade price and sizes 🤝

#### Trades over time ⌚

In [None]:
trades_per_day = plot_time_series("TRADE_PRICE", "count")

In [None]:
trades_per_day.nlargest(N)

In [None]:
trades_per_day.nsmallest(N)

**Observation:**
* Number of trades increases over time.
* There is no obvious explanation why the number of trades spikes at certain days.

#### Trade size

In [None]:
# Think about outliers
ax = sns.histplot(data, x="TRADE_SIZE", bins=50)
ax.title.set_text("Histogram of trade size")

**Observation:**
* highly skewed with few outliers.
* Similar to the price, $\log(\cdot)$ transform helps a little bit.

In [None]:
trades_over_time = plot_time_series("TRADE_SIZE", ["mean", "median"])

In [None]:
trade_ask_bid_size = plot_time_series(["TRADE_SIZE", "ask_size_ex", "bid_size_ex"], "mean")


**Observation:**
* There is a slow downward trend in `TRADE_SIZE`.
* Controversely, the average number of trades per day increases over time.
* Market share of ISE has decrease over time, as reported in https://www.sifma.org/wp-content/uploads/2022/03/SIFMA-Insights-Market-Structure-Compendium-March-2022.pdf. 

In [None]:
data["TRADE_SIZE"].describe()

In [None]:
data[data["TRADE_SIZE"].max()==data["TRADE_SIZE"]]

In [None]:
data.nlargest(N, "TRADE_SIZE", keep='first').T

In [None]:
# const not really needed here, due to the trade size being >=1
const = 1
data['log_trade_size'] = np.log(data["TRADE_SIZE"]+const)
ax = sns.histplot(data, x="log_trade_size", bins=50)
ax.title.set_text(f"Histogram of trade size (log) with const = {const}")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_trade_size", clip=[0, 6], label=str(i), cumulative=False)
# clip=[-0.5, 0.5]
ax.title.set_text("Distribution of buys and sells")

#### Trade price

In [None]:
# Think about outliers
ax = sns.histplot(data, x="TRADE_PRICE", bins=50)
ax.title.set_text("Histogram of trade price")

In [None]:
ax = sns.boxplot(data = data, x="buy_sell", y = "TRADE_PRICE")
ax.title.set_text("Box plot of 'TRADE_PRICE' for buys (1) and sells (-1)")

**Observations:**
* Very few very large trade prices, many very small trade prices.
* Scaling can problematic, if outliers affect scaling much. Try $\log(\cdot)$ transform. Could improve results.
* Trade price is hardly informative, as distribution is very similar.

In [None]:
const = 1e-2
data['log_trade_price'] = np.log(data["TRADE_PRICE"] + const)
#data['log_trade_price'].replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
fig, ax = plt.subplots()

sns.histplot(data, x="log_trade_price", bins=50, stat='density', label="log price")

# extract the limits for the x-axis and fit normal distributon
x0, x1 = ax.get_xlim()  
x_pdf = np.linspace(x0, x1, 100)
y_pdf = stats.norm.pdf(x_pdf)

pdf = pd.DataFrame({'x':x_pdf,'y':y_pdf})
sns.lineplot(data = pdf,x='x', y='y',label="pdf", color="r")


ax.title.set_text("Distribution of log prices")                                                   
ax.legend()

In [None]:
ax = sns.boxplot(data=data, x="buy_sell", y="log_trade_price")
ax.title.set_text("Box plot of log prices for buys (1) and sells (-1)")

In [None]:
data.nlargest(N, "TRADE_PRICE", keep='first').T

In [None]:
trade_price_over_time = plot_time_series("TRADE_PRICE",['mean','median'])

In [None]:
trade_price_over_time = plot_time_series(["TRADE_PRICE", "price_ex_lead", "price_ex_lag"],'mean')

In [None]:
trade_price_over_time = plot_time_series(["TRADE_PRICE", "price_ex_lead", "price_ex_lag"],'median')

### Time to Maturity ⌚

In [None]:
ttm_over_time = plot_time_series("ttm", "mean")

In [None]:
sample = data.sample(n=1000, random_state=seed)

plot = sns.displot(data = sample, 
                x = "ttm", 
                y = "TRADE_PRICE", kind="kde", hue="OPTION_TYPE")
plot.figure.subplots_adjust(top=0.9)
plot.figure.suptitle("Trade Price vs. Time to Maturity");

In [None]:
ax = sns.scatterplot(data = sample, 
                x = "ttm", 
                y = "bid_ex",
                hue= "OPTION_TYPE")
ax.title.set_text("Scatter plot of time to maturity (months) and bid (ex)")

In [None]:
ax = sns.histplot(data = data[data["bid_ex"]==0.0], 
                  x = "ttm", bins=50)
ax.title.set_text("Count of transactions with regard to time to maturity (months)")

In [None]:
# TODO: ask of zero plausible?
sns.histplot(data = data[data["ask_ex"]==0.0], 
                x = "ttm", bins=50)

### Buy Sell 👛

In [None]:
ratio_buy_sell = data["buy_sell"].value_counts() / data["buy_sell"].count()
ratio_buy_sell

**Observation:**
* Ratios similar to the one reported in Grauer et. al. Yet not identical as calculation is done on a sample.
* As both classes have a $\approx~0.5$ probability, I would not rebalance. Rebalancing through sampling etc. itself has a bias.
* Ratios seem to be stable over time (see below). Thus, distribution is similar for training, validation, and test set.

#### By option type

In [None]:
ax = sns.countplot(data=data,x="OPTION_TYPE", hue="buy_sell")
ax.title.set_text("Distribution of Buy / Sell indicator with regard to option type")
sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))

#### By year

In [None]:
ax = sns.countplot(data=data,x="year_binned", hue="buy_sell")
ax.title.set_text("Distribution of Buy / Sell indicator with regard to year (binned)")
# sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="center")
plt.tight_layout()
plt.show()

#### By time time to maturity

In [None]:
ax = sns.countplot(data=data,x="ttm_binned", hue="buy_sell")
ax.title.set_text("Distribution of Buy / Sell indicator with regard to time to maturity (binned)")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="center")
plt.tight_layout()
plt.show()

#### Over time

In [None]:
trades_over_time = data.groupby(data['date'])["buy_sell"].value_counts().unstack(fill_value=0)
ax = trades_over_time.plot(kind="line", figsize=ratio, title="buy / sell count over time", xlabel="date", ylabel="sell (-1) / buy (1)")
plot_recessions()
ax.legend()
plt.show()

### $n$ most frequent symbols, indices, and special codes 🔢

In [None]:
alphanumeric_symbols = data[~data['UNDERLYING_SYMBOL'].str.isalpha()]
alphanumeric_symbols.drop_duplicates(inplace=True)

In [None]:
overlong_symbols = data[data['UNDERLYING_SYMBOL'].str.len()>=5]
overlong_symbols.drop_duplicates(inplace=True)

In [None]:
most_frequent_symbols = data["UNDERLYING_SYMBOL"].value_counts().head(N).reset_index(name="Count")
most_frequent_symbols.rename(columns={'index':'Symbol'}, inplace=True)

ax = sns.barplot(data=most_frequent_symbols, x="Symbol", y="Count")
ax.title.set_text(f"{N} most frequently traded symbols")
most_frequent_symbols

In [None]:
list_freq_symbols = most_frequent_symbols.Symbol.tolist()

In [None]:
frequent_symbols_over_time = data[data["UNDERLYING_SYMBOL"].isin(list_freq_symbols)]

In [None]:
frequent_symbols_trades_per_day = frequent_symbols_over_time.groupby([frequent_symbols_over_time.QUOTE_DATETIME.dt.to_period('m'), "UNDERLYING_SYMBOL"])["TRADE_SIZE"].count().reset_index().rename(columns={"TRADE_SIZE": "count", "QUOTE_DATETIME": "date", "UNDERLYING_SYMBOL": "Symbol"})


In [None]:
frequent_symbols_over_time = frequent_symbols_trades_per_day.groupby(["date", "Symbol"])['count'].first().unstack()

In [None]:
frequent_symbols_over_time.plot(kind="line", title=f"{N} most frequently traded underlyings over time")

In [None]:
most_frequent_indices = data[data["symbol_is_index"]]["UNDERLYING_SYMBOL"].value_counts().head(N).reset_index(name="Count")
most_frequent_indices.rename(columns={'index':'Symbol'}, inplace=True)

ax = sns.barplot(data=most_frequent_indices, x="Symbol", y="Count")
ax.title.set_text(f"{N} most frequently traded indices (symbols with ^)")

In [None]:
ax = sns.countplot(data=data,x="symbol_is_index", hue="buy_sell")
ax.title.set_text("Distribution of Buy / Sell indicator with regard to whether underlying is an index")
sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))

In [None]:
ratios_is_index = data.groupby(['symbol_is_index', "buy_sell"])['buy_sell'].count() / data.groupby(["symbol_is_index"])["buy_sell"].count()
ratios_is_index

**Observation:**
- Feature can be important, as it's much more likely for trade to be sell, rather than buy, if and only if the underlying is no index option.
- Difference isn't too pronounced and could be due to sampling effects.

In [None]:
data["special_code"].value_counts(dropna=False)

**Observation:**
* `L`: Misc. (compare Google Shares)
* `B`: Class "B"
* `A`: Class "A"
* `K`: "Non-voting"
* `X`: "Mutual funds"
* `Y`: "ADRs"

Grauer et al. also include ETFs in `others`. Not sure how they are identified.

In [None]:
ax = sns.countplot(data=data,x="security_type")
ax.title.set_text("No. of transactions by security type")
ax.xaxis.label.set_text("security type")

In [None]:
ax = sns.countplot(data=data, x = "special_code")
ax.title.set_text("No. of special codes in trades")

In [None]:
ax = sns.countplot(data=data, x = "ttm_binned")
ax.title.set_text("No. of trades by time to maturity (binned)")
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="center")

In [None]:
ax = sns.scatterplot(data = sample, 
                x = "ttm", 
                y = "bid_ex",
                hue= "OPTION_TYPE")
ax.title.set_text("Scatter plot of time to maturity (months) and bid (ex)")

In [None]:
ax = sns.histplot(data = data[data["bid_ex"]==0.0], 
                  x = "ttm", bins=50)
ax.title.set_text("Count of transactions with regard to time to maturity (months)")

In [None]:
# TODO: ask of zero plausible?
sns.histplot(data = data[data["ask_ex"]==0.0], 
                x = "ttm", bins=50)

In [None]:
trades_over_time = data[["TRADE_SIZE", "ask_size_ex", "bid_size_ex"]].groupby(data['date']).agg(['mean'])
trades_over_time.plot(kind="line", figsize=ratio, title="Trade size over time (mean)", xlabel="Timestamp", ylabel="contracts")

###  Ask, bid, and mid

In [None]:
bid_ask_over_time = plot_time_series(["bid_ex", "ask_ex", "BEST_ASK", "BEST_BID"],'mean')

In [None]:
ax = sns.histplot(data, x="bid_ex", bins=50)
ax.title.set_text("Histogram of bid (exchange)")

In [None]:
const = 1
data['log_bid_ex'] = np.log(data["bid_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text(f"Histogram of bid exchange (log) with const = {const}")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ex", clip=[0, 5], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

In [None]:
const = 1e-2
data['log_bid_ex'] = np.log(data["bid_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text("Histogram of bid exchange (log)")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ex", clip=[-5, 6], label=str(i), cumulative=False)

ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

**Observation:**
- One can choose different constants, but small constants, e. g., `const=1e-2` gives fuzzy results. Better choose larger constant like `const=1`.

**Observation:**
- log on size seems to worsen results.
- `TODO:` investigate further, what the reason is. e. g., some skewness, but outliers....

# NaNs 🪲

In [None]:
def visualize_nan():
    """
    Visualize NaN values in a heatmap to learn about patterns.
    """
    plt.subplots()
    sns.heatmap(data.head(50).isnull(), cbar=False)
    plt.xlabel("feature")
    plt.ylabel("row")
    plt.title("Missing values (colored in bright beige)")
    plt.show()

In [None]:
visualize_nan()

In [None]:
isna_vals = data.isna().sum().sort_values(ascending=False)
isna_vals = isna_vals.loc[lambda x: x > 0]

ax = isna_vals.T.plot(kind="bar", figsize=ratio, legend=False, 
                      xlabel="No. of missing values", 
                      ylabel="feature", 
                      title="Missing values")

In [None]:
isna_vals_over_time = data[isna_vals.index.tolist()].groupby(data['QUOTE_DATETIME'].dt.date).agg(lambda x: x.isnull().sum())
isna_vals_over_time.plot(kind="line", figsize=ratio, title="Missing values over time", xlabel="Timestamp", ylabel="No. of missing values")

In [None]:
# adapted from: https://github.com/ResidentMario/missingno/blob/master/missingno/missingno.py

isna_data = data.iloc[:, [i for i, n in enumerate(np.var(data.isnull(), axis='rows')) if n > 0]]

corr_mat = isna_data.isnull().corr()
mask = np.zeros_like(corr_mat)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(9,9)) 
ax = sns.heatmap(corr_mat, mask=mask, annot=False, annot_kws={'size':10}, ax=ax)
ax.title.set_text("Correlation between missing features")

In [None]:
# TODO: Check if there is a pattern between the missing values

# Correlations of engineered features 🎲

In [None]:
foo

## ($\log$) mid spread ✔️

In [None]:
ax = sns.histplot(data, x="mid_ex", bins=50)
ax.title.set_text("Histogram of mid spread")

In [None]:
data["mid_ex"].describe()

In [None]:
mean_spread = data["spread_ex"].groupby(data['date']).agg(['mean'])

In [None]:
ax = sns.lineplot(mean_spread)
ax.title.set_text("Abs. spread (mean) over time")
ax.yaxis.label.set_text("spread")
plot_recessions()
plt.show()

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="spread_ex", clip=[-1,1], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells")

In [None]:
# quotient rule
const = 1
data["log_mid_ex"] = np.log(data["mid_ex"]+const)

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_mid_ex", label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells after log with const = {const}")

## bid-ask-ratio

In [None]:
data["bid_ask_ratio_ex"] = data["ask_ex"] / data["bid_ex"]

In [None]:
ax = sns.histplot(data, x="bid_ask_ratio_ex", bins=50)
ax.title.set_text("Histogram of bid-ask ratio")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="bid_ask_ratio_ex", clip=[0.6,3], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells")

In [None]:
data["log_bid_ask_ratio_ex"] = np.log(data["ask_ex"]+1e-2) - np.log(data["bid_ex"]+1e-2)

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ask_ratio_ex",clip=[-0.5, 1], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells")

**Observation:**
- Distribution is the same. This is expected due to $\log$ laws and the $\log$ being a monotonous function.

## distance of trade price from mid ✔️

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="abs_mid_ex", clip=[-0.5,0.5], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells")

In [None]:
# quotient rule
data["log_abs_mid_ex"] = np.log((data["TRADE_PRICE"] / data["mid_ex"])+1)

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_abs_mid_ex", clip=[0.5,1], label=str(i), cumulative=False)

In [None]:
data[["STRK_PRC"]].hist(figsize=ratio, bins=50)

In [None]:
fig, ax = plt.subplots()

data['log_strike_price'] = np.log(data["STRK_PRC"])

ax = sns.histplot(data, x= "log_strike_price", bins=50, stat='density', label="log strike price")
ax.title.set_text("Histogram of strike price (log)")

# extract the limits for the x-axis and fit normal distributon
x0, x1 = ax.get_xlim()  
x_pdf = np.linspace(x0, x1, 100)
y_pdf = stats.norm.pdf(x=x_pdf)

pdf = pd.DataFrame({'x':x_pdf,'y':y_pdf})
sns.lineplot(data = pdf,x='x', y='y',label="pdf", color="r")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="spread_ex",clip=[-0.1, 0.5], label=str(i), cumulative=False)

# ax.title.set_text("Difference between $P_{t}$ and $P_{t}^{mid}$ with $x \in(-0.5, 0.5)$ (quote rule)")
# ax.xaxis.label.set_text("$x = P_{t} - P_{t}^{mid}$")
ax.legend()
plt.show()

### Quote rule

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="abs_mid_ex",clip=[-0.5, 0.5], label=str(i), cumulative=False)

ax.title.set_text("Difference between $P_{t}$ and $P_{t}^{mid}$ with $x \in(-0.5, 0.5)$ (quote rule)")
ax.xaxis.label.set_text("$x = P_{t} - P_{t}^{mid}$")
ax.legend()
#sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

In [None]:
years = [2005, 2006, 2014, 2016]
f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

f.suptitle("Distribution of buy and sells with regard to `abs_mid_ex`")

for y, year in enumerate(years):
  for i in [-1,1]:
    sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x="abs_mid_ex", clip=[-1, 1], label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
    ax[y].xaxis.label.set_text(str(year))

f.legend()

**Observations:**
- `TODO:` Analyze effects that lead to a fuzzy distribution at the beginning of the sample, but largely overlapping distributions at the end.

### day vol

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="day_vol", clip=[0, 500], label=str(i), cumulative=False)
# clip=[-0.5, 0.5]
# ax.title.set_text("Difference between $P_{t}$ and $P_{t}^{mid}$ with $x \in(-0.5, 0.5)$ (quote rule)")
# ax.xaxis.label.set_text("$x = P_{t} - P_{t}^{mid}$")
ax.legend()
#sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

In [None]:
years = [2005, 2006, 2014, 2016]
f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

f.suptitle("Distribution of buy and sells with regard to `day_vol`")

for y, year in enumerate(years):
  for i in [-1,1]:
    sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x="day_vol", clip=[0, 200], label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
    ax[y].xaxis.label.set_text(str(year))

f.legend()

### year

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="year", label=str(i), cumulative=False)

### tick test

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="chg_lead_ex", clip=[-5, 5], label=str(i), cumulative=False, common_grid=True)

ax.title.set_text("Density plot of change from $P_{t-1}$ to $P_{t}$ cropped at -5 and 5 (tick test)")
ax.xaxis.label.set_text("$x = P_{t} - P_{t-1}$")
ax.legend()
sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

In [None]:

years = [2005, 2014, 2016, 2017]
f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

f.suptitle("Distribution of buy and sells with regard to `chg_lead_ex`")

for y, year in enumerate(years):
  for i in [-1,1]:
    sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x="chg_lead_ex", clip=[-2, 2], label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
    ax[y].xaxis.label.set_text(str(year))

f.legend()

### trade size

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="TRADE_SIZE", clip=[0, 40], label=str(i), cumulative=False)
# clip=[-0.5, 0.5]
ax.title.set_text("Distribution of buys and sells")
# ax.xaxis.label.set_text("$x = P_{t} - P_{t}^{mid}$")
ax.legend()
#sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

In [None]:
years = [2006, 2015, 2016]
f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

f.suptitle("Distribution of buy and sells with regard to `TRADE_SIZE`")

for y, year in enumerate(years):
  for i in [-1,1]:
    sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x="TRADE_SIZE", clip=[0, 40], label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
    ax[y].xaxis.label.set_text(str(year))

f.legend()

### Reverse tick test

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="chg_lag_ex", clip=[-5, 5], label=str(i), cumulative=False, common_grid=True)

ax.title.set_text("Density plot of change from $P_{t}$ to $P_{t+1}$ cropped at -5 and 5 (reverse tick test)")
ax.xaxis.label.set_text("$x = P_{t} - P_{t+1}$")
ax.legend()
#sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="TRADE_SIZE", clip=[0, 40], label=str(i), cumulative=False)
# clip=[-0.5, 0.5]
ax.title.set_text("Distribution of buys and sells")
# ax.xaxis.label.set_text("$x = P_{t} - P_{t}^{mid}$")
ax.legend()
#sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, -0.3))
plt.show()

# Bid, Ask, and Spread 🛍️

## ($\log$) bid ex ✔️

In [None]:
ax = sns.histplot(data, x="bid_ex", bins=50)
ax.title.set_text("Histogram of bid (exchange)")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="bid_ex", clip=[0, 10], label=str(i), cumulative=False)

ax.title.set_text("Distribution of buys and sells")

In [None]:
const = 1
data['log_bid_ex'] = np.log(data["bid_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text(f"Histogram of bid exchange (log) with const = {const}")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ex", clip=[0, 5], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

In [None]:
const = 1e-2
data['log_bid_ex'] = np.log(data["bid_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text("Histogram of bid exchange (log)")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ex", clip=[-5, 6], label=str(i), cumulative=False)

ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

**Observation:**
- One can choose different constants, but small constants, e. g., `const=1e-2` gives fuzzy results. Better choose larger constant like `const=1`.

## ($\log$) ask ex ✔️

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="ask_ex", clip=[0, 10], label=str(i), cumulative=False)

ax.title.set_text("Distribution of buys and sells")

In [None]:
const = 1
data['log_ask_ex'] = np.log(data["ask_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text("Histogram of bid exchange (log)")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_bid_ex", clip=[0, 5], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

In [None]:
const = 1e-2
data['log_ask_ex'] = np.log(data["ask_ex"]+const)
ax = sns.histplot(data, x="log_bid_ex", bins=50)
ax.title.set_text("Histogram of bid exchange (log)")

In [None]:
fig, ax = plt.subplots()

for i in [-1,1]:
  sns.kdeplot(data=data[data["buy_sell"]==i], x="log_ask_ex", clip=[-5, 6], label=str(i), cumulative=False)
ax.title.set_text(f"Distribution of buys and sells after (log) with const = {const}")

In [None]:
years = [2006, 2015, 2016]
f, ax = plt.subplots(nrows=1, ncols=len(years), figsize=(18, 4))

f.suptitle("Distribution of buy and sells with regard to `TRADE_SIZE`")

for y, year in enumerate(years):
  for i in [-1,1]:
    sns.kdeplot(data=data[(data["buy_sell"]==i) & (data["year"] == year)], x="TRADE_SIZE", clip=[0, 40], label="_" * y + str(i), cumulative=False, common_grid=True, ax=ax[y])
    ax[y].xaxis.label.set_text(str(year))

f.legend()

### Date features ⏰

In [None]:
date_cols = data.columns[data.columns.str.startswith("date_")].tolist()

date_target_cols = [*date_cols, "buy_sell", "day","month" , "year"]

corr = data[date_target_cols].corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

In [None]:
sort_criteria = corr["buy_sell"].abs().sort_values(ascending=False)
corr_target = corr.sort_values("buy_sell", ascending=False)["buy_sell"]
corr_target.loc[sort_criteria.index].to_frame()

**Observation:**
* Correlation with date features is relatively high if compared with other classical features.
* For day correlation is highest, if the day is not mapped to a unit circle. But $\sin$ and $\cos$ have a greater feature importance together.

In [None]:
kde_target_plot("day")

In [None]:
kde_target_plot("date_year_min")

**Observation:**
* Judging from the plot there seems to be a seasonal pattern e. g., more buys 
at the beginning of the month and more sells towards the end of the month. 
* Due to the distributions it could make sense to include date features in some feature sets.

**Observation:**
- log on size seems to worsen results.
- `TODO:` investigate further, what the reason is. e. g., some skewness, but outliers....

In [None]:
sns.kdeplot(data=data[data["buy_sell"]==i], x="log_trade_size", clip=[0, 6], label=str(i), cumulative=False)
# clip=[-0.5, 0.5]
ax.title.set_text("Distribution of buys and sells")
