Moduls importations

In [1]:
import numpy as np
import pandas as pd
from data.DataManager import DataManager
from data.DataGenerator import DataGenerator
from data.DataCollectors import HistoricalDataCollector, RealTimeDataCollector
from main.RandomnessAnalysis import RandomnessAnalysis
from main.PredictableDayAnalysis import PredictableDayAnalysis
from utils.MultiTester import MultiTester
from utils.VisualizationTools import plot_block_frequencies

To import the data, one must first declare the pairs and the symbols associated to the return intervals that will be studied. The symbols variable must be a dictionary with symbols as keys and lists of two tuples as values. The first tuple corresponds to the interval associated with the key and the second specifies if the bounds of the interval are included (True) or not (False).

In [2]:
asset_pairs = ["BTCUSDT", "ETHUSDT", "BNBUSDT", "XRPUSDT", "SOLUSDT", "ADAUSDT", "DOGEUSDT", "DOTUSDT","COMPUSDT","RENDERUSDT"]
symbols = {
            0: [(-np.inf, 0)
                , (False, False)],
            1: [(0, np.inf), (False, False)]
        }

Then to start collecting data, one has the choice between:
- using the class `RealTimeDataCollector()` to collect them in real time for a certain duration in hours
- using the class `HistoricalDataCollector()` to collect them for a given year and month

In order to get a great deal of data quickly, the second option is the best.

In [3]:
real_time_collection = RealTimeDataCollector(pairs=asset_pairs,
                                             duration_hours=1,
                                             update_interval=30)
real_time_collection.run()

<coroutine object RealTimeDataCollector.run at 0x000001A79FD05140>

For every asset pair asked, the collector will download the zip file of the historical trades for the year and month given. Then it will unzip it, save it as csv file and delete the original zip file. If the csv file for a certain pair, year and month is already present in the folder data/raw_data, the method `collect()` will notice it and won't download it again to save some execution time.

In [3]:
historical_collector = HistoricalDataCollector(pairs=asset_pairs, year=2024, month=11, day=5)
historical_collector.collect()

[SYSTEM] Processing BTCUSDT...
[SYSTEM] Downloading BTCUSDT-trades-2024-11-05.zip from https://data.binance.vision/data/spot/daily/trades/BTCUSDT/BTCUSDT-trades-2024-11-05.zip ...
[SYSTEM] File downloaded: BTCUSDT-trades-2024-11-05.zip
[SYSTEM] Extracting .\BTCUSDT-trades-2024-11-05.zip ...
[SYSTEM] Extraction finished.
[SYSTEM] ZIP file deleted: BTCUSDT-trades-2024-11-05.zip
[SYSTEM] Processing ETHUSDT...
[SYSTEM] Downloading ETHUSDT-trades-2024-11-05.zip from https://data.binance.vision/data/spot/daily/trades/ETHUSDT/ETHUSDT-trades-2024-11-05.zip ...
[SYSTEM] File downloaded: ETHUSDT-trades-2024-11-05.zip
[SYSTEM] Extracting .\ETHUSDT-trades-2024-11-05.zip ...
[SYSTEM] Extraction finished.
[SYSTEM] ZIP file deleted: ETHUSDT-trades-2024-11-05.zip
[SYSTEM] Processing BNBUSDT...
[SYSTEM] Downloading BNBUSDT-trades-2024-11-05.zip from https://data.binance.vision/data/spot/daily/trades/BNBUSDT/BNBUSDT-trades-2024-11-05.zip ...
[SYSTEM] File downloaded: BNBUSDT-trades-2024-11-05.zip
[SYSTE

Once the data is collected, one can preprocessed them just by creating an instance of the class `DataManager()` and then use the method `block constructor()` to build the blocks for a given size. The method will return a dictionary where each key is a pair and each value is a `pd.Dataframe` containing the built blocks for the pair.

In [4]:
data_manager = DataManager(asset_pairs, symbols, year=2024, month=11, day=5, aggregation_level=5)
blocks = data_manager.block_constructor(block_size=2, overlapping=False)
blocks_btc = blocks['BTCUSDT']
blocks_btc.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw_data\\BTCUSDT-trades-2024-11-5.csv'

Now that the blocks are built, one can compute the blocks relative frequencies by creating an instance of the class `RandomnessAnalysis()` and executing the method `compute_blocks_frequencies()` which will return a `pd.Dataframe` with the blocks as index and the absolute and relative frequencies in columns.

In [6]:
s = 2
analyser = RandomnessAnalysis(blocks_df=blocks_btc, s=s)
frequencies_df = analyser.compute_blocks_frequencies()
frequencies_df

Unnamed: 0,block,absolute frequency,relative frequency
0,"(0, 0)",4987751,0.430066
1,"(0, 1)",694530,0.059885
2,"(1, 0)",693941,0.059835
3,"(1, 1)",5221418,0.450214


To better visualize the frequencies distributions, one can apply the function `plot_block_frequencies()` on the dataframe.

In [7]:
plot_block_frequencies(frequencies_df)

The class `RandomnessAnalysis()` has also two other methods :
- `entropy_bias_test()` to launch a predictability hypothesis test based on Entropy Bias
- `KL_divergence_test()` to launch a predictability hypothesis test based on KL Divergence

For both methods, the output is the test results in a `pd.Dataframe` format.

In [8]:
test_entropy = analyser.entropy_bias_test()
test_entropy

Unnamed: 0,Entropy Bias test
Bias,7585106.567047
Quantile 90%,6.251389
Quantile 95%,7.814728
Quantile 99%,11.344867
P-value,0.0
Mean,3
Hypothesis 1,True


In [9]:
test_divergence = analyser.KL_divergence_test()
test_divergence

Unnamed: 0,KL Divergence test
KL Divergence,7575690.115326
Quantile 90%,2.705543
Quantile 95%,3.841459
Quantile 99%,6.634897
P-value,0.0
Mean,1
Hypothesis 1,True


If one wants to perform multiple predictability tests for different sets of parameters (block size or aggregation level), the class `MultiTester()` can manage it with the methods `test_by_block_size()` and `test_by_aggregation_level()`. The output is displayed in a `pd.Dataframe` with the test statistics, the quantiles and the theoretical distribution mean by block size or aggregation level.

In [10]:
btc_multi_tester = MultiTester(asset='BTCUSDT',symbols=symbols,overlapping=False)
btc_multi_tester.test_by_block_size(test='Entropy Bias',
                                    max_block_size=15,
                                    year=2024,
                                    month=11,
                                    aggregation_level=50)

Unnamed: 0_level_0,Test statistic,Quantile 99,Quantile 95,Quantile 90,Mean
Block size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4716.604,6.634897,3.841459,2.705543,1
2,7585107.0,11.344867,7.814728,6.251389,3
3,1648109.0,18.475307,14.06714,12.017037,7
4,1862117.0,30.577914,24.99579,22.30713,15
5,1995203.0,52.191395,44.985343,41.421736,31
6,2083016.0,92.010024,82.528727,77.745385,63
7,2145522.0,166.98739,154.301516,147.804813,127
8,2193572.0,310.457388,293.247835,284.335908,255
9,2231650.0,588.297794,564.696133,552.373933,511
10,2263570.0,1131.158739,1098.520782,1081.379444,1023


In [11]:
btc_multi_tester.test_by_aggregation_level(test='KL Divergence',
                                           max_aggregation_level=50,
                                           year=2024,
                                           month=11,
                                           block_size=2)

Unnamed: 0_level_0,Test statistic,Quantile 90,Quantile 95,Quantile 99,Mean
Aggregation level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,7575690.0,2.705543,3.841459,6.634897,1
2,7575690.0,2.705543,3.841459,6.634897,1
3,7575690.0,2.705543,3.841459,6.634897,1
4,7575690.0,2.705543,3.841459,6.634897,1
5,7575690.0,2.705543,3.841459,6.634897,1
6,7575690.0,2.705543,3.841459,6.634897,1
7,7575690.0,2.705543,3.841459,6.634897,1
8,7575690.0,2.705543,3.841459,6.634897,1
9,7575690.0,2.705543,3.841459,6.634897,1
10,7575690.0,2.705543,3.841459,6.634897,1


It is also possible to perfom multiple hypothesis tests by block size and by aggregation level and see the test results in a 3D plot with the method `plot_3D_test_result()`.

In [18]:
import importlib
import utils.MultiTester
importlib.reload(utils.MultiTester)
from utils.MultiTester import MultiTester

In [None]:
btc_multi_tester = MultiTester(asset='BTCUSDT',symbols=symbols,overlapping=False)
btc_multi_tester.plot_3D_test_result(asset='BTCUSDT',
                                     test='Entropy Bias',
                                     max_block_size=15,
                                     year=2024,month=11,
                                     max_aggregation_level=50)

To generate articial transaction data and see the fraction of predictable days by aggregation level and by time lag, the class `DataGenerator()` has 3 methods related to 3 different models to produce those charts:
- `lambda_model()` generate the data according to the λ-model
- `OD_model()` generate the data according to the Order Driven model
- `TS_model()` generate the data according to the TS model

In each method, with the argument `plots`, one can specify if he wants the two charts of fraction of predictable days by aggregation level and by time lag (`plots=(True,True)`) or just one chart.

In [None]:
data_gen = DataGenerator()
data_gen.lambda_model(test='KL Divergence',
                      overlapping=True,
                      max_aggregation_level=5,
                      max_time_lag=5,
                      n_days=2,
                      plots=(False,True))
data_gen.OD_model(test='KL Divergence',
                  overlapping=True,
                  max_aggregation_level=5,
                  max_time_lag=5,
                  n_days=2,
                  plots=(False,True))
data_gen.TS_model(test='KL Divergence',
                  overlapping=True,
                  max_aggregation_level=10,
                  max_time_lag=5,
                  n_days=2,
                  plots=(False,True))

To observe the fraction of predictable days by aggregation level or by time lag for the data generated by the 3 models on a single chart, one can use the method `plot_all()` and specifiy in the what argument the chart wanted.

In [None]:
data_gen.plot_all(what='aggregation level')
data_gen.plot_all(what='time lag')

Finally, to study the specificities of efficient and inefficent day for a given pair, month, year and aggragation level, one can use the class `PredictableDayAnalysis` and create an instance of it with the name of the pair and the data_manager instance related to the pair. Then one can execute the methode `analyze_days()` with the `pd.Dataframe` of the blocks associated with the pair.

In [None]:
analysis = PredictableDayAnalysis(pair="BTCUSDT",data_manager=data_manager)
analysis.analyze_days(blocks_btc)