Moduls importations

In [2]:
import numpy as np
import pandas as pd
from data.DataManager import DataManager
from data.DataGenerator import DataGenerator
from data.DataCollectors import HistoricalDataCollector, RealTimeDataCollector
from main.RandomnessAnalysis import RandomnessAnalysis
from main.PredictableDayAnalysis import PredictableDayAnalysis
from utils.Analysis import get_assets_properties, localization_predictable_intervals
from utils.MultiTester import MultiTester
from utils.VisualizationTools import plot_block_frequencies
from utils.Analysis import intervals_analysis
from utils.VisualizationTools import plot_test

To import the data, one must first declare the pairs and the symbols associated to the return intervals that will be studied. The symbols variable must be a dictionary with symbols as keys and lists of two tuples as values. The first tuple corresponds to the interval associated with the key and the second specifies if the bounds of the interval are included (True) or not (False).

In [3]:
asset_pairs = ['BTCUSDT','ETHUSDT','SOLUSDT']
symbols = {
            0: [(-np.inf, 0)
                , (False, False)],
            1: [(0, np.inf), (False, False)]
        }

Then to start collecting data, one has the choice between:
- using the class `RealTimeDataCollector()` to collect them in real time for a certain duration in hours
- using the class `HistoricalDataCollector()` to collect them for a given year and month

In order to get a great deal of data quickly, the second option is the best.

In [3]:
real_time_collection = RealTimeDataCollector(pairs=asset_pairs,
                                             duration_hours=1,
                                             update_interval=30)
real_time_collection.run()

<coroutine object RealTimeDataCollector.run at 0x0000022F12905340>

For every asset pair asked, the collector will download the zip file of the historical trades for the year and month given. Then it will unzip it, save it as csv file and delete the original zip file. If the csv file for a certain pair, year and month is already present in the folder data/raw_data, the method `collect()` will notice it and won't download it again to save some execution time.

In [3]:
historical_collector = HistoricalDataCollector(pairs=asset_pairs, year=2024, month=11, day = 1)
historical_collector.collect()

[SYSTEM] Processing BTCUSDT...
[SYSTEM] Downloading BTCUSDT-trades-2024-11-01.zip from https://data.binance.vision/data/spot/daily/trades/BTCUSDT/BTCUSDT-trades-2024-11-01.zip ...
[SYSTEM] File downloaded: BTCUSDT-trades-2024-11-01.zip
[SYSTEM] Extracting .\BTCUSDT-trades-2024-11-01.zip ...
[SYSTEM] Extraction finished.
[SYSTEM] ZIP file deleted: BTCUSDT-trades-2024-11-01.zip
[SYSTEM] Processing ETHUSDT...
[SYSTEM] Downloading ETHUSDT-trades-2024-11-01.zip from https://data.binance.vision/data/spot/daily/trades/ETHUSDT/ETHUSDT-trades-2024-11-01.zip ...
[SYSTEM] File downloaded: ETHUSDT-trades-2024-11-01.zip
[SYSTEM] Extracting .\ETHUSDT-trades-2024-11-01.zip ...
[SYSTEM] Extraction finished.
[SYSTEM] ZIP file deleted: ETHUSDT-trades-2024-11-01.zip
[SYSTEM] Processing SOLUSDT...
[SYSTEM] Downloading SOLUSDT-trades-2024-11-01.zip from https://data.binance.vision/data/spot/daily/trades/SOLUSDT/SOLUSDT-trades-2024-11-01.zip ...
[SYSTEM] File downloaded: SOLUSDT-trades-2024-11-01.zip
[SYSTE

Once the data is collected, one can preprocessed them just by creating an instance of the class `DataManager()` and then use the method `block constructor()` to build the blocks for a given size. The method will return a dictionary where each key is a pair and each value is a `pd.Dataframe` containing the built blocks for the pair.

In [4]:
data_manager = DataManager(asset_pairs, symbols, year=2024, month=11, day=1, aggregation_level=5)
blocks = data_manager.block_constructor(block_size=2, overlapping=False)
blocks_btc = blocks['BTCUSDT']
blocks_btc.head(3)

[DataManager] Loading csv file: data/raw_data\BTCUSDT-trades-2024-11-01.csv
[DataManager] Loading csv file: data/raw_data\ETHUSDT-trades-2024-11-01.csv
[DataManager] Loading csv file: data/raw_data\SOLUSDT-trades-2024-11-01.csv


Unnamed: 0,0,1
0,1,0
1,1,0
2,0,1


Now that the blocks are built, one can compute the blocks relative frequencies by creating an instance of the class `RandomnessAnalysis()` and executing the method `compute_blocks_frequencies()` which will return a `pd.Dataframe` with the blocks as index and the absolute and relative frequencies in columns.

In [None]:
s = 2
analyser = RandomnessAnalysis(blocks_df=blocks_btc, s=s)
frequencies_df = analyser.compute_blocks_frequencies()
frequencies_df

Unnamed: 0,block,absolute frequency,relative frequency
0,0,134163,0.442213
1,1,15021,0.049511
2,2,14980,0.049375
3,3,139226,0.458901


To better visualize the frequencies distributions, one can apply the function `plot_block_frequencies()` on the dataframe.

In [6]:
plot_block_frequencies(frequencies_df)

The class `RandomnessAnalysis()` has also two other methods :
- `entropy_bias_test()` to launch a predictability hypothesis test based on Entropy Bias
- `KL_divergence_test()` to launch a predictability hypothesis test based on KL Divergence

For both methods, the output is the test results in a `pd.Dataframe` format.

In [7]:
test_entropy = analyser.entropy_bias_test()
test_entropy

Unnamed: 0,Entropy Bias test
Bias,259478.925051
Quantile 90%,6.251389
Quantile 95%,7.814728
Quantile 99%,11.344867
P-value,0.0
Mean,3
Hypothesis 1,True


In [6]:
test_divergence = analyser.KL_divergence_test()
test_divergence

Unnamed: 0,NP Statistic test
NP Statistic,224748.343879
Quantile 90%,2.705543
Quantile 95%,3.841459
Quantile 99%,6.634897
P-value,0.0
Mean,1
Hypothesis 1,True


If one wants to perform multiple predictability tests for different sets of parameters (block size or aggregation level), the class `MultiTester()` can manage it with the methods `test_by_block_size()` and `test_by_aggregation_level()`. The output is displayed in a `pd.Dataframe` with the test statistics, the quantiles and the theoretical distribution mean by block size or aggregation level.

In [8]:
btc_multi_tester = MultiTester(asset='BTCUSDT',symbols=symbols,overlapping=False)
df_test_block = btc_multi_tester.test_by_block_size(test='Entropy Bias',
                                    max_block_size=15,
                                    year=2024,
                                    month=11,
                                    day=5,
                                    aggregation_level=5)
df_test_block

Unnamed: 0_level_0,Test statistic,Quantile 99,Quantile 95,Quantile 90,Mean
Block size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1473.12751,6.634897,3.841459,2.705543,1
2,259478.925051,11.344867,7.814728,6.251389,3
3,370554.077159,18.475307,14.06714,12.017037,7
4,430657.512803,30.577914,24.99579,22.30713,15
5,466766.477413,52.191395,44.985343,41.421736,31
6,491823.758053,92.010024,82.528727,77.745385,63
7,509449.581121,166.98739,154.301516,147.804813,127
8,522619.136222,310.457388,293.247835,284.335908,255
9,533171.923821,588.297794,564.696133,552.373933,511
10,542251.62152,1131.158739,1098.520782,1081.379444,1023


In [9]:
plot_test(x_values=df_test_block.index, 
          y1_values=df_test_block['Test statistic'].values,
          y2_values=df_test_block['Quantile 99'].values,
          test='Entropy Bias',
          x_label='Block size',
          pair='BTCUSDT')

In [10]:
btc_multi_tester = MultiTester(asset='BTCUSDT',symbols=symbols,overlapping=True)

In [11]:
df_test_agg = btc_multi_tester.test_by_aggregation_level(test='NP Statistic',
                                           max_aggregation_level=10,
                                           year=2024,
                                           month=11,
                                           day=5,
                                           block_size=2)
df_test_agg

KeyboardInterrupt: 

In [None]:
plot_test(x_values=df_test_agg.index, 
          y1_values=df_test_agg['Test statistic'].values,
          y2_values=df_test_agg['Quantile 99'].values,
          test='NP Statistic',
          x_label='Aggregation level',
          pair='BTCUSDT')

It is also possible to perfom multiple hypothesis tests by block size and by aggregation level and see the test results in a 3D plot with the method `plot_3D_test_result()`.

In [18]:
import importlib
import utils.MultiTester
importlib.reload(utils.MultiTester)
from utils.MultiTester import MultiTester

In [None]:
btc_multi_tester = MultiTester(asset='BTCUSDT',symbols=symbols,overlapping=False)
btc_multi_tester.plot_3D_test_result(asset='BTCUSDT',
                                     test='Entropy Bias',
                                     max_block_size=15,
                                     year=2024,month=11,
                                     max_aggregation_level=50)

To generate articial transaction data and see the fraction of predictable days by aggregation level and by time lag, the class `DataGenerator()` has 3 methods related to 3 different models to produce those charts:
- `lambda_model()` generate the data according to the λ-model
- `OD_model()` generate the data according to the Order Driven model
- `TS_model()` generate the data according to the TS model

In each method, with the argument `plots`, one can specify if he wants the two charts of fraction of predictable days by aggregation level and by time lag (`plots=(True,True)`) or just one chart.

In [None]:
data_gen = DataGenerator()
data_gen.lambda_model(test='NP Statistic',
                      overlapping=True,
                      max_aggregation_level=5,
                      max_time_lag=5,
                      n_days=2,
                      plots=(False,True))
data_gen.OD_model(test='NP Statistic',
                  overlapping=True,
                  max_aggregation_level=5,
                  max_time_lag=5,
                  n_days=2,
                  plots=(False,True))
data_gen.TS_model(test='NP Statistic',
                  overlapping=True,
                  max_aggregation_level=10,
                  max_time_lag=5,
                  n_days=2,
                  plots=(False,True))

Day 1/2


KeyboardInterrupt: 

To observe the fraction of predictable days by aggregation level or by time lag for the data generated by the 3 models on a single chart, one can use the method `plot_all()` and specifiy in the what argument the chart wanted.

In [4]:
data_gen.plot_all(what='aggregation level')
data_gen.plot_all(what='time lag')

Day 1/80
Day 2/80


KeyboardInterrupt: 

To study the specificities of efficient and inefficent day for a given pair, month, year and aggragation level, one can use the class `PredictableDayAnalysis` and create an instance of it with the name of the pair and the data_manager instance related to the pair. Then one can execute the methode `analyze_days()` with the `pd.Dataframe` of the blocks associated with the pair.

In [None]:
analysis = PredictableDayAnalysis(pair="BTCUSDT",data_manager=data_manager)
analysis.analyze_days(blocks_btc)
analysis.efficient_df

To get the properties of a given set of pairs and for a given day or month, one can use you the function `get_asset_properties()` with the following inputs:
- the list of pairs ;
- the cardinal of the chosen alphabet ;
- the year, the month and eventually the day.

Of course, one must have already instantiated a data manager first for the given pairs (or at least have their preprocessed data in the corresponding folder).

In [5]:
data_manager = DataManager(asset_pairs, symbols, year=2024, month=11, day=5, aggregation_level=1)

In [6]:
df_prop = get_assets_properties(asset_pairs, s=2, year=2024, month=11, day=5)

In [8]:
df_prop

Unnamed: 0,Mean price,Price standard deviation,Mean return,Return standard deviation,Volume,Mean volume,Standard deviation of volume,Daily number of transactions
BTCUSDT,69227.471775,720.381637,2.402404e-08,1.9e-05,10339.08,0.011197,0.074842,923406.0
ETHUSDT,2435.963466,20.563697,1.912779e-08,3.1e-05,83625.22,0.158403,0.722906,527927.0
BNBUSDT,562.771773,4.988655,1.7379e-07,4.2e-05,56409.72,0.511101,2.575696,110369.0
XRPUSDT,0.510427,0.003333,7.873785e-07,0.000198,24806650.0,946.349254,3404.033293,26213.0


In [5]:
data_manager = DataManager(asset_pairs, symbols, year=2024, month=11, day=5, aggregation_level=5)

In [9]:
import importlib
import utils.Analysis
importlib.reload(utils.Analysis)
from utils.Analysis import localization_predictable_intervals

In [None]:
df = localization_predictable_intervals(data_manager, "BTCUSDT", test='NP Statistic')
df.head(10)

Unnamed: 0_level_0,Timestamp Start,Timestamp End,Test Stat,Quantile 99%,P-value,Hypothesis
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2024-11-05 01:07:19.539,2024-11-05 01:08:35.880000,765.099422,362.956911,0.0,True
2,2024-11-05 15:12:38.976,2024-11-05 15:13:05.003000,763.348155,362.956911,0.0,True
3,2024-11-05 08:43:59.902,2024-11-05 08:45:36.364000,755.555759,362.956911,0.0,True
4,2024-11-05 15:14:57.148,2024-11-05 15:15:25.603000,742.731022,362.956911,0.0,True
5,2024-11-05 15:49:40.196,2024-11-05 15:50:35.508000,741.195695,362.956911,0.0,True
6,2024-11-05 14:03:38.931,2024-11-05 14:05:26.077000,740.881631,362.956911,0.0,True
7,2024-11-05 02:03:36.286,2024-11-05 02:04:49.218000,736.230903,362.956911,0.0,True
8,2024-11-05 10:18:20.865,2024-11-05 10:21:21.811000,731.954847,362.956911,0.0,True
9,2024-11-05 11:23:33.880,2024-11-05 11:25:55.515000,731.082091,362.956911,0.0,True
10,2024-11-05 16:45:59.939,2024-11-05 16:46:19.265000,731.051935,362.956911,0.0,True


In [None]:
intervals_analysis(pairs=asset_pairs,
                   symbols=symbols,
                   max_aggregation_level=50,
                   year=2024,
                   month=[8,9,10,11,12])

[SYSTEM] Processing 2024-8 with aggregation level 1...
[SYSTEM] Processing BTCUSDT...
[SYSTEM] Data already available for BTCUSDT (BTCUSDT-trades-2024-08.csv)
[SYSTEM] Processing ETHUSDT...
[SYSTEM] Data already available for ETHUSDT (ETHUSDT-trades-2024-08.csv)
[SYSTEM] Processing SOLUSDT...
[SYSTEM] Data already available for SOLUSDT (SOLUSDT-trades-2024-08.csv)
[SYSTEM] Processing BNBUSDT...
[SYSTEM] Data already available for BNBUSDT (BNBUSDT-trades-2024-08.csv)
[SYSTEM] Processing AVAXUSDT...
[SYSTEM] Data already available for AVAXUSDT (AVAXUSDT-trades-2024-08.csv)
[SYSTEM] Processing UNIUSDT...
[SYSTEM] Data already available for UNIUSDT (UNIUSDT-trades-2024-08.csv)
[SYSTEM] Processing LINKUSDT...
[SYSTEM] Data already available for LINKUSDT (LINKUSDT-trades-2024-08.csv)
[SYSTEM] Processing AXSUSDT...
[SYSTEM] Data already available for AXSUSDT (AXSUSDT-trades-2024-08.csv)
[SYSTEM] Processing RENDERUSDT...
[SYSTEM] Data already available for RENDERUSDT (RENDERUSDT-trades-2024-08

In [None]:
intervals_analysis(pairs=asset_pairs,
                   symbols=symbols,
                   max_aggregation_level=50,
                   year=2024,
                   month=[8,9,10,11,12],
                   test = 'NP Statistic')