# Propdesk Data Pipeline - Seasonality and IO-bound processes
### Example: Calculate seasonality for a pair in a given exchange, check if everything is ok and set up a periodic job to keep it updated


### Most of examples in notebook 1 are useful and/or the same to compute/retrieve seasonality. Check it for reference. 
#### This notebook highlights some details and differences

In [7]:
%load_ext autoreload
%autoreload 2
from propdesk_tardis.tardis_transfero import tardis_transfero as tardis

exchange = 'binance'
items_to_search = ['ada', 'brl']

all_datasets = tardis.get_all_exchange_datasets(exchange)
print([i for i in all_datasets.keys() if all([s in i for s in items_to_search])])

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
['adabrl']


In [8]:
pair = 'btcusdt'

### Let's say we want to calculate it from 2021-11-20 to 2021-12-15. 
## Note that dates are **[date_from, date_to)**, so this will effectively calculate data until ***2021-12-14 23:59:59***

In [9]:
date_from = '2021-11-20'
date_to = '2021-12-15'

Then, we can define the parameters of the seasonality estimation process

In [10]:
params_dict = {
    'exchange': exchange,
    'pair': pair,
    'start_date': date_from,
    'end_date': date_to,
    'lookback_days':30,
    'resampling_rule': '60S'
}

We can simply run this job to have it computed. As an example, let's see if the script is deployed to databricks

In [11]:
from propdesk_azure_services.azure_databricks import list_databricks_src_files
deployed_files = list_databricks_src_files()
print(deployed_files)

['spark_schedule_handler.py', 'spark_seasonality.py', 'spark_volatility_zma.py']


In [12]:
job_name = f'{pair}_{exchange}_seasonality'
script_to_run = 'spark_seasonality.py'
job_type = 'io_intensive'

from propdesk_azure_services.azure_databricks import single_run_job

# function commented because it would trigger a run and return the data structure printed below
# single_run_job(job_name, script_to_run, params_dict, job_type=job_type)

HEREEE


{'run_id': 220169,
 'job_id': 8138,
 'run_page_url': 'https://adb-3928083337264192.12.azuredatabricks.net/?o=3928083337264192#job/8138/run/1'}

# ⚠️ ATTENTION - job_type = 'io_intensive'

Seasonality is an **I/O-BOUND** job. This means that donwloading data dominates the time to complete the job. We download a large amount of data (namely, `lookback_days=30`) and work lightly on that large amount of data (we downsample dates to 60S ticks).
That means it does make some sense to parallelize the calculations (using Spark), but not the distribution of tasks in the Spark nodes. This would cause the creation of a task run for a particular day ***N***, and another for a particular day ***N+1***.
Task for day ***N*** will download data from day ***N-30*** and task for day ***N-1*** will overlap downloading data from ***N+1-30***. This causes an overload of requests on a provider, e.g., Tardis and is completely redundant overloading the cluster.

Therefore, we can schedule a single task job (which will download all data, once, with no overlap) by limiting `max_tasks_in_job=1`. This is done automatically when passing `job_type='io_intensive'`. Also, `task_len = 31` limits each task to calculate up to 31 days.

Just pass the correct job type and everything will be fine :)
# ⚠️ -----------------------------------------

### Connecting to storage

In [12]:
from propdesk_estimators.exchange_storage import ExchangeStorage
binance = ExchangeStorage(exchange) # -- ExchangeStorage('binance')

Check for datasets that were already computed

In [20]:
query_dict = {
    'dataset_type': 'seasonality',
    'pair': pair,
    'date_from': date_from,
    'date_to': date_to,
    'lookback_days': 30,
    'resampling_rule': '60s'
}
print(query_dict)

{'dataset_type': 'seasonality', 'pair': 'adabrl', 'date_from': '2021-11-20', 'date_to': '2021-12-15', 'lookback_days': 30, 'resampling_rule': '60s'}


In [21]:
print ('Already computed datasets in query')
binance.list_datasets_by_params(query_dict)

Already computed datasets in query


['adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-20',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-21',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-22',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-23',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-24',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-25',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-26',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-27',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-28',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-29',
 'adabrl/seasonality/2021/11/adabrl_seasonality_2021-11-30',
 'adabrl/seasonality/2021/12/adabrl_seasonality_2021-12-01',
 'adabrl/seasonality/2021/12/adabrl_seasonality_2021-12-02',
 'adabrl/seasonality/2021/12/adabrl_seasonality_2021-12-03',
 'adabrl/seasonality/2021/12/adabrl_seasonality_2021-12-04',
 'adabrl/seasonality/2021/12/adabrl_seasonality_2021-12-05',
 'adabrl/seasonality/202

## We could have used the method amend to find only the days that we are missing and generate a dict to process those days
(check notebook 1 on volatility)

### With the module `propdesk_estimators.exchange_storage`, we can interact with those datasets. Let's straight up get a dataframe with the data

In [22]:
from propdesk_estimators.exchange_storage import get_dataframe_by_params

# pass the flag keep_local to keep raw files instead of downloading them again if needed; keep_local=False will delete the files
adabrl_seas_df = get_dataframe_by_params(exchange_str='binance', params_dict=query_dict, keep_local=True)
adabrl_seas_df

files saved to: /tmp/tmp48wt0d10


Unnamed: 0,datetime,seasonality_estimation
0,2021-11-20 00:00:00,283.65
1,2021-11-20 00:01:00,679.25
2,2021-11-20 00:02:00,451.20
3,2021-11-20 00:03:00,500.50
4,2021-11-20 00:04:00,351.95
...,...,...
35995,2021-12-14 23:55:00,187.55
35996,2021-12-14 23:56:00,345.20
35997,2021-12-14 23:57:00,130.15
35998,2021-12-14 23:58:00,433.60


# Success :)

## Creating a daily job
Now, after using, checking data, etc., we can create a periodic (weekly/daily) job to keep this dataset updated. We'll setup it to run everyday at 04:00 AM Sao Paulo time, to use the cluster at an idle time. 

## Ideally, **check Databricks UI to see if there are jobs scheduled for that time and day to avoid overloading the cluster**

In [1]:
from propdesk_azure_services.azure_databricks import create_periodic_job

exchange = 'binance'
# daily job
job_name = f'daily_seasonality_{pair}_{exchange}'
script_to_run = 'spark_seasonality.py'

##############
# make sure to have a schedule that is compatible with the period
# you can validate it in https://www.freeformatter.com/cron-expression-generator-quartz.html
# this one reads 'everyday at 4PM'
# defautl timezone is AMERICAS-SAO PAULO
job_schedule = "0 0 4 * * ?"
#############

period = 'daily'
# no need for start_date and end_date in daily jobs 
job_params = {
    'exchange': exchange,
    'pair': pair,
    'lookback_days':30,
    'resampling_rule': '60S'
}


create_periodic_job(job_name_str=job_name,
                    filename_str=script_to_run,
                    params_dict=job_params,
                    period=period,
                    cron_expression_str=job_schedule)

{'job': {'job_id': 10089}, 'schedule_job': {'job_id': 10169}}

## That's it. **check Databricks UI to make sure everything is ok**

### Have fun, move fast, break things, buy btc (or dcr or algorand) ⚡.
#### -- Propdesk Transfero