# Automation

## TLDR

- The automation features gives you the possibility to add additional processing steps after the default update steps: download of new zip files from SEC, transforming them to Parquet format and indexing them in the SQLite DB. This means, when a new zip file is detected at SEC, not only the mentioned three steps are automatcially executed but also additional steps that you define by yourself.
- There are two hook methods you can define and use to implement additional logic. Both of them are activated by defining them in the configuration file.
- The simpler of this two hook methods just receives the `Configuration` object and is called after the all updated steps, the default update stapes (downloading of zip files, transform to parquet, indexing) and additional user defined update steps, were executed.
- The more complex one receives a `Configuration` object and has to return a list of instances derived from `AbstractProcess`. There are some basic implementations of `AbstractProcess`, that can be used to filter, concat, and standardize the data. These implementeations are also used in the example implementation described in this notebook.
- You have to keep an eye on the memory usage, depending on what you do

## Defining a postupdatehook function

If you define a postupdatehook function in the configuration file then this function will be called after the all update steps were executed.

It will be called, regardless if the previous steps did actually do something. For instance, if there was now new zip file detected to download, it will be called anyway, but not more than once every 24 hours (the usual period the framework checks for upates).

Since the hook method is called even if there were no updates, it is your responsibility to check if actually something did change. Otherwise, if you implemented time consuming logic, it would be executed every 24 hours once.

The postupdatehook function needs a `Configuration` parameter and does not return anything. It can have any name you like.

<pre>
# it is ok to import the Configuration class
from secfsdstools.a_config.configmodel import Configuration 
from secfsdstools.c_index.indexdataaccess import ParquetDBIndexingAccessor
from secfsdstools... import ...

def my_postupdatehook_function(configuration: Configuration):
    
    # you can use the configuration for instance to instantiate access to the SQLite db
    index_db = ParquetDBIndexingAccessor(db_dir=configuration.db_dir)
    ...
    
</pre>

To activate it, just define it in the DEFAULT section of the configuration file:

<pre>
[DEFAULT]
downloaddirectory = C:/data/sec/automated/dld
dbdirectory = C:/data/sec/automated/db
parquetdirectory = C:/data/sec/automated/parquet
useragentemail = your.email@goeshere.com
autoupdate = True
keepzipfiles = False
postupdatehook = mypackage.mymodule.my_postupdatehook_function
</pre>

## Defining a postupdateprocesses function

If you define a postupdateprocess function, it has to return a list of instances of `AbtractProcess`. These instances are then executed after the default steps download, transform to parquet, and indexing were executed.

Also here, every "process" will be called once every 24 hours, and therefore, every process implementation has to check itself if something changed.

As a parameter, the postupdatedprocesses function must have a `Configuration` parameter and also has to return list of instances `AbstractProcess`.

*Note: There are some basic implementations of the `AbstractProcess` class within the `secfsdstools.g_pipelines` package that provide implementation to filter, to concat bags, and to standardize joined bags. 
Please have a look at the following section which show an example on how this basic implementations can be used.*

<pre>
# it is ok to import the Configuration and AbstractProcess classes
from secfsdstools.a_config.configmodel import Configuration 
from secfsdstools.c_automation.task_framework import AbstractProcess


def my_postupdateprocesses_function(configuration: Configuration) -> List[AbstractProcess]:
    # do your secfsdstools imprts here
    from secfsdstools... import ...
    
    processes: List[AbstractProcess] = []
    ...
    
    return processes
    
</pre>

To activate it, added the appropriate configuration in the DEFAULT section of the configuration file:

<pre>
[DEFAULT]
downloaddirectory = C:/data/sec/automated/dld
dbdirectory = C:/data/sec/automated/db
parquetdirectory = C:/data/sec/automated/parquet
useragentemail = your.email@goeshere.com
autoupdate = True
keepzipfiles = False
postupdateprocesses = mypackage.mymodule.my_postupdateprocesses_function
</pre>

## A working example of the postupdateprocesses function

The package `secfsdstools.x_examples.automation provides` a default implemention of a postupdateprocesses function: `define_extra_processes`.

You can use this function directly by adding it to your configuration file together with some additional configuration parameters used by it: 
<pre>
[DEFAULT]
...
postupdateprocesses=secfsdstools.x_examples.automation.automation.define_extra_processes

[Filter]
filtered_dir_by_stmt_joined = C:/data/sec/automated/_1_filtered_by_stmt_joined

[Concat]
concat_dir_by_stmt_joined = C:/data/sec/automated/_2_concat_by_stmt_joined

[Standardizer]
standardized_dir = C:/data/sec/automated/_3_standardized

; [SingleBag]
; singlebag_dir = C:/data/sec/automated/_4_single_bag
</pre>

The function will add 3 additional steps and a fourth optional step. The optional step is only executed if the needed parameter `singlebag_dir` is defined. 

These steps add the following processing:

The first step creates a joined bag for every zip file which is filtered for 10-K and 10-Q reports only
and also applies the filters `ReportPeriodRawFilter`, `MainCoregRawFilter`, `USDOnlyRawFilter`, `OfficialTagsOnlyRawFilter`. 
Furthermore, the data is also split by stmt.
The filtered joined bag is stored under the path that is defined under `filtered_dir_by_stmt_joined` in the configuration file.
The resulting directory structure will look like this:


    <filtered_dir_by_stmt_joined>
        quarter
            2009q2.zip
                BS
                CF
                CI
                CP
                EQ
                IS
            ...

The second step creates a single joined bag for every statement (balance sheet, income statement,
cash flow, cover page, ...) that contains the data from all zip files, resp from all the
available quarters. These bags are stored under the path defined as `concat_dir_by_stmt_joined`.
The resulting directory structure will look like this:

    <concat_dir_by_stmt_joined>
        BS
        CF
        CI
        CP
        EQ
        IS    


The third step standardizes the data for balance sheet, income statement, and cash flow and stores
the standardized bags under the path that is defined as `standardized_dir`.
The resulting structure will look like this:

    <standardized_dir>
        BS
        CF
        IS    
    

The fourth step is optional and is only executed if the configuration file contains an entry
for `singlebag_dir`. If it does, it will create a single joined bag concatenating all the bags
created in the second step, so basically creating a single bag that contains all the filtered data from
all the available zip files, resp. quarters. This step needs more memory than the others, so it
might not be running on every system.
The resulting directory structure will look like this:

    <singlebag_dir>
        all


Hint -> data can directly be loaded with the JoinedDataBag load, resp with STandardizedBag load.





In [None]:
explaning the implementation .. 
explaining how the state is saved

verweis auf implemenation of AbstractProcess implementations