## A working example of the postupdateprocesses function: `secfsdstools.x_examples.automation.memory_optimized_daily_automation.define_extra_processes` (introduced in 2.4.2)

<span style="color: #FF8C00;">==========================================================</span>

**If you find this tool useful, a sponsorship would be greatly appreciated!**

**https://github.com/sponsors/HansjoergW**

How to get in touch

* Found a bug: https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues
* Have a remark: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/general
* Have an idea: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/ideas
* Have a question: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/q-a
* Have something to show: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/show-and-tell

<span style="color: #FF8C00;">==========================================================</span>

### What this pipeline creates

It result in creating the following bags:

- from the quarterly zip files
  - a single joined bag per statement (BS, IS, CF, ..) that will contain the data from all available quarter.
  - single standardized bags for each of BS, IS, CF which contain data from all the available quarters.
  - a single joined bag containing all the data from all statements from all available quarters.
- from daily processing of data not yet available in the quarter zip files
  - a single joined bag per statement (BS, IS, CF, ..).
  - single standardized bags for each of BS, IS, CF.
  - a single joined bag containing.
- combined quarterly and daily processed data
  - a single joined bag per statement (BS, IS, CF, ..).
  - single standardized bags for each of BS, IS, CF.
  - a single joined bag containing.
  
Moreover, new filed report are added every day.

This version has a low memory footprint and should run without any problems on 16 GB.


### How to use the example


You can use this function directly by adding it to your configuration file together with some additional configuration parameters used by it: 
<pre>
[DEFAULT]
...
postupdateprocesses=secfsdstools.x_examples.automation.memory_optimized_daily_automation.define_extra_processes
dailyprocessing = True

# configuration for quarterly data / daily data
[Filter]
filtered_quarterly_joined_by_stmt_dir = C:/data/sec/automated/_1_by_quarter/_1_filtered_joined_by_stmt
filtered_daily_joined_by_stmt_dir = C:/data/sec/automated/_1_by_day/_1_filtered_joined_by_stmt
parallelize = True

[Standardizer]
standardized_quarterly_by_stmt_dir = C:/data/sec/automated/_1_by_quarter/_2_standardized_by_stmt
standardized_daily_by_stmt_dir = C:/data/sec/automated/_1_by_day/_2_standardized_by_stmt

[Concat]
concat_quarterly_joined_by_stmt_dir = C:/data/sec/automated/_2_all_quarter/_1_joined_by_stmt
concat_daily_joined_by_stmt_dir = C:/data/sec/automated/_2_all_day/_1_joined_by_stmt

concat_quarterly_joined_all_dir = C:/data/sec/automated/_2_all_quarter/_2_joined
concat_daily_joined_all_dir = C:/data/sec/automated/_2_all_day/_2_joined

concat_quarterly_standardized_by_stmt_dir = C:/data/sec/automated/_2_all_quarter/_3_standardized_by_stmt
concat_daily_standardized_by_stmt_dir = C:/data/sec/automated/_2_all_day/_3_standardized_by_stmt

concat_all_joined_by_stmt_dir = C:/data/sec/automated/_3_all/_1_joined_by_stmt
concat_all_joined_dir = C:/data/sec/automated/_3_all/_2_joined
concat_all_standardized_by_stmt_dir = C:/data/sec/automated/_3_all/_3_standardized_by_stmt
</pre>

The function will add additional steps to process the data added by the quartlerly zip files, to process the data of filed reports
not yet in the quarterly zip files, and finally to combine all data together.

#### Processing the data in the quarterly zip files
The first step creates a joined bag for every zip file which is filtered for 10-K and 10-Q reports only
and also applies the filters `ReportPeriodRawFilter`, `MainCoregRawFilter`, `USDOnlyRawFilter`, `OfficialTagsOnlyRawFilter`. 
Furthermore, the data is also split by stmt. If you set `parallelize = False`, the step will use less memory in the initial run
but be a little bit slower. This will only make a difference during the first run, when all available quarters from the past
are processed.

The filtered joined bag is stored under the path that is defined under `filtered_quarterly_joined_by_stmt_dir` in the configuration file.
The resulting directory structure will look like this:


    <filtered_quarterly_joined_by_stmt_dir>
        quarter
            2009q2.zip
                BS
                CF
                CI
                CP
                EQ
                IS
            ...

The second step uses the the results of the first step and creates standardized bags for every quarter.
The results are stored under the path that is defined under `standardized_quarterly_by_stmt_dir` and the structure will look like this:

    <standardized_quarterly_by_stmt_dir>
        2009q2.zip
            BS
            CF
            IS
        2009q3.zip
            BS
            CF
            IS
        ...

The third step concatenates per statement all available dat from the first step.
So, you will have one bag with all BS information for all quarters, one for CF, and so on.
The results are stored under the path that is defined under `concat_quarterly_joined_by_stmt_dir` and the structure will look like this:

    <concat_quarterly_joined_by_stmt_dir>
        BS
        CF
        CI
        CP
        EQ
        IS

The fourth step concatenates the results from the third step into a single bag. 
So, you will have all data from all quarters in one bag. Especially when using predicate pushdown, you will still get
reasonable load performance.

The resutling bag is stored under the path that is defined under `concat_quarterly_joined_all_dir`.


The fith step concatenates the standardized bags together (per statement). You will get a single standardize bag for each 
BS, CF, and IS containing all the datat from all quarters.

The results are stored under the path that is defined under `concat_quarterly_standardized_by_stmt_dir` and the structure will look like this:

    <concat_quarterly_standardized_by_stmt_dir>
        BS
        CF
        IS
        all


#### Daily Processing the data not yet in quarterly zip files
The same steps are executed for the filed reports that are not yet contained in the available quarterly zip files.

The results of the filtering step are stored at `filtered_daily_joined_by_stmt_dir`.
The results of the stndardizing step are stored at `standardized_daily_by_stmt_dir`.
The bags that were the joined per statement are stored at `concat_daily_joined_by_stmt_dir`.
The bag containing all joined data is stored at `concat_daily_joined_all_dir`.
The standardize bags for BS, IS, and CF are stored at `concat_daily_standardized_by_stmt_dir`.

#### Combining quarterly and daily data
Finally, bags containing both the quarterly and the daily data are created in the following three steps.

First, bags containing joined per statement data are stored under `concat_all_joined_by_stmt_dir`.
Second, one single bag containing the joined data is stored under `concat_all_joined_dir` . 
Third, standardized bags for BS, IS, and CF are stored under `concat_all_standardized_by_stmt_dir`.

**Hint: This bags can now be loaded directly with the load method of JoinedDataBag, resp StandardizedBag.**


### How the example is implemented.

Let us have a look at the implementation of the the function `define_extra_processes`:


In [None]:
def define_extra_processes(configuration: Configuration) -> List[AbstractProcess]:
    filtered_quarterly_joined_by_stmt_dir = configuration.get_parser().get(
        section="Filter", option="filtered_quarterly_joined_by_stmt_dir"
    )
    filtered_daily_joined_by_stmt_dir = configuration.get_parser().get(
        section="Filter", option="filtered_daily_joined_by_stmt_dir"
    )

    filter_parallelize = configuration.get_parser().get(section="Filter", option="parallelize", fallback=True)

    standardized_quarterly_by_stmt_dir = configuration.get_parser().get(
        section="Standardizer", option="standardized_quarterly_by_stmt_dir"
    )
    standardized_daily_by_stmt_dir = configuration.get_parser().get(
        section="Standardizer", option="standardized_daily_by_stmt_dir"
    )

    concat_quarterly_joined_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_quarterly_joined_by_stmt_dir"
    )
    concat_daily_joined_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_daily_joined_by_stmt_dir"
    )

    concat_quarterly_joined_all_dir = configuration.get_parser().get(
        section="Concat", option="concat_quarterly_joined_all_dir"
    )
    concat_daily_joined_all_dir = configuration.get_parser().get(section="Concat", option="concat_daily_joined_all_dir")

    concat_quarterly_standardized_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_quarterly_standardized_by_stmt_dir"
    )
    concat_daily_standardized_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_daily_standardized_by_stmt_dir"
    )

    concat_all_joined_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_all_joined_by_stmt_dir"
    )

    concat_all_joined_dir = configuration.get_parser().get(section="Concat", option="concat_all_joined_dir")

    concat_all_standardized_by_stmt_dir = configuration.get_parser().get(
        section="Concat", option="concat_all_standardized_by_stmt_dir"
    )

    processes: List[AbstractProcess] = []

    # QUARTERLY DATA Processing    
    # processes the data from the quarterly zip files from the SEC
    processes.append(LoggingProcess(title="Post Update Processes For Quarterly Data Started", lines=[]))
    
    processes.append(
        # 1. Filter, join, and save by stmt
        FilterProcess(
            db_dir=configuration.db_dir,
            target_dir=filtered_quarterly_joined_by_stmt_dir,
            bag_type="joined",
            save_by_stmt=True,
            execute_serial=not filter_parallelize,
            file_type="quarter",
        )
    )

    processes.append(
        # 2. Standardize the data for every quarter
        StandardizeProcess(
            root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter", target_dir=standardized_quarterly_by_stmt_dir
        ),
    )

    processes.extend(
        [
            # 3. building datasets with all entries by stmt
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/BS",
                pathfilter="*/BS",
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/CF",
                pathfilter="*/CF",
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/CI",
                pathfilter="*/CI",
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/CP",
                pathfilter="*/CP",
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/EQ",
                pathfilter="*/EQ",
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=f"{filtered_quarterly_joined_by_stmt_dir}/quarter",
                target_dir=f"{concat_quarterly_joined_by_stmt_dir}/IS",
                pathfilter="*/IS",
            ),
        ]
    )

    # 4. create a single joined bag with all the data filtered and joined
    processes.append(
        ConcatByChangedTimestampProcess(
            root_dir=concat_quarterly_joined_by_stmt_dir,
            target_dir=concat_quarterly_joined_all_dir,
        )
    )

    # 5. concate the standardized bags together by stmt (BS, IS, CF).
    processes.extend(
        [
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_quarterly_by_stmt_dir,
                target_dir=f"{concat_quarterly_standardized_by_stmt_dir}/BS",
                pathfilter="*/BS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_quarterly_by_stmt_dir,
                target_dir=f"{concat_quarterly_standardized_by_stmt_dir}/CF",
                pathfilter="*/CF",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_quarterly_by_stmt_dir,
                target_dir=f"{concat_quarterly_standardized_by_stmt_dir}/IS",
                pathfilter="*/IS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
        ]
    )

    # DAILY DATA Processing
    processes.append(LoggingProcess(title="Post Update Processes For Daily Data Started", lines=[]))

    # clean daily data covered now by quarterly data
    processes.append(
        ClearDailyDataProcess(
            db_dir=configuration.db_dir,
            filtered_daily_joined_by_stmt_dir=filtered_daily_joined_by_stmt_dir,
            standardized_daily_by_stmt_dir=standardized_daily_by_stmt_dir,
        )
    )

    # 1. Filter, join, and save by stmt
    processes.append(
        FilterProcess(
            db_dir=configuration.db_dir,
            target_dir=filtered_daily_joined_by_stmt_dir,
            bag_type="joined",
            save_by_stmt=True,
            execute_serial=not filter_parallelize,
            file_type="daily",
        )
    )

    processes.append(
        # 2. Standardize the data for daily data
        StandardizeProcess(
            root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily", target_dir=standardized_daily_by_stmt_dir
        ),
    )

    processes.extend(
        [
            # 3. building datasets with all entries by stmt for daily data
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/BS",
                pathfilter="*/BS",
            ),
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/CF",
                pathfilter="*/CF",
            ),
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/CI",
                pathfilter="*/CI",
            ),
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/CP",
                pathfilter="*/CP",
            ),
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/EQ",
                pathfilter="*/EQ",
            ),
            ConcatByChangedTimestampProcess(
                root_dir=f"{filtered_daily_joined_by_stmt_dir}/daily",
                target_dir=f"{concat_daily_joined_by_stmt_dir}/IS",
                pathfilter="*/IS",
            ),
        ]
    )

    # 4. create a single joined bag with all the data filtered and joined for daily data
    processes.append(
        ConcatByChangedTimestampProcess(
            root_dir=concat_daily_joined_by_stmt_dir,
            target_dir=concat_daily_joined_all_dir,
        )
    )

    # 5. concate the standardized bags together by stmt (BS, IS, CF) for daily data.
    processes.extend(
        [
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_daily_by_stmt_dir,
                target_dir=f"{concat_daily_standardized_by_stmt_dir}/BS",
                pathfilter="*/BS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_daily_by_stmt_dir,
                target_dir=f"{concat_daily_standardized_by_stmt_dir}/CF",
                pathfilter="*/CF",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatByNewSubfoldersProcess(
                root_dir=standardized_daily_by_stmt_dir,
                target_dir=f"{concat_daily_standardized_by_stmt_dir}/IS",
                pathfilter="*/IS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
        ]
    )

    # Concat daily and quarter together
    processes.append(
        LoggingProcess(title="Post Update Processes To Combine Quarterly And Daily Data Started", lines=[])
    )

    # 1. concat joined_by_statement
    processes.extend(
        [
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/BS",
                pathfilter="BS",
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/CF",
                pathfilter="CF",
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/CI",
                pathfilter="CI",
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/CP",
                pathfilter="CP",
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/EQ",
                pathfilter="EQ",
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_quarterly_joined_by_stmt_dir, concat_daily_joined_by_stmt_dir],
                target_dir=f"{concat_all_joined_by_stmt_dir}/IS",
                pathfilter="IS",
            ),
        ]
    )

    # 2. concat joined
    processes.append(
        ConcatMultiRootByChangedTimestampProcess(
            root_dirs=[concat_daily_joined_all_dir, concat_quarterly_joined_all_dir],
            pathfilter="",
            target_dir=concat_all_joined_dir,
        )
    )

    # 3. concat standardized by statement
    processes.extend(
        [
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_daily_standardized_by_stmt_dir, concat_quarterly_standardized_by_stmt_dir],
                target_dir=f"{concat_all_standardized_by_stmt_dir}/BS",
                pathfilter="BS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_daily_standardized_by_stmt_dir, concat_quarterly_standardized_by_stmt_dir],
                target_dir=f"{concat_all_standardized_by_stmt_dir}/CF",
                pathfilter="CF",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
            ConcatMultiRootByChangedTimestampProcess(
                root_dirs=[concat_daily_standardized_by_stmt_dir, concat_quarterly_standardized_by_stmt_dir],
                target_dir=f"{concat_all_standardized_by_stmt_dir}/IS",
                pathfilter="IS",
                in_memory=True,  # Standardized Bag only work with in_memory
            ),
        ]
    )
    return processes

### How the created bags can be used

The most important bags are the ones that contain all the available data and are updated every day:
    
Joined and fitlered bags by stmt (statement):
- **[concat_all_joined_by_stmt_dir]/BS** <br/> Contains a joined bag with all the BS datapoints that are available on the SEC
- **[concat_all_joined_by_stmt_dir]/CF** <br/> Contains a joined bag with all the CF datapoints that are available on the SEC
- **[concat_all_joined_by_stmt_dir]/CI** <br/> Contains a joined bag with all the CI datapoints that are available on the SEC
- **[concat_all_joined_by_stmt_dir]/CP** <br/> Contains a joined bag with all the CP datapoints that are available on the SEC
- **[concat_all_joined_by_stmt_dir]/EQ** <br/> Contains a joined bag with all the EQ datapoints that are available on the SEC
- **[concat_all_joined_by_stmt_dir]/IS** <br/> Contains a joined bag with all the IS datapoints that are available on the SEC

Joined and filtered single bag:
- **[concat_all_joined_all_dir]** <br/> Contains a sinlge joined bag with all datapoints that are available on the SEC

Standardized bags by stmt (statement):
- **[concat_all_standardized_by_stmt_dir]/BS** <br/> Contains a standardized bag from all the BS datapoints that are available on the SEC
- **[concat_all_standardized_by_stmt_dir]/CF** <br/> Contains a standardized bag from all the CF datapoints that are available on the SEC
- **[concat_all_standardized_by_stmt_dir]/IS** <br/> Contains a standardized bag from all the IS datapoints that are available on the SEC


First, let us load the configuration, so that we can get the paths to the bags directly from the configuration file

In [1]:
import os
from secfsdstools.a_config.configmodel import Configuration
from secfsdstools.a_config.configmgt import ConfigurationManager, SECFSDSTOOLS_ENV_VAR_NAME

# set the path to your configfile containg the above shown configuration into the SECFSDSTOOLS_CFG env variable, if it is not in your user home
#os.environ[SECFSDSTOOLS_ENV_VAR_NAME] = "..." 

configuration = ConfigurationManager.read_config_file()

2025-03-05 16:13:53,106 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg




Next, If you want to analyze just **BS data**, you can simply load the appropriate bag:

In [11]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

concat_all_joined_by_stmt_dir = configuration.config_parser.get(section="Concat", option="concat_all_joined_by_stmt_dir")

all_bs_joined_bag = JoinedDataBag.load(target_path=f"{concat_all_joined_by_stmt_dir}/BS") # loading all the available BS data
print(all_bs_joined_bag.pre_num_df.shape)

(19657047, 17)


With the single joined bag and using predicate pushdown, you can also easily load a single report by its adsh (also a recent one). This still performs quite ok, even if the file is about 1.3GB (as of Q1 2025).

In [3]:
from secfsdstools.d_container.databagmodel import JoinedDataBag

concat_all_joined_all_dir = configuration.config_parser.get(section="Concat", option="concat_all_joined_all_dir")

apple_10k_2022_adsh = "0000320193-22-000108"
a_single_report = JoinedDataBag.load(target_path=f"{concat_all_joined_all_dir}", adshs_filter=[apple_10k_2022_adsh]) # loading all the available BS data

print(a_single_report.sub_df.shape)
print(a_single_report.pre_num_df.shape)

2025-03-05 16:14:32,022 [INFO] databagmodel  apply sub_df filter: [('adsh', 'in', ['0000320193-22-000108'])]
2025-03-05 16:14:32,536 [INFO] databagmodel  apply pre_num_df filter: ["('adsh', 'in', ['0000320193-22-000108'])"]


(1, 36)
(179, 17)


while this is a little bit slower than using the singlebag collector

<pre>
from secfsdstools.e_collector.reportcollecting import SingleReportCollector

apple_10k_2022_adsh = "0000320193-22-000108"

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
a_single_report = collector.collect().join()
</pre>

it still performs reasonably, and you use a bag, that is already filtered according to your needs.

Moreover, since you have a single bag with all the data, you can also use it to load data for different companies and multiple years. Let's say, we want to read the data for all 10-K reports from Microsoft, Alphabet, and Amazon

In [6]:
ciks=[789019, 1652044,1018724] #Microsoft, Alphabet, Amazon

all_asorted_10Ks = JoinedDataBag.load(target_path=f"{concat_all_joined_all_dir}", forms_filter=["10-K"], ciks_filter=ciks) # loading all 10-Ks for Microsoft, Alphabet, and Amazon

print(all_asorted_10Ks.sub_df.shape)
print(all_asorted_10Ks.pre_num_df.shape)

2025-03-05 16:18:49,404 [INFO] databagmodel  apply sub_df filter: [('cik', 'in', [789019, 1652044, 1018724]), ('form', 'in', ['10-K'])]
2025-03-05 16:18:49,604 [INFO] databagmodel  apply pre_num_df filter: ["('adsh', 'in', ['0001193125-10-016098', '0001193125-10-171791', '0001193125-11-016253', '0001193125-...)"]


(39, 36)
(8098, 17)


Also with this example, it performs quite well thanks to predicate pushdown if we consider, that the file is about 1.3GB in size.

## Conclusion

This example pipeline enables you to concat together all the data from all the quarters into a single bag. And this is done in a memory efficient way.

Moreover, using predicate pushdown in the load methods, you can easily retrieve the data from single reports, or also from different companies. 

In addition, you also have the standardized data for BS, IS, and CF, so that you can compare the data between different years and/or companies.

Last but not least, the bags are updated automatically, as soon as new data is available on the SEC's website.
