# Bulk Data - Memory Efficiency

This notebook gives some ideas on how you keep your memory footprint as low as possible. This is especially crucial, when you work with larger datasets, for instance where all that data from all quarters are contained. 

<span style="color: #FF8C00;">==========================================================</span>

**If you find this tool useful, a sponsorship would be greatly appreciated!**

**https://github.com/sponsors/HansjoergW**

How to get in touch

* Found a bug: https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues
* Have a remark: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/general
* Have an idea: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/ideas
* Have a question: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/q-a
* Have something to show: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/show-and-tell

<span style="color: #FF8C00;">==========================================================</span>

## Use Predicate Pushdown - Apply Filters During Loading

One big advantage of using Parquet files is the ability to execute filters direct while loading so that less data has to be read into the memory. This is called Predicate Pushdown.
For instance, the `read_parquet` method of Pandas provides provides this feature with the parameter `filter`.

The SECFSDSTools also uses this feature.

### Use Predicate Pushdown in Collectors

The collectors `CompanyReportCollector`, `MultiReportCollector`, and `ZipCollector` provide the following optional filter parameters: `forms_filter`, `stmt_filter`, and `tag_filter`. Definting this filters ensures that less data is read and therefore data is loaded faster and less memory is consumed.


In [6]:
from typing import List
from secfsdstools.e_collector.companycollecting import CompanyReportCollector

ciks_to_load: List[int] = [320193, 789019, 1652044, 1045810, 1018724, 2488, 50863] # Apple, Microsoft, Alphabet, nvidia, Amazon, AMD, intel
bag = CompanyReportCollector.get_company_collector(ciks=ciks_to_load, forms_filter=['10-K'], stmt_filter=['BS', 'IS', 'CF']).collect()

2025-02-17 17:05:31,554 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2025-02-17 17:05:32,235 [INFO] parallelexecution      items to process: 46
2025-02-17 17:05:50,111 [INFO] parallelexecution      commited chunk: 0


### Use Predicate Pushdown in Load Methods

Coming with version 2.1, Predicate Pushdown is also supported in the `laod` methods of `RawDataBag` and `JoinedDataBag`. They are especially useful, if you created a single bag containing all prefiltered data (as described in 06_bulk_data_processing_deep_dive or in 08_00_automation_basics).

**Some numbers**

Reading the `_4_single_bag/all` bag (created as described in 08_00_automation_basics) and then filtering for all "BS" entries
<pre>
    file_path = Path("c:/data/sec/automated/_4_single_bag/all")

    joined = JoinedDataBag.load(str(file_path))
    joined_filtered = joined[StmtJoinedFilter(stmts=['BS'])]
</pre>

took about **45 seconds** and needed about **20 GB** memory.

Doing the same with Predicate Pushdown in the `load` method
<pre>
    file_path = Path("c:/data/sec/automated/_4_single_bag/all")

    joined = JoinedDataBag.load(target_path=str(file_path), stmt_filter=['BS'])
</pre>

took about **15 seconds** and consumed **7 GB** of memory.

### Write Your Own Predicate Pushdown Load Methods

If you have some special filter requirements, maybe you want to consider writing your own `load` function using Predicate Pushdown.

## Concat by Folders Instead of Concat by Loaded Bags

In version 2.1 the possibility to concat bags directly on the folder/file level was introduced. When you use these features, parquet files (thanks to pyarrow) are directly concatenated on the file system instead of using Pandas concat function. This leads to a very low memory consumption and makes it possible to concat large amount data into a single file. Combine that with using Predicate Pushdown and you have the ability to work with a huge dataset also with limited hardware resources.

In detail, the framework provides the following features:

**`from secfsdstools.a_utils.fileutils import concat_parquet_files`**

Concats sub.txt, num.txt, pre.txt, and pre_num.txt parquet files. Note: the defined output-file name will determine which file it is (sub.txt, num.txt, ..).

In [1]:
from secfsdstools.a_utils.fileutils import concat_parquet_files

sub_txt_to_concat = ['a/sub.txt.parquet', 'b/sub.txt.parquet', 'c/sub.txt.parquet']
output_file = 'sub.txt.parquet'

concat_parquet_files(input_files=sub_txt_to_concat, output_file=output_file)

Skipping empty file: a/sub.txt.parquet
Skipping empty file: b/sub.txt.parquet
Skipping empty file: c/sub.txt.parquet


**`RawDataBag.concat_filebased` and `JoinedDataBag.concat_filebased`**

The `concat_filebased` provided in `RawDataBag` and `JoinedDataBag` takes three input parameters. First, a list with the input folders (containing the data of a `RawDataBag`, resp. of a `JoinedDataBag`), the target_path, and a flag that indicates whether the sub_df must be checked for duplicates (for instance, if you separated the data for balance sheets, income statements, and cash flow statement in different bags and want to concat them together, all would have the same sub data so you need to make sure that you do not get duplicated entries in sub).

In [2]:
from pathlib import Path
from secfsdstools.d_container.databagmodel import RawDataBag

raw_databag_folders = [Path('bag1'), Path('bag2'), Path('bag3')]
output_folder = Path('out')

RawDataBag.concat_filebased(paths_to_concat=raw_databag_folders, target_path=output_folder, drop_duplicates_sub_df=False)

Skipping empty file: bag1\pre.txt.parquet
Skipping empty file: bag2\pre.txt.parquet
Skipping empty file: bag3\pre.txt.parquet
Skipping empty file: bag1\num.txt.parquet
Skipping empty file: bag2\num.txt.parquet
Skipping empty file: bag3\num.txt.parquet
Skipping empty file: bag1\sub.txt.parquet
Skipping empty file: bag2\sub.txt.parquet
Skipping empty file: bag3\sub.txt.parquet


**Automation Processes `ConcatByChangedTimestampProcess` and `ConcatByChangedTimestampProcess`**

If you use *automation* as described in 08_00_automation basics you very likely use `ConcatByChangedTimestampProcess` and `ConcatByChangedTimestampProcess`. With version 2.1, those classes now also use the `concat_filebased` methods and therefore have a significant lower memory footprint than in the previous version.