In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Bulk Data Processing
The main advantage of this library is that the all data is downloaded down to your computer and therefore makes to analyze the whole datasets as a whole. For instance if you want to implement your own screener.

Just on the file system, the size of all data files are more than 2 GB. Since the parquet format is also storage optimized, loading all the data into memory would need significantly more memory than a standard computer/laptop does provide.

Hence it is important to filter the data during the loading process.

So in this notebook, we will create different datasets for the balance sheet datapoints, the cashflow datapoints, and the income statement datapoints.
These datasets will be stored in their own directories, so that they can be easily loaded afterwards. Moreover, we will store the raw version (where the num_df and the pre_df are not joined) and the joined version, where num_df and pre_df are joined.

This notebook will show two approaches. The first one is loading all the data in parallel, which you can do if you have enough resources in your computer. The second is doing it sequentially, which is slower, but needs less memory.

We will also apply different filters:

* only filter 10-K and 10-Q reports during loading
* `ReportPeriodRawFilter`: since we are only interested in datapoints that belong to the period of the report
* `MainCoregRawFilter`: since we don't want to see datapoints of a subsidiary
* `OfficialTagsOnlyRawFilter`: since we want to be able to compare the content and therefore don't want to read tags that or not in the standard sec xbrl definition
* `USDOnlyRawFilter`: since we are not interested in money datapoints that are not in USD

## Basics
First we will defines some basic stuff that is used by both approaches.
We will start with the imports

In [3]:
import os
from secfsdstools.d_container.databagmodel import RawDataBag, JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector

The following list defines which statements we want to load.

In [6]:
statements_to_load = ["BS", "CF", "IS"]

Next, we define a filter function, that defines the whole chain. As mentioned in the 04_collector_deep_dive.ipynt notebook, we have to define the imports inside the function itself, if we want to use it in jupyter together with parallization.

In [7]:
def postloadfilter(databag: RawDataBag) -> RawDataBag:
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter

    return databag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]

Next is a simple function that takes a raw databag and creates the joined databag. Both, the rawdatabag and the joined databag are then stored in a specific folder.


In [8]:
def save_databag(databag: RawDataBag, financial_statement: str, base_path: str) -> JoinedDataBag:
    target_path_raw = os.path.join(base_path, financial_statement, 'raw')
    print(f"store rawdatabag under {target_path_raw}")
    os.makedirs(target_path_raw, exist_ok=True)
    databag.save(target_path_raw)
    
    target_path_joined = os.path.join(base_path, financial_statement, 'joined')
    os.makedirs(target_path_joined, exist_ok=True)
    print("create joined databag")
    joined_databag = databag.join()
    print(f"store joineddatabag under {target_path_joined}")
    joined_databag.save(target_path_joined)
    return joined_databag

## Parallel Data Loading
As stated above, we want to load all available 10-K and 10-Q reports. Therefore, we can use the `ZipCollector`which provides an option to load data from all available zip files. 

Moreover, the implementation of the ziploader does using all your cores in order to load data from your disk into memory. So you don't have to implement the parallization yourself. There are 50+ zip files that have to be loaded, so if you have 4 cores, you will load 4 at one time.

Also, the `ZipCollector` provides parameters for filtering the report type (10-K and 10-Q) amd the financial statement type (Balance Sheet, Casch Flow, or Income Statement). This filters are directly applied during loading, since the data is stored in Parquet format. This will already reduce that amount of data that is being loaded into memory significantly.

Moreover, it also provides the post_load_filter which we can use to apply the other filters, defined in the postloadfilter function.

In [9]:
def load_all_financial_statements_parallel(financial_statement: str) -> RawDataBag:
    """ 
    financial_statement: either "BS", "CF", or "IS"
    """

    collector: ZipCollector = ZipCollector.get_all_zips(forms_filter=["10-K", "10-Q"],
                                                        stmt_filter=[financial_statement],
                                                        post_load_filter=postloadfilter)
    return collector.collect()

Wrapped it together in simple loop where we iterate over the statements that we want to load. Depending on your logging configuration, you can see in the output of the terminal where you started jupyter which files are being processed.

This process will take several minutes. On my laptop the execution time was approximately 16 mintues (32GB Ram / 4 Cores).

In [12]:
for statement_to_load in statements_to_load:
    rawdatabag = load_all_financial_statements_parallel(financial_statement=statement_to_load)
    save_databag(databag=rawdatabag, financial_statement=statement_to_load, base_path="./set/parallel/")

2023-12-05 06:30:59,104 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:30:59,170 [INFO] updateprocess  Check if new report zip files are available...
2023-12-05 06:30:59,249 [INFO] updateprocess  check if there are new files to download from sec.gov ...
2023-12-05 06:31:00,295 [INFO] updateprocess  start to transform to parquet format ...
2023-12-05 06:31:00,311 [INFO] updateprocess  start to index parquet files ...
2023-12-05 06:31:00,375 [INFO] parallelexecution      items to process: 58


No rapid-api-key is set: 
If you are interested in daily updates, please have a look at https://rapidapi.com/hansjoerg.wingeier/api/daily-sec-financial-statement-dataset


2023-12-05 06:35:21,228 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/BS\raw
create joined databag
store joineddatabag under ./set/parallel/BS\joined


2023-12-05 06:36:54,241 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:36:54,288 [INFO] parallelexecution      items to process: 58
2023-12-05 06:41:06,636 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/CF\raw
create joined databag
store joineddatabag under ./set/parallel/CF\joined


2023-12-05 06:42:38,230 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:42:38,269 [INFO] parallelexecution      items to process: 58
2023-12-05 06:47:02,662 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/IS\raw
create joined databag
store joineddatabag under ./set/parallel/IS\joined


After processing, you have the following structure and sizes (with data up to 2023 Q3):
<pre>
- set/parallel
  - BS
    - raw     : 715 MB
    - joined  : 266 MB
  - CF
    - raw     : 700 MB
    - joined  : 246 MB
  - IS
    - raw     : 636 MB
    - joined  : 217 MB
</pre>

In [None]:
Frage: sind die post filter wirklich angewendet worden?

In [13]:
#load BS joined data
joinedBS = JoinedDataBag.load("./set/parallel/BS/joined")


In [15]:
joinedBS.sub_df.form.unique()

array(['10-Q', '10-K'], dtype=object)

In [17]:
joinedBS.pre_num_df.columns

Index(['adsh', 'tag', 'version', 'coreg', 'ddate', 'qtrs', 'uom', 'value', 'footnote', 'report', 'line', 'stmt', 'inpth', 'rfile', 'plabel', 'negating'], dtype='object')

In [22]:
joinedBS.pre_num_df.version.unique()

array(['us-gaap/2018', 'us-gaap/2019', 'srt/2019', 'us-gaap/2021',
       'srt/2021', 'us-gaap/2022', 'srt/2022', 'us-gaap-sup/2022q3',
       'srt-sup/2022q3', 'us-gaap/2023', 'us-gaap/2014', 'us-gaap/2013',
       'dei/2013', 'us-gaap/2015', 'us-gaap/2016', 'us-gaap/2017',
       'us-gaap/2012', 'dei/2014', 'us-gaap/2020', 'srt/2020',
       'us-gaap/2009', 'us-gaap/2011', 'dei/2011', 'us-gaap/2008',
       'dei/2012', 'www.sec.gov/20220930', 'srt/2023', 'dei/2019',
       'www.sec.gov/20230630'], dtype=object)