In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Bulk Data Processing Deep Dive
The main advantage of this library is that the all data is downloaded down to your computer and therefore makes it easy to analyze all the data at once. 

For instance, if you want to implement your own screener.

Just on the file system, the size of all data files is more than 2 GB. Since the parquet format is also storage optimized, loading all the data into memory would need significantly more memory than a standard computer/laptop usually provides.

Hence it is important to filter the data during the loading process, so that you only load the data into memory that is really needed.

## Prepare Datasets
In the first part of this notebook, we will create different datasets for all the balance sheet datapoints, the cashflow datapoints, and the income statement datapoints.
These datasets will be stored in their own directories, so that they can be easily loaded afterwards. Moreover, we will store the raw version (where the num_df and the pre_df are not joined) and the joined version, where num_df and pre_df are joined. Depending on what you want to do/analyze, you can use either one.

**Note:** The code that is explained here is also available in the modul `bulk_loading` which is inside the `u_usecase` package.

This notebook will show two approaches. The first one is loading all the data in parallel, which you can do if you have enough resources in your computer. The second is doing it sequentially, which is slower, but needs less memory. In the end, you will create these datasets once and extend it when new quarterly zip files arrive, or you will recreate them once every quarter. So in the end it doesn't really matter if the process takes 15 minutes or an hour.

We will also apply different filters:

* only filter 10-K and 10-Q reports during loading
* `ReportPeriodRawFilter`: since we are only interested in datapoints that belong to the period of the report
* `MainCoregRawFilter`: since we don't want to see datapoints of a subsidiary
* `OfficialTagsOnlyRawFilter`: since we want to be able to compare the content and therefore don't want to read tags that or not in the standard sec xbrl definition
* `USDOnlyRawFilter`: since we are not interested in money datapoints that are not in USD

### Basics
First, we will defines some basic stuff that is used by both approaches.

In [2]:
import os
from secfsdstools.d_container.databagmodel import RawDataBag, JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector

The following list defines which statements we want to load.

In [3]:
statements_to_load = ["BS", "CF", "IS"]

Next, we define a filter function, that defines the whole chain. As mentioned in the 04_collector_deep_dive.ipynt notebook, we have to define the imports inside the function itself, if we want to use it in jupyter together with parallization.

In [4]:
def postloadfilter(databag: RawDataBag) -> RawDataBag:
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter

    return databag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]

Next is a simple function that takes a raw databag and creates the joined databag. Both, the rawdatabag and the joined databag are then stored in a specific folder.


In [5]:
def save_databag(databag: RawDataBag, financial_statement: str, base_path: str) -> JoinedDataBag:
    target_path_raw = os.path.join(base_path, financial_statement, 'raw')
    print(f"store rawdatabag under {target_path_raw}")
    os.makedirs(target_path_raw, exist_ok=True)
    databag.save(target_path_raw)
    
    target_path_joined = os.path.join(base_path, financial_statement, 'joined')
    os.makedirs(target_path_joined, exist_ok=True)
    print("create joined databag")
    joined_databag = databag.join()
    
    print(f"store joineddatabag under {target_path_joined}")
    joined_databag.save(target_path_joined)
    return joined_databag

### Parallel Data Loading
As stated above, we want to load all available 10-K and 10-Q reports. Therefore, we can use the `ZipCollector`which provides an option to load data from all available zip files. 

Moreover, the implementation of the ziploader uses all your cores in order to load data from your disk into memory. So you don't have to implement the parallization yourself. There are 50+ zip files that have to be loaded, so if you have 4 cores, you will load 4 at one time.

Also, the `ZipCollector` provides parameters for filtering the report type (10-K and 10-Q) amd the financial statement type (Balance Sheet, Casch Flow, or Income Statement). These filters are directly applied during loading, since the data is stored in Parquet format. This will already reduce that amount of data that is being loaded into memory significantly.

Moreover, it also provides the post_load_filter which we can use to apply the other filters, defined in the postloadfilter function.

In [6]:
def load_all_financial_statements_parallel(financial_statement: str) -> RawDataBag:
    """ 
    financial_statement: either "BS", "CF", or "IS"
    """

    collector: ZipCollector = ZipCollector.get_all_zips(forms_filter=["10-K", "10-Q"],
                                                        stmt_filter=[financial_statement],
                                                        post_load_filter=postloadfilter)
    return collector.collect()

We loop over the statements that we want to load and collect their datapoints into a specific dataset.

This process will take several minutes. On my laptop the execution time was approximately 16 minutes (32GB Ram / 4/8 Cores).

In [12]:
for statement_to_load in statements_to_load:
    rawdatabag = load_all_financial_statements_parallel(financial_statement=statement_to_load)
    save_databag(databag=rawdatabag, financial_statement=statement_to_load, base_path="./set/parallel/")

2023-12-05 06:30:59,104 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:30:59,170 [INFO] updateprocess  Check if new report zip files are available...
2023-12-05 06:30:59,249 [INFO] updateprocess  check if there are new files to download from sec.gov ...
2023-12-05 06:31:00,295 [INFO] updateprocess  start to transform to parquet format ...
2023-12-05 06:31:00,311 [INFO] updateprocess  start to index parquet files ...
2023-12-05 06:31:00,375 [INFO] parallelexecution      items to process: 58


No rapid-api-key is set: 
If you are interested in daily updates, please have a look at https://rapidapi.com/hansjoerg.wingeier/api/daily-sec-financial-statement-dataset


2023-12-05 06:35:21,228 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/BS\raw
create joined databag
store joineddatabag under ./set/parallel/BS\joined


2023-12-05 06:36:54,241 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:36:54,288 [INFO] parallelexecution      items to process: 58
2023-12-05 06:41:06,636 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/CF\raw
create joined databag
store joineddatabag under ./set/parallel/CF\joined


2023-12-05 06:42:38,230 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-12-05 06:42:38,269 [INFO] parallelexecution      items to process: 58
2023-12-05 06:47:02,662 [INFO] parallelexecution      commited chunk: 0


store rawdatabag under ./set/parallel/IS\raw
create joined databag
store joineddatabag under ./set/parallel/IS\joined


After processing, you have the following structure and sizes (with data up to 2023 Q3):
<pre>
- set/parallel
  - BS
    - raw     : 715 MB
    - joined  : 266 MB
  - CF
    - raw     : 700 MB
    - joined  : 246 MB
  - IS
    - raw     : 636 MB
    - joined  : 217 MB
</pre>

Especially the joined databags have a size that can be easily loaded. Moreover, loading them just takes a few seconds. 

In [10]:
#load BS joined data
joinedBS = JoinedDataBag.load("./set/parallel/BS/joined")
print("loaded BS databag: ", joinedBS.pre_num_df.shape)
joinedCF = JoinedDataBag.load("./set/parallel/CF/joined")
print("loaded CF databag: ", joinedCF.pre_num_df.shape)
joinedIS = JoinedDataBag.load("./set/parallel/IS/joined")
print("loaded IS databag: ", joinedIS.pre_num_df.shape)

loaded BS databag:  (10430891, 16)
loaded CF databag:  (9468009, 16)
loaded IS databag:  (9512425, 16)


### Serial Data Loading
As mentioned in above, parallel loading requires some minimal ressources on your laptop/computer. However, using a serial process, you still can create the databags for all balance sheet, cash flow, and income statments. Of course, we need more code and we will also save intermediate results on disk.

The first thing which we need, is a list of all available zip-files. Actually, we just can copy the code from `ZipCollector.get_all_zips()`.

In [6]:
from typing import List
from secfsdstools.a_config.configmgt import ConfigurationManager
from secfsdstools.c_index.indexdataaccess import ParquetDBIndexingAccessor

def read_all_zip_names() -> List[str]:
    configuration = ConfigurationManager.read_config_file()
    dbaccessor = ParquetDBIndexingAccessor(db_dir=configuration.db_dir)

    # exclude 2009q1.zip, since this is empty and causes an error when it is read with a filter
    filenames = [x.fileName for x in dbaccessor.read_all_indexfileprocessing() if not x.fullPath.endswith("2009q1.zip")]
    return filenames

In [9]:
all_zip_names = read_all_zip_names()
print(len(all_zip_names))
print(all_zip_names)

2023-12-10 12:40:16,684 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg


58
['2019q4.zip', '2023q1.zip', '2014q3.zip', '2016q3.zip', '2018q1.zip', '2013q4.zip', '2015q2.zip', '2021q3.zip', '2010q3.zip', '2011q4.zip', '2012q2.zip', '2010q4.zip', '2016q1.zip', '2021q1.zip', '2011q2.zip', '2009q2.zip', '2022q1.zip', '2012q4.zip', '2010q1.zip', '2015q1.zip', '2022q3.zip', '2018q2.zip', '2019q3.zip', '2020q2.zip', '2022q4.zip', '2017q2.zip', '2012q3.zip', '2011q1.zip', '2017q4.zip', '2010q2.zip', '2018q3.zip', '2021q4.zip', '2019q2.zip', '2013q1.zip', '2015q4.zip', '2009q3.zip', '2016q2.zip', '2013q3.zip', '2016q4.zip', '2017q3.zip', '2018q4.zip', '2023q2.zip', '2014q4.zip', '2011q3.zip', '2020q3.zip', '2014q2.zip', '2020q1.zip', '2012q1.zip', '2014q1.zip', '2019q1.zip', '2015q3.zip', '2017q1.zip', '2020q4.zip', '2013q2.zip', '2021q2.zip', '2022q2.zip', '2009q4.zip', '2023q3.zip']


**Prepare the temporary dataset**
Next, prepare the data for every single zip-file. So for every zip-file, we collect the datapoints for BS, CF, and IS and apply the aove defined filters. The following functions takes care of that.

In [17]:
def build_tmp_set(financial_statement: str, file_names: List[str], target_path: str = "set/tmp/"):
    """ This function reads the data in sequence from the provided list of zip file names. It filters according to the 
        defined financial_statement and stores the data in specific subfolders.
        
        the folder structure will look like
        <target_path>/<file_name>/<financial_statement>/raw
        <target_path>/<file_name>/<financial_statement>/joined                                       
        """
    
    for file_name in file_names:
        collector = ZipCollector.get_zip_by_name(name=file_name,
                                 forms_filter=["10-K", "10-Q"],
                                 stmt_filter=[financial_statement],
                                 post_load_filter=postloadfilter)

        rawdatabag = collector.collect()

        base_path = os.path.join(target_path, file_name)
        # saving the raw databag, joining and saving the joined databag
        save_databag(databag=rawdatabag, financial_statement=financial_statement, base_path=base_path)

We call the function for every statement (BS, CF, and IS).
As a reference, running all three cells took about 12 minutes on my laptop (32GB Ram / 4/8 Cores)

In [None]:
build_tmp_set(financial_statement="BS", file_names=all_zip_names, target_path="set/tmp/")

In [None]:
build_tmp_set(financial_statement="CF", file_names=all_zip_names, target_path="set/tmp/")

In [None]:
build_tmp_set(financial_statement="IS", file_names=all_zip_names, target_path="set/tmp/")

We know have subfolders for BS, CF, IS for every quarterly zipfile with the corresponding datapoints.

**Create the rawdatabags**

In [11]:
from glob import glob

def create_rawdatabag(financial_statement: str, target_path: str):
    raw_files = glob(f"./set/tmp/*/{financial_statement}/raw/", recursive = True)    
    raw_databags = [RawDataBag.load(file) for file in raw_files]
    raw_databag = RawDataBag.concat(raw_databags)
    target_path_raw = os.path.join(target_path, financial_statement, 'raw')
    print(f"store rawdatabag under {target_path_raw}")
    os.makedirs(target_path_raw, exist_ok=True)
    raw_databag.save(target_path_raw)      

Next, concatenate the raw datasets together. Again, as a reference it took about 5 minutes to create all three rawdatabags.

In [12]:
create_rawdatabag(financial_statement="BS", target_path="set/serial/")

store rawdatabag under set/serial/BS\raw


In [13]:
create_rawdatabag(financial_statement="CF", target_path="set/serial/")

store rawdatabag under set/serial/CF\raw


In [14]:
create_rawdatabag(financial_statement="IS", target_path="set/serial/")

store rawdatabag under set/serial/IS\raw


**Create the joined databags**

In [22]:
from glob import glob

def create_joineddatabag(financial_statement: str, target_path: str):
    joined_files = glob(f"./set/tmp/*/{financial_statement}/joined/", recursive = True)
    joined_databags = [JoinedDataBag.load(file) for file in joined_files]
    joined_databag = JoinedDataBag.concat(joined_databags)
    target_path_joined = os.path.join(target_path, financial_statement, 'joined')
    print(f"store joineddatabag under {target_path_joined}")
    os.makedirs(target_path_joined, exist_ok=True)
    joined_databag.save(target_path_joined)   

Finally, create the joined databags. To create all three datasets, it took about 90 seconds.

In [24]:
create_joineddatabag(financial_statement="BS", target_path="set/serial/")

store joineddatabag under set/serial/BS\joined


In [25]:
create_joineddatabag(financial_statement="CF", target_path="set/serial/")

store joineddatabag under set/serial/CF\joined


In [26]:
create_joineddatabag(financial_statement="IS", target_path="set/serial/")

store joineddatabag under set/serial/IS\joined


Now we can read back all three prepared joined datasets. This only takes a few seconds.

In [27]:
#load BS joined data
joinedBS = JoinedDataBag.load("./set/serial/BS/joined")
print("loaded BS databag: ", joinedBS.pre_num_df.shape)
joinedCF = JoinedDataBag.load("./set/serial/CF/joined")
print("loaded CF databag: ", joinedCF.pre_num_df.shape)
joinedIS = JoinedDataBag.load("./set/serial/IS/joined")
print("loaded IS databag: ", joinedIS.pre_num_df.shape)

loaded BS databag:  (10430891, 16)
loaded CF databag:  (9468009, 16)
loaded IS databag:  (9512425, 16)
