In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Collector Deep Dive
This notebooks dives a little deeper into using the collector classes.

## Basics

All the `Collector` classes have their own factory method(s) which instantiates the class. Most of these factory methods
also provide parameters to filter the data directly when being loaded from the parquet files.
These are
* the `forms_filter` <br> lets you select which report type should be loaded (e.g. "10-K" or "10-Q").<br>
  Note: the fomrs filter affects all dataframes (sub, pre, num).
* the `stmt_filter` <br> defines the statements that should be loaded (e.g., "BS" if only "Balance Sheet" data should be loaded) <br>
  Note: the stmt filter only affects the pre dataframe.
* the `tag_filter` <br> defines the tags, that should be loaded (e.g., "Assets" if only the "Assets" tag should be loaded) <br>
  Note: the tag filter affects the pre and num dataframes.

It is also possible to apply filter for these attributes after the data is loaded, but since the `Collector` classes
apply this filters directly during the load process from the parquet files (which means that fewer data is loaded from
the disk and also the memory footprint is reduced) this is generally more efficient.

All `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance
of `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,
`pre` (presentation) data, and `num` (the numeric values) data.

## `SingleReportCollector`
As the name suggests, this `Collector` returns the data of a single report. It is instantiated by providing the `adsh` of the desired report as parameter of the `get_report_by_adsh` factory method, 
or by using an instance of the `IndexReport` as parameter of the `get_report_by_indexreport`. (As a reminder: instances of `IndexReport` are returned by the `CompanyIndexReader` class).

Reading a single report: **Apples 10-K from 2022**

In [2]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector

apple_10k_2022_adsh = "0000320193-22-000108"

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
rawdatabag = collector.collect()

# as expected, there is just one entry in the submission dataframe
print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:06,187 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg


sub (1, 36)
pre (185, 10)
num (503, 9)
pre_num (262, 16)


As mentioned above, we can also directly apply filters, to reduce the amount of data that is loaded. 

First, let's only load data for the **Balance Sheet**.

In [3]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh, stmt_filter=['BS'])
rawdatabag = collector.collect()

# as expected, there is just one entry in the submission dataframe
print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:06,906 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg


sub (1, 36)
pre (39, 10)
num (503, 9)
pre_num (66, 16)


As mentioned above, the stmt_filter only applies to the pre_df, since only the pre_df has information about the statement which a tag belongs to. But of course, also the joined dataframe is significantely smaller.

Next, lets even be a bit more restrictive and just load the **'Assets'** tag.

In [4]:
from secfsdstools.e_collector.reportcollecting import SingleReportCollector

collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh, tag_filter=['Assets'])
rawdatabag = collector.collect()

# as expected, there is just one entry in the submission dataframe
print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:07,474 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg


sub (1, 36)
pre (1, 10)
num (2, 9)
pre_num (2, 16)


Now, as expected, there should only be one Asset tag in the pre dataframe. The reason with have two entries in the num dataframe is, that there is a value for the year 2022 and a value for the previous year 2021, as can be seen in ddate column:

In [5]:
rawdatabag.num_df

Unnamed: 0,adsh,tag,version,coreg,ddate,qtrs,uom,value,footnote
0,0000320193-22-000108,Assets,us-gaap/2022,,20210930,0,USD,351002000000.0,
1,0000320193-22-000108,Assets,us-gaap/2022,,20220930,0,USD,352755000000.0,


# `MultiReportCollector`
Contrary to the `SingleReportCollector`, this `Collector` can collect data from several
reports. Moreover, the data of the reports are loaded in parallel, this  especially improves the performance if the
reports are from different quarters (resp. are in different zip files). The class provides the factory methods 
`get_reports_by_adshs` and `get_reports_by_indexreports`. The first takes a list of adsh strings, the second a list
of `IndexReport` instances.

Reading two reports: **Apple's 10-K from 2022 and 2012**

In [6]:
from secfsdstools.e_collector.multireportcollecting import MultiReportCollector
apple_10k_2022_adsh = "0000320193-22-000108"
apple_10k_2012_adsh = "0001193125-12-444068"

# load only the assets tags that are present in the 10-K report of apple in the years
# 2022 and 2012
collector: MultiReportCollector = \
    MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh,
                                                     apple_10k_2012_adsh])
rawdatabag = collector.collect()
# as expected, there are just two entries in the submission dataframe
print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:11,742 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-21 15:58:11,752 [INFO] parallelexecution      items to process: 2
2023-10-21 15:58:14,545 [INFO] parallelexecution      commited chunk: 0


sub (2, 36)
pre (323, 10)
num (1124, 9)
pre_num (562, 16)


Again, using the filter parameters reduces the amount of data that is loaded. Let us load the tags **'Assets' and 'Liabilities'.**

In [7]:
from secfsdstools.e_collector.multireportcollecting import MultiReportCollector
apple_10k_2022_adsh = "0000320193-22-000108"
apple_10k_2012_adsh = "0001193125-12-444068"

# load only the assets tags that are present in the 10-K report of apple in the years
# 2022 and 2012
collector: MultiReportCollector = \
    MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh, apple_10k_2012_adsh],
                                              tag_filter=['Assets', 'Liabilities'])
rawdatabag = collector.collect()
# as expected, there are just two entries in the submission dataframe
print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:14,632 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-21 15:58:14,650 [INFO] parallelexecution      items to process: 2
2023-10-21 15:58:16,439 [INFO] parallelexecution      commited chunk: 0


sub (2, 36)
pre (4, 10)
num (8, 9)
pre_num (8, 16)


In [8]:
rawdatabag.num_df

Unnamed: 0,adsh,tag,version,coreg,ddate,qtrs,uom,value,footnote
0,0000320193-22-000108,Assets,us-gaap/2022,,20210930,0,USD,351002000000.0,
1,0000320193-22-000108,Assets,us-gaap/2022,,20220930,0,USD,352755000000.0,
2,0000320193-22-000108,Liabilities,us-gaap/2022,,20220930,0,USD,302083000000.0,
3,0000320193-22-000108,Liabilities,us-gaap/2022,,20210930,0,USD,287912000000.0,
4,0001193125-12-444068,Assets,us-gaap/2012,,20110930,0,USD,116371000000.0,
5,0001193125-12-444068,Assets,us-gaap/2012,,20120930,0,USD,176064000000.0,
6,0001193125-12-444068,Liabilities,us-gaap/2012,,20110930,0,USD,39756000000.0,
7,0001193125-12-444068,Liabilities,us-gaap/2012,,20120930,0,USD,57854000000.0,


As expected, the data now contains the values for 'Assets' and 'Liabilities' for the year 2012 and the previous year 2011, as well as for the year 2022 and the previoius year 2021.

# `CompanyReportCollector`

This class returns reports for one or more companies. The factory method `get_company_collector` provides the parameter `ciks` which takes a list of cik numbers.

Let us read the data for **all reports of Apple and Microsoft.**

In [9]:
from secfsdstools.e_collector.companycollecting import CompanyReportCollector

apple_cik = 320193
microsoft_cik = 789019
collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik, microsoft_cik])

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:58:16,712 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-21 15:58:16,938 [INFO] parallelexecution      items to process: 190
2023-10-21 15:59:15,239 [INFO] parallelexecution      commited chunk: 0


sub (190, 36)
pre (17437, 10)
num (51279, 9)
pre_num (32656, 16)


As you will see, this takes a couple dozens of seconds. But nonetheless, data from all zip files was loaded in parallel.

But maybe, we just one to have a look at the **'Assets' of all 10-K reports.**

In [10]:
from secfsdstools.e_collector.companycollecting import CompanyReportCollector

collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik, microsoft_cik],
                                                        tag_filter=['Assets'],
                                                        forms_filter=['10-K'])

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-21 15:59:15,996 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-21 15:59:16,211 [INFO] parallelexecution      items to process: 27
2023-10-21 15:59:23,858 [INFO] parallelexecution      commited chunk: 0


sub (27, 36)
pre (27, 10)
num (56, 9)
pre_num (56, 16)


# `ZipCollector`

This `Collector` collects the data of one or more zip (resp. the folders that contain the parquet
  files of this zip files). And since every of the original zip files contains the data for one quarter, the names you provide
  in the `get_zip_by_name` or `get_zip_by_names` factory methods reflect the quarter which data you want to load: e.g. `2022q1.zip`.
  
There are several fatctory methods to provide the functionality. First let us load the data for the zip file **2022q1.zip**.

In [2]:
from secfsdstools.e_collector.zipcollecting import ZipCollector

collector: ZipCollector = ZipCollector.get_zip_by_name(name="2022q1.zip")

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-23 20:27:46,320 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-23 20:27:46,427 [INFO] parallelexecution      items to process: 1
2023-10-23 20:27:46,468 [INFO] zipcollecting  processing C:\Users\hansj\secfsdstools\data\parquet\quarter\2022q1.zip
2023-10-23 20:27:48,912 [INFO] parallelexecution      commited chunk: 0


sub (23657, 36)
pre (1372969, 10)
num (3277301, 9)
pre_num (2033785, 16)


As you may notice this is quite a significant amount of data that was loaded, just for one single quarter. 

Next, we are going to load data for **all the quarters of 2022, but only the Balance Sheet of the 10-K reports.**

In [4]:
from secfsdstools.e_collector.zipcollecting import ZipCollector

collector: ZipCollector = ZipCollector.get_zip_by_names(names=["2022q1.zip", "2022q2.zip", "2022q3.zip", "2022q4.zip"],
                                                        forms_filter=["10-K"],
                                                        stmt_filter=["BS"],)

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-23 20:29:52,453 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-23 20:29:52,485 [INFO] parallelexecution      items to process: 4
2023-10-23 20:30:27,282 [INFO] parallelexecution      commited chunk: 0


sub (6532, 36)
pre (306749, 10)
num (3021616, 9)
pre_num (482592, 16)


But we can even be a little more bold and read data from **all zip files at once**. We will read **10-K and 10-Q reports, but read only the Assets tag.** This will take some time.

Note: Use with caution, since this can fill up your memory if you don't provide a tag_filter.

In [3]:
from secfsdstools.e_collector.zipcollecting import ZipCollector

collector: ZipCollector = ZipCollector.get_all_zips(forms_filter=["10-K", "10-Q"],
                                                    tag_filter=["Assets"])

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

2023-10-23 20:27:54,270 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-10-23 20:27:54,292 [INFO] parallelexecution      items to process: 57
2023-10-23 20:28:26,050 [INFO] parallelexecution      commited chunk: 0


sub (316161, 36)
pre (315679, 10)
num (759441, 9)
pre_num (772627, 16)


## `post_load_filter`
The ZipCollector factory methods have an additional filter parameter: `post_load_filter`.

Loading the from multiple zip files is done in parallel processes. The number of parallel prcocesses depends on the number of core the system has. First, a RawDataBag instance is created for every zip file that has to be loaded and once all the zip files are loaded, all these RawDataBag instances are concatenated into a single instance of a RawDataBag. After that, additional filters can be applied with the `filter` method of the RawDataBag.

Especially when wanting to load data from all zip files at once, it would make sense to have a possibility to add an additional filter directly after the data of a single zip file has been loaded, since this would reduce the memory footprint significantly.

This is what the `post_laod_filter` is for. It simply takes a function as parameter that receives a `RawDataBag`as parameter and returns a `RawDataBag` as result. Hence, it can also be defined as a lambda function.

The following example will load the data for all available 10-K and 10-Q balance sheets. If we would do that
without a `post_laod_filter` it might very likely in a out-of-memory exception before we change to reduce the data. Therfore, we add a `post_load_filter` that

* filters only the datapoints of the current report (by using `ReportPeriodRawFilter`)
* removes datapointss ob subsidiaries (by using `MainCoregFilter`)
* create a real copy of the reduced dataframes so that dataframes containing all the data can be garbage collected (by using the `copy_bag()`) method

The `post_load_filter` is just a function that receives a `RawDataBag` and has to return a `RawDataBag`. It can be either defined as a function or
directly as a lambda expression.

```
    # as function
    def postloadfilter(databag: RawDataBag) -> RawDataBag:
        return databag[ReportPeriodRawFilter()][MainCoregFilter()]
    
    # as lambda
    post_filter = lambda x: x[ReportPeriodRawFilter()][MainCoregFilter()]
```

**Attention 1:** while either defining a function or using a lambda did work perfectly when running the script directly from the command line or in an IDE, it didn't work within Jupyter. In Jupyter, I have to use a function and moreover, also the needed imports have to be included in the function itself:
```
def postloadfilter(databag: RawDataBag) -> RawDataBag:
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregFilter
    return databag[ReportPeriodRawFilter()][MainCoregFilter()]
```

**Attention 2:** running this code took a few minutes on my 4-core/32GB laptop (around 4 minutes). Moreover, it still needed about 18GB (!) of free memory when launched.

In [None]:
import os

from secfsdstools.e_collector.zipcollecting import ZipCollector
from secfsdstools.d_container.databagmodel import RawDataBag
from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter

target_path = "bs_10k_10q_all"
os.makedirs(target_path, exist_ok = True)

def postloadfilter(databag: RawDataBag) -> RawDataBag:
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter
    return databag[ReportPeriodRawFilter()][MainCoregRawFilter()]

collector: ZipCollector = ZipCollector.get_all_zips(forms_filter=["10-K", "10-Q"],
                                                    stmt_filter=["BS"],
                                                    post_load_filter=postloadfilter)

rawdatabag = collector.collect()

print("sub", rawdatabag.sub_df.shape)

# just print the size of the pre and num dataframes
print("pre", rawdatabag.pre_df.shape)
print("num", rawdatabag.num_df.shape)

# joining the pre and num dataframes
joineddatabag = rawdatabag.join()
print("pre_num", joineddatabag.pre_num_df.shape)

# of course, saving the the databag would be a good idea here
# but remember, the path has to exist and has to be empty
joineddatabag.save(target_path=target_path)

sub (316161, 36)
pre (14085879, 10)
num (51855742, 9)
pre_num (10723938, 16)