In [None]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Filter Deep Dive
This notebook introduces all the available Filters

## Basics

A few basic points

* All Filters are implmented for the RawDataBag and for the JoinedDataBag. Depending for which databag type the filter is implemented its postfix is either `RawFiltere` or `JoinedFilter`.
* All Filters do not copy the dataframes. They just apply filter on existing dataframes, but don't create new ones.
* All Filters have a `filter()` method which takes a databag as parameter and returns a new databag as parameter (again, they dataframes are not copied in the new instance of the databag). However, there is also a filter `filter()` method of the databag itself.
```
a_filter = USDOnlyRawFilter()
a_rawdatabag: RawDataBag = ...

# use the filter() method of the filter..
new_databag = a_filter.filter(a_rawdatabag)

# or use the filter method of the databab
new_databag = a_rawdatabag.filter(a_filter)
```
* Calls to the `filter()` method of the databag can be chained as follows
```
filter1 = USDOnlyRawFilter()
filter2 = OfficialTagsOnlyRawFilter()
a_rawdatabag: RawDataBag = ...

new_databag = a_rawdatabag.filter(filter1).filter(filter2)

```
* The index operator (`[]`) of the databag class is forwarded to the `filter()` method, therefore you can write the previous call as follows:
```
new_databag = a_rawdatabag[filter1][filter2]
```

## Load Demo Databag

In [8]:
from secfsdstools.e_collector.zipcollecting import ZipCollector

databag = ZipCollector.get_zip_by_name('2022q4.zip').collect()

print("sub: ", databag.sub_df.shape)
print("pre: ", databag.pre_df.shape)
print("num: ", databag.num_df.shape)

2023-11-29 06:55:39,372 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2023-11-29 06:55:39,403 [INFO] parallelexecution      items to process: 1
2023-11-29 06:55:39,437 [INFO] zipcollecting  processing C:\Users\hansj\secfsdstools\data\parquet\quarter\2022q4.zip
2023-11-29 06:55:40,810 [INFO] parallelexecution      commited chunk: 0


sub:  (23943, 36)
pre:  (1315392, 10)
num:  (2701629, 9)


## `AdshRawFilter`

This filter lets you select the data for certain reports by their adsh number. Just provide the list of the adsh numbers you are interested in in the constructor of the filter.

It operates on all dataframes (sub, pre, and num).

In [7]:
from secfsdstools.e_filter.rawfiltering import AdshRawFilter

apple_10k_2022_adsh = "0000320193-22-000108"
adsh_filter = AdshRawFilter(adshs=[apple_10k_2022_adsh])

filtered_databag = databag[adsh_filter]

print("sub: ", filtered_databag.sub_df.shape)
print("pre: ", filtered_databag.pre_df.shape)
print("num: ", filtered_databag.num_df.shape)

sub:  (1, 36)
pre:  (185, 10)
num:  (503, 9)


In [None]:
implement your own filter