In [9]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Financial Statements Standardizer

## Goal

Even when adhering to the US-GAAP standard, financial statements among different companies or even across different years of the same company are often not directly comparable.

Let's examine the balance sheet to illustrate a couple of problems that can arise:

- There are over 3000 different tags that could be used in a balance sheet, even though a balance sheet typically only has about 30-40 rows.
- Some tags have a similar meaning; for instance, the position "Total assets" can be tagged with "Assets," but sometimes, the tag "AssetsNet" is also used.
- Sometimes not all major positions are presented. For instance, normally, you expect Liabilities, LiabilitiesCurrent, and LiabilitiesNoncurrent to appear in the balance sheet. However, in some reports, only Liabilities, Liabilities, and only detailed positions of LiabilitiesNoncurrent are listed, with no total position for LiabilitiesCurrent. Sometimes even the total position for liabilities is missing.

The Standardizer processes the data and produces comparable statements that contain the main positions of a certain financial statement. For example, the balance sheet standardizer produces reports with values for Assets, AssetsCurrent, AssetsNoncurrent, Liabilities, LiabilitiesCurrent, LiabilitiesNoncurrent, Equity, as well as a few other positions that are not always present.

To achieve this, the standardizer uses a **simple rule framework** that lets you define rules acting on the data. In the context of the balance sheet, a few rules include:
- If there is an AssetsNet tag but no Assets tag, copy the value from AssetsNet to Assets.
- If two of the tags Assets, AssetsCurrent, AssetsNoncurrent are present, calculate the missing one by applying the formula Assets = AssetsCurrent + AssetsNoncurrent.
- If the LiabilitiesNoncurrent tag is missing, sum up any existing detail tags of LiabilitiesNoncurrent and store the sum in the LiabilitiesNoncurrent tag.

Since calculations are involved, which under certain circumstances could be incorrect or problematic, **any action is logged**. Therefore, if a specific rule was applied for a certain report/financial statement, it is logged. With that information, a user can trace how many rules and which rules were applied to which tags of a particular report.

As mentioned, applying certain rules could lead under certain circumstances to incorrect results or interpretations. Moreover, the input data could also be incorrect or essential information could be missing from the dataset altogether. Therefore, **validation rules** can be defined and are applied at the end of processing the data. In the case of the balance sheet, a few examples of validation rules are Assets = AssetsCurrent + AssetsNoncurrent, Liabilities = LiabilitiesCurrent + LiabilitiesNoncurrent, Assets = Liabilities + Equity. These validation checks are applied for every financial statement, and the results are presented with a relative error and a categorized error (category 0 = exact match, category 1 = less than 1%, category 5 = less than 5%, category 10 = less than 10%, category 100 = greater than 10%). For instance, if you want to use the data to train an ML model, you might want to choose only statements with all checks below category 5.


## Main Process

The main process comprises four key steps:

1. **Preprocessing**
    1. Removal of unused tags: Based on predefined rules, all tags not utilized by these rules are eliminated.
    1. Deduplication: Occasionally, values for certain tags or even entire sets of tags in financial statements may be duplicated. These redundant entries need to be removed.
    1. Inversion of negated values: The sign of values marked as negated is inverted.
    1. Table pivoting: Currently, each tag and its corresponding value have their own row in the dataset. The goal is to transform this structure so that each tag has its own column.
    1. Filtering for main statements: Some financial reports contain multiple tables attributed to a specific financial statement. This step aims to retain only the main statement.
    1. Application of preprocess rules: These rules are designed to rectify errors in the data. For example, there may be reports where the tags for Assets and AssetsNoncurrent are interchanged, causing the value of Assets to be tagged as AssetsNoncurrent and vice versa. Preprocess rules help correct such errors.
    1. Preparation of log dataframes.

2. **Main Processing**<br> This step applies the main rules, following the order in which they are defined. The entire rule tree can be executed multiple times, as applying a rule at the end of the tree could calculate a previously absent tag that can then be used to calculate another value in the next iteration.

3. **Postprocessing**<br> Postprocess rules are applied in this step. Their primary purpose is to refine the results, such as setting values to zero.

4. **Finalizing**<br> Validation rules are applied, and summary logs are generated.


## Preparing the Example

In order to explain the details of the standardizer, we apply the balance sheet standardizer on the reports of the year 2022.

In [6]:
# first, create a collector which collects all reports for 2022
from secfsdstools.d_container.databagmodel import JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector
from secfsdstools.u_usecases.bulk_loading import default_postloadfilter

collector = ZipCollector.get_zip_by_names(names=["2022q1.zip", "2022q2.zip", "2022q3.zip", "2022q4.zip"], 
                                          forms_filter=["10-K", "10-Q"],                                        
                                          stmt_filter=["BS"], post_load_filter=default_postloadfilter)

joined_bag: JoinedDataBag = collector.collect().join()
print("number of loaded reports: ", len(joined_bag.sub_df))

2024-01-11 06:42:25,242 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2024-01-11 06:42:25,262 [INFO] parallelexecution      items to process: 4
2024-01-11 06:42:51,966 [INFO] parallelexecution      commited chunk: 0


number of loaded reports:  26357


**Note:** <br> we only load the 10-K and 10-Q reports. We also filter directly just for balance sheet information. We also apply the default_postloadfilter, which includes ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, and USDOnlyRawFilter. You definitely should apply the ReportPeriodRawFilter and the USDOnlyRawFilter. The standardizer should work without applying the MainCoregRawFilter and therefore also standardizing the statements for subsidiaries, however, I didn't test it.

## Using the standardizer

The standardizer implements the presenter interface, so you could pass it as parameter to the `present` method of the `JoinedDataBag` instance, but you could also call the `process` method of the standardizer and providing the `pre_num_df` of the `JoinedDataBag` as input.

In [8]:
from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer

standardizer = BalanceSheetStandardizer()
standardized_bs_df = joined_bag.present(standardizer)

In [10]:
standardized_bs_df[:10]

tag,adsh,coreg,report,ddate,uom,Assets,AssetsCurrent,AssetsNoncurrent,Liabilities,LiabilitiesCurrent,LiabilitiesNoncurrent,HolderEquity,TemporaryEquity,RedeemableEquity,Equity,LiabilitiesAndEquity,Cash,RetainedEarnings,AdditionalPaidInCapital,TreasuryStockValue,AssetsCheck_error,AssetsCheck_cat,LiabilitiesCheck_error,LiabilitiesCheck_cat,EquityCheck_error,EquityCheck_cat,AssetsLiaEquCheck_error,AssetsLiaEquCheck_cat
0,0000002178-22-000033,,3,20211231,USD,374703000.0,273210000.0,101493000.0,214317000.0,186011000.0,28306000.0,160386000.0,0.0,0.0,160386000.0,374703000.0,97825000.0,143040000.0,16913000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0000002178-22-000046,,2,20220331,USD,470117000.0,370972000.0,99145000.0,304596000.0,276979000.0,27617000.0,165521000.0,0.0,0.0,165521000.0,470117000.0,99295000.0,148066000.0,17020000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0000002178-22-000066,,2,20220630,USD,504623000.0,408006000.0,96617000.0,337171000.0,310972000.0,26199000.0,167452000.0,0.0,0.0,167452000.0,504623000.0,67728000.0,149475000.0,17541000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0000002178-22-000089,,2,20220930,USD,462123000.0,326647000.0,135476000.0,292872000.0,245231000.0,47641000.0,169251000.0,0.0,0.0,169251000.0,462123000.0,86510000.0,150595000.0,18218000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0000002488-22-000016,,5,20211231,USD,12419000000.0,8583000000.0,3836000000.0,4922000000.0,4240000000.0,682000000.0,7497000000.0,0.0,0.0,7497000000.0,12419000000.0,2535000000.0,-1451000000.0,,-2130000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0000002488-22-000078,,4,20220331,USD,66915000000.0,13369000000.0,53546000000.0,11582000000.0,5581000000.0,6001000000.0,55333000000.0,0.0,0.0,55333000000.0,66915000000.0,4740000000.0,-665000000.0,,-941000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0000002488-22-000123,,4,20220630,USD,67502000000.0,13462000000.0,54040000000.0,12333000000.0,5523000000.0,6810000000.0,55169000000.0,0.0,0.0,55169000000.0,67502000000.0,4964000000.0,-218000000.0,,-1893000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0000002488-22-000170,,4,20220930,USD,67811000000.0,14420000000.0,53391000000.0,13269000000.0,6691000000.0,6578000000.0,54542000000.0,0.0,0.0,54542000000.0,67811000000.0,3398000000.0,-152000000.0,,-2815000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0000002969-22-000010,,5,20211231,USD,27125300000.0,6483500000.0,20641800000.0,12749700000.0,2630100000.0,10119600000.0,14375600000.0,0.0,0.0,14375600000.0,27125300000.0,2953700000.0,15905200000.0,,-1989200000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0000002969-22-000026,,5,20220331,USD,27449700000.0,6249400000.0,21200300000.0,12939000000.0,3205100000.0,9733900000.0,14510700000.0,0.0,0.0,14510700000.0,27449700000.0,2348700000.0,16075900000.0,,-1985400000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


content
- goal
- problems / examples
  -missing liabilities and lieabilitiesnoncurrent tags
- what does the standardizer to
 - pre steps
   - deduplication
   - correct errors
 - main rules
  - iterateed
 - post rule
 - validation
- loading and saving information

-Example BalanceSheet
- input 
   -> filtered data 
   -> only statement, only one currency, only main company, only standardized tags
- what kind of logs are produced, what do they show
- what you should consider if using standardized information
  -> check logs / check validation summary 
- limitations
  wrong data, missing tags in data
  none standardized tags
  
- Empirical approach, definition is not taken into account