In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# Financial Statements Standardizer

## Goal

Even when adhering to the US-GAAP standard, financial statements among different companies or even across different years of the same company are often not directly comparable.

Let's examine the balance sheet to illustrate a couple of problems that can arise:

- There are over 3000 different tags that could be used in a balance sheet, even though a balance sheet typically only has about 30-40 rows.
- Some tags have a similar meaning; for instance, the position "Total assets" can be tagged with "Assets," but sometimes, the tag "AssetsNet" is also used.
- Sometimes not all major positions are presented. For instance, normally, you expect Liabilities, LiabilitiesCurrent, and LiabilitiesNoncurrent to appear in the balance sheet. However, in some reports, only Liabilities, LiabilitiesCurrent, and only the detailed positions of LiabilitiesNoncurrent are listed, but no total position for LiabilitiesNonCurrent. Sometimes, even the total position for Liabilities is missing.

The Standardizer processes the data and produces comparable statements that contain the main positions of a certain financial statement. For example, the balance sheet standardizer produces reports with values for Assets, AssetsCurrent, AssetsNoncurrent, Liabilities, LiabilitiesCurrent, LiabilitiesNoncurrent, Equity, as well as a few other positions that are not always present.

To achieve this, the standardizer uses a **simple rule framework** that lets you define rules acting on the data. In the context of the balance sheet, a few rules include:
- If there is an AssetsNet tag but no Assets tag, copy the value from AssetsNet to Assets.
- If two of the tags Assets, AssetsCurrent, AssetsNoncurrent are present, calculate the missing one by applying the formula Assets = AssetsCurrent + AssetsNoncurrent.
- If the LiabilitiesNoncurrent tag is missing, sum up any existing detail tags of LiabilitiesNoncurrent and store the sum in the LiabilitiesNoncurrent tag.

Since calculations are involved, which under certain circumstances could be incorrect or problematic, **any action is logged**. Therefore, if a specific rule was applied for a certain report/financial statement, it is logged. With that information, a user can trace how many rules and which rules were applied to which tags of a particular report.

As mentioned, applying certain rules could lead under certain circumstances to incorrect results or interpretations. Moreover, the input data could also be incorrect or essential information could be missing from the dataset altogether. Therefore, **validation rules** can be defined and are applied at the end of processing the data. In the case of the balance sheet, a few examples of validation rules are Assets = AssetsCurrent + AssetsNoncurrent, Liabilities = LiabilitiesCurrent + LiabilitiesNoncurrent, Assets = Liabilities + Equity. These validation checks are applied for every financial statement, and the results are presented with a relative error and a categorized error (category 0 = exact match, category 1 = less than 1% off, category 5 = less than 5% off, category 10 = less than 10% off, category 100 = greater than 10% off). For instance, if you want to use the data to train an ML model, you might want to include only data for reports where all validation rule have catagery of 5 or less.

**Disclaimer** <br> **USE AT YOUR OWN RISK.** <br> As mentioned before, the applied rules could be wrong, the input data could be incorrect. Always check the official company filings if you want to make investement decisions!

## Main Process

The main process comprises four key steps:

1. **Preprocessing**
    1. Removal of unused tags: Based on predefined rules, all tags not utilized by these rules are eliminated.
    1. Apply PrePivotRules, for instance deduplication: Occasionally, values for certain tags or even entire sets of tags in financial statements may be duplicated. These redundant entries need to be removed before the data can be pivoted.
    1. Inversion of negated values: The sign of values marked as negated is inverted.
    1. Table pivoting: Currently, each tag and its corresponding value have their own row in the dataset. The goal is to transform this structure so that each tag has its own column.
    1. Filtering for main statements: Some financial reports contain multiple tables attributed to a specific financial statement. This step aims to retain only the main statement.
    1. Application of preprocess rules: These rules are designed to rectify errors in the data. For example, there may be reports where the tags for Assets and AssetsNoncurrent are interchanged, causing the value of Assets to be tagged as AssetsNoncurrent and vice versa. Preprocess rules help correct such errors.
    1. Preparation of log dataframes.

2. **Main Processing**<br> This step applies the main rules, following the order in which they are defined. The entire rule tree can be executed multiple times, as applying a rule at the end of the tree could calculate a previously absent tag that can then be used to calculate another value in the next iteration.

3. **Postprocessing**<br> Postprocess rules are applied in this step. Their primary purpose is to refine the results, such as setting values to zero.

4. **Finalizing**<br> Validation rules are applied, and summary logs are generated.


## Preparing the Example

In order to explain the details of the standardizer, we apply the balance sheet standardizer on the reports of the year 2022.

In [2]:
# first, create a collector which collects all reports for 2022
from secfsdstools.d_container.databagmodel import JoinedDataBag
from secfsdstools.e_collector.zipcollecting import ZipCollector
from secfsdstools.u_usecases.bulk_loading import default_postloadfilter

collector = ZipCollector.get_zip_by_names(names=["2022q1.zip", "2022q2.zip", "2022q3.zip", "2022q4.zip"], 
                                          forms_filter=["10-K", "10-Q"],                                        
                                          stmt_filter=["BS"], post_load_filter=default_postloadfilter)

joined_bag: JoinedDataBag = collector.collect().join()
print("number of loaded reports: ", len(joined_bag.sub_df))

2024-03-27 06:47:22,603 [INFO] configmgt  reading configuration from C:\Users\hansj\.secfsdstools.cfg
2024-03-27 06:47:22,641 [INFO] parallelexecution      items to process: 4
2024-03-27 06:47:50,182 [INFO] parallelexecution      commited chunk: 0


number of loaded reports:  26357


**Note:** <br> we only load the 10-K and 10-Q reports. We also filter directly just for balance sheet information. We also apply the default_postloadfilter, which includes ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, and USDOnlyRawFilter. You definitely should apply the ReportPeriodRawFilter and the USDOnlyRawFilter. The standardizer should work without applying the MainCoregRawFilter and therefore also standardizing the statements for subsidiaries.

## Using the standardizer

The standardizer implements the presenter interface, so you can pass joinedbag as a parameter to the `present` method of the `JoinedDataBag` instance. You can also call the `process` method of the standardizer and provide the `pre_num_df` of the `JoinedDataBag` as input.

However, there is a slight difference between those two methods. When you use the `present` method, then the following attributes from the sub_df are joined to the standardized result: 
* cik (company identifier)
* name (company name)
* form (10-K or 10Q)
* fye (fiscal year ending)
* fy (fiscal year)
* fp (fiscal period)

This makes it easier to identify the entries for one company and is therefore the recommended way.

In [3]:
from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer

standardizer = BalanceSheetStandardizer()
standardized_bs_df = joined_bag.present(standardizer)

  log_df[self.identifier] = False
  log_df[self.identifier] = False


In [4]:
standardized_bs_df[:10]

Unnamed: 0,adsh,cik,name,form,fye,fy,fp,date,coreg,report,ddate,uom,qtrs,Assets,AssetsCurrent,AssetsNoncurrent,Liabilities,LiabilitiesCurrent,LiabilitiesNoncurrent,HolderEquity,TemporaryEquity,RedeemableEquity,Equity,LiabilitiesAndEquity,Cash,RetainedEarnings,AdditionalPaidInCapital,TreasuryStockValue,AssetsCheck_error,AssetsCheck_cat,LiabilitiesCheck_error,LiabilitiesCheck_cat,EquityCheck_error,EquityCheck_cat,AssetsLiaEquCheck_error,AssetsLiaEquCheck_cat
12985,0001663577-22-000212,1554906,CROWN BAUS CAPITAL CORP.,10-K,430,2015.0,FY,2015-04-30,,2,20150430,USD,0,0.0,0.0,0.0,253680.0,253680.0,0.0,-253680.0,0.0,0.0,-253680.0,0.0,,-45278680.0,44885000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12986,0001663577-22-000214,1554906,CROWN BAUS CAPITAL CORP.,10-K,430,2016.0,FY,2016-04-30,,2,20160430,USD,0,7.0,7.0,0.0,379980.0,379980.0,0.0,-379973.0,0.0,0.0,-379973.0,7.0,7.0,-45404973.0,44885000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6949,0001493152-22-015967,1442853,"INDO GLOBAL EXCHANGE(S) PTE, LTD.",10-K,731,2016.0,FY,2016-07-31,,2,20160731,USD,0,0.0,0.0,0.0,486515.0,486515.0,0.0,-486515.0,0.0,0.0,-486515.0,0.0,,-7591643.0,6024427.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2998,0001477932-22-001114,1374881,KINGFISH HOLDING CORP,10-K,930,2016.0,FY,2016-09-30,,2,20160930,USD,0,21308.0,21308.0,0.0,230324.0,210324.0,20000.0,-209016.0,0.0,0.0,-209016.0,21308.0,21308.0,-4579323.0,4378213.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6950,0001493152-22-015996,1442853,"INDO GLOBAL EXCHANGE(S) PTE, LTD.",10-Q,731,2017.0,Q1,2016-10-31,,2,20161031,USD,0,0.0,0.0,0.0,486515.0,486515.0,0.0,-486515.0,0.0,0.0,-486515.0,0.0,0.0,-7591643.0,6024427.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24546,0001099910-22-000217,1427644,"TELCO CUBA, INC.",10-K,1130,2016.0,FY,2016-11-30,,2,20161130,USD,0,41602.0,26864.0,14738.0,5381900.0,5381900.0,0.0,-5340298.0,0.0,0.0,-5340298.0,41602.0,21414.0,-6114193.0,558926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18590,0001096906-22-001760,1387998,SNOOGOO CORP.,10-K,1231,2016.0,FY,2016-12-31,,2,20161231,USD,0,10400.0,10400.0,0.0,506175.0,506175.0,0.0,-495775.0,0.0,0.0,-495775.0,10400.0,,-6390780.0,5704690.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6951,0001493152-22-015999,1442853,"INDO GLOBAL EXCHANGE(S) PTE, LTD.",10-Q,731,2017.0,Q2,2017-01-31,,2,20170131,USD,0,0.0,0.0,0.0,486506.0,486506.0,0.0,-486506.0,0.0,0.0,-486506.0,0.0,0.0,-7591643.0,6024427.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12987,0001663577-22-000215,1554906,CROWN BAUS CAPITAL CORP.,10-K,430,2017.0,FY,2017-04-30,,2,20170430,USD,0,6300000.0,6300000.0,0.0,500912.0,500912.0,0.0,5799088.0,0.0,0.0,5799088.0,6300000.0,,-46575912.0,44885000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6931,0001493152-22-016043,1442853,"INDO GLOBAL EXCHANGE(S) PTE, LTD.",10-Q,731,2017.0,Q3,2017-04-30,,2,20170430,USD,0,0.0,0.0,0.0,486515.0,486515.0,0.0,-486515.0,0.0,0.0,-486515.0,0.0,0.0,-7591643.0,6024427.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


As you can see, the `present` method (as would the `process` method) returns the standardized dataframe. The "index" colums are **adsh**, **coreg**, **report**, **ddate**, and **uom**. Since we actually did use the UsdOnlyFilter and the MainCoreFilter, we could drop the coreg and the uom column.

After the value columns, you find the results of the applied validation rules the `..Check_error` and `..Check_cat` columns. I will explain them later.

As i mentioned before, different logs are created as well, which can be directly accessed in the instance of the Standardizer.

The standardizer als has bag object which contains the result of the standardization and all the logs. You can get this bag by calling the method `get_standardize_bag`.

In [5]:
from secfsdstools.f_standardize.standardizing import StandardizedBag
result_bag: StandardizedBag = standardizer.get_standardize_bag()

The `StandardizedBag` class also has a save and load method, so you can store the results and analyse them later. (**Note**: you have to create the directory before saving)

In [7]:
import os
# save the results
result_bag.save('./bs_standardizer_results')

In [8]:
from secfsdstools.f_standardize.standardizing import StandardizedBag
# load the results
result_bag: StandardizedBag = StandardizedBag.load('./bs_standardizer_results')



  applied_rules_sum_s = pd.read_csv(


Inside the bag, we find the folliwing objects (as mentioned, this datasets are also directyl available in the Standardizer instance):

- **result_df** <br> The dataframe that contains the result of the standardization process and also the validation results for every entry.
- **process_description_df** <br> This dataframe gives a textual description of all rules and also shows there unique id.
- **applied_prepivot_rules_log_df** <br> This dataframe contains the logs for applied pre-pivot-rules. Rules that were applied before the data set was pivoted. As mentioned above, this includes lines that were duplicated.
- **applied_rules_log_df** <br> For every column in the result_df, this df contains the information which rules where applied on which entry.
- **applied_rules_sum_s** <br> This series shows how often a rule was applied in total.
- **stats_df** <br> This shows the total and the percentage of null values for columns for every processing step. You also see the improvement between the steps.
- **validation_overview_df** <br> This dataframe gives a summary of the validation and therefore giving an indication about how successfull the standardization process was.

## Analyzing the logs

### **process_description_df**

This dataframe shows all defined rules with their id and a textual description. The order in which they are shown is also the order in which they are applied.

In [9]:
result_bag.process_description_df

Unnamed: 0,part,type,ruleclass,identifier,description
0,PREPIVOT,Group,,PREPIVOT_BS_PREPIV,
1,PREPIVOT,Rule,PrePivotDeduplicate,PREPIVOT_BS_PREPIV_#1_DeDup,"Deduplicates the dataframe based on the columns ['adsh', 'coreg', 'report', 'ddate', 'uom', 'qtrs', 'tag', 'version', 'value']"
2,PRE,Group,,PRE_BS_PRE,
3,PRE,Rule,PreSumUpCorrection,PRE_BS_PRE_#1_Assets/AssetsNoncurrent,"Swaps the values between the tag 'Assets' and 'AssetsNoncurrent' if the following equation is True ""'AssetsNoncurrent' = 'Assets' + 'AssetsCurrent"" and 'AssetsCurrent' > 0"
4,PRE,Rule,PreSumUpCorrection,PRE_BS_PRE_#2_Assets/AssetsCurrent,"Swaps the values between the tag 'Assets' and 'AssetsCurrent' if the following equation is True ""'AssetsCurrent' = 'Assets' + 'AssetsNoncurrent"" and 'AssetsNoncurrent' > 0"
5,MAIN,Group,,MAIN_BS,
6,MAIN,Group,,MAIN_BS_#1_BR,
7,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#1_Assets,Copies the values from AssetsNet to Assets if AssetsNet is not null and Assets is nan
8,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#2_Cash,Copies the values from CashAndCashEquivalentsAtCarryingValue to Cash if CashAndCashEquivalentsAtCarryingValue is not null and Cash is nan
9,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#3_LiabilitiesAndEquity,Copies the values from LiabilitiesAndStockholdersEquity to LiabilitiesAndEquity if LiabilitiesAndStockholdersEquity is not null and LiabilitiesAndEquity is nan


**part** is either PREPIVOT, PRE, MAIN, POST, or VALID. **ruleclass** tells you the classname of the applied rule. The **identifier** is a unique id that is used as column name in the `applied_rules_log_df`, resp. the `applied_prepivot_rules_log_df` for prepivot rules. Finally, **description** gives you a textual description about what the rule does, which tags they use and change, as well as the condition under which they are applied.

### applied_prepivot_rules_log_df

In [10]:
result_bag.applied_prepivot_rules_log_df[:10]

Unnamed: 0,adsh,coreg,report,ddate,uom,qtrs,tag,version,id
2421,0000109563-22-000025,,4,20211231,USD,0,AssetsCurrent,us-gaap/2020,PREPIVOT_BS_PREPIV_#1_DeDup
13329,0001500435-22-000016,,3,20211231,USD,0,AccruedIncomeTaxesNoncurrent,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
13419,0001567892-22-000007,,4,20211231,USD,0,AccruedIncomeTaxesNoncurrent,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
20782,0001567892-22-000007,,4,20211231,USD,0,AdditionalPaidInCapital,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
28520,0000215466-22-000019,,4,20211231,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
29513,0001437749-22-000287,,2,20211031,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
29966,0000950170-22-002959,,2,20211231,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
31686,0001564590-22-011685,,2,20220131,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
31757,0001567892-22-000007,,4,20211231,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup
31786,0001628280-22-003530,,7,20211231,USD,0,Assets,us-gaap/2021,PREPIVOT_BS_PREPIV_#1_DeDup


For the balance sheet standardizer, there is only the deduplication prepivot rule applied. Therefore, showing the entries in the original `pre_num_df` that were duplicated and had to be removed.

### applied_rules_log_df

In [11]:
result_bag.applied_rules_log_df[:10]

tag,adsh,coreg,report,ddate,uom,qtrs,PRE_BS_PRE_#1_Assets/AssetsNoncurrent,PRE_BS_PRE_#2_Assets/AssetsCurrent,MAIN_1_BS_#1_BR_#1_Assets,MAIN_1_BS_#1_BR_#2_Cash,MAIN_1_BS_#1_BR_#3_LiabilitiesAndEquity,MAIN_1_BS_#1_BR_#4_RetainedEarnings,MAIN_1_BS_#2_EQ_#1_HolderEquity,MAIN_1_BS_#2_EQ_#2_HolderEquity,MAIN_1_BS_#2_EQ_#3_HolderEquity,MAIN_1_BS_#2_EQ_#4_TemporaryEquity,MAIN_1_BS_#2_EQ_#5_RedeemableEquity,MAIN_1_BS_#2_EQ_#6_Equity,MAIN_1_BS_#3_SC_#1_Assets,MAIN_1_BS_#3_SC_#2_AssetsCurrent,MAIN_1_BS_#3_SC_#3_AssetsNoncurrent,MAIN_1_BS_#3_SC_#4_Liabilities,MAIN_1_BS_#3_SC_#5_LiabilitiesCurrent,MAIN_1_BS_#3_SC_#6_LiabilitiesNoncurrent,MAIN_1_BS_#3_SC_#7_Assets,MAIN_1_BS_#3_SC_#8_Liabilities,MAIN_1_BS_#3_SC_#9_Equity,MAIN_1_BS_#3_SC_#10_LiabilitiesAndEquity,MAIN_1_BS_#3_SC_#11_Liabilities,MAIN_1_BS_#3_SC_#12_Equity,MAIN_1_BS_#4_SU_#1_Cash,MAIN_1_BS_#4_SU_#2_RetainedEarnings,MAIN_1_BS_#4_SU_#3_LongTermDebt,MAIN_1_BS_#4_SU_#4_LiabilitiesNoncurrent,MAIN_1_BS_#5_SetSum_#1_Assets/AssetsNoncurrent,MAIN_1_BS_#5_SetSum_#2_Assets/AssetsCurrent,MAIN_1_BS_#5_SetSum_#3_Liabilities/LiabilitiesNoncurrent,MAIN_1_BS_#5_SetSum_#4_Liabilities/LiabilitiesCurrent,MAIN_2_BS_#1_BR_#1_Assets,MAIN_2_BS_#1_BR_#2_Cash,MAIN_2_BS_#1_BR_#3_LiabilitiesAndEquity,MAIN_2_BS_#1_BR_#4_RetainedEarnings,MAIN_2_BS_#2_EQ_#1_HolderEquity,MAIN_2_BS_#2_EQ_#2_HolderEquity,MAIN_2_BS_#2_EQ_#3_HolderEquity,MAIN_2_BS_#2_EQ_#4_TemporaryEquity,MAIN_2_BS_#2_EQ_#5_RedeemableEquity,MAIN_2_BS_#2_EQ_#6_Equity,MAIN_2_BS_#3_SC_#1_Assets,MAIN_2_BS_#3_SC_#2_AssetsCurrent,MAIN_2_BS_#3_SC_#3_AssetsNoncurrent,MAIN_2_BS_#3_SC_#4_Liabilities,MAIN_2_BS_#3_SC_#5_LiabilitiesCurrent,MAIN_2_BS_#3_SC_#6_LiabilitiesNoncurrent,MAIN_2_BS_#3_SC_#7_Assets,MAIN_2_BS_#3_SC_#8_Liabilities,MAIN_2_BS_#3_SC_#9_Equity,MAIN_2_BS_#3_SC_#10_LiabilitiesAndEquity,MAIN_2_BS_#3_SC_#11_Liabilities,MAIN_2_BS_#3_SC_#12_Equity,MAIN_2_BS_#4_SU_#1_Cash,MAIN_2_BS_#4_SU_#2_RetainedEarnings,MAIN_2_BS_#4_SU_#3_LongTermDebt,MAIN_2_BS_#4_SU_#4_LiabilitiesNoncurrent,MAIN_2_BS_#5_SetSum_#1_Assets/AssetsNoncurrent,MAIN_2_BS_#5_SetSum_#2_Assets/AssetsCurrent,MAIN_2_BS_#5_SetSum_#3_Liabilities/LiabilitiesNoncurrent,MAIN_2_BS_#5_SetSum_#4_Liabilities/LiabilitiesCurrent,MAIN_3_BS_#1_BR_#1_Assets,MAIN_3_BS_#1_BR_#2_Cash,MAIN_3_BS_#1_BR_#3_LiabilitiesAndEquity,MAIN_3_BS_#1_BR_#4_RetainedEarnings,MAIN_3_BS_#2_EQ_#1_HolderEquity,MAIN_3_BS_#2_EQ_#2_HolderEquity,MAIN_3_BS_#2_EQ_#3_HolderEquity,MAIN_3_BS_#2_EQ_#4_TemporaryEquity,MAIN_3_BS_#2_EQ_#5_RedeemableEquity,MAIN_3_BS_#2_EQ_#6_Equity,MAIN_3_BS_#3_SC_#1_Assets,MAIN_3_BS_#3_SC_#2_AssetsCurrent,MAIN_3_BS_#3_SC_#3_AssetsNoncurrent,MAIN_3_BS_#3_SC_#4_Liabilities,MAIN_3_BS_#3_SC_#5_LiabilitiesCurrent,MAIN_3_BS_#3_SC_#6_LiabilitiesNoncurrent,MAIN_3_BS_#3_SC_#7_Assets,MAIN_3_BS_#3_SC_#8_Liabilities,MAIN_3_BS_#3_SC_#9_Equity,MAIN_3_BS_#3_SC_#10_LiabilitiesAndEquity,MAIN_3_BS_#3_SC_#11_Liabilities,MAIN_3_BS_#3_SC_#12_Equity,MAIN_3_BS_#4_SU_#1_Cash,MAIN_3_BS_#4_SU_#2_RetainedEarnings,MAIN_3_BS_#4_SU_#3_LongTermDebt,MAIN_3_BS_#4_SU_#4_LiabilitiesNoncurrent,MAIN_3_BS_#5_SetSum_#1_Assets/AssetsNoncurrent,MAIN_3_BS_#5_SetSum_#2_Assets/AssetsCurrent,MAIN_3_BS_#5_SetSum_#3_Liabilities/LiabilitiesNoncurrent,MAIN_3_BS_#5_SetSum_#4_Liabilities/LiabilitiesCurrent,POST_BS_POST_#1_AssetsCurrent/AssetsNoncurrent,POST_BS_POST_#2_LiabilitiesCurrent/LiabilitiesNoncurrent,POST_BS_POST_#3_Assets/AssetsCurrent/AssetsNoncurrent,POST_BS_POST_#4_Liabilities/LiabilitiesCurrent/LiabilitiesNoncurrent,POST_BS_POST_#5_TemporaryEquity,POST_BS_POST_#6_RedeemableEquity,POST_BS_POST_#7_AdditionalPaidInCapital,POST_BS_POST_#8_TreasuryStockValue
0,0000002178-22-000033,,3,20211231,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True
1,0000002178-22-000046,,2,20220331,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True
2,0000002178-22-000066,,2,20220630,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True
3,0000002178-22-000089,,2,20220930,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True
4,0000002488-22-000016,,5,20211231,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False
5,0000002488-22-000078,,4,20220331,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False
6,0000002488-22-000123,,4,20220630,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False
7,0000002488-22-000170,,4,20220930,USD,0,False,False,False,True,True,True,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False
8,0000002969-22-000010,,5,20211231,USD,0,False,False,False,True,True,True,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False
9,0000002969-22-000026,,5,20220331,USD,0,False,False,False,True,True,True,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,False


Obviously, this table has the same index columns as the `result_df`. Furthermore, for every rule that could be applied there is a column. Since the main rules are applied multiple times (3 iterations by default), you find multiple rule ids for the MAIN rules. **MAIN_1** indicate which MAIN-rules where applied in the first iteration, **MAIN_2** in the second iteration, and **MAIN_3** in the third.

If you analyze the statements of a certain company, you should check how many rules were applied (at least the MAIN rules), since this gives an indication about how much "assumption" might in the standardized data.

As an example, let us have a look at the 10-K of apple in 2022, which adsh is "0000320193-22-000108".

First, lets have a look at the standardized entry for this report:

In [14]:
apple_10k_2022 = "0000320193-22-000108"
result_bag.result_df[result_bag.result_df.adsh==apple_10k_2022]

Unnamed: 0,adsh,cik,name,form,fye,fy,fp,date,coreg,report,ddate,uom,qtrs,Assets,AssetsCurrent,AssetsNoncurrent,Liabilities,LiabilitiesCurrent,LiabilitiesNoncurrent,HolderEquity,TemporaryEquity,RedeemableEquity,Equity,LiabilitiesAndEquity,Cash,RetainedEarnings,AdditionalPaidInCapital,TreasuryStockValue,AssetsCheck_error,AssetsCheck_cat,LiabilitiesCheck_error,LiabilitiesCheck_cat,EquityCheck_error,EquityCheck_cat,AssetsLiaEquCheck_error,AssetsLiaEquCheck_cat
25474,0000320193-22-000108,320193,APPLE INC,10-K,930,2022.0,FY,2022-09-30,,5,20220930,USD,0,352755000000.0,135405000000.0,217350000000.0,302083000000.0,153982000000.0,148101000000.0,50672000000.0,0.0,0.0,50672000000.0,352755000000.0,23646000000.0,-3068000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we filter the applied rules on this entry.

In [15]:
apple_10k_2022 = "0000320193-22-000108"
apple_10k_2022_applied_rules_log_df = result_bag.applied_rules_log_df[result_bag.applied_rules_log_df.adsh==apple_10k_2022]

# just filter for the applied MAIN rules
main_rule_cols = apple_10k_2022_applied_rules_log_df.columns[apple_10k_2022_applied_rules_log_df.columns.str.contains('MAIN')]
main_rule_df = apple_10k_2022_applied_rules_log_df[main_rule_cols]

# get the applied rules, by using the True and False values of main_rule_df.iloc[0] as a mask on the columns index
main_rule_df.columns[main_rule_df.iloc[0]].tolist()

['MAIN_1_BS_#1_BR_#2_Cash',
 'MAIN_1_BS_#1_BR_#3_LiabilitiesAndEquity',
 'MAIN_1_BS_#1_BR_#4_RetainedEarnings',
 'MAIN_1_BS_#2_EQ_#3_HolderEquity',
 'MAIN_1_BS_#2_EQ_#6_Equity',
 'MAIN_1_BS_#4_SU_#3_LongTermDebt']

As you can see, all six applied rules were applied in the first iteration. Let's check them one by one in detail (please refer to the `process_description_df` for details):
- **MAIN_0_BS_#1_BR_#2_Cash** <br> This is just a "renaming" (rule class is CopyTagRule) of CashAndCashEquivalentsAtCarryingValue, so there is no "assumption" by applying this rule
- **MAIN_0_BS_#1_BR_#3_LiabilitiesAndEquity** <br> This is also just a "renaming" of LiabilitiesAndStockholdersEquity, so not problematic
- **MAIN_0_BS_#1_BR_#4_RetainedEarnings** <br> Again, just a "renamin" of RetainedEarningsAccumulatedDeficit
- **MAIN_0_BS_#2_EQ_#3_HolderEquity** <br> Also here, just a "renaming" of StockholdersEquity
- **MAIN_0_BS_#2_EQ_#6_Equity** <br> This rules sums up HolderEquity, TemporaryEquity, and RedeemableEquity to  Equity. In the case of apple's 10K of 2022, there is just HolderEquity. So this is not probplematic
- **MAIN_0_BS_#4_SU_#3_LongTermDebt** <br> This rules sums up the availalbe values in the columns ['LongTermDebtNoncurrent', 'LongTermDebtAndCapitalLeaseObligations'] into the column 'LongTermDebt'. LongTermDebt is only used to calculate LiabilitiesNoncurrent in the rule MAIN_BS_#4_SU_#4_LiabilitiesNoncurrent and is also not present in the final `result_df`. Since the rule MAIN_BS_#4_SU_#4_LiabilitiesNoncurrent hasn't been applied, MAIN_0_BS_#4_SU_#3_LongTermDebt doesn't have any effect.

As we can see, the applied rules mainly are renaming rules. And since also all validation columns have a 0.0, which indicates an exact match of the expected values, we can "trust" the values for this statement from apple.

### applied_rules_sum_s

This log gives an overview about how often a rule was applied, giving an indication about how important a rule is. For instance, if you write your own rule, it might be useful to check how often it gets applied.

In [16]:
result_bag.applied_rules_sum_s[:20]

0
tag                                             0
PRE_BS_PRE_#1_Assets/AssetsNoncurrent           0
PRE_BS_PRE_#2_Assets/AssetsCurrent              0
MAIN_1_BS_#1_BR_#1_Assets                       2
MAIN_1_BS_#1_BR_#2_Cash                     20166
MAIN_1_BS_#1_BR_#3_LiabilitiesAndEquity     25873
MAIN_1_BS_#1_BR_#4_RetainedEarnings         24809
MAIN_1_BS_#2_EQ_#1_HolderEquity              6647
MAIN_1_BS_#2_EQ_#2_HolderEquity               353
MAIN_1_BS_#2_EQ_#3_HolderEquity             18741
MAIN_1_BS_#2_EQ_#4_TemporaryEquity           2643
MAIN_1_BS_#2_EQ_#5_RedeemableEquity           984
MAIN_1_BS_#2_EQ_#6_Equity                   25799
MAIN_1_BS_#3_SC_#1_Assets                       2
MAIN_1_BS_#3_SC_#2_AssetsCurrent                3
MAIN_1_BS_#3_SC_#3_AssetsNoncurrent         19620
MAIN_1_BS_#3_SC_#4_Liabilities                994
MAIN_1_BS_#3_SC_#5_LiabilitiesCurrent           3
MAIN_1_BS_#3_SC_#6_LiabilitiesNoncurrent    15917
MAIN_1_BS_#3_SC_#7_Assets                     14

### validation_overview_df

The validation overview counts the validation categories of the validation columns in the `result_df`. Validation catagories are:
- **Category 0** <br> The validation was an exact match
- **Category 1** <br> The error was less than 1% of the expected value.
- **Category 5** <br> The error was less than 5% of the expected value.
- **Category 10** <br> The error was less than 10% of the expected value.
- **Category 100** <br> The error was above 10% of the expected value.

The results are shown as total count and in percent of all available rows in the dataset.

If you want to use the data for ML, you might want to consider only using rows which only have categories smaller than 10 or even 5.

In [17]:
result_bag.validation_overview_df

Unnamed: 0,AssetsCheck_cat,LiabilitiesCheck_cat,EquityCheck_cat,AssetsLiaEquCheck_cat,AssetsCheck_cat_pct,LiabilitiesCheck_cat_pct,EquityCheck_cat_pct,AssetsLiaEquCheck_cat_pct
0,26179.0,25366,24265,24262,99.86,96.76,92.56,92.55
1,13.0,103,388,392,0.05,0.39,1.48,1.5
5,,161,209,210,,0.61,0.8,0.8
10,3.0,161,132,133,0.01,0.61,0.5,0.51
100,20.0,424,1190,1194,0.08,1.62,4.54,4.55


### stats_df

The goal of the standardizer is to have a dataset in which all rows have meaningful values for all columns/tags. 
This dataframe gives an indication about how much every step/iteration adds to this goal by counting the nan values of the tags that are shown in the final dataframe.

In [18]:
result_bag.stats_df

Unnamed: 0_level_0,pre,pre_rel,MAIN_1,MAIN_1_rel,MAIN_1_gain,MAIN_2,MAIN_2_rel,MAIN_2_gain,MAIN_3,MAIN_3_rel,MAIN_3_gain,POST,POST_rel,POST_gain
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Assets,271,0.010338,73,0.002785,0.007553,26,0.000992,0.001793,26,0.000992,0.0,0,0.0,0.000992
AssetsCurrent,4894,0.186687,4891,0.186573,0.000114,4891,0.186573,0.0,4891,0.186573,0.0,0,0.0,0.186573
AssetsNoncurrent,24590,0.938013,4918,0.187603,0.75041,4891,0.186573,0.00103,4891,0.186573,0.0,0,0.0,0.186573
Liabilities,3377,0.128819,66,0.002518,0.126302,29,0.001106,0.001411,29,0.001106,0.0,0,0.0,0.001106
LiabilitiesCurrent,4912,0.187374,4903,0.18703,0.000343,3535,0.134846,0.052184,3535,0.134846,0.0,0,0.0,0.134846
LiabilitiesNoncurrent,23119,0.8819,4170,0.159069,0.72283,3535,0.134846,0.024223,3535,0.134846,0.0,0,0.0,0.134846
HolderEquity,26215,1.0,474,0.018081,0.981919,474,0.018081,0.0,474,0.018081,0.0,474,0.018081,0.0
TemporaryEquity,26215,1.0,23572,0.89918,0.10082,23572,0.89918,0.0,23572,0.89918,0.0,0,0.0,0.89918
RedeemableEquity,26215,1.0,25231,0.962464,0.037536,25231,0.962464,0.0,25231,0.962464,0.0,0,0.0,0.962464
Equity,26215,1.0,63,0.002403,0.997597,24,0.000916,0.001488,24,0.000916,0.0,24,0.000916,0.0


For instance, let's have a look at the Assets tag. After the preprocessing step, we count 271 nan values in over 26'000 rows. Which is about 1 percent. 

After applying the main rule set in the first iteration (MAIN_1) only 73 entries have a none value. This is about 0.3 percent of the total of 26'000 rows. So the gain, resp. the improvement was about 0.7 percent. And then again, after the second iteration (MAIN_2) only 26 entries in the Assets column had a nan value, which lead to another gain of 0.18 percent. After applying the main rules a third time, not additional gain was reached.

In fact, if you look at the MAIN_3_gain column, you see that no improvement was possible for any tag after applying the main rules a third time. So for this dataset, applying the main rules only twice would had been enough.

The log gives an overview about how "complete" the dataset is. It is helpful when developing the ruleset since it shows where the most missing values are.