In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)
pd.set_option('display.max_rows', 500) # ensure that all rows are shown

# `BalanceSheetStandardizer`

In the `07_00_stanardizer_basics.ipynb` we looked at the basic principles of the standardizer. And now we are going to explore the details of the `BalanceSheetStandardizer`.

## Main Goal
The main Goal of the `BalanceSheetStandardizer` is to provide a consilidated, standardized view that contains the main positions of a balance sheet.

The current implementation tries to find/calculate the values for the following positions:

- Assets
    - AssetsCurrent
        - Cash
    - AssetsNoncurrent
- Liabilities
    - LiabilitiesCurrent
    - LiabilitiesNoncurrent
- Equity
    - HolderEquity (mainly StockholderEquity or PartnerCapital)
        - RetainedEarnings
        - AdditionalPaidInCapital
        - TreasuryStockValue
    - TemporaryEquity
    - RedeemableEquity
- LiabilitiesAndEquity


## Prepare the dataset

As input, we are going to use the dataset which was created with the `06_bulk_data_processing_deep_dive.ipynb`. That dataset contains all available data for balance sheets. The path to this dataset - on my machine - is either `set/parallel/BS/joined` or `set/serial/BS/joined` depending whether it was produced with the faster parallel or slower serial processing approach.

The data is already filtered for 10-K and 10-Q reports. Moreover, the following filters were applied as well: `ReportPeriodRawFilter`, `MainCoregRawFilter`, `OfficialTagsOnlyRawFilter`, `USDOnlyRawFilter`. The dataset is already joined, so we can use it directly with the `BalanceSheetStandardizer`.

Of course, if you prefer another dataset, for instance all data of a few companies, feel free to do so.

In [2]:
from secfsdstools.d_container.databagmodel import JoinedDataBag
from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer

all_bs_joinedbag:JoinedDataBag = JoinedDataBag.load(target_path="set/parallel/BS/joined")
bs_standardizer = BalanceSheetStandardizer()

# standardize the data
all_bs_joinedbag.present(bs_standardizer)

  log_df[self.identifier] = False
  log_df[self.identifier] = False


tag,adsh,coreg,report,ddate,uom,Assets,AssetsCurrent,AssetsNoncurrent,Liabilities,LiabilitiesCurrent,LiabilitiesNoncurrent,HolderEquity,TemporaryEquity,RedeemableEquity,Equity,LiabilitiesAndEquity,Cash,RetainedEarnings,AdditionalPaidInCapital,TreasuryStockValue,AssetsCheck_error,AssetsCheck_cat,LiabilitiesCheck_error,LiabilitiesCheck_cat,EquityCheck_error,EquityCheck_cat,AssetsLiaEquCheck_error,AssetsLiaEquCheck_cat
0,0000002178-11-000032,,3,20110630,USD,3.593880e+08,2.865200e+08,7.286800e+07,2.600610e+08,2.512960e+08,8.765000e+06,9.932700e+07,0.0,0.0,9.932700e+07,3.593880e+08,27939000.0,87212000.0,11693000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0000002178-11-000049,,3,20110930,USD,3.535870e+08,2.756010e+08,7.798600e+07,2.452340e+08,2.311140e+08,1.412000e+07,1.083530e+08,0.0,0.0,1.083530e+08,3.535870e+08,49560000.0,96238000.0,11693000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0000002178-12-000010,,2,20111231,USD,3.788400e+08,3.049650e+08,7.387500e+07,2.681580e+08,2.560940e+08,1.206400e+07,1.106820e+08,0.0,0.0,1.106820e+08,3.788400e+08,37066000.0,98567000.0,11693000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0000002178-12-000026,,3,20120331,USD,4.125340e+08,3.269200e+08,8.561400e+07,2.952770e+08,2.828150e+08,1.246200e+07,1.172570e+08,0.0,0.0,1.172570e+08,4.125340e+08,35989000.0,105142000.0,11693000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0000002178-12-000037,,4,20120630,USD,3.440350e+08,2.497470e+08,9.428800e+07,2.213920e+08,2.083550e+08,1.303700e+07,1.226430e+08,0.0,0.0,1.226430e+08,3.440350e+08,32213000.0,110528000.0,11693000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319587,0001971213-23-000019,,2,20230630,USD,6.201000e+09,1.506000e+09,4.695000e+09,5.560000e+09,6.240000e+08,4.936000e+09,6.410000e+08,0.0,0.0,6.410000e+08,6.201000e+09,728000000.0,184000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319588,0001973047-23-000016,,2,20230731,USD,3.366300e+04,6.460000e+02,3.301700e+04,4.148700e+04,4.148700e+04,0.000000e+00,-7.824000e+03,0.0,0.0,-7.824000e+03,3.366300e+04,646.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319589,0001974138-23-000008,,4,20230630,USD,5.736000e+09,1.676000e+09,4.060000e+09,2.415000e+09,1.795000e+09,6.200000e+08,3.321000e+09,0.0,0.0,3.321000e+09,5.736000e+09,337000000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319590,0001974793-23-000004,,2,20230630,USD,1.405810e+08,1.405810e+08,0.000000e+00,3.796100e+07,8.460000e+05,3.711500e+07,1.026200e+08,0.0,0.0,1.026200e+08,1.405810e+08,89275000.0,320000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


At first, we will save the results, including all the logs, so that we can use it directly in the future, without the need to process it again.<br>
**Note:** you need to create the target directory before storing the data

In [6]:
import os
target_path = "standardized/BS"
os.makedirs(target_path, exist_ok=True)

bs_standardizer.get_standardize_bag().save(target_path)

## Load the dataset
Once the data has been processed and saved, you can load it directly.

In [2]:
from secfsdstools.f_standardize.standardizing import StandardizedBag

bs_standardizer_result_bag = StandardizedBag.load("standardized/BS")



  applied_rules_sum_s = pd.read_csv(


## Overview

Before we dive into what the `BalanceSheetStandardizer` does in detail, lets get a first impression of the the produced data. First, let us see how many rows we have.

In [3]:
len(bs_standardizer_result_bag.result_df)

319592

Next, a good idea is to look at the `validation_overview_df`. This table gives an idea about the "quality" of the dateset based on the summary of the results of the applied validation rules.

In [4]:
bs_standardizer_result_bag.validation_overview_df

Unnamed: 0,AssetsCheck_cat,LiabilitiesCheck_cat,EquityCheck_cat,AssetsLiaEquCheck_cat,AssetsCheck_cat_pct,LiabilitiesCheck_cat_pct,EquityCheck_cat_pct,AssetsLiaEquCheck_cat_pct
0,314299,303072,298779,298544,98.34,94.83,93.49,93.41
1,329,2083,5985,6108,0.1,0.65,1.87,1.91
5,619,3377,3410,3447,0.19,1.06,1.07,1.08
10,484,2707,1392,1413,0.15,0.85,0.44,0.44
100,3861,8353,9160,9381,1.21,2.61,2.87,2.94


This seems to be quite ok, since we have around 95% of the data in the first two categories. As a reminder, Category 0 means it is an exact match, catagory 1 means that it is less than 1 percent off the expected value (see notebook `07_00_standardizer_basics.ipynb` for details.

Next, let's see how often which rule was applied. This gives an idea about how much "calculation" had to be done in order to create a standardized dataset. We can to this by looking at the `applied_rules_sum_s` pandas Series object.

In [7]:
bs_standardizer_result_bag.applied_rules_sum_s

0
tag                                                                          0
PRE_BS_PRE_#1_Assets/AssetsNoncurrent                                      123
PRE_BS_PRE_#2_Assets/AssetsCurrent                                           0
MAIN_1_BS_#1_BR_#1_Assets                                                  239
MAIN_1_BS_#1_BR_#2_Cash                                                 253222
MAIN_1_BS_#1_BR_#3_LiabilitiesAndEquity                                 314512
MAIN_1_BS_#1_BR_#4_RetainedEarnings                                     282340
MAIN_1_BS_#2_EQ_#1_HolderEquity                                          85609
MAIN_1_BS_#2_EQ_#2_HolderEquity                                           9449
MAIN_1_BS_#2_EQ_#3_HolderEquity                                         217674
MAIN_1_BS_#2_EQ_#4_TemporaryEquity                                       10724
MAIN_1_BS_#2_EQ_#5_RedeemableEquity                                       8500
MAIN_1_BS_#2_EQ_#6_Equity                         

## Applied Rules
To be able assess the content of `applied_rules_sum_s`  we need to understand the rules that are applied. The simplest way to do this is to print the description of them:

In [8]:
bs_standardizer_result_bag.process_description_df

Unnamed: 0,part,type,ruleclass,identifier,description
0,PRE,Group,,PRE_BS_PRE,
1,PRE,Rule,PreSumUpCorrection,PRE_BS_PRE_#1_Assets/AssetsNoncurrent,"Swaps the values between the tag 'Assets' and 'AssetsNoncurrent' if the following equation is True ""'AssetsNoncurrent' = 'Assets' + 'AssetsCurrent"" and 'AssetsCurrent' > 0"
2,PRE,Rule,PreSumUpCorrection,PRE_BS_PRE_#2_Assets/AssetsCurrent,"Swaps the values between the tag 'Assets' and 'AssetsCurrent' if the following equation is True ""'AssetsCurrent' = 'Assets' + 'AssetsNoncurrent"" and 'AssetsNoncurrent' > 0"
3,MAIN,Group,,MAIN_BS,
4,MAIN,Group,,MAIN_BS_#1_BR,
5,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#1_Assets,Copies the values from AssetsNet to Assets if AssetsNet is not null and Assets is nan
6,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#2_Cash,Copies the values from CashAndCashEquivalentsAtCarryingValue to Cash if CashAndCashEquivalentsAtCarryingValue is not null and Cash is nan
7,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#3_LiabilitiesAndEquity,Copies the values from LiabilitiesAndStockholdersEquity to LiabilitiesAndEquity if LiabilitiesAndStockholdersEquity is not null and LiabilitiesAndEquity is nan
8,MAIN,Rule,CopyTagRule,MAIN_BS_#1_BR_#4_RetainedEarnings,Copies the values from RetainedEarningsAccumulatedDeficit to RetainedEarnings if RetainedEarningsAccumulatedDeficit is not null and RetainedEarnings is nan
9,MAIN,Group,,MAIN_BS_#2_EQ,


Let's discuss a few of the rules in detail:
- **PRE_BS_PRE_#1_Assets/AssetsNoncurrent**<br> is a preprocess correction rule. There are actually about 120 reports in which the tags for Assets and AssetsNoncurrent were swapped. 
- **MAIN_BS_#1_BR_#1_Assets**<br> Most of the reports use the Assets tag. However, there are about 240 reports who use the AssetsNet tag. If this is the case, the value is copied to the Assets column.
- **MAIN_BS_#1_BR_#2_Cash, MAIN_BS_#1_BR_#3_LiabilitiesAndEquity, MAIN_BS_#1_BR_#4_RetainedEarnings** <br> These are mainly "renaming" rules, to have a shorter term. 
- **MAIN_BS_#2_EQ_#1_HolderEquity, MAIN_BS_#2_EQ_#2_HolderEquity, MAIN_BS_#2_EQ_#3_HolderEquity** <br> This rules ensures the precedence is considered when it comes to tags, that can contain the stockholderequity or the partnercapital. This are mainly three different tags, that have to be considered: StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest, StockholdersEquity, and PartnerCapital. Generally, it is either PartnerCapital or some kind of stockholderequity. Furthermore, StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest and StockholdersEquity can appear together. If they do appear together, StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest has precedence over StockholdersEquity, since StockholdersEquity is a child tag of StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest. As you can see in the `applied_rules_sum_s` data, two thirds of the entries have only StockholdersEquity present, one quarter has StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest present, and a few thousands have PartnerCapital set.
- **MAIN_BS_#2_EQ_#4_TemporaryEquity, MAIN_BS_#2_EQ_#5_RedeemableEquity**<br> Sometimes, Equity does not only include HolderEquity, but also TemporaryEquity and/or RedeemableEquity. Both of them have several tags that can define values which belongs to these catagories. So these two rules sum up all possible values for Temporary- and RedeemableEquity.
- **MAIN_BS_#2_EQ_#6_Equity** <br>

- Übersicht über Tags, die berechnet werden, und wie diese in Zusammenhang stehen
- Beschreibung der Regeln, die angewendet werden
- Resultate, wie gut ist das Datenset


In [1]:
import pandas as pd
# ensure that all columns are shown and that colum content is not cut
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width',1000)

In [3]:
from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer

bs_standardizer = BalanceSheetStandardizer()

In [4]:
bs_standardizer.present(all_bs_joinedbag)

tag,adsh,coreg,report,ddate,uom,Assets,AssetsCurrent,AssetsNoncurrent,Liabilities,LiabilitiesCurrent,LiabilitiesNoncurrent,HolderEquity,TemporaryEquity,RedeemableEquity,Equity,LiabilitiesAndEquity,Cash,RetainedEarnings,AdditionalPaidInCapital,TreasuryStockValue,AssetsCheck_error,AssetsCheck_cat,LiabilitiesCheck_error,LiabilitiesCheck_cat,EquityCheck_error,EquityCheck_cat,AssetsLiaEquCheck_error,AssetsLiaEquCheck_cat
0,0000002178-11-000032,,3,20110630,USD,3.593880e+08,2.865200e+08,7.286800e+07,2.600610e+08,2.512960e+08,8.765000e+06,9.932700e+07,0.0,0.0,9.932700e+07,3.593880e+08,27939000.0,87212000.0,11693000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0000002178-11-000049,,3,20110930,USD,3.535870e+08,2.756010e+08,7.798600e+07,2.452340e+08,2.311140e+08,1.412000e+07,1.083530e+08,0.0,0.0,1.083530e+08,3.535870e+08,49560000.0,96238000.0,11693000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0000002178-12-000010,,2,20111231,USD,3.788400e+08,3.049650e+08,7.387500e+07,2.681580e+08,2.560940e+08,1.206400e+07,1.106820e+08,0.0,0.0,1.106820e+08,3.788400e+08,37066000.0,98567000.0,11693000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0000002178-12-000026,,3,20120331,USD,4.125340e+08,3.269200e+08,8.561400e+07,2.952770e+08,2.828150e+08,1.246200e+07,1.172570e+08,0.0,0.0,1.172570e+08,4.125340e+08,35989000.0,105142000.0,11693000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0000002178-12-000037,,4,20120630,USD,3.440350e+08,2.497470e+08,9.428800e+07,2.213920e+08,2.083550e+08,1.303700e+07,1.226430e+08,0.0,0.0,1.226430e+08,3.440350e+08,32213000.0,110528000.0,11693000.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319587,0001971213-23-000019,,2,20230630,USD,6.201000e+09,1.506000e+09,4.695000e+09,5.560000e+09,6.240000e+08,4.936000e+09,6.410000e+08,0.0,0.0,6.410000e+08,6.201000e+09,728000000.0,184000000.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319588,0001973047-23-000016,,2,20230731,USD,3.366300e+04,6.460000e+02,3.301700e+04,4.148700e+04,4.148700e+04,0.000000e+00,-7.824000e+03,0.0,0.0,-7.824000e+03,3.366300e+04,646.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319589,0001974138-23-000008,,4,20230630,USD,5.736000e+09,1.676000e+09,4.060000e+09,2.415000e+09,1.795000e+09,6.200000e+08,3.321000e+09,0.0,0.0,3.321000e+09,5.736000e+09,337000000.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
319590,0001974793-23-000004,,2,20230630,USD,1.405810e+08,1.405810e+08,0.000000e+00,3.796100e+07,8.460000e+05,3.711500e+07,1.026200e+08,0.0,0.0,1.026200e+08,1.405810e+08,89275000.0,320000.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
std_bag = bs_standardizer.get_standardize_bag()

In [6]:
std_bag_target="std_bs_result_6"

In [8]:
std_bag.save(target_path=std_bag_target)

In [9]:
from secfsdstools.f_standardize.standardizing import StandardizedBag
loaded: StandardizedBag = StandardizedBag.load(target_path=std_bag_target) 



  applied_rules_sum_s = pd.read_csv(
