# Financial Statements Standardizer

## Goal

Even when adhering to the US-GAAP standard, financial statements among different companies or even across different years of the same company are often not directly comparable.

Let's examine the balance sheet to illustrate a couple of problems that can arise:

- There are over 3000 different tags that could be used in a balance sheet, even though a balance sheet typically only has about 30-40 rows.
- Some tags have a similar meaning; for instance, the position "Total assets" can be tagged with "Assets," but sometimes, the tag "AssetsNet" is also used.
- Sometimes not all major positions are presented. For instance, normally, you expect Liabilities, LiabilitiesCurrent, and LiabilitiesNoncurrent to appear in the balance sheet. However, in some reports, only Liabilities, Liabilities, and only detailed positions of LiabilitiesNoncurrent are listed, with no total position for LiabilitiesCurrent. Sometimes even the total position for liabilities is missing.

The Standardizer processes the data and produces comparable statements that contain the main positions of a certain financial statement. For example, the balance sheet standardizer produces reports with values for Assets, AssetsCurrent, AssetsNoncurrent, Liabilities, LiabilitiesCurrent, LiabilitiesNoncurrent, Equity, as well as a few other positions that are not always present.

To achieve this, the standardizer uses a **simple rule framework** that lets you define rules acting on the data. In the context of the balance sheet, a few rules include:
- If there is an AssetsNet tag but no Assets tag, copy the value from AssetsNet to Assets.
- If two of the tags Assets, AssetsCurrent, AssetsNoncurrent are present, calculate the missing one by applying the formula Assets = AssetsCurrent + AssetsNoncurrent.
- If the LiabilitiesNoncurrent tag is missing, sum up any existing detail tags of LiabilitiesNoncurrent and store the sum in the LiabilitiesNoncurrent tag.

Since calculations are involved, which under certain circumstances could be incorrect or problematic, **any action is logged**. Therefore, if a specific rule was applied for a certain report/financial statement, it is logged. With that information, a user can trace how many rules and which rules were applied to which tags of a particular report.

As mentioned, applying certain rules could lead under certain circumstances to incorrect results or interpretations. Moreover, the input data could also be incorrect or essential information could be missing from the dataset altogether. Therefore, **validation rules** can be defined and are applied at the end of processing the data. In the case of the balance sheet, a few examples of validation rules are Assets = AssetsCurrent + AssetsNoncurrent, Liabilities = LiabilitiesCurrent + LiabilitiesNoncurrent, Assets = Liabilities + Equity. These validation checks are applied for every financial statement, and the results are presented with a relative error and a categorized error (category 0 = exact match, category 1 = less than 1%, category 5 = less than 5%, category 10 = less than 10%, category 100 = greater than 10%). For instance, if you want to use the data to train an ML model, you might want to choose only statements with all checks below category 5.


## Main Process

The main process has four main steps:

1. Preprocessing
    1. Based on the rules that are defined, all tags that are not used by these rules, are removed
    1. Deduplication: sometimes in the data, the value for certain tags or even all tags of a financial statements appear more than once. These duplicated entries have to be removed
    1. Invert negated values: the sign of values that are marked as negated is changed
    1. Pivot the table: So far, every tag with its value has its own row in the dataset. However, what we want to have is that every tag has its own column
    1. Filter for main statement: sometimes, an financial report has more than one table that is attributed to a certain financial statement. For instance, there may be the main balance sheet statement and a somewhat reduced one in the same report. This steps tries to keep just the main statement.
    1. Apply the preprocess rules: The idea of preprocess rules is to correct errors in the data. For instance, reports exists where the tags for Assets and AssetsNoncurrent are mixed up, meaning the value of Assets is tagged as AssetsNoncurrent and vice versa. With preprocess rules, we can fix such errors
    1. Preparation of log dataframes
2. Mainprocessing <br> This step applies the main rule. They are applied in the order in which they are defined. Also, they whole rule tree can be applied more than once, since applying a rule at the end of the tree could calculate a not present tag that then can be used to calculate another value.
3. Postprocessing <br> in the 
 

    

content
- goal
- problems / examples
  -missing liabilities and lieabilitiesnoncurrent tags
- what does the standardizer to
 - pre steps
   - deduplication
   - correct errors
 - main rules
  - iterateed
 - post rule
 - validation
- loading and saving information

-Example BalanceSheet
- input 
   -> filtered data 
   -> only statement, only one currency, only main company, only standardized tags
- what kind of logs are produced, what do they show
- what you should consider if using standardized information
  -> check logs / check validation summary 
- limitations
  wrong data, missing tags in data
  none standardized tags
  
- Empirical approach, definition is not taken into account