# Tutorial - Evaluate DNBs additional Rules

This notebook contains a tutorial for the evaluation of DNBs additional Rules for the following Solvency II reports:
- Annual Reporting Solo (ARS); and
- Quarterly Reporting Solo (QRS)

Besides the necessary preparation, the tutorial consists of 6 steps:
1. Read possible datapoints
2. Read data
3. Clean data
4. Read additional rules
5. Evaluate rules
6. Save results

## 0. Preparation

### Import packages

In [None]:
import pandas as pd  # dataframes
import numpy as np  # mathematical functions, arrays and matrices
from os.path import join, isfile  # some os dependent functionality
import data_patterns  # evaluation of patterns
import regex as re  # regular expressions
from pprint import pprint  # pretty print
import logging

### Variables

In [None]:
# ENTRYPOINT: 'ARS' for 'Annual Reporting Solo' or 'QRS' for 'Quarterly Reporting Solo'
# INSTANCE: Name of the report you want to evaluate the additional rules for

ENTRYPOINT = 'ARS'  
INSTANCE = 'ars_260_instance'  # Test instances: ars_260_instance or qrs_260_instance

In [None]:
# DATAPOINTS_PATH: path to the excel-file containing all possible datapoints (simplified taxonomy)
# RULES_PATH: path to the excel-file with the additional rules
# INSTANCES_DATA_PATH: path to the source data
# RESULTS_PATH: path to the results

DATAPOINTS_PATH = join('..', 'data', 'datapoints')
RULES_PATH = join('..', 'solvency2-rules')
INSTANCES_DATA_PATH = join('..', 'data', 'instances', INSTANCE)
RESULTS_PATH = join('..', 'results') 

In [None]:
# We log to rules.log in the data/instances path

logging.basicConfig(filename = join(INSTANCES_DATA_PATH, 'rules.log'),level = logging.INFO, 
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

## 1. Read possible datapoints

In the data/datapoints directory there is a file for both ARS and QRS in which all possible datapoints are listed (simplified taxonomy).  
We will use this information to add all unreported datapoints to the imported data.

In [None]:
df_datapoints = pd.read_csv(join(DATAPOINTS_PATH, ENTRYPOINT.upper() + '.csv'), sep=";").fillna("")  # load file to dataframe
df_datapoints.head()

## 2. Read data

We distinguish 2 types of tables: 
- With a closed-axis, e.g. the balance sheet: an entity reports only 1 balance sheet per period
- With an open-axis, e.g. the list of assets: an entity reports several 'rows of data' in the relevant table

### General information

First we gather some general information:
- A list of all possible reported tables
- A list of all reported tables
- A list of all tables that have not been reported

In [None]:
tables_complete_set = df_datapoints.tabelcode.sort_values().unique().tolist()
tables_reported = [table for table in tables_complete_set if isfile(join(INSTANCES_DATA_PATH, table + '.pickle'))]
tables_not_reported = [table for table in tables_complete_set if table not in tables_reported]

### Closed-axis

Besides all separate tables, the 'Tutorial Convert XBRL-instance to CSV, HTML and pickles' also outputs a large dataframe with the data from all closed-axis tables combined.  
We use this dataframe for evaluating the patterns on closed-axis tables.

In [None]:
df_closed_axis = pd.read_pickle(join(INSTANCES_DATA_PATH, INSTANCE + '.pickle'))
tables_closed_axis = sorted(list(set(x[:13] for x in df_closed_axis.columns)))
df_closed_axis.head()

### Open-axis

For open-axis tables we create a dictionary with all data per table.  
Later we will evaluate the additional rules on each seperate table in this dictionary.

In [None]:
dict_open_axis = {}
tables_open_axis = [table for table in tables_reported if table not in tables_closed_axis]

for table in tables_open_axis:
    df = pd.read_pickle(join(INSTANCES_DATA_PATH, table + '.pickle'))
    
    # Identify which columns within the open-axis table make a table row unique (index-columns):
    index_columns_open_axis = [col for col in list(df.index.names) if col not in ['entity','period']]
    
    # Duplicate index-columns to data columns:
    df.reset_index(level=index_columns_open_axis, inplace=True)
    for i in range(len(index_columns_open_axis)):
        df['index_col_' + str(i)] = df[index_columns_open_axis[i]].astype(str)
        df.set_index(['index_col_' + str(i)], append=True, inplace=True)
        
    dict_open_axis[table] = df 

print("Open-axis tables:")
print(list(dict_open_axis.keys()))

## 3. Clean data

We have to make 2 modifications on the data:
1. Add unreported datapoints  
so rules (partly) pointing to unreported datapoints can still be evaluated
2. Change string values to uppercase  
because the additional rules are defined using capital letters for textual comparisons 

In [None]:
all_datapoints = [x.replace(',,',',') for x in 
                  list(df_datapoints['tabelcode'] + ',' + df_datapoints['rij'] + ',' + df_datapoints['kolom'])]
all_datapoints_closed = [x for x in all_datapoints if x.split(",")[0] in tables_closed_axis]
all_datapoints_open = [x for x in all_datapoints if x.split(",")[0] in tables_open_axis]

### Closed-axis tables

In [None]:
# add not reported datapoints to the dataframe with data from closed axis tables:
for col in [column for column in all_datapoints_closed if column not in list(df_closed_axis.columns)]:
    df_closed_axis[col] = np.nan
df_closed_axis.fillna(0, inplace = True)

# string values to uppercase
df_closed_axis = df_closed_axis.applymap(lambda s:s.upper() if type(s) == str else s)

### Open-axis tables

In [None]:
for table in [table for table in dict_open_axis.keys()]:
    all_datapoints_table = [x for x in all_datapoints_open if x.split(",")[0] == table]
    for col in [column for column in all_datapoints_table if column not in list(dict_open_axis[table].columns)]:
        dict_open_axis[table][col] = np.nan
    dict_open_axis[table].fillna(0, inplace = True)
    
    dict_open_axis[table] = dict_open_axis[table].applymap(lambda s:s.upper() if type(s) == str else s)

## 4. Read additional rules

DNBs additional validation rules are published as an Excel file on the DNB statistics website.  
We included the Excel file in the project under data/downloaded files.

The rules are already converted to a syntax Python can interpret, using the notebook: 'Convert DNBs Additional Validation Rules to Patterns'.  
In the next line of code we read these converted rules (patterns).

In [None]:
df_patterns = pd.read_excel(join(RULES_PATH, ENTRYPOINT.lower() + '_patterns_additional_rules.xlsx'), engine='openpyxl').fillna("").set_index('index')

## 5. Evaluate rules

### Closed-axis tables

To be able to evaluate the rules for closed-axis tables, we need to filter out:
- patterns for open-axis tables; and
- patterns pointing to tables that are not reported.

In [None]:
df_patterns_closed_axis = df_patterns.copy()
df_patterns_closed_axis = df_patterns_closed_axis[df_patterns_closed_axis['pandas ex'].apply(
    lambda expr: not any(table in expr for table in tables_not_reported) 
    and not any(table in expr for table in tables_open_axis))]
df_patterns_closed_axis.head()

We now have:
- the data for closed-axis tables in a dataframe;
- the patterns for closed-axis tables in a dataframe.

To evaluate the patterns we need to create a 'PatternMiner' (part of the data_patterns package), and run the analyze function.

In [None]:
miner = data_patterns.PatternMiner(df_patterns=df_patterns_closed_axis)
df_results_closed_axis = miner.analyze(df_closed_axis)
df_results_closed_axis.head()

### Open-axis tables

First find the patterns defined for open-axis tables

In [None]:
df_patterns_open_axis = df_patterns.copy()
df_patterns_open_axis = df_patterns_open_axis[df_patterns_open_axis['pandas ex'].apply(
    lambda expr: any(table in expr for table in tables_open_axis))]

Patterns involving multiple open-axis tables are not yet supported

In [None]:
df_patterns_open_axis = df_patterns_open_axis[df_patterns_open_axis['pandas ex'].apply(
    lambda expr: len(set(re.findall('S.\d\d.\d\d.\d\d.\d\d|T\d[A-Z]?',expr)))) == 1]
df_patterns_open_axis.head()

Next we loop through the open-axis tables and evaluate the corresponding patterns on the data

In [None]:
output_open_axis = {}  # dictionary with input and results per table
for table in tables_open_axis:  # loop through open-axis tables
    if df_patterns_open_axis['pandas ex'].apply(lambda expr: table in expr).sum() > 0:  # check if there are patterns
        info = {}
        info['data'] = dict_open_axis[table]  # select data
        info['patterns'] = df_patterns_open_axis[df_patterns_open_axis['pandas ex'].apply(
            lambda expr: table in expr)]  # select patterns
        miner = data_patterns.PatternMiner(df_patterns=info['patterns'])
        info['results'] = miner.analyze(info['data'])  # evaluate patterns
        output_open_axis[table] = info

Print results for the first table (if there are rules for tables with an open axis)

In [None]:
if len(output_open_axis.keys()) > 0:
    display(output_open_axis[list(output_open_axis.keys())[0]]['results'].head())

## 6. Save results

### Combine results for closed- and open-axis tables

To output the results in a single file, we want to combine the results for closed-axis and open-axis tables

In [None]:
# Function to transform results for open-axis tables, so it can be appended to results for closed-axis tables
# The 'extra' index columns are converted to data columns
def transform_results_open_axis(df):
    if df.index.nlevels > 2:
        reset_index_levels = list(range(2, df.index.nlevels))
        df = df.reset_index(level=reset_index_levels)
        rename_columns={}
        for x in reset_index_levels:
            rename_columns['level_' + str(x)] = 'id_column_' + str(x - 1)
        df.rename(columns=rename_columns, inplace=True)
    return df

In [None]:
df_results = df_results_closed_axis.copy()  # results for closed axis tables
for table in list(output_open_axis.keys()):  # for all open axis tables with rules -> append and sort results
    df_results = transform_results_open_axis(output_open_axis[table]['results']).append(df_results, sort=False).sort_values(by=['pattern_id']).sort_index()

Change column order so the dataframe starts with the identifying columns:

In [None]:
list_col_order = []
for i in range(1, len([col for col in list(df_results.columns) if col[:10] == 'id_column_']) + 1):
    list_col_order.append('id_column_' + str(i))
list_col_order.extend(col for col in list(df_results.columns) if col not in list_col_order)
df_results = df_results[list_col_order]
df_results.head()

### Save results

The dataframe df_results contains all output of the evaluation of the validation rules. 

In [None]:
# To save all results use df_results
# To save all exceptions use df_results['result_type']==False 
# To save all confirmations use df_results['result_type']==True

# Here we save only the exceptions to the validation rules
df_results[df_results['result_type']==False].to_excel(join(RESULTS_PATH, "results.xlsx"))

### Example of an error in the report

In [None]:
# Get the pandas code from the first pattern and evaluate it
s = df_patterns.loc[4, 'pandas ex'].replace('df', 'df_closed_axis')
print('Pattern:', s)
display(eval(s)[re.findall('S.\d\d.\d\d.\d\d.\d\d,R\d\d\d\d,C\d\d\d\d|T\d[A-Z]?,R\d\d\d,C\d\d\d',s)])