# Processing of multiple levels of groupby for custom function in Pandas

Main goal of this project is to calculate custom function from values in the list of group values. 

Grouping of values is done on multiple variables (levels), from which we can say that some variables are stable and some will change during the execution.

For example in this we will use function to calculate market concentrations Herfindahl-Hirschman Index (https://www.investopedia.com/terms/h/hhi.asp) for multiple business days, country of residence, issuer code, rating value and face amount of held assets. This is only an example and you can use this script to create grouped values in the lists, that are then passed into custom or existing function.

Script uses data that are stored in XLSX format file InputConfiguration.xlsx in data folder.

Source data are stored in two sheets: 

-	InputData sheet - contains basic data used in calculation. These are atomic data that will be grouped in function
-	Configuration sheet – contains information on ways to aggregate data and which variable is going to be    aggregated. For this example we will sum variable by selected groups.

Configuration sheet has two columns that lists variable and variable types. 

Variables sheet is only a list of variables and types to be used in calculation.

VariableTypes defines parameters for aggregation. 

In current example we will fix BusinessDay and CountryOfResidence as static variables by adding StaticLevel in VariableTypes.

IssuerCode and RatingValue are going to be used for partial grouping and iterating over to produce lists of values to be put into custom function. 

NominalValue is aggregation variable and it is necessary to add value Variable in ValueTypes.

Following code will create output file which contains calculated HHI index for each business date, counterparty, issuer codes and ratings.


In [1]:
import pandas as pd

Importing data sheet and configuration data from XLSX file.

In [2]:
base_data = pd.read_excel('./data/InputAndConfiguration.xlsx', sheet_name = 'InputData')

configuration = pd.read_excel('./data/InputAndConfiguration.xlsx', sheet_name = 'Configuration')

Creating lists of variables in static and dynamic levels in order to be used in iteration in for loop. Also aggregation variable will be assigned from VariableType Variable.

Script will raise errors if we don’t provide correct parameters:

-	In current mode program only expects one variable to be used for aggregation and it raises error if there are more than 1 variable,

-	We must enter at least one static and dynamic level

-	Current script only expects only 4 levels of static levels.in case that you need more static levels, there is a possibility to extend functionality in for loop, by adding new elif statements (this will be mentioned when the part of the program is reached).

For creating lists from Pandas dataframes we use .values.to_list() function.


In [3]:
static_level = configuration[configuration['VariableType'] == 'StaticLevel']['Variables'].values.tolist()

dynamic_level = configuration[configuration['VariableType'] == 'IterableLevel']['Variables'].values.tolist()

if len(configuration[configuration['VariableType'] == 'Variable']['Variables'].values.tolist()) != 1:
    raise Exception('Script is implemented for one summing variable check Configuration sheet and VariableType in Variable class.')
else:
    aggregation_variable = configuration[configuration['VariableType'] == 'Variable']['Variables'].values.tolist()[0]

if len(static_level) == 0:
    raise Exception('You have to enter at least one level that will not be changed in summing in calculation')
    
if len(dynamic_level) == 0:
    raise Exception('You have to enter at least one or more levels that will be changed in calculation')

if len(static_level) > 4:
    raise Exception('Current function is implemented only for 4 levels of static variables. You can extend this to more levels in script.')

Defining hhi_index function.

In [4]:
def hhi_index(input_variables):
    """
    Function takes list of integers as input parameter input_variables.
    input_variables - List of aggregated values by groups performed by groupby function. 
    
    Function returns value of HHI index calculated according to https://www.investopedia.com/terms/h/hhi.asp
    """
    s1 = sum(input_variables)
    out = [pow(g/s1*100,2) for g in input_variables]
    return round(sum(out),0)

Converting all levels of grouping variables to string, this part is not necessary, but sometimes it is easier to work with strings that other datatypes.

In [5]:
for strdata in static_level+dynamic_level:
    base_data[strdata] = base_data[strdata].astype(str)

Assigning two lists in order to store groups for aggregation and list which will contain results of calculation.

list_loop -        list of variables used for grouping
output_variables - list that will contain output calculation of HHI index and other grouping levels

In [6]:
loop_list = []
output_variables = []

This part aggregates static and dynamic levels using groupby function in for loop. For loop goes thru dynamic level list and aggregates static levels plus one dynamic level.

Inner loop is triggered in if elif statements. These conditional statements check number of static variables and use them in creating structure used for looping thru multi-index. In constructing this solution I have used solution described in https://stackoverflow.com/questions/34139121/how-to-iterate-over-multiindex-levels-in-pandas/34139354

Trick is to get one level less than there are variables used in groupby function. This will return list of grouped variables, that will be passed into hhi_index function.

In case that you need more that 4 static levels, you need to add more elif statements and add index levels. Example

elif len(static_level) == 5:
        for idx, sel in looping_groupby.groupby(level = [0,1,2,3,4]):
            output_variables.append([idx[0], idx[1], idx[2], idx[3], idx[4], looping_group, hhi_index(list(sel))])


In [7]:
for looping_group in dynamic_level:
    loop_list = static_level.copy()
    loop_list.append(looping_group)

    looping_groupby = base_data.groupby(loop_list)[aggregation_variable].sum()
    
    if len(static_level) == 1:
        for idx, sel in looping_groupby.groupby(level = [0]):
            output_variables.append([idx[0], looping_group, hhi_index(list(sel))])
        
    elif len(static_level) == 2:
        for idx, sel in looping_groupby.groupby(level = [0,1]):
            output_variables.append([idx[0], idx[1], looping_group, hhi_index(list(sel))])
        
    elif len(static_level) == 3:
        for idx, sel in looping_groupby.groupby(level = [0,1,2]):
            output_variables.append([idx[0], idx[1], idx[2], looping_group, hhi_index(list(sel))])
            
    elif len(static_level) == 4:
        for idx, sel in looping_groupby.groupby(level = [0,1,2,3]):
            output_variables.append([idx[0], idx[1], idx[2], idx[3], looping_group, hhi_index(list(sel))])

Create column names for output dataframe.

In [8]:
column_names = static_level + ['Variable level'] + ['Calculated Function']

Create dataframe from list of values calculated in loop and assign column names from list column_names.

In [9]:
output_dataframe = pd.DataFrame(output_variables, columns = column_names)

Output dataframe to XLSX file.

In [10]:
output_dataframe.to_excel('./data/Output.xlsx', index = False)