Author: Justin Braun

Date: 20221119

Purpose: Generate input data for experiment based on ini config files. Loops over all ini files in '../conf' and outputs single csv file with combinations and sliced data. Combinations data is a copy of the training/synthetic data for each possible combination of values in the variable_list, which are not in violation of the business rule.  'data_generator()' in the final cell calls all other functions.

Also note that some combinations between variables we are interested in may violate business rules. These combinations can be removed from the data by specifying them in 'excluded_combinations' in the ini file.

Final point, in Data Generator, look for "CHECK" comments and specify file paths and whether you have access to the real training data which is not publicly available.

In [15]:
# Import packages
import pandas as pd
import configparser
import ast
import os
import random
import copy

## Load Config File

In [16]:
# Load config files
#
# @param config_path: file path to an ini file, which contains configuration data for a single experiment
#
# @return conf: A dictionary which matches the config file in config path. File paths are concatenated within the function
def read_config(conf_path): 
    #read config file
    config = configparser.ConfigParser(allow_no_value=True)
    config.read(conf_path)
    
    conf = {} #set up conf dictionary
    
    #load meta data
    meta = config['META']
    conf['user'] = meta['user']
    conf['date'] = meta['date']
    conf['name'] = meta['name']
    
    #generate destination file path (where the output csv will be saved)
    conf['dest_filename'] = '../data/03_experiment_input/'+conf['date']+'_'+conf['user']+'_'+conf['name']+'.csv'
    print('Destination Filename: ' + conf['dest_filename'])
    
    #generate input filepaths for real and synthetic data
    filepaths = config['FILEPATHS']
    conf['real_fp'] = filepaths['real']
    conf['synth_fp'] = filepaths['synth']
    print('Real Source FP: '+ conf['real_fp'])
    print('Synth Source FP: '+ conf['synth_fp'])
    
    #store variable list as list of lists of dictionaries
    variables = config['VARIABLES']
    variable_list = variables['variable_list']
    variable_list = variable_list.replace('“', '"')
    variable_list = variable_list.replace('”', '"')
    conf['variable_list'] = ast.literal_eval(variable_list) #evaluate string to list of lists of dictionaries
    print('Variable list:')
    print(conf['variable_list'])
    
    excluded_combinations = variables['excluded_combinations']
    conf['excluded_combinations'] = ast.literal_eval(excluded_combinations)
    print('Excluded combinations:')
    print(conf['excluded_combinations'])
    return conf


## Load Data

In [17]:
# Load data
#
# @param filepath: file path to the training data
#
# @return df: pandas dataframe of the training data
def load_data(fp):
    #CHECK: if you are running this code on a Windows machine, you may have to include the argument "encoding = 'latin'"
    df = pd.read_csv(fp) 
    return df


## Check User Inputs

In [18]:
# Check User Inputs
#
# @param td: pandas dataframe for the training data
# @variable_list: list of lists of dictionary containing all the variables
#
# Purpose: checks that user inputs actually correspond to variables in td and change variable_list values for variables
# where 'ALL' is specified to all unique values of that variable
def check_user_inputs(td, variable_list):
    col_names = list(td.columns.values) #all column names
    for nested_list in variable_list:
        for dic in nested_list:
            #print(dic)
            var = list(dic.keys())[0] #var name in variable_list
            assert_message = var + ' is not a column name.' #warning message if variable name is not contained in td
            
            #assert that user inputs actually correspond to variables in td
            assert var in col_names, assert_message 
            
            #if variable values are specified as 'ALL', change to all unique values for this variable
            if (dic[var] == ['ALL']):
                dic[var] = list(td[var].unique())
            dic[var] = list(map(pd.to_numeric, dic[var]))
    print('All chosen variables correspondond to columns in the dataset')


## Slice Data

In [19]:
# Slice Data
#
# @param df: pandas dataframe to be sliced
# @param variable_list: list of lists of dictionaries containing variable values, df rows have to meet to remain
# @param data_type: string specifying which data type this particular df is, e.g., 'real' or 'synth'
#
# @return df_copy: pandas dataframe sliced according to the values specified in variable_list
def slice_data(df, variable_list, dt):
    df_copy = copy.deepcopy(df) #make a copy of the original df
    
    for nested_list in variable_list:
        for dic in nested_list:
            var = list(dic.keys())[0]
            df_copy = df_copy.loc[df_copy[var].isin(dic[var])] #subset df_copy by values for each variable
    
    df_copy['data_type'] = dt #set data_type column
    print(dt + ' copied, shape: ' + str(df_copy.shape))
    return df_copy


## Business Rules

In [20]:
# Zero One Hot Encoded Is Valid
#
# @param column_names: list of column names
# @param data: pandas data frame
#
# @return bool_list: returns boolean list, where True elements correspond to rows
# in accordance with the business rule and False elements correspond to rows which violate the business rule.
#
# Purpose: Is valid when either 1 or 0 of the columns in column_names are coded as one
def zero_one_hot_encoding_is_valid(column_names, data):
    temp=data.loc[:, column_names]
    is_zero_or_one_hot_encoded = (temp.sum(axis=1) <= 1)
    return is_zero_or_one_hot_encoded

#test
data = [[0,0],[0,1],[1,0],[1,1],[1,2],[2,1],[2,2]]
test_df = pd.DataFrame(data, columns=['refcol', 'col1'])
test_fn = zero_one_hot_encoding_is_valid(['col1', 'refcol'], test_df)
print(test_fn)

0     True
1     True
2     True
3    False
4    False
5    False
6    False
dtype: bool


In [21]:
# Zero or GE Is Valid
#
# @param column_names: list of column names
# @param data: pandas data frame
# @param ref_column: name of a column in data
#
# @return bool_list: returns boolean list, where True elements correspond to rows
# in accordance with the business rule and False elements correspond to rows which violate the business rule.
#
# Purpose: Is valid when all columns are zero or when ref column is smaller than all columns in column_names. For
# instance, when 'relatie_kind_heeft_kinderen' = 0, 'relatie_kind_huidige_aantal' also has to be 0. Conversely,
# when 'relatie_kind_heeft_kinderen' = 1, then 'relatie_kind_huidige_aantal' >= 1.
def zero_or_ge_is_valid(column_names, data, ref_column):
    ret_list = [True for i in range(len(data.index))]
    for col in column_names:
        temp_list = (data[ref_column] >= data[col])
        ret_list = temp_list & ret_list
    return ret_list

#test
data = [[0,0],[0,1],[1,0],[1,1],[1,2],[2,1],[2,2]]
test_df = pd.DataFrame(data, columns=['refcol', 'col1'])
test_fn = zero_or_ge_is_valid(['col1'], test_df, 'refcol')
print(test_fn)

0     True
1    False
2     True
3     True
4    False
5     True
6     True
dtype: bool


In [22]:
# Zero Must Match
#
# @param column_names: list of column names
# @param data: pandas data frame
# @param ref_column: name of a column in data
#
# @return bool_list: returns boolean list, where True elements correspond to rows
# in accordance with the business rule and False elements correspond to rows which violate the business rule.
#
# Purpose: Is valid if when ref_column is zero and all columns in column_names are also zero.
# If ref_column is not zero, other columns can vary. For instance, if an individual has 'ontheffing_hist_ind' = 0,
# all other 'ontheffing_hist*' also have to be zero. If 'ontheffing_hist_ind' = 1, the other 'ontheffing_hist*' can vary.
def zero_must_match_is_valid(column_names, data, ref_column):
    ret_list = [True for i in range(len(data.index))]
    for col in column_names:
        temp_list = (((data[ref_column] == 0) & (data[col] == 0)) | (data[ref_column] > 0))
        ret_list = temp_list & ret_list
    return ret_list

#test function
data = [[0,0],[0,1],[1,0],[1,1],[1,2],[2,1],[2,2]]
test_df = pd.DataFrame(data, columns=['refcol', 'col1'])
test_fn = zero_must_match_is_valid(['col1'], test_df, 'refcol')
print(test_fn)

0     True
1    False
2     True
3     True
4     True
5     True
6     True
dtype: bool


## Check Business Rules Violations

In [23]:
# Check Business Rules Violations
#
# @param comb_list: list of pandas dataframes, to be checked for business rules violations
#
# @return comb_list: list of pandas dataframes, where rows have been removed from all dataframes, 
# which are in violation of any business rules for any of the dataframes in comb_list
def check_bus_rules_violations(comb_list):
    
    #initialize bool_list, which is used to subset the dataframes in comb_list
    df = comb_list[0]
    bool_list = pd.Series(True, index=df.index) #initialize to all true
    
    #for each dataframe, check if any of the business rules is violated
    for index in range(len(comb_list)):
        df = comb_list[index] #extract dataframe
        
        district_vars = ['adres_recentste_wijk_charlois', 'adres_recentste_wijk_delfshaven', 'adres_recentste_wijk_feijenoord',
                   'adres_recentste_wijk_ijsselmonde', 'adres_recentste_wijk_kralingen_c', 'adres_recentste_wijk_noord',
                   'adres_recentste_wijk_other', 'adres_recentste_wijk_prins_alexa', 'adres_recentste_wijk_stadscentru']
        bool_list = (zero_one_hot_encoding_is_valid(district_vars, df) & bool_list)
        #print('1: ', bool_list.value_counts())
                
        neighborhood_vars = ['adres_recentste_buurt_groot_ijsselmonde', 'adres_recentste_buurt_nieuwe_westen', 'adres_recentste_buurt_other',
                   'adres_recentste_buurt_oude_noorden', 'adres_recentste_buurt_vreewijk']
        bool_list = (zero_one_hot_encoding_is_valid(neighborhood_vars, df) & bool_list)
        #print('2: ', bool_list.value_counts())
            
        bool_list = (zero_one_hot_encoding_is_valid(['adres_recentste_plaats_other','adres_recentste_plaats_rotterdam'], df) & bool_list)
        #print('3: ', bool_list.value_counts())        
        
        district_neighborhood_plaats = district_vars + neighborhood_vars + ['adres_recentste_plaats_rotterdam']
        district_neighborhood_plaats = list(set(district_neighborhood_plaats) - set(['adres_recentste_wijk_delfshaven', 'adres_recentste_wijk_other', 'adres_recentste_buurt_other']))
        bool_list = (zero_or_ge_is_valid(district_neighborhood_plaats, df, 'adres_recentst_onderdeel_rdam') & bool_list)
        #print('4: ', bool_list.value_counts())      
        
        district_neighborhood_plaats.remove('adres_recentste_plaats_rotterdam')
        district_neighborhood = district_neighborhood_plaats
        bool_list = (zero_or_ge_is_valid(district_neighborhood, df, 'adres_recentste_plaats_rotterdam') & bool_list)
        #print('5: ', bool_list.value_counts())        
        
        district_neighborhood_matches = {'adres_recentste_wijk_noord':'adres_recentste_buurt_oude_noorden',
                                'adres_recentste_wijk_feijenoord':'adres_recentste_buurt_vreewijk',
                                'adres_recentste_wijk_ijsselmonde':'adres_recentste_buurt_groot_ijsselmonde',
                                'adres_recentste_wijk_delfshaven':'adres_recentste_buurt_nieuwe_westen'}
        for key, value in district_neighborhood_matches.items():
            bool_list = (zero_or_ge_is_valid([value], df, key) & bool_list)
            #print('5: ', bool_list.value_counts())
                        
        bool_list = (zero_must_match_is_valid(['adres_recentste_wijk_other'], df, 'adres_recentste_buurt_other') & bool_list)
        #print('6: ', bool_list.value_counts())        
        
        reading_vars = ['persoonlijke_eigenschappen_nl_lezen3', 'persoonlijke_eigenschappen_nl_lezen4']
        bool_list = (zero_one_hot_encoding_is_valid(reading_vars, df) & bool_list)
        #print('7: ', bool_list.value_counts())
        
        writing_vars = ['persoonlijke_eigenschappen_nl_schrijven0', 'persoonlijke_eigenschappen_nl_schrijven1', 'persoonlijke_eigenschappen_nl_schrijven2',
                'persoonlijke_eigenschappen_nl_schrijven3', 'persoonlijke_eigenschappen_nl_schrijvenfalse']
        bool_list = (zero_one_hot_encoding_is_valid(writing_vars, df) & bool_list)
        #print('8: ', bool_list.value_counts())
            
        speaking_vars = ['persoonlijke_eigenschappen_nl_spreken1', 'persoonlijke_eigenschappen_nl_spreken2',
           'persoonlijke_eigenschappen_nl_spreken3']
        bool_list = (zero_one_hot_encoding_is_valid(speaking_vars, df) & bool_list)
        #print('9: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['afspraak_laatstejaar_aantal_woorden'], df, 'afspraak_aantal_woorden') & bool_list)
        #print('10: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['afspraak_laatstejaar_resultaat_ingevuld_uniek'], df, 'afspraak_resultaat_ingevuld_uniek') & bool_list)
        #print('11: ', bool_list.value_counts())
        
        bool_list = (zero_or_ge_is_valid(['beschikbaarheid_huidig_afwijkend_wegens_medische_omstandigheden'], df, 'beschikbaarheid_huidig_bekend') & bool_list)
        #print('12: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['beschikbaarheid_recent_afwijkend_wegens_medische_omstandigheden', 'beschikbaarheid_huidig_afwijkend_wegens_medische_omstandigheden'], df, 'beschikbaarheid_aantal_historie_afwijkend_wegens_medische_omstandigheden') & bool_list)
        #print('13: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['beschikbaarheid_recent_afwijkend_wegens_sociaal_maatschappelijke_situatie'], df, 'beschikbaarheid_aantal_historie_afwijkend_wegens_sociaal_maatschappelijke_situatie') & bool_list)
        #print('14: ', bool_list.value_counts())        
        
        bool_list = (zero_one_hot_encoding_is_valid(['beschikbaarheid_recent_afwijkend_wegens_medische_omstandigheden', 
                                                             'beschikbaarheid_recent_afwijkend_wegens_sociaal_maatschappelijke_situatie'], df) & bool_list)
        #print('15: ', bool_list.value_counts())        
        
        contacten_matches = {'contacten_onderwerp__arbeids_motivatie':'contacten_onderwerp_boolean__arbeids_motivatie',
                 'contacten_onderwerp__pre__intake':'contacten_onderwerp_boolean__pre__intake',
                 'contacten_onderwerp__werk_intake':'contacten_onderwerp_boolean__werk_intake',
                 'contacten_onderwerp_beoordelen_taaleis':'contacten_onderwerp_boolean_beoordelen_taaleis',
                 'contacten_onderwerp_contact_derden':'contacten_onderwerp_boolean_contact_derden',
                 'contacten_onderwerp_contact_met_aanbieder':'contacten_onderwerp_boolean_contact_met_aanbieder',
                 'contacten_onderwerp_diagnosegesprek':'contacten_onderwerp_boolean_diagnosegesprek',
                 'contacten_onderwerp_documenten__innemen_':'contacten_onderwerp_boolean_documenten__innemen_',
                 'contacten_onderwerp_documenttype__cv_':'contacten_onderwerp_boolean_documenttype__cv_',
                 'contacten_onderwerp_documenttype__overeenkomst_':'contacten_onderwerp_boolean_documenttype__overeenkomst_',
                 'contacten_onderwerp_financiële_situatie':'contacten_onderwerp_boolean_financiële_situatie',
                 'contacten_onderwerp_groepsbijeenkomst':'contacten_onderwerp_boolean_groepsbijeenkomst',
                 'contacten_onderwerp_inkomen':'contacten_onderwerp_boolean_inkomen',
                 'contacten_onderwerp_maatregel_overweging':'contacten_onderwerp_boolean_maatregel_overweging',
                 'contacten_onderwerp_matching':'contacten_onderwerp_boolean_matching',
                 'contacten_onderwerp_mutatie':'contacten_onderwerp_boolean_mutatie',
                 'contacten_onderwerp_no_show':'contacten_onderwerp_boolean_no_show',
                 'contacten_onderwerp_overige':'contacten_onderwerp_boolean_overige',
                 'contacten_onderwerp_overleg_met_inkomen':'contacten_onderwerp_boolean_overleg_met_inkomen',
                 'contacten_onderwerp_scholing':'contacten_onderwerp_boolean_scholing',
                 'contacten_onderwerp_terugbelverzoek':'contacten_onderwerp_boolean_terugbelverzoek',
                 'contacten_onderwerp_traject':'contacten_onderwerp_boolean_traject',
                 'contacten_onderwerp_uitnodiging':'contacten_onderwerp_boolean_uitnodiging',
                 'contacten_onderwerp_ziek__of_afmelding':'contacten_onderwerp_boolean_ziek__of_afmelding',
                 'contacten_onderwerp_zorg':'contacten_onderwerp_boolean_zorg'}
        for key, value in contacten_matches.items():
            bool_list = (zero_or_ge_is_valid([value], df, key) & bool_list)
            #print('16: ', bool_list.value_counts())           
               
        bool_list = (zero_or_ge_is_valid(['relatie_kind_heeft_kinderen'], df, 'relatie_kind_huidige_aantal') & bool_list)
        #print('17: ', bool_list.value_counts())
        
        bool_list = (zero_or_ge_is_valid(['relatie_partner_huidige_partner___partner__gehuwd_'], df, 'relatie_partner_aantal_partner___partner__gehuwd_') & bool_list)
        #print('18: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['pla_ondertekeningen_actueel'], df, 'pla_ondertekeningen_historie') & bool_list)
        #print('19: ', bool_list.value_counts())        
        
        ontheffing_vars = ['ontheffing_reden_hist_medische_gronden','ontheffing_reden_hist_other', 
                   'ontheffing_reden_hist_sociale_gronden',
                   'ontheffing_reden_hist_tijdelijke_ontheffing_arbeidsverpl__en_tegenprestatie',
                   'ontheffing_reden_hist_tijdelijke_ontheffing_arbeidsverplichtingen',
                   'ontheffing_reden_hist_vanwege_uw_sociaal_maatschappelijke_situatie',
                   'ontheffing_dagen_hist_vanwege_uw_medische_omstandigheden', 
                   'ontheffing_dagen_hist_mean']
        bool_list = (zero_must_match_is_valid(ontheffing_vars, df, 'ontheffing_hist_ind') & bool_list)
        #print('20: ', bool_list.value_counts())        
        
        typering_vars = ['typering_indicatie_geheime_gegevens', 'typering_other',
                 'typering_transport__logistiek___tuinbouw', 'typering_zorg__schoonmaak___welzijn',
                 'typering_aantal', 'typering_ind', 'typering_hist_inburgeringsbehoeftig', 
                 'typering_hist_sector_zorg', 'typering_dagen_som']
        bool_list = (zero_must_match_is_valid(typering_vars, df, 'typering_hist_ind') & bool_list)
        #print('21: ', bool_list.value_counts())        
        
        bool_list = (zero_or_ge_is_valid(['typering_hist_ind'], df, 'typering_hist_aantal') & bool_list)
        #print('22: ', bool_list.value_counts())
    #print number of rows which are in violation of business rules
    print('Number of rows matching business rules:\n', str(bool_list.value_counts()))
    
    #exclude rows which are in violation of any business rule from all datafranes in comb_list
    for index in range(len(comb_list)):
        df = comb_list[index]
        df = df[bool_list.values]
        comb_list[index] = df
        
    return comb_list

## Generate Combinations Data

In [24]:
# Check excluded combinations
#
# @param comb_list: list of pandas data frames
# @param excluded_combinations: list of dictionaries specifying which combinations are not allowed
#
# @return temp_list: list of pandas data frames from which data frames that match one of the exclusion combinations have been removed
#
# Purpose: Certain variable value combinations for the list of combinations data frames can violate business rules.
# For instance, if we want to look at the number of children, the copy where number of children = 0 will violate business rules
# for observations where has children = 1. Conversely, setting number of children = 1 or greater will violate business
# rules for cases where has children = 0. Thus all observations will be in violation of business rules for some copy C_i.
# To account for this, this function removes copies C_i from the list of copies, when they match an exclusion restriction.
# These restrictions should be specified by the user to make sure that instances are taken care off where every row will violate 
# some business rule.
# Note that this function only runs on 'comb_list', i.e., a list of dataframes where for the variables in var_list
# all rows have the same value. This means that we don't need to check whether every single row matches an 
# exclusion combination, but it suffices to check the first row.
def check_excluded_combinations(comb_list, excluded_combinations):
    print('Length comb_list before exclusion: ' + str(len(comb_list)))
    temp_list = []
    for df in comb_list: #iterate over dataframes
        include = True
        for dic in excluded_combinations: #iterate over dictionaries specifying exclusion restrictions
            exclude = True
            for key in dic:
                if (df[key].values[0] != int(dic[key])): #case: an exclusion restriction is violated, i.e., the df doesn't need to be removed
                    exclude = False
                    break
            if exclude:
                include = False
                break
        if include: #case: no exclusion restriction has been violated, the df can stay
            temp_list.append(df)
    print('Length of comb_list after exclusion: ' + str(len(temp_list)))
    return temp_list

In [25]:
# Concat combinations
#
# @param new_data: dataframe to be copied for each combination of variable values
# @param variable_list: list of list of dictionaries specifying each variable and corresponding values
# @param excluded_combinations: list of dictionaries specifying which combinations are not allowed
#
# @return new_data: concatenated copies of input new_data, one copy for every possible combination of variable values
def concat_combinations(new_data, variable_list, excluded_combinations):
    comb_list = [new_data] #put new_data into a list
    
    #each nested list corresponds to a single 'feature'. This can either be a single variable or multiple One Hot Encoded vars
    #iterate over nested lists
    for nested_list in variable_list:
        #Hot One encoded case
        if len(nested_list) > 1:
            
            #extract all variable names which are OHE
            OHE_vars = []
            for dic in nested_list:
                OHE_vars.append(list(dic.keys())[0])
            #set all OHE vars to zero
            for df in comb_list:
                df.loc[:, OHE_vars] = 0
                
            #for each OHE var create a copy of comb_list and set the var to 1
            comb_list_temp = []
            for cur_var in OHE_vars:
                temp = copy.deepcopy(comb_list)
                for df in temp:
                    df[cur_var] = 1
                comb_list_temp = comb_list_temp + temp
            
            #set comb_list equal to all the newly created copies
            comb_list = comb_list_temp
                
        #Single variable case
        else:
            dic = nested_list[0]
            var = list(dic.keys())[0] #extract var name
            vals = dic[var] #get values for the variable
            
            #if a variable has more than 20 unique values, take a random sample of those values.
            if len(vals) > 20:
                random.seed(1)
                vals = random.sample(vals, 20)
                
            #for each value, create a copy of comb_list and set var equal to value
            comb_list_temp = []
            for value in vals:
                temp = copy.deepcopy(comb_list)
                for df in temp:
                    df[var] = value
                comb_list_temp = comb_list_temp + temp
                
            #set comb_list equal to all the newly created copies
            comb_list = comb_list_temp
    
    #exclude prohibited combinations of specified features
    comb_list = check_excluded_combinations(comb_list, excluded_combinations)
    
    #exclude cases which violate business rules
    comb_list = check_bus_rules_violations(comb_list)
    
    #set new_data equal to all the possible combinations
    new_data = pd.concat(comb_list)
    
    return new_data

In [26]:
# Combine Varlist
#
# @param df: dataframe to be copied for each combination of variable values in variable_list
# @param variable_list: list of list of dictionaries specifying each variable and corresponding values
# @param excluded_combinations: list of dictionaries specifying which combinations are not allowed
# @param data_type: string specifying which data type this particular df is, e.g., 'real_comb' or 'synth_comb'
#
# @return df: concatenated copies of input df, one copy for every possible combination of variable values
def combine_varlist(df, variable_list, excluded_combinations, data_type):
    df = concat_combinations(df.copy(), variable_list, excluded_combinations)
    df['data_type'] = data_type #specify data_type
    print(data_type + ' shape: ' + str(df.shape))
    return df


## Save Data

In [27]:
# Save Data
#
# @param data_files: list of pandas dataframes, to be combined and saved
# @param dest_filename: filename where data_files are to be saved
def save_data(data_files, dest_filename):
    data_exp = pd.concat(data_files) #concatenate data_files
    print('Final data shape: ' + str(data_exp.shape)) #print out shape of concatenated dataframe
    data_exp.to_csv(dest_filename, index = False) #save
    print('Data has been saved to ' + dest_filename) #print save message

## Data Generator

In [28]:
#CHECK: set flag depending on whether you have access to the real training data
training_access = False

# Data Generator
#
# Purpose: wrapper function; once all the other functions have been loaded, you only need to run this function to 
# generate data for each config file in '../conf'
def data_generator():
    directory_in_str = '../conf/archetypes' #CHECK: file path of ini files
    directory = os.fsencode(directory_in_str)
    
    
    #iterate over all ini config files
    for file in os.listdir(directory):
        filename = directory_in_str + "/" + os.fsdecode(file)
        
        #only load ini files
        if not filename.endswith('.ini'):
            continue
        print('Reading: ' + filename)
        
        #read ini file
        conf = read_config(filename)
        
        #load synthetic and real data
        if training_access:
            real = load_data(conf['real_fp'])
        synth = load_data(conf['synth_fp'])
        
        #check user inputs
        check_user_inputs(synth, conf['variable_list'])
        
        #generate copy of real data for simple statistical parity test
        if training_access:
            real_exp = slice_data(real, conf['variable_list'], 'real') 
        synth_exp = slice_data(synth, conf['variable_list'], 'synth')
        
        #generate data for conditional statistical parity test
        if training_access:
            real_comb = combine_varlist(real, conf['variable_list'], conf['excluded_combinations'], 'real_conditional')
            if real_comb.empty:
                raise Exception('Real Combinations Dataframe is empty! You might want to specify excluded combinations in the config file...')
        synth_comb = combine_varlist(synth, conf['variable_list'], conf['excluded_combinations'], 'synth_conditional')
        if synth_comb.empty:
            raise Exception('Synth Combinations Dataframe is empty! You might want to specify excluded combinations in the config file...')
        
        if training_access:
            save_data([real_exp, real_comb, synth_exp, synth_comb], conf['dest_filename'])
        else:
            save_data([synth_exp, synth_comb], conf['dest_filename'])
        print()

data_generator()

Reading: ../conf/archetypes/arch_combined_max.ini
Destination Filename: ../data/03_experiment_input/20221119_jb_arch_combined_max.csv
Real Source FP: ../data/00_hidden/td_numeric.csv
Synth Source FP: ../data/01_raw/synth_data.csv
Variable list:
[[{'persoon_geslacht_vrouw': ['1']}], [{'relatie_partner_totaal_dagen_partner': ['720']}], [{'relatie_kind_huidige_aantal': ['2']}], [{'relatie_kind_heeft_kinderen': ['1']}], [{'relatie_kind_leeftijd_verschil_ouder_eerste_kind': ['20']}], [{'relatie_kind_basisschool_kind': ['2']}], [{'relatie_overig_historie_vorm__andere_inwonende': ['3']}], [{'persoonlijke_eigenschappen_taaleis_voldaan': ['0']}], [{'persoonlijke_eigenschappen_dagen_sinds_taaleis': ['0']}], [{'persoonlijke_eigenschappen_uitstroom_verw_vlgs_km': ['1']}], [{'adres_recentste_wijk_delfshaven': ['1']}], [{'adres_recentste_wijk_stadscentru': ['0']}], [{'adres_recentste_wijk_charlois': ['0']}], [{'adres_recentste_wijk_feijenoord': ['0']}], [{'adres_recentste_wijk_ijsselmonde': ['0']}],

All chosen variables correspondond to columns in the dataset
synth copied, shape: (251, 316)
Length comb_list before exclusion: 16
Length of comb_list after exclusion: 2
Number of rows matching business rules:
 True     12508
False      137
dtype: int64
synth_conditional shape: (25016, 316)
Final data shape: (25267, 316)
Data has been saved to ../data/03_experiment_input/20221119_jb_parent.csv

Reading: ../conf/archetypes/arch_migrant_worker_neighborhood.ini
Destination Filename: ../data/03_experiment_input/20221119_jb_migrant_worker_neighborhood.csv
Real Source FP: ../data/00_hidden/td_numeric.csv
Synth Source FP: ../data/01_raw/synth_data.csv
Variable list:
[[{'relatie_overig_historie_vorm__andere_inwonende': ['0', '3']}], [{'adres_recentste_wijk_delfshaven': ['ALL']}, {'adres_recentste_wijk_stadscentru': ['ALL']}], [{'adres_recentste_wijk_charlois': ['0']}], [{'adres_recentste_wijk_feijenoord': ['0']}], [{'adres_recentste_wijk_ijsselmonde': ['0']}], [{'adres_recentste_wijk_kralingen

All chosen variables correspondond to columns in the dataset
synth copied, shape: (0, 316)
Length comb_list before exclusion: 8
Length of comb_list after exclusion: 4
Number of rows matching business rules:
 True    12645
dtype: int64
synth_conditional shape: (50580, 316)
Final data shape: (50580, 316)
Data has been saved to ../data/03_experiment_input/20221119_jb_migrant_worker_language.csv

Reading: ../conf/archetypes/arch_migrant_worker_comment_no_comment.ini
Destination Filename: ../data/03_experiment_input/20221119_jb_migrant_worker_comment_no_comment.csv
Real Source FP: ../data/00_hidden/td_numeric.csv
Synth Source FP: ../data/01_raw/synth_data.csv
Variable list:
[[{'relatie_overig_historie_vorm__andere_inwonende': ['3']}], [{'persoonlijke_eigenschappen_taaleis_voldaan': ['0']}], [{'persoonlijke_eigenschappen_dagen_sinds_taaleis': ['0']}], [{'contacten_onderwerp_boolean_beoordelen_taaleis': ['0']}], [{'afspraak_verzenden_beschikking_i_v_m__niet_voldoen_aan_wet_taaleis': ['0']}], 