<!-- Simon-Style -->
<p style="font-size:19px; text-align:left; margin-top:    15px;"><i>German Association of Actuaries (DAV) — Working Group "Explainable Artificial Intelligence"</i></p>
<p style="font-size:25px; text-align:left; margin-bottom: 15px"><b>Use Case SOA GLTD Experience Study:<br>
Data initialisation
</b></p>
<p style="font-size:19px; text-align:left; margin-bottom: 15px; margin-bottom: 15px">Guido Grützner (<a href="mailto:guido.gruetzner@quantakt.com">guido.gruetzner@quantakt.com</a>)


This notebook reads the SOA GLTD experience study data in the original zipped format (zipped csv) as found in "https://cdn-files.soa.org/2019-group-ltd-exp-studies/2009-2013-gltd-consolidated-database.zip" and transforms the file into several smaller files in feather format.

The path to your local copy of the zip-file has to be assigned to the variable `datadir` in line 8 of the first code block below.

The number of resulting files after the split is determined by the variable `anzblock` defined below. The current number is sufficient large and the resulting files sufficiently small to enable reading one(!) of those files into a PC with 8GB of RAM. The script itself should run on a 8GB RAM Laptop in roughly twenty minutes. It only needs to run once. The total size of all .feather files is roughly 2GB.

Each of the resulting split files contains a random sample (without replacement) of the total database. The split files are disjoint, and the union of all split files is the total database.

In [1]:
import os
import numpy as np
import pandas as pd
pd.options.mode.copy_on_write = True

# Adjust according to your local setup
# LOCATION OF GLTD DATA
datadir = "d:/tmp/GLTD data/"
fn_in = "2009-2013-gltd-consolidated-database.zip"


incols = ['Study_ID', 'Elimination_Period', 'Calendar_Year', 'Calendar_Month',
          'Duration_Month', 'Age_at_Disability', 'Diagnosis_Category',
          'OwnOccToAnyTransition', 'Gender', 'Attained_Age',
          'Mental_and_Nervous_Period', 'M_N_Limit_Transition',
          'Gross_Indexed_Benefit_Amount', 'Industry', 'Indexed_Monthly_Salary',
          'Taxability_of_Benefits', 'Integration_with_STD', 'Case_Size',
          'Residence_State', 'COLA_Indicator', 'Benefit_Max_Limit_Proxy',
          'Replacement_Ratio', 'Original_Social_Security_Award_Status',
          'Updated_Social_Security_Award_Status', 'Exposures',
          'Actual_Recoveries', 'Actual_Deaths', 'Settlement_Counts',
          'Max_Out_Counts', 'Limits_Count']

# Please, don't change the random seed when working in a team
# You will select base data different from everyone else!
sq = np.random.SeedSequence()
print('seed = {}'.format(47110815))
rng = np.random.default_rng(seed=sq)

seed = 47110815


Groups are created according to Study_IDs. Study_IDs are assigned randomly to groups, but it is assured that all records (=lines of the csv) with the same `Study_ID` end up in the same group. 

In [2]:
# Determine groups
rawtbl = pd.read_table(datadir + fn_in, usecols=["Study_ID"])

In [3]:
id_uq = pd.Series(rawtbl["Study_ID"].unique())
n = id_uq.size
# apply a random permutation to the sequence of IDs
id_uq = id_uq.sample(n=n, random_state=rng, replace=False)
# this defines the number of blocks (i.e. anzblock + 1)
anzblock = 4  # should be OK for 8GB RAM
# unique IDs per block
nid_bk = np.floor(n * 0.9 / anzblock).astype(int)
# Rest is UHM data which should be kept separate
nid_uhm = (n - anzblock * nid_bk).astype(int)
# names of output files, extension include the "."
fn_ext = ".gz"
nm_out = ["gltd09_13_pt" + str(i) + fn_ext for i in range(anzblock)]
nm_out.append("uhmgltd09_13" + fn_ext)

# create groups mapping
tt = np.array([np.repeat(igrp, nid_bk) for igrp in range(anzblock)]).ravel()
grp = pd.Series(np.concatenate((tt, np.repeat(anzblock, nid_uhm))))
id2grp = pd.DataFrame({"id_uq": id_uq, "grp": grp})

In [4]:
with pd.read_table(datadir + fn_in, usecols=incols, engine="c",
                   chunksize=200000) as reader:
    for chunk in reader:
        tt = chunk.merge(id2grp, left_on='Study_ID', right_on='id_uq')
        tt.drop(["id_uq"], axis=1, inplace=True)
        grouped = tt.groupby(["grp"])

        for fnum, group in grouped:
            fn = datadir + nm_out[fnum[0]]
            flg_header = not os.path.isfile(fn)
            group.drop(["grp"], axis=1, inplace=True)
            group.to_csv(fn, sep="\t", header=flg_header, mode="a",
                         index=False, index_label=False)

In [5]:
nm_feather = [nm[:-len(fn_ext) + 1] + "feather" for nm in nm_out]
for ifile in range(len(nm_out)):
    fn_csv = datadir + nm_out[ifile]
    fn_feather = datadir + nm_feather[ifile]
    if os.path.isfile(fn_csv):
        print(ifile)
        tt = pd.read_csv(fn_csv, sep="\t")
        tt.to_feather(fn_feather)

0
1
2
3
4


# Vertrauen ist gut Kontrolle ist besser!

These Cells need to be run only for validation purposes. Once, after changes to the code above were made. Note that you need a machine with sufficient RAM ($\geq$32GB) to run these validations

In [6]:
# # repeat assignments to make this run stand-alone
# import numpy as np
# import pandas as pd
# pd.options.mode.copy_on_write = True

# datadir = "d:/tmp/GLTD data/"
# fn_in = "2009-2013-gltd-consolidated-database.zip"
# rawtbl = pd.read_table(datadir + fn_in)

# anzblock = 4
# fn_ext = ".gz"
# nm_out = ["gltd09_13_pt" + str(i) + fn_ext for i in range(anzblock)]
# nm_out.append("uhmgltd09_13" + fn_ext)
# nm_feather = [nm[:-len(fn_ext) + 1] + "feather" for nm in nm_out]
# tt = [pd.read_feather(datadir + ifile) for ifile in nm_feather]
# df_in = pd.concat(tt, axis=0)

In [7]:
# iduq_raw = pd.unique(rawtbl["Study_ID"])
# iduq_df = pd.unique(df_in["Study_ID"])
# all(iduq_raw == iduq_df)

In [8]:
# # get records per Study_ID
# rawrecperid = rawtbl["Study_ID"].value_counts()
# dfrecperid = df_in["Study_ID"].value_counts()
# all(rawrecperid == dfrecperid)

In [9]:
# abs(rawtbl.Exposures.sum() - df_in.Exposures.sum()) < 1e-7