# Data voorbewerking

De data voor dit project is afkomstig van het back office systeem van de Volksbank Financial Markets, deels aangevuld met extra data vanuit Bloomberg.
De data voor dit onderzoek bestaat uit de volgende data sets:

- Bond data
- Bondprijzen
- Government Yield curves
- Inflation data

Alle data is uit de bronsystemen geextraheerd en opgeslagen in csv files. In dit workbook lopen we door de data voorbereiding heen. Alle hier genoemde stappen kunnen ook geautomatiseerd worden uitgevoerd door het aanroepen van de make_data routine zoals te zien aan het eind van dit notebook.

In [4]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd

sys.path.insert(0, "..") 
from src.data import make_dataset, join_data
from src.features import build_features

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Bond data

In [8]:

# Get bond data, drop unneeded columns, convert formats and strip training blanks
df_bonds = make_dataset.get_bond_data()

2022-01-27 21:47:14.273 | INFO     | src.data.make_dataset:get_bond_data:34 - Load bond data
2022-01-27 21:47:14.274 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\bonds.csv


Imputeren ontbrekende waarden voor issue rating. 
Issue rating wordt waar deze ontbreekt ingevuld met de meest voorkomende issue rating voor de issuer.

CFI code wordt waar deze ontbreekt ingevuld met code 'onbekend' = DXXXXX.

Datums met de waarde 1899-12-30 zijn default waardes van het bronsysteem. Deze worden verwijderd.
Daar waar de eerste coupon datum ontbreekt (o.a. zero coupon bonds) - wordt deze aangevuld met de issue datum.
De reden hiervoor is dat we zo zonder veel moeite de looptijd van de bond kunnen berekenen.

In [9]:
df_bonds = make_dataset.impute_bonds(df_bonds)

2022-01-27 21:47:19.816 | INFO     | src.data.make_dataset:impute_bonds:73 - Impute bond data


In [11]:
df_bonds.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 226 entries, 0 to 229
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ccy                226 non-null    string        
 1   country            226 non-null    string        
 2   bond_ext_name      226 non-null    string        
 3   issue_dt           226 non-null    datetime64[ns]
 4   first_coupon_date  226 non-null    datetime64[ns]
 5   mature_dt          226 non-null    datetime64[ns]
 6   isin               226 non-null    string        
 7   issuer_name        226 non-null    string        
 8   coupon_frq         226 non-null    string        
 9   coupon             226 non-null    float64       
 10  tot_issue          226 non-null    float64       
 11  cfi_code           226 non-null    string        
 12  issue_rating       226 non-null    string        
 13  bond_duration      226 non-null    int64         
 14  issue     

In [12]:
make_dataset.save_pkl('bonds', df_bonds)

2022-01-27 21:47:34.784 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed bonds data


# Bondprijzen

In [13]:
df_price = make_dataset.get_price()

2022-01-27 21:47:36.809 | INFO     | src.data.make_dataset:get_price:107 - Load bond price data
2022-01-27 21:47:36.810 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\price.csv


In [14]:
df_price = make_dataset.impute_price(df_price)

2022-01-27 21:47:38.698 | INFO     | src.data.make_dataset:impute_price:131 - Impute bond price


In [15]:
df_price.head()

Unnamed: 0,reference_identifier,ccy,rate_dt,mid,lastday
0,DE0001135143,EUR,2010-12-17,136.76,2010-12-31
1,NL0000102275,EUR,2010-12-17,103.39,2010-12-31
2,DE0001135424,EUR,2010-12-17,95.453,2010-12-31
3,NL0009446418,EUR,2010-12-17,102.69,2010-12-31
4,NL0000102234,EUR,2010-12-17,106.22,2010-12-31


In [16]:
df_price.describe()

Unnamed: 0,mid
count,223950.0
mean,110.80075
std,15.315898
min,84.429
25%,102.137
50%,106.278
75%,112.225
max,195.749


In [17]:
df_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223950 entries, 0 to 235985
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   reference_identifier  223950 non-null  string        
 1   ccy                   223950 non-null  string        
 2   rate_dt               223950 non-null  datetime64[ns]
 3   mid                   223950 non-null  float64       
 4   lastday               223950 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(1), string(2)
memory usage: 10.3+ MB


In [18]:
make_dataset.save_pkl('price', df_price)

2022-01-27 21:47:47.932 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed price data


# Government Yield curves

In [19]:
df_yield = make_dataset.get_yield()

2022-01-27 21:47:51.860 | INFO     | src.data.make_dataset:get_yield:149 - Load goverment yield curve data
2022-01-27 21:47:51.862 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\yield.csv


In [20]:
df_yield = make_dataset.impute_yield(df_yield)

2022-01-27 21:47:53.842 | INFO     | src.data.make_dataset:impute_yield:180 - Impute yield curve


In [21]:
df_yield.tail()

Unnamed: 0,country,rate_dt,timeband,ratename,ccy,actual_dt,datedays,bid,offer,int_basis,time,mid
117793,Spain,2022-01-07,5 YEARS,GOV Yield Curve ES BB,EUR,2027-01-07,1830,0.026,0.013,ANNUAL,1826 days,0.0195
117795,Spain,2022-01-07,6 YEARS,GOV Yield Curve ES BB,EUR,2028-01-07,2195,0.097,0.079,ANNUAL,2191 days,0.088
117796,Spain,2022-01-07,7 YEARS,GOV Yield Curve ES BB,EUR,2029-01-07,2561,0.175,0.158,ANNUAL,2557 days,0.1665
117797,Spain,2022-01-07,8 YEARS,GOV Yield Curve ES BB,EUR,2030-01-07,2926,0.332,0.319,ANNUAL,2922 days,0.3255
117799,Spain,2022-01-07,9 YEARS,GOV Yield Curve ES BB,EUR,2031-01-07,3293,0.489,0.477,ANNUAL,3287 days,0.483


In [22]:
df_yield.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90506 entries, 1 to 117799
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype          
---  ------     --------------  -----          
 0   country    90506 non-null  string         
 1   rate_dt    90506 non-null  datetime64[ns] 
 2   timeband   90506 non-null  string         
 3   ratename   90506 non-null  string         
 4   ccy        90506 non-null  string         
 5   actual_dt  90506 non-null  datetime64[ns] 
 6   datedays   90506 non-null  int64          
 7   bid        90506 non-null  float64        
 8   offer      90506 non-null  float64        
 9   int_basis  90506 non-null  string         
 10  time       90506 non-null  timedelta64[ns]
 11  mid        90506 non-null  float64        
dtypes: datetime64[ns](2), float64(3), int64(1), string(5), timedelta64[ns](1)
memory usage: 9.0 MB


In [23]:
make_dataset.save_pkl('yield', df_yield)

2022-01-27 21:47:59.887 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed yield data


# Inflation data


In [24]:
df_inflation = make_dataset.get_inflation()  

2022-01-27 21:48:04.092 | INFO     | src.data.make_dataset:get_inflation:207 - Load goverment yield curve data
2022-01-27 21:48:04.093 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\DE Inflation.csv
2022-01-27 21:48:04.111 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\FR Inflation.csv
2022-01-27 21:48:04.128 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\ES Inflation.csv
2022-01-27 21:48:04.147 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\IT Inflation.csv
2022-01-27 21:48:04.166 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\US Inflation.csv


In [25]:
df_inflation = make_dataset.impute_inflation(df_inflation)

2022-01-27 21:48:20.253 | INFO     | src.data.make_dataset:impute_inflation:235 - Impute inflation curve


In [26]:
df_inflation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215007 entries, 0 to 216873
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype          
---  ------     --------------   -----          
 0   country    215007 non-null  string         
 1   rate_dt    215007 non-null  datetime64[ns] 
 2   timeband   215007 non-null  string         
 3   inflation  215007 non-null  float64        
 4   ratename   215007 non-null  string         
 5   actual_dt  215007 non-null  datetime64[ns] 
 6   time       215007 non-null  timedelta64[ns]
dtypes: datetime64[ns](2), float64(1), string(3), timedelta64[ns](1)
memory usage: 13.1 MB


In [27]:
df_inflation.head()

Unnamed: 0,country,rate_dt,timeband,inflation,ratename,actual_dt,time
0,Germany,2021-12-23,1 YEAR,3.28625,Inflation,2022-12-23,365 days
1,Germany,2021-12-22,1 YEAR,3.33875,Inflation,2022-12-22,365 days
2,Germany,2021-12-21,1 YEAR,3.15625,Inflation,2022-12-21,365 days
3,Germany,2021-12-20,1 YEAR,3.01375,Inflation,2022-12-20,365 days
4,Germany,2021-12-17,1 YEAR,2.89875,Inflation,2022-12-17,365 days


In [28]:
make_dataset.save_pkl('inflation', df_inflation)

2022-01-27 21:48:21.172 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed inflation data


In [29]:
df_bp = join_data.join_price(df_bonds,df_price )
df_bp = build_features.add_duration(df_bp)
make_dataset.save_pkl('bp', df_bp)


2022-01-27 21:48:30.823 | INFO     | src.features.build_features:add_duration:13 - Add remaining duration...
2022-01-27 21:48:30.837 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed bp data


Om de termspread toe te kunnen voegen moeten we de bond data joinen met de government yield. 

In [30]:
df_bpy = join_data.join_yield(df_bp, df_yield)
df_bpy = build_features.add_term_spread(df_bpy)
df_bpy = build_features.add_bid_offer_spread(df_bpy)
make_dataset.save_pkl('bpy', df_bpy)

2022-01-27 21:48:48.221 | INFO     | src.features.build_features:add_term_spread:64 - Add term spread...
2022-01-27 21:48:48.226 | INFO     | src.features.build_features:add_bid_offer_spread:75 - Add bid offer spread...
2022-01-27 21:48:48.510 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed bpy data


All in one make statement...

(dit duurt ongeveer 1min, 25sec)

In [31]:
make_dataset.make_data()

2022-01-27 21:49:04.472 | INFO     | src.data.make_dataset:get_bond_data:34 - Load bond data
2022-01-27 21:49:04.473 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\bonds.csv
2022-01-27 21:49:04.496 | INFO     | src.data.make_dataset:impute_bonds:73 - Impute bond data
2022-01-27 21:49:04.508 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed bonds data
2022-01-27 21:49:04.564 | INFO     | src.data.make_dataset:get_price:107 - Load bond price data
2022-01-27 21:49:04.565 | INFO     | src.data.make_dataset:read_csv:23 - Loading data from ..\data\raw\price.csv
2022-01-27 21:49:04.942 | INFO     | src.data.make_dataset:impute_price:131 - Impute bond price
2022-01-27 21:49:05.154 | INFO     | src.data.make_dataset:save_pkl:322 - Save preprocessed price data
2022-01-27 21:49:09.193 | INFO     | src.data.make_dataset:get_yield:149 - Load goverment yield curve data
2022-01-27 21:49:09.194 | INFO     | src.data.make_dataset:read_csv:23 - Loading d