# Data voorbewerking

De data voor dit project is afkomstig van het back office systeem van Financial Markets, deels aangevuld met extra data vanuit Bloomberg.
De data bestaat uit de volgende elementen:

- Bond data
- Bondprijzen
- Government Yield curves
- Inflation data

Alle data is extracted en opgeslagen in csv files. In dit workbook lopen we door de data voorbereiding heen. Alle hier genoemde stappen kunnen ook geautomatiseerd worden uitgevoerd door het shell command 'Make Data'.

In [3]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd

sys.path.insert(0, "..") 
from src.data import make_dataset
from src.features import build_features

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Bond data

In [4]:

# Get bond data, drop unneeded columns, convert formats and strip training blanks
df_bonds = make_dataset.get_bond_data()

2022-01-16 14:21:40.668 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-16 14:21:40.670 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\bonds.csv


In [5]:
df_bonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ccy                230 non-null    string        
 1   country            230 non-null    string        
 2   bond_ext_name      230 non-null    object        
 3   issue_dt           230 non-null    datetime64[ns]
 4   first_coupon_date  207 non-null    datetime64[ns]
 5   mature_dt          230 non-null    datetime64[ns]
 6   isin               230 non-null    string        
 7   issuer_name        230 non-null    string        
 8   coupon_frq         230 non-null    string        
 9   coupon             230 non-null    float64       
 10  tot_issue          230 non-null    float64       
 11  cfi_code           223 non-null    string        
 12  issue_rating       226 non-null    string        
dtypes: datetime64[ns](3), float64(2), object(1), string(7)
memory usa

Imputeren ontbrekende waarden voor issue rating. 
Issue rating wordt waar deze ontbreekt ingevuld met de meest voorkomende issue rating voor de issuer.

CFI code wordt waar deze ontbreekt ingevuld met code 'onbekend' = DXXXXX.

Datums met de waarde 1899-12-30 zijn default waardes van het bronsysteem. Deze worden verwijderd.
Daar waar de eerste coupon datum ontbreekt (o.a. zero coupon bonds) - wordt deze aangevuld met de issue datum.
De reden hiervoor is dat we zo zonder veel moeite de looptijd van de bond kunnen berekenen.

In [6]:
df_bonds = make_dataset.impute_bonds(df_bonds)


2022-01-16 14:21:44.953 | INFO     | src.data.make_dataset:impute_bonds:81 - Impute bond data


In [7]:
make_dataset.save_pkl('bonds', df_bonds)

2022-01-16 14:21:47.004 | INFO     | src.data.make_dataset:save_pkl:364 - Save preprocessed bonds data


# Bondprijzen

In [26]:
df_price = make_dataset.get_price()

2022-01-16 20:06:47.906 | INFO     | src.data.make_dataset:get_price:116 - Load bond price data
2022-01-16 20:06:47.918 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\price.csv


In [28]:
df_price = make_dataset.impute_price(df_price)

2022-01-16 20:09:13.199 | INFO     | src.data.make_dataset:impute_price:139 - Impute bond price


Unnamed: 0,reference_identifier,ccy,rate_dt,mid
0,DE0001135143,EUR,2010-12-17,136.760
1,NL0000102275,EUR,2010-12-17,103.390
2,DE0001135424,EUR,2010-12-17,95.453
3,NL0009446418,EUR,2010-12-17,102.690
4,NL0000102234,EUR,2010-12-17,106.220
...,...,...,...,...
235980,NL0015614579,EUR,2022-01-07,89.449
235981,SI0002104196,EUR,2022-01-07,97.525
235983,XS1756338551,EUR,2022-01-07,100.810
235984,XS2262263622,EUR,2022-01-07,95.945


In [21]:
df_price

Unnamed: 0,reference_identifier,ccy,rate_dt,mid,lastday
0,DE0001135143,EUR,2010-12-17,136.760,2010-12-31
1,NL0000102275,EUR,2010-12-17,103.390,2010-12-31
2,DE0001135424,EUR,2010-12-17,95.453,2010-12-31
3,NL0009446418,EUR,2010-12-17,102.690,2010-12-31
4,NL0000102234,EUR,2010-12-17,106.220,2010-12-31
...,...,...,...,...,...
235980,NL0015614579,EUR,2022-01-07,89.449,2022-12-31
235981,SI0002104196,EUR,2022-01-07,97.525,2022-12-31
235983,XS1756338551,EUR,2022-01-07,100.810,2022-12-31
235984,XS2262263622,EUR,2022-01-07,95.945,2022-12-31


In [None]:
df_price.head()

In [None]:
df_price.describe()

In [None]:
df_price.info()

In [None]:
make_dataset.save_pkl('price', df_price)

# Government Yield curves

In [None]:
df_yield = make_dataset.get_yield()

In [None]:
df_yield.info()

In [None]:
df_yield = make_dataset.impute_yield(df_yield)

In [None]:
df_yield.tail()

In [None]:
df_yield.info()

In [None]:
make_dataset.save_pkl('yield', df_yield)

# Inflation data


In [None]:
df_inflation = make_dataset.get_inflation()  

In [None]:
df_inflation = make_dataset.impute_inflation(df_inflation)


df_inflation.info()

In [None]:
df_inflation.head()

In [None]:
make_dataset.save_pkl('inflation', df_inflation)

In [None]:
df_bp = make_dataset.join_price(df_bonds,df_price )
df_bp = build_features.add_duration(df_bp)

Om de termspread toe te kunnen voegen moeten we de bond data joinen met de government yield. Hierdoor hebben we minder data beschikbaar.
De vraag is of dit nodig is (of dat het model dit zelf uit kan vogelen)

Hoe meer data we joinen - hoe meer data we kwijt raken. 

In [None]:
df_bpy = make_dataset.join_yield(df_bp, df_yield)
df_bpy = build_features.add_term_spread(df_bpy)
df_bpy = build_features.add_bid_offer_spread(df_bpy)
df_bpy.info()

In [None]:
df_tf = make_dataset.build_simple_input(df_bonds, df_price)


All in one make statement...

(dit duurt ongeveer 1min, 20sec)

In [25]:
make_dataset.make_data()

2022-01-16 14:42:40.071 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-16 14:42:40.072 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\bonds.csv
2022-01-16 14:42:40.100 | INFO     | src.data.make_dataset:impute_bonds:81 - Impute bond data
2022-01-16 14:42:40.106 | INFO     | src.data.make_dataset:save_pkl:369 - Save preprocessed bonds data
2022-01-16 14:42:40.127 | INFO     | src.data.make_dataset:get_price:116 - Load bond price data
2022-01-16 14:42:40.128 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\price.csv
2022-01-16 14:42:40.540 | INFO     | src.data.make_dataset:impute_price:139 - Impute bond price
2022-01-16 14:42:40.766 | INFO     | src.data.make_dataset:save_pkl:369 - Save preprocessed price data
2022-01-16 14:42:44.924 | INFO     | src.data.make_dataset:get_yield:154 - Load goverment yield curve data
2022-01-16 14:42:44.926 | INFO     | src.data.make_dataset:read_csv:27 - Loading d