# Data voorbewerking

De data voor dit project is afkomstig van het back office systeem van Financial Markets, deels aangevuld met extra data vanuit Bloomberg.
De data bestaat uit de volgende elementen:

- Bond data
- Bondprijzen
- Government Yield curves
- Inflation data

Alle data is extracted en opgeslagen in csv files. In dit workbook lopen we door de data voorbereiding heen. Alle hier genoemde stappen kunnen ook geautomatiseerd worden uitgevoerd door het shell command 'Make Data'.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd

sys.path.insert(0, "..") 
from src.data import make_dataset
from src.features import build_features

# Bond data

In [50]:

# Get bond data, drop unneeded columns, convert formats and strip training blanks
df_bonds = make_dataset.get_bond_data()

2022-01-10 21:27:19.301 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-10 21:27:19.302 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\bonds.csv


In [53]:
df_bonds.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226 entries, 0 to 229
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype          
---  ------             --------------  -----          
 0   ccy                226 non-null    string         
 1   country            226 non-null    string         
 2   issue_dt           226 non-null    datetime64[ns] 
 3   first_coupon_date  226 non-null    datetime64[ns] 
 4   mature_dt          226 non-null    datetime64[ns] 
 5   isin               226 non-null    string         
 6   issuer_name        226 non-null    string         
 7   coupon_frq         226 non-null    string         
 8   coupon             226 non-null    float64        
 9   tot_issue          226 non-null    float64        
 10  cfi_code           226 non-null    string         
 11  issue_rating       226 non-null    string         
 12  bond_duration      226 non-null    timedelta64[ns]
dtypes: datetime64[ns](3), float64(2), string(7), timed

Imputeren ontbrekende waarden voor issue rating. 
Issue rating wordt waar deze ontbreekt ingevuld met de meest voorkomende issue rating voor de issuer.

CFI code wordt waar deze ontbreekt ingevuld met code 'onbekend' = DXXXXX.

Datums met de waarde 1899-12-30 zijn default waardes van het bronsysteem. Deze worden verwijderd.
Daar waar de eerste coupon datum ontbreekt (o.a. zero coupon bonds) - wordt deze aangevuld met de issue datum.
De reden hiervoor is dat we zo zonder veel moeite de looptijd van de bond kunnen berekenen.

In [52]:
df_bonds = make_dataset.impute_bonds(df_bonds)


2022-01-10 21:27:29.214 | INFO     | src.data.make_dataset:impute_bonds:81 - Impute bond data


In [54]:
make_dataset.save_pkl('bonds', df_bonds)

2022-01-10 21:27:47.341 | INFO     | src.data.make_dataset:save_pkl:364 - Save preprocessed bonds data


In [55]:
df_bonds.count()

ccy                  226
country              226
issue_dt             226
first_coupon_date    226
mature_dt            226
isin                 226
issuer_name          226
coupon_frq           226
coupon               226
tot_issue            226
cfi_code             226
issue_rating         226
bond_duration        226
dtype: int64

# Bondprijzen

In [29]:
df_price = make_dataset.get_price()

2022-01-10 21:13:35.830 | INFO     | src.data.make_dataset:get_price:113 - Load bond price data
2022-01-10 21:13:35.831 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\price.csv


In [30]:
df_price = make_dataset.impute_price(df_price)

2022-01-10 21:13:38.655 | INFO     | src.data.make_dataset:impute_price:136 - Impute bond price


In [None]:
df_price.head()

In [None]:
df_price.describe()

In [None]:
df_price.info()

In [None]:
make_dataset.save_pkl('price', df_price)

# Government Yield curves

In [None]:
df_yield = make_dataset.get_yield()

In [None]:
df_yield.info()

In [None]:
df_yield = make_dataset.impute_yield(df_yield)

In [None]:
df_yield.tail()

In [None]:
df_yield.info()

In [None]:
make_dataset.save_pkl('yield', df_yield)

# Inflation data


In [None]:
df_inflation = make_dataset.get_inflation()  

In [None]:
df_inflation = make_dataset.impute_inflation(df_inflation)


df_inflation.info()

In [None]:
df_inflation.head()

In [None]:
make_dataset.save_pkl('inflation', df_inflation)

In [None]:
df_bp = make_dataset.join_price(df_bonds,df_price )
df_bp = build_features.add_duration(df_bp)

Om de termspread toe te kunnen voegen moeten we de bond data joinen met de government yield. Hierdoor hebben we minder data beschikbaar.
De vraag is of dit nodig is (of dat het model dit zelf uit kan vogelen)

Hoe meer data we joinen - hoe meer data we kwijt raken. 

In [None]:
df_bpy = make_dataset.join_yield(df_bp, df_yield)
df_bpy = build_features.add_term_spread(df_bpy)
df_bpy = build_features.add_bid_offer_spread(df_bpy)
df_bpy.info()

In [None]:
df_tf = make_dataset.build_simple_input(df_bonds, df_price)


All in one make statement...

(dit duurt ongeveer 1min, 20sec)

In [2]:
make_dataset.make_data()

2022-01-15 11:50:55.540 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-15 11:50:55.541 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\bonds.csv
2022-01-15 11:50:55.580 | INFO     | src.data.make_dataset:impute_bonds:81 - Impute bond data
2022-01-15 11:50:55.587 | INFO     | src.data.make_dataset:save_pkl:364 - Save preprocessed bonds data
2022-01-15 11:50:55.611 | INFO     | src.data.make_dataset:get_price:114 - Load bond price data
2022-01-15 11:50:55.612 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\price.csv
2022-01-15 11:50:56.016 | INFO     | src.data.make_dataset:impute_price:137 - Impute bond price
2022-01-15 11:50:56.041 | INFO     | src.data.make_dataset:save_pkl:364 - Save preprocessed price data
2022-01-15 11:50:59.385 | INFO     | src.data.make_dataset:get_yield:150 - Load goverment yield curve data
2022-01-15 11:50:59.387 | INFO     | src.data.make_dataset:read_csv:27 - Loading d