# Data voorbewerking

De data voor dit project is afkomstig van het back office systeem van Financial Markets, deels aangevuld met extra data vanuit Bloomberg.
De data bestaat uit de volgende elementen:

- Bond data
- Bondprijzen
- Government Yield curves
- Inflation data

Alle data is extracted en opgeslagen in csv files. In dit workbook lopen we door de data voorbereiding heen. Alle hier genoemde stappen kunnen ook geautomatiseerd worden uitgevoerd door het shell command 'Make Data'.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd

sys.path.insert(0, "..") 
from src.data import make_dataset
from src.features import build_features

# Bond data

In [2]:

# Get bond data, drop unneeded columns, convert formats and strip training blanks
df_bonds = make_dataset.get_bond_data()

2022-01-10 19:33:48.584 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-10 19:33:48.586 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\bonds.csv


In [3]:
df_bonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ccy                230 non-null    string        
 1   country            230 non-null    string        
 2   issue_dt           230 non-null    datetime64[ns]
 3   first_coupon_date  207 non-null    datetime64[ns]
 4   mature_dt          230 non-null    datetime64[ns]
 5   isin               230 non-null    string        
 6   issuer_name        230 non-null    string        
 7   coupon_frq         230 non-null    string        
 8   coupon             230 non-null    float64       
 9   tot_issue          230 non-null    float64       
 10  cfi_code           223 non-null    string        
 11  issue_rating       226 non-null    string        
dtypes: datetime64[ns](3), float64(2), string(7)
memory usage: 21.7 KB


In [4]:
df_bonds.head()

Unnamed: 0,ccy,country,issue_dt,first_coupon_date,mature_dt,isin,issuer_name,coupon_frq,coupon,tot_issue,cfi_code,issue_rating
0,EUR,Netherlands,2009-02-13,2009-07-15,2019-07-15,NL0009086115,STAAT DER NEDERLANDEN,ANNUAL,4.0,5000.0,DBFTFN,AAA
1,NLG,Austria,1994-02-28,1995-02-28,2024-02-28,NL0000133924,AUSTRIA,ANNUAL,6.25,1000.0,DBFTXB,AA+
2,EUR,Netherlands,2012-03-09,2013-01-15,2033-01-15,NL0010071189,STAAT DER NEDERLANDEN,ANNUAL,2.5,4160.0,DBFXXN,AAA
3,USD,United States,2009-05-15,2009-11-15,2019-05-15,US912828KQ20,UNITED STATES TREASURY,SEMI ANNUAL,3.125,64411.0,,
4,USD,United States,2010-02-15,2010-08-15,2020-02-15,US912828MP29,UNITED STATES TREASURY,SEMI ANNUAL,3.625,0.0,,


Imputeren ontbrekende waarden voor issue rating. 
Issue rating wordt waar deze ontbreekt ingevuld met de meest voorkomende issue rating voor de issuer.

CFI code wordt waar deze ontbreekt ingevuld met code 'onbekend' = DXXXXX.

Datums met de waarde 1899-12-30 zijn default waardes van het bronsysteem. Deze worden verwijderd.
Daar waar de eerste coupon datum ontbreekt (o.a. zero coupon bonds) - wordt deze aangevuld met de issue datum.
De reden hiervoor is dat we zo zonder veel moeite de looptijd van de bond kunnen berekenen.

In [5]:
df_bonds = make_dataset.impute_bonds(df_bonds)

2022-01-10 19:34:00.912 | INFO     | src.data.make_dataset:impute_bonds:78 - Impute bond data


In [6]:
make_dataset.save_pkl('bonds', df_bonds)

2022-01-10 19:34:02.626 | INFO     | src.data.make_dataset:save_pkl:359 - Save preprocessed bonds data


In [7]:
df_bonds.count()

ccy                  230
country              230
issue_dt             230
first_coupon_date    230
mature_dt            230
isin                 230
issuer_name          230
coupon_frq           230
coupon               230
tot_issue            230
cfi_code             230
issue_rating         226
bond_duration        230
dtype: int64

# Bondprijzen

In [8]:
df_price = make_dataset.get_price()

2022-01-10 19:34:09.664 | INFO     | src.data.make_dataset:get_price:112 - Load bond price data
2022-01-10 19:34:09.665 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\price.csv


In [9]:
df_price = make_dataset.impute_price(df_price)

2022-01-10 19:34:11.458 | INFO     | src.data.make_dataset:impute_price:142 - Impute bond price


In [10]:
df_price.head()

Unnamed: 0,reference_identifier,ccy,rate_dt,mid
0,DE0001135143,EUR,2010-12-17,136.76
1,NL0000102275,EUR,2010-12-17,103.39
2,DE0001135424,EUR,2010-12-17,95.453
3,NL0009446418,EUR,2010-12-17,102.69
4,NL0000102234,EUR,2010-12-17,106.22


In [11]:
df_price.describe()

Unnamed: 0,mid
count,224586.0
mean,110.794414
std,15.3118
min,84.429
25%,102.136
50%,106.272
75%,112.214
max,195.749


In [12]:
df_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 224586 entries, 0 to 235985
Data columns (total 4 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   reference_identifier  224586 non-null  string        
 1   ccy                   224586 non-null  string        
 2   rate_dt               224586 non-null  datetime64[ns]
 3   mid                   224586 non-null  float64       
dtypes: datetime64[ns](1), float64(1), string(2)
memory usage: 8.6 MB


In [13]:
make_dataset.save_pkl('price', df_price)

2022-01-10 19:34:28.522 | INFO     | src.data.make_dataset:save_pkl:359 - Save preprocessed price data


# Government Yield curves

In [14]:
df_yield = make_dataset.get_yield()

2022-01-10 19:34:36.514 | INFO     | src.data.make_dataset:get_yield:150 - Load goverment yield curve data
2022-01-10 19:34:36.515 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\yield.csv


In [15]:
df_yield = make_dataset.impute_yield(df_yield)

2022-01-10 19:34:37.971 | INFO     | src.data.make_dataset:impute_yield:181 - Impute yield curve


In [16]:
df_yield.tail()

Unnamed: 0,country,rate_dt,timeband,ratename,ccy,actual_dt,datedays,bid,offer,int_basis,time
116809,Spain,2021-12-23,5 YEARS,GOV Yield Curve ES BB,EUR,2026-12-23,1831,-0.062,-0.081,ANNUAL,1826 days
116811,Spain,2021-12-23,6 YEARS,GOV Yield Curve ES BB,EUR,2027-12-23,2195,-0.015,-0.03,ANNUAL,2191 days
116812,Spain,2021-12-23,7 YEARS,GOV Yield Curve ES BB,EUR,2028-12-23,2561,0.071,0.057,ANNUAL,2557 days
116813,Spain,2021-12-23,8 YEARS,GOV Yield Curve ES BB,EUR,2029-12-23,2926,0.24,0.229,ANNUAL,2922 days
116815,Spain,2021-12-23,9 YEARS,GOV Yield Curve ES BB,EUR,2030-12-23,3291,0.367,0.357,ANNUAL,3287 days


In [17]:
df_yield.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89750 entries, 1 to 116815
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype          
---  ------     --------------  -----          
 0   country    89750 non-null  object         
 1   rate_dt    89750 non-null  datetime64[ns] 
 2   timeband   89750 non-null  object         
 3   ratename   89750 non-null  string         
 4   ccy        89750 non-null  string         
 5   actual_dt  89750 non-null  datetime64[ns] 
 6   datedays   89750 non-null  int64          
 7   bid        89750 non-null  float64        
 8   offer      89750 non-null  float64        
 9   int_basis  89750 non-null  string         
 10  time       89750 non-null  timedelta64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(2), string(3), timedelta64[ns](1)
memory usage: 8.2+ MB


In [18]:
make_dataset.save_pkl('yield', df_yield)

2022-01-10 19:34:43.363 | INFO     | src.data.make_dataset:save_pkl:359 - Save preprocessed yield data


# Inflation data


In [19]:
df_inflation = make_dataset.get_inflation()  

2022-01-10 19:34:48.850 | INFO     | src.data.make_dataset:get_inflation:201 - Load goverment yield curve data
2022-01-10 19:34:48.852 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\DE Inflation.csv
2022-01-10 19:34:48.868 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\FR Inflation.csv
2022-01-10 19:34:48.886 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\ES Inflation.csv
2022-01-10 19:34:48.905 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\IT Inflation.csv
2022-01-10 19:34:48.926 | INFO     | src.data.make_dataset:read_csv:26 - Loading data from ..\data\raw\US Inflation.csv


In [20]:
df_inflation = make_dataset.impute_inflation(df_inflation)


df_inflation.info()

2022-01-10 19:35:33.596 | INFO     | src.data.make_dataset:impute_inflation:229 - Impute inflation curve


<class 'pandas.core.frame.DataFrame'>
Int64Index: 215007 entries, 0 to 216873
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype          
---  ------     --------------   -----          
 0   country    215007 non-null  string         
 1   rate_dt    215007 non-null  datetime64[ns] 
 2   timeband   215007 non-null  string         
 3   inflation  215007 non-null  float64        
 4   ratename   215007 non-null  string         
 5   actual_dt  215007 non-null  datetime64[ns] 
 6   time       215007 non-null  timedelta64[ns]
dtypes: datetime64[ns](2), float64(1), string(3), timedelta64[ns](1)
memory usage: 13.1 MB


In [21]:
df_inflation.head()

Unnamed: 0,country,rate_dt,timeband,inflation,ratename,actual_dt,time
0,Germany,2021-12-23,1 YEAR,3.28625,Inflation,2022-12-23,365 days
1,Germany,2021-12-22,1 YEAR,3.33875,Inflation,2022-12-22,365 days
2,Germany,2021-12-21,1 YEAR,3.15625,Inflation,2022-12-21,365 days
3,Germany,2021-12-20,1 YEAR,3.01375,Inflation,2022-12-20,365 days
4,Germany,2021-12-17,1 YEAR,2.89875,Inflation,2022-12-17,365 days


In [22]:
make_dataset.save_pkl('inflation', df_inflation)

2022-01-10 19:35:40.553 | INFO     | src.data.make_dataset:save_pkl:359 - Save preprocessed inflation data


In [23]:
df_bp = make_dataset.join_price(df_bonds,df_price )
df_bp = build_features.add_duration(df_bp)

2022-01-10 19:36:01.854 | INFO     | src.features.build_features:add_duration:13 - Add remaining duration...


Om de termspread toe te kunnen voegen moeten we de bond data joinen met de government yield. Hierdoor hebben we minder data beschikbaar.
De vraag is of dit nodig is (of dat het model dit zelf uit kan vogelen)

Hoe meer data we joinen - hoe meer data we kwijt raken. 

In [None]:
df_bpy = make_dataset.join_yield(df_bp, df_yield)
df_bpy = build_features.add_term_spread(df_bpy)
df_bpy = build_features.add_bid_offer_spread(df_bpy)
df_bpy.info()

In [None]:
df_tf = make_dataset.build_simple_input(df_bonds, df_price)


All in one make statement...

In [34]:
make_dataset.make_data()

2022-01-10 20:06:02.006 | INFO     | src.data.make_dataset:get_bond_data:42 - Load bond data
2022-01-10 20:06:02.007 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\bonds.csv
2022-01-10 20:06:02.031 | INFO     | src.data.make_dataset:impute_bonds:80 - Impute bond data
2022-01-10 20:06:02.039 | INFO     | src.data.make_dataset:save_pkl:357 - Save preprocessed bonds data
2022-01-10 20:06:02.062 | INFO     | src.data.make_dataset:get_price:114 - Load bond price data
2022-01-10 20:06:02.063 | INFO     | src.data.make_dataset:read_csv:27 - Loading data from ..\data\raw\price.csv
2022-01-10 20:06:02.470 | INFO     | src.data.make_dataset:impute_price:138 - Impute bond price
2022-01-10 20:06:02.492 | INFO     | src.data.make_dataset:save_pkl:357 - Save preprocessed price data
2022-01-10 20:06:06.203 | INFO     | src.data.make_dataset:get_yield:151 - Load goverment yield curve data
2022-01-10 20:06:06.203 | INFO     | src.data.make_dataset:read_csv:27 - Loading d

MemoryError: Unable to allocate 404. MiB for an array with shape (236, 224586) and data type float64