# Data voorbewerking

De data voor dit project is afkomstig van het back office systeem van Financial Markets, deels aangevuld met extra data vanuit Bloomberg.
De data bestaat uit de volgende elementen:

- Bond data
- Bondprijzen
- Government Yield curves
- Inflation data

Alle data is extracted en opgeslagen in csv files. In dit workbook lopen we door de data voorbereiding heen. Alle hier genoemde stappen kunnen ook geautomatiseerd worden uitgevoerd door het shell command 'Make Data'.

In [30]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd

sys.path.insert(0, "..") 
from src.data import make_dataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Bond data

In [31]:

# Get bond data, drop unneeded columns, convert formats and strip training blanks
df_bonds = make_dataset.get_bond_data()

2021-12-30 16:59:14.317 | INFO     | src.data.make_dataset:get_bond_data:34 - Load bond data
2021-12-30 16:59:14.318 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\bonds.csv


In [32]:
df_bonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   ccy                225 non-null    string        
 1   country            225 non-null    string        
 2   issue_dt           225 non-null    datetime64[ns]
 3   first_coupon_date  202 non-null    datetime64[ns]
 4   mature_dt          225 non-null    datetime64[ns]
 5   isin               225 non-null    string        
 6   issuer_name        225 non-null    string        
 7   coupon_frq         225 non-null    string        
 8   coupon             225 non-null    float64       
 9   tot_issue          225 non-null    float64       
 10  cfi_code           218 non-null    string        
 11  issue_rating       220 non-null    string        
dtypes: datetime64[ns](3), float64(2), string(7)
memory usage: 21.2 KB


In [33]:
df_bonds.head()

Unnamed: 0,ccy,country,issue_dt,first_coupon_date,mature_dt,isin,issuer_name,coupon_frq,coupon,tot_issue,cfi_code,issue_rating
0,EUR,Netherlands,2009-02-13,2009-07-15,2019-07-15,NL0009086115,STAAT DER NEDERLANDEN,ANNUAL,4.0,5000000000.0,DBFTFN,AAA
1,NLG,Austria,1994-02-28,1995-02-28,2024-02-28,NL0000133924,AUSTRIA,ANNUAL,6.25,1000000000.0,DBFTXB,AA+
2,EUR,Netherlands,2012-03-09,2013-01-15,2033-01-15,NL0010071189,STAAT DER NEDERLANDEN,ANNUAL,2.5,4160000000.0,DBFXXN,AAA
3,USD,United States,2009-05-15,2009-11-15,2019-05-15,US912828KQ20,UNITED STATES TREASURY,SEMI ANNUAL,3.125,64411000000.0,,
4,USD,United States,2010-02-15,2010-08-15,2020-02-15,US912828MP29,UNITED STATES TREASURY,SEMI ANNUAL,3.625,0.0,,


Imputeren ontbrekende waarden voor issue rating. 
Issue rating wordt waar deze ontbreekt ingevuld met de meest voorkomende issue rating voor de issuer.

CFI code wordt waar deze ontbreekt ingevuld met code 'onbekend' = DXXXXX.

Datums met de waarde 1899-12-30 zijn default waardes van het bronsysteem. Deze worden verwijderd.
Daar waar de eerste coupon datum ontbreekt (o.a. zero coupon bonds) - wordt deze aangevuld met de issue datum.
De reden hiervoor is dat we zo zonder veel moeite de looptijd van de bond kunnen berekenen.

In [34]:
df_bonds = make_dataset.impute_bonds(df_bonds)

2021-12-30 16:59:21.232 | INFO     | src.data.make_dataset:impute_bonds:67 - Impute bond data


In [35]:
make_dataset.save_pkl('bonds', df_bonds)

2021-12-30 16:59:22.977 | INFO     | src.data.make_dataset:save_pkl:279 - Save preprocessed bonds data


# Bondprijzen

In [36]:
df_price = make_dataset.get_price()

2021-12-30 16:59:24.476 | INFO     | src.data.make_dataset:get_price:104 - Load bond price data
2021-12-30 16:59:24.476 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\price.csv


In [37]:
df_price = make_dataset.impute_price(df_price)

2021-12-30 16:59:26.055 | INFO     | src.data.make_dataset:impute_price:129 - Impute bond price


In [38]:
df_price.head()

Unnamed: 0,reference_identifier,ccy,rate_dt,mid
0,BE0000332412,EUR,2014-01-20,100.719
1,BE0000332412,EUR,2014-01-24,101.005
2,BE0000332412,EUR,2014-01-28,100.953
3,BE0000332412,EUR,2014-01-22,100.359
4,BE0000332412,EUR,2014-01-21,100.601


In [39]:
df_price.describe()

Unnamed: 0,mid
count,209404.0
mean,111.289448
std,15.522596
min,84.429
25%,102.286
50%,106.73
75%,112.926
max,195.749


In [40]:
df_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 209404 entries, 0 to 220236
Data columns (total 4 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   reference_identifier  209404 non-null  string        
 1   ccy                   209404 non-null  string        
 2   rate_dt               209404 non-null  datetime64[ns]
 3   mid                   209404 non-null  float64       
dtypes: datetime64[ns](1), float64(1), string(2)
memory usage: 8.0 MB


In [41]:
make_dataset.save_pkl('price', df_price)

2021-12-30 16:59:32.611 | INFO     | src.data.make_dataset:save_pkl:279 - Save preprocessed price data


# Government Yield curves

In [42]:
df_yield = make_dataset.get_yield()

2021-12-30 16:59:34.196 | INFO     | src.data.make_dataset:get_yield:137 - Load goverment yield curve data
2021-12-30 16:59:34.197 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\yield.csv


In [43]:
df_yield = make_dataset.impute_yield(df_yield)

2021-12-30 16:59:35.997 | INFO     | src.data.make_dataset:impute_yield:162 - Impute yield curve


In [44]:
df_yield.head()

Unnamed: 0,ratename,ccy,rate_dt,timeband,actual_dt,datedays,bid,offer,int_basis,country
0,GOV Yield Curve DE BB,EUR,2017-02-07,1 YEAR,2018-02-09,367,-0.875,-0.804,ANNUAL,Germany
1,GOV Yield Curve DE BB,EUR,2017-02-07,2 YEARS,2019-02-11,734,-0.799,-0.773,ANNUAL,Germany
2,GOV Yield Curve DE BB,EUR,2017-02-07,3 YEARS,2020-02-10,1098,-0.769,-0.754,ANNUAL,Germany
3,GOV Yield Curve DE BB,EUR,2017-02-07,4 YEARS,2021-02-09,1463,-0.62,-0.602,ANNUAL,Germany
4,GOV Yield Curve DE BB,EUR,2017-02-07,5 YEARS,2022-02-09,1828,-0.414,-0.404,ANNUAL,Germany


In [45]:
df_yield.describe()

Unnamed: 0,datedays,bid,offer
count,89824.0,89824.0,89824.0
mean,3389.145162,0.284556,0.274044
std,2877.661657,0.921327,0.92168
min,367.0,-1.038,-1.051
25%,1463.0,-0.429,-0.443
50%,2560.0,0.016,0.009
75%,3657.0,0.756,0.75
max,10968.0,4.128,4.109


# Inflation data


In [46]:
df_inflation = make_dataset.get_inflation()  

2021-12-30 16:59:41.066 | INFO     | src.data.make_dataset:get_inflation:174 - Load goverment yield curve data
2021-12-30 16:59:41.068 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\DE Inflation.csv
2021-12-30 16:59:41.085 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\FR Inflation.csv
2021-12-30 16:59:41.103 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\ES Inflation.csv
2021-12-30 16:59:41.122 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\IT Inflation.csv
2021-12-30 16:59:41.140 | INFO     | src.data.make_dataset:read_csv:19 - Loading data from ..\data\raw\US Inflation.csv


In [50]:
df_inflation = make_dataset.impute_inflation(df_inflation)

df_inflation.info()

2021-12-30 17:00:14.086 | INFO     | src.data.make_dataset:impute_inflation:201 - Impute inflation curve


<class 'pandas.core.frame.DataFrame'>
Int64Index: 215007 entries, 0 to 216873
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   country    215007 non-null  string        
 1   rate_dt    215007 non-null  datetime64[ns]
 2   timeband   215007 non-null  string        
 3   inflation  215007 non-null  float64       
 4   ratename   215007 non-null  string        
dtypes: datetime64[ns](1), float64(1), string(3)
memory usage: 9.8 MB


In [51]:
df_price.merge(df_bonds, left_on = 'reference_identifier', right_on='isin',  how = 'left')

Unnamed: 0,reference_identifier,ccy_x,rate_dt,mid,ccy_y,country,issue_dt,first_coupon_date,mature_dt,isin,issuer_name,coupon_frq,coupon,tot_issue,cfi_code,issue_rating,bond_duration
0,BE0000332412,EUR,2014-01-20,100.719,EUR,Belgium,2014-01-21,2014-06-22,2024-06-22,BE0000332412,BELGIE BRU,ANNUAL,2.60,5.000000e+09,DBFTFR,AA-,3805 days
1,BE0000332412,EUR,2014-01-24,101.005,EUR,Belgium,2014-01-21,2014-06-22,2024-06-22,BE0000332412,BELGIE BRU,ANNUAL,2.60,5.000000e+09,DBFTFR,AA-,3805 days
2,BE0000332412,EUR,2014-01-28,100.953,EUR,Belgium,2014-01-21,2014-06-22,2024-06-22,BE0000332412,BELGIE BRU,ANNUAL,2.60,5.000000e+09,DBFTFR,AA-,3805 days
3,BE0000332412,EUR,2014-01-22,100.359,EUR,Belgium,2014-01-21,2014-06-22,2024-06-22,BE0000332412,BELGIE BRU,ANNUAL,2.60,5.000000e+09,DBFTFR,AA-,3805 days
4,BE0000332412,EUR,2014-01-21,100.601,EUR,Belgium,2014-01-21,2014-06-22,2024-06-22,BE0000332412,BELGIE BRU,ANNUAL,2.60,5.000000e+09,DBFTFR,AA-,3805 days
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
209399,FR0011486067,EUR,2021-08-31,104.237,EUR,France,2013-05-07,2013-05-25,2023-05-25,FR0011486067,REPUBLIC FRANCE,ANNUAL,1.75,4.015000e+09,DBFTFN,AA,3670 days
209400,FR0011486067,EUR,2021-09-02,104.231,EUR,France,2013-05-07,2013-05-25,2023-05-25,FR0011486067,REPUBLIC FRANCE,ANNUAL,1.75,4.015000e+09,DBFTFN,AA,3670 days
209401,FR0011486067,EUR,2021-10-14,103.889,EUR,France,2013-05-07,2013-05-25,2023-05-25,FR0011486067,REPUBLIC FRANCE,ANNUAL,1.75,4.015000e+09,DBFTFN,AA,3670 days
209402,FR0011486067,EUR,2021-10-15,103.882,EUR,France,2013-05-07,2013-05-25,2023-05-25,FR0011486067,REPUBLIC FRANCE,ANNUAL,1.75,4.015000e+09,DBFTFN,AA,3670 days


In [52]:
make_dataset.join_full(df_bonds,df_price,df_yield, df_inflation)


Unnamed: 0_level_0,Unnamed: 1_level_0,reference_identifier,ccy,rate_dt,mid,country,issue_dt,first_coupon_date,mature_dt,isin,issuer_name,...,20 YEARS,25 YEARS,3 YEARS,30 YEARS,4 YEARS,5 YEARS,6 YEARS,7 YEARS,8 YEARS,9 YEARS
country,rate_dt,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1


In [None]:
df_inflation.describe()

All in one make statement...

In [None]:
make_dataset.make_data()