<p>Importing the packages:</p>

In [1]:
import pandas as pd
import glob
import requests
import json
from time import sleep
import extractdata as ed

# Data Preparation
## Read Data from several CSV files

<p> Loading files on CSPP corporate bonds holdings: </p>

In [2]:
df = ed.pd_read_csv('data/*holdings_*.csv', 'iso8859-1')

In [3]:
# check for the null values
df.isna().sum()

NCB                2069
ISIN_CODE         21215
ISSUER_NAME_      56389
MATURITY_DATE_    56389
COUPON_RATE_      56389
MONTH                 0
ISSUER_NAME       25409
MATURITY_DATE     25409
COUPON_RATE       25409
Unnamed: 0        59371
ISIN              40226
ISSUER            40226
MATURITY DATE     40226
COUPON RATE       40226
dtype: int64

<p> Column names are a slightly different in reports generated in different months. Filling in an empty cells to put all relevant data into one column: </p>

In [4]:
# fill in empty cells
df['NCB'].fillna(df['Unnamed: 0'], inplace=True)
df['ISIN'].fillna(df['ISIN_CODE'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME_'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE_'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE_'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE'], inplace=True)

In [5]:
# choose necessary columns and rows
df = df[df['ISIN'].notna()][['MONTH', 'NCB', 'ISIN', 'ISSUER', 'MATURITY DATE', 'COUPON RATE']]

In [6]:
# check for the null values
df.isna().sum()

MONTH            0
NCB              0
ISIN             0
ISSUER           0
MATURITY DATE    0
COUPON RATE      0
dtype: int64

In [7]:
df.head()

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE
0,2017/06,IT,XS1088274169,2i Rete Gas S.p.A.,16/07/2019,1.75
1,2017/06,IT,XS1088274672,2i Rete Gas S.p.A.,16/07/2024,3.0
2,2017/06,IT,XS1144492532,2i Rete Gas S.p.A.,02/01/2020,1.125
3,2017/06,IT,XS1571982468,2i Rete Gas S.p.A.,28/08/2026,1.75
4,2017/06,IT,XS0859920406,A2A S.p.A.,28/11/2019,4.5


## Data Extraction via API

<p>We use different PermID APIs to assign Industrial sectors to companies. Access token and for API requests:</p>

In [8]:
token = 'r3S0DwnAKTqYq9jgJIs0XI04YBDWjPVJ'

<p> Creating a template for Record Matching API: </p>

In [9]:
template = pd.DataFrame({'Company': df['ISSUER']})
template['Value0'] = template['Company'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').\
                                    str.strip().str.lower().str.replace('.','').str.replace('/',' ').str.replace('-',' ').\
                                    str.replace('(','').str.replace(')','').str.split(',',1).str[0]                    
template['Value2'] = template['Value0'].str.split().str[:2].str.join(' ')
template['Value1'] = template['Value0'].str.split().str[:1].str.join(' ')
template['Name'] =  template[['Value1','Value2','Value0']].apply(lambda x: '|'.join(x), axis = 1)
df['KEY NAME'] = template['Name']
template = template['Name']
template.head()

0    2i|2i rete|2i rete gas spa
1    2i|2i rete|2i rete gas spa
2    2i|2i rete|2i rete gas spa
3    2i|2i rete|2i rete gas spa
4           a2a|a2a spa|a2a spa
Name: Name, dtype: object

In [10]:
template = template.drop_duplicates().to_csv(index=False)

<p>Find companies' PermIDs by their names using Record Matching API:</p>

In [11]:
match_results = ed.record_matching(token, template = template)

Processed: 442
Matched: 
  Total 404
  Excellent 357
  Good 21
  Possible 26
Unmatched: 38


<p>Sometimes the request is processed successfully, but does not return any matches due to unexpected server errors. Then the request is sent again and a message appears.</p>

<p> There are 38 unmatched results out of 442.</p>

In [12]:
# output of the record matching result
output = pd.DataFrame([d for d in match_results['outputContentResponse']])
output.head()

Unnamed: 0,ProcessingStatus,Match OpenPermID,Match OrgName,Match Score,Match Level,Match Ordinal,Original Row Number,Input_Name
0,OK,https://permid.org/1-5000936840,2I Rete Gas SpA,92%,Excellent,1,2,2i|2i rete|2i rete gas spa
1,OK,https://permid.org/1-5000005309,A2A SpA,92%,Excellent,1,3,a2a|a2a spa|a2a spa
2,OK,https://permid.org/1-5000066931,ABB Finance BV,92%,Excellent,1,4,abb|abb finance|abb finance bv
3,OK,https://permid.org/1-4295889666,Abertis Infraestructuras SA,92%,Excellent,1,5,abertis|abertis infraestructuras|abertis infra...
4,OK,https://permid.org/1-4295875677,Acea SpA,92%,Excellent,1,6,acea|acea spa|acea spa


In [13]:
# companies that did not have a match
noMatch = pd.DataFrame(output[output['Match Level'] == 'No Match']['Input_Name'])
noMatch

Unnamed: 0,Input_Name
12,airbus|airbus group|airbus group finance bv
30,autostr|autostr bresvervicpad|autostr bresverv...
47,caterpillar|caterpillar intl|caterpillar intl ...
49,ciba|ciba spc|ciba spc chem fin lxbg sa
65,delhaize|delhaize group|delhaize group sa
73,deutsche|deutsche wohnen|deutsche wohnen ag
75,eon|eon intl|eon intl finance bv
76,eon|eon se|eon se
78,eandis|eandis cvba|eandis cvba
86,elia|elia system|elia system operator nv


<p><b> Note: will be continued.  Another possible solution: use Entity Search API. </b></p>

<p>Using the Entity Lookup API to get the data on industry for each company and joining it to the main dataframe:</p>

In [14]:
# companies that were matched
companies = pd.DataFrame(output[output['Match Level'] != 'No Match'])

In [15]:
IDs = companies['Match OpenPermID']

In [16]:
sectors = []
for permID in IDs:
    lookup_company = ed.entity_lookup(token, permID)
    if 'hasPrimaryBusinessSector' in lookup_company:
            sectorID = lookup_company['hasPrimaryBusinessSector']
            lookup_sector = ed.entity_lookup(token, sectorID)
            sectors.append(pd.DataFrame({'Match OpenPermID' : [permID],
                                          'Sector PermID' : [sectorID], 
                                          'Sector' : [lookup_sector['prefLabel']],
                                          'Sector Description' : lookup_sector['rdfs:comment']}))

In [17]:
sectors = pd.concat(sectors,ignore_index=True).drop_duplicates()

In [18]:
companies = companies.merge(sectors, how = 'left', on = 'Match OpenPermID')[['Input_Name','Sector']]

In [19]:
df = df.merge(companies, left_on='KEY NAME', right_on='Input_Name', how='left').drop(columns=['KEY NAME','Input_Name'])
df.sample()

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE,Sector
52218,2020/12,DE,XS1548436473,BMW Finance N.V.,12/07/2024,0.75,


In [20]:
df.isna().sum()

MONTH                0
NCB                  0
ISIN                 0
ISSUER               0
MATURITY DATE        0
COUPON RATE          0
Sector           10153
dtype: int64

In [21]:
df.sample(10)

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE,Sector
44146,2020/07,BE,XS2187525949,Alliander N.V.,10/06/2030,0.375,Utilities
6979,2017/12,FR,XS1408317433,Orange S.A.,12/05/2025,1.0,Telecommunications Services
58530,2021/04,DE,DE000A19B8E2,Vonovia Finance B.V.,25/01/2027,1.75,Real Estate
37732,2020/02,FR,FR0013266830,Legrand S.A.,06/07/2024,0.75,Industrial Goods
41731,2020/05,FR,FR0010961581,Electricité de France (E.D.F.),12/11/2040,4.5,Investment Holding Companies
17189,2018/09,FR,XS1569845404,Unibail-Rodamco SE,22/02/2028,1.5,Real Estate
45266,2020/07,IT,XS1578294081,Italgas S.P.A.,14/03/2024,1.125,Utilities
41125,2020/05,BE,XS2156598281,Akzo Nobel N.V.,14/04/2030,1.625,Chemicals
56342,2021/02,IT,XS1401125346,"Buzzi Unicem S.p.A., Casale Monferrato",28/04/2023,2.125,Mineral Resources
22545,2019/02,DE,XS0753143709,Deutsche Bahn Finance B.V.,08/03/2024,3.0,Banking & Investment Services


In [22]:
df['Sector'].unique()

array(['Utilities', 'Industrial Goods', 'Transportation', 'Insurance',
       'Industrial & Commercial Services', nan, 'Chemicals',
       'Real Estate', 'Banking & Investment Services', 'Food & Beverages',
       'Energy - Fossil Fuels', 'Technology Equipment',
       'Pharmaceuticals & Medical Research', 'Cyclical Consumer Services',
       'Software & IT Services', 'Telecommunications Services',
       'Cyclical Consumer Products', 'Automobiles & Auto Parts',
       'Financial Technology (Fintech) & Infrastructure',
       'Investment Holding Companies', 'Mineral Resources', 'Retailers',
       'Consumer Goods Conglomerates', 'Collective Investments',
       'Healthcare Services & Equipment', 'Food & Drug Retailing',
       'Applied Resources'], dtype=object)

<p><b> Note: I think Business Sectors are ok to use for the analysis. For making the visualisations, check "Decarbonising is easy": https://neweconomics.org/uploads/files/Decarbonising-is-easy.pdf.</b></p>