Importing the packages:

In [1]:
import pandas as pd
import boto3
import io
from io import StringIO
import requests
import json
from time import sleep
import permidapi as api
import s3data

# Data Preparation
## Read Data from several CSV files

In order to access data, configure your Boto3 credentials (AWS Access Key ID, AWS Secret Access Key, Default region name) via Anaconda promt command `aws configure`.

In [2]:
bucket = 's3groupmorocco'

Loading files on CSPP corporate bonds holdings with filenames starting with "CSPPholdings_":

In [3]:
df1 = s3data.read_multiple_csv(bucket, 'data/CSPPholdings_', 'iso8859-1')

Loading files on CSPP corporate bonds holdings with filenames starting with "CSPP_PEPP_corporate_bond_holdings_":

In [4]:
df2 = s3data.read_multiple_csv(bucket, 'data/CSPP_PEPP_corporate_bond_holdings_', 'iso8859-1')

Combining two dataframes together:

In [5]:
df = pd.concat([df1, df2], ignore_index=True)

In [6]:
# check for the null values
df.isna().sum()

NCB                2069
ISIN_CODE         21215
ISSUER_NAME_      56389
MATURITY_DATE_    56389
COUPON_RATE_      56389
MONTH                 0
ISSUER_NAME       25409
MATURITY_DATE     25409
COUPON_RATE       25409
Unnamed: 0        59371
ISIN              40226
ISSUER            40226
MATURITY DATE     40226
COUPON RATE       40226
dtype: int64

Column names are a slightly different in reports generated in different months. Filling in empty cells to put all relevant data into one column:

In [7]:
# fill in empty cells
df['NCB'].fillna(df['Unnamed: 0'], inplace=True)
df['ISIN'].fillna(df['ISIN_CODE'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME_'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE_'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE_'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE'], inplace=True)

In [8]:
# choose necessary columns and rows
df = df[df['ISIN'].notna()][['MONTH', 'NCB', 'ISIN', 'ISSUER', 'MATURITY DATE', 'COUPON RATE']]

In [9]:
# check for the null values
df.isna().sum()

MONTH            0
NCB              0
ISIN             0
ISSUER           0
MATURITY DATE    0
COUPON RATE      0
dtype: int64

In [10]:
df.head()

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE
0,2017/06,IT,XS1088274169,2i Rete Gas S.p.A.,16/07/2019,1.75
1,2017/06,IT,XS1088274672,2i Rete Gas S.p.A.,16/07/2024,3.0
2,2017/06,IT,XS1144492532,2i Rete Gas S.p.A.,02/01/2020,1.125
3,2017/06,IT,XS1571982468,2i Rete Gas S.p.A.,28/08/2026,1.75
4,2017/06,IT,XS0859920406,A2A S.p.A.,28/11/2019,4.5


## Data Extraction via API

We use different PermID APIs to assign Industrial groups and Business sectors to companies. Access token for API requests:

In [11]:
token = 'r3S0DwnAKTqYq9jgJIs0XI04YBDWjPVJ'

Creating a template for Record Matching API. Instructions on the format of the template can be found here: https://permid.org/match

In [12]:
template = pd.DataFrame({'Company': df['ISSUER']})
template['Value0'] = template['Company'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').\
                                    str.strip().str.lower().str.replace('.','').str.replace('/',' ').str.replace('-',' ').\
                                    str.replace('(','').str.replace(')','').str.split(',',1).str[0]                    
template['Value2'] = template['Value0'].str.split().str[:2].str.join(' ')
template['Value1'] = template['Value0'].str.split().str[:1].str.join(' ')
template['Name'] =  template[['Value1','Value2','Value0']].apply(lambda x: '|'.join(x), axis = 1)
df['KEY NAME'] = template['Name']
template = template['Name']
template.sample(5)

19679            deutsche|deutsche borse|deutsche borse ag
58915                               snam|snam spa|snam spa
2977     anheuser|anheuser busch|anheuser busch inbev s...
55363         dassault|dassault systemes|dassault systemes
26580    ferrovie|ferrovie dello|ferrovie dello stato i...
Name: Name, dtype: object

In [13]:
template = template.drop_duplicates().to_csv(index=False)

Find companies' PermIDs by their names using Record Matching API:

In [14]:
match_results = api.record_matching(token, template = template)

Processed: 442
Matched: 
  Total 404
  Excellent 357
  Good 21
  Possible 26
Unmatched: 38


Sometimes the request is processed successfully, but does not return any matches due to unexpected server errors. Then the request is sent again and a message appears.

There are {{match_results['unMatched']}} unmatched results out of 442.

In [15]:
# output of the record matching result
output = pd.DataFrame([d for d in match_results['outputContentResponse']])
output.head()

Unnamed: 0,ProcessingStatus,Match OpenPermID,Match OrgName,Match Score,Match Level,Match Ordinal,Original Row Number,Input_Name
0,OK,https://permid.org/1-5000936840,2I Rete Gas SpA,92%,Excellent,1,2,2i|2i rete|2i rete gas spa
1,OK,https://permid.org/1-5000005309,A2A SpA,92%,Excellent,1,3,a2a|a2a spa|a2a spa
2,OK,https://permid.org/1-5000066931,ABB Finance BV,92%,Excellent,1,4,abb|abb finance|abb finance bv
3,OK,https://permid.org/1-4295889666,Abertis Infraestructuras SA,92%,Excellent,1,5,abertis|abertis infraestructuras|abertis infra...
4,OK,https://permid.org/1-4295875677,Acea SpA,92%,Excellent,1,6,acea|acea spa|acea spa


In [16]:
# companies that did not have a match
noMatch = pd.DataFrame(output[output['Match Level'] == 'No Match']['Input_Name'])
noMatch

Unnamed: 0,Input_Name
12,airbus|airbus group|airbus group finance bv
30,autostr|autostr bresvervicpad|autostr bresverv...
47,caterpillar|caterpillar intl|caterpillar intl ...
49,ciba|ciba spc|ciba spc chem fin lxbg sa
65,delhaize|delhaize group|delhaize group sa
73,deutsche|deutsche wohnen|deutsche wohnen ag
75,eon|eon intl|eon intl finance bv
76,eon|eon se|eon se
78,eandis|eandis cvba|eandis cvba
86,elia|elia system|elia system operator nv


<b> Note: will be continued.  Another possible solution: use Entity Search API. </b>

Next we have to get the data on industry for each company and join it to the main dataframe. <br> We choose the data on companies that were matched with their PermIDs:

In [17]:
# companies that were matched
companies = pd.DataFrame(output[output['Match Level'] != 'No Match'])

A list of PermIDs for all companies:

In [18]:
IDs = companies['Match OpenPermID']

We look for Industry Group and Business Sector of each company by its PermID using Entity Lookup API. <br>
First, we request information about the company. If Industry Group key is available in the response, we request information about Industry Group and Business Sector using their PermIDs. <br> 
<b>Note</b>: for one Industry Group the label value was returned as a list (['Freight&Logistics Services', 'Freight & Logistics Services']), so additional check for the type is added.

In [19]:
sectors = []
for permID in IDs:
    lookup_company = api.entity_lookup(token, permID)
    if 'hasPrimaryIndustryGroup' in lookup_company:
        industryID = lookup_company['hasPrimaryIndustryGroup']
        sectorID = lookup_company['hasPrimaryBusinessSector']
        lookup_industry = api.entity_lookup(token, industryID)
        lookup_sector = api.entity_lookup(token, sectorID)
        industry = lookup_industry['prefLabel'][0] if type(lookup_industry['prefLabel']) == list \
                                                    else lookup_industry['prefLabel']
        sectors.append(pd.DataFrame({'Match OpenPermID' : [permID],
                                     'Industry PermID' : [industryID],
                                     'Industry Group' : [industry],
                                     'Industry Description' : [lookup_industry['rdfs:comment']],
                                     'Sector PermID' : [sectorID],
                                     'Business Sector' : [lookup_sector['prefLabel']],
                                     'Sector Description' : [lookup_sector['rdfs:comment']]}))

In [20]:
sectors = pd.concat(sectors,ignore_index=True).drop_duplicates()
sectors.head()

Unnamed: 0,Match OpenPermID,Industry PermID,Industry Group,Industry Description,Sector PermID,Business Sector,Sector Description
0,https://permid.org/1-5000936840,https://permid.org/1-4294952817,Natural Gas Utilities,Producers and distributors of natural gas.,https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."
1,https://permid.org/1-5000005309,https://permid.org/1-4294952819,Electric Utilities & IPPs,Generators and distributors of electric power....,https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."
2,https://permid.org/1-5000066931,https://permid.org/1-4294952765,"Machinery, Tools, Heavy Vehicles, Trains & Ships","Manufacturers of industrial, construction, agr...",https://permid.org/1-4294952766,Industrial Goods,Manufacturers of aerospace & defense equipment...
3,https://permid.org/1-4295889666,https://permid.org/1-4294952750,Transport Infrastructure,"Owners and operators of highways, rail tracks,...",https://permid.org/1-4294952945,Transportation,"Transporters of freight and passengers by air,..."
4,https://permid.org/1-4295875677,https://permid.org/1-4294952813,Multiline Utilities,"Producers and distributors of electric power, ...",https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."


Merge the data about sectors to companies dataframe and choose only necessary columns:

In [21]:
companies = companies.merge(sectors, how = 'left', on = 'Match OpenPermID')[['Input_Name','Industry Group','Business Sector']]
companies.head()

Unnamed: 0,Input_Name,Industry Group,Business Sector
0,2i|2i rete|2i rete gas spa,Natural Gas Utilities,Utilities
1,a2a|a2a spa|a2a spa,Electric Utilities & IPPs,Utilities
2,abb|abb finance|abb finance bv,"Machinery, Tools, Heavy Vehicles, Trains & Ships",Industrial Goods
3,abertis|abertis infraestructuras|abertis infra...,Transport Infrastructure,Transportation
4,acea|acea spa|acea spa,Multiline Utilities,Utilities


Add data about companies' Industry Groups and Business Sectors to the main dataframe by the template Input Name:

In [22]:
df = df.merge(companies, left_on='KEY NAME', right_on='Input_Name', how='left').drop(columns=['KEY NAME','Input_Name'])
df.head()

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE,Industry Group,Business Sector
0,2017/06,IT,XS1088274169,2i Rete Gas S.p.A.,16/07/2019,1.75,Natural Gas Utilities,Utilities
1,2017/06,IT,XS1088274672,2i Rete Gas S.p.A.,16/07/2024,3.0,Natural Gas Utilities,Utilities
2,2017/06,IT,XS1144492532,2i Rete Gas S.p.A.,02/01/2020,1.125,Natural Gas Utilities,Utilities
3,2017/06,IT,XS1571982468,2i Rete Gas S.p.A.,28/08/2026,1.75,Natural Gas Utilities,Utilities
4,2017/06,IT,XS0859920406,A2A S.p.A.,28/11/2019,4.5,Electric Utilities & IPPs,Utilities


In [23]:
df.isna().sum()

MONTH                  0
NCB                    0
ISIN                   0
ISSUER                 0
MATURITY DATE          0
COUPON RATE            0
Industry Group     10153
Business Sector    10153
dtype: int64

Put a resulting dataframe into "CSPP_bonds_with_sectors.csv" file on S3 bucket:

In [24]:
s3data.df_to_csv(df, bucket, 'data/CSPP_bonds_with_sectors.csv')