<p>Importing the packages:</p>

In [1]:
import pandas as pd
import glob
import requests
import json
from time import sleep
import extractdata as ed

# Data Preparation
## Read Data from several CSV files

<p> Loading files on CSPP corporate bonds holdings: </p>

In [2]:
df = ed.pd_read_csv('data/*holdings_*.csv', 'iso8859-1')

In [3]:
# check for the null values
df.isna().sum()

NCB                2069
ISIN_CODE         21215
ISSUER_NAME_      56389
MATURITY_DATE_    56389
COUPON_RATE_      56389
MONTH                 0
ISSUER_NAME       25409
MATURITY_DATE     25409
COUPON_RATE       25409
Unnamed: 0        59371
ISIN              40226
ISSUER            40226
MATURITY DATE     40226
COUPON RATE       40226
dtype: int64

<p> Column names are a slightly different in reports generated in different months. Filling in an empty cells to put all relevant data into one column: </p>

In [4]:
# fill in empty cells
df['NCB'].fillna(df['Unnamed: 0'], inplace=True)
df['ISIN'].fillna(df['ISIN_CODE'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME_'], inplace=True)
df['ISSUER'].fillna(df['ISSUER_NAME'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE_'], inplace=True)
df['MATURITY DATE'].fillna(df['MATURITY_DATE'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE_'], inplace=True)
df['COUPON RATE'].fillna(df['COUPON_RATE'], inplace=True)

In [5]:
# choose necessary columns and rows
df = df[df['ISIN'].notna()][['MONTH', 'NCB', 'ISIN', 'ISSUER', 'MATURITY DATE', 'COUPON RATE']]

In [6]:
# check for the null values
df.isna().sum()

MONTH            0
NCB              0
ISIN             0
ISSUER           0
MATURITY DATE    0
COUPON RATE      0
dtype: int64

In [7]:
df.head()

Unnamed: 0,MONTH,NCB,ISIN,ISSUER,MATURITY DATE,COUPON RATE
0,2017/06,IT,XS1088274169,2i Rete Gas S.p.A.,16/07/2019,1.75
1,2017/06,IT,XS1088274672,2i Rete Gas S.p.A.,16/07/2024,3.0
2,2017/06,IT,XS1144492532,2i Rete Gas S.p.A.,02/01/2020,1.125
3,2017/06,IT,XS1571982468,2i Rete Gas S.p.A.,28/08/2026,1.75
4,2017/06,IT,XS0859920406,A2A S.p.A.,28/11/2019,4.5


## Data Extraction via API

<p>We use different PermID APIs to assign Industrial sectors to companies. Access token and for API requests:</p>

In [8]:
token = 'r3S0DwnAKTqYq9jgJIs0XI04YBDWjPVJ'

<p> Creating a template for Record Matching API: </p>

In [9]:
template = pd.DataFrame({'Name': df['ISSUER']})
template['Name'] = template['Name'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').\
                                    str.strip().str.replace('.','').str.replace('(','').str.replace(')','').\
                                    str.replace('-',' ').str.replace('/',' ').str.lower().str.split(',',1).str[0]
df['KEY NAME'] = template['Name']
template.head()

Unnamed: 0,Name
0,2i rete gas spa
1,2i rete gas spa
2,2i rete gas spa
3,2i rete gas spa
4,a2a spa


In [10]:
template = template.drop_duplicates().to_csv(index=False)

<p>Find companies' PermIDs by their names using Record Matching API:</p>

In [11]:
match_results = ed.record_matching(token, template = template)

Processed: 442
Matched: 
  Total 394
  Excellent 337
  Good 26
  Possible 31
Unmatched: 48


<p>Sometimes the request is processed successfully, but does not return any matches due to unexpected server errors. Then the request is sent again and a message appears.</p>

<p> There are 48 unmatched results out of 442. First, we try to create a new template and make further transformations to match the rest of the companies. Another possible solution: use Entity Search API. </p>

In [12]:
output = match_results['outputContentResponse']

# create dataframe with permIDs for matched companies
companies = pd.DataFrame([d for d in output if d['Match Level'] != 'No Match'])

# create dataframe with companies that did not have a match
noMatch = pd.DataFrame([d for d in output if d['Match Level'] == 'No Match'])['Input_Name']

In [13]:
noMatch

0                        air liquide sa etexplpgcl
1                          airbus group finance bv
2                        autostr bresvervicpad spa
3                     caterpillar intl finance ltd
4                        ciba spc chem fin lxbg sa
5                     compagnie fin ind autoroutes
6                                   delhaize group
7                                delhaize group sa
8                      deutsche telekom intl finbv
9                               deutsche wohnen ag
10                    distribuidora intl de alimsa
11                             eon intl finance bv
12                                          eon se
13                                     eandis cvba
14                         elia system operator nv
15                      elia system operator sa nv
16                               gas natural cm sa
17                              gie engie alliance
18                      heidelbergcement finlux sa
19                       holcim

In [14]:
new_template = pd.DataFrame({'Name': noMatch})
new_template['Name'] = new_template['Name'].str.split().str[:2]
new_template.sample(5)

Unnamed: 0,Name
23,"[o2, telefonica]"
16,"[gas, natural]"
18,"[heidelbergcement, finlux]"
30,"[sncf, mobilites]"
0,"[air, liquide]"


<p><b> Note: will be continued. I will try to match using first two words from the names. </b></p>

<p>Using the Entity Lookup API to get the data on industry for each company and joining it to the main dataframe:</p>

In [15]:
IDs = companies['Match OpenPermID']

In [16]:
sectors = []
for permID in IDs:
    lookup_company = ed.entity_lookup(token, permID)
    if 'hasPrimaryBusinessSector' in lookup_company:
            sectorID = lookup_company['hasPrimaryBusinessSector']
            lookup_sector = ed.entity_lookup(token, sectorID)
            sectors.append(pd.DataFrame({'Match OpenPermID' : [permID],
                                          'Sector PermID' : [sectorID], 
                                          'Sector' : [lookup_sector['prefLabel']],
                                          'Sector Description' : lookup_sector['rdfs:comment']}))

In [17]:
sectors = pd.concat(sectors).drop_duplicates()

In [18]:
companies = companies.merge(sectors, how = 'left', on = 'Match OpenPermID')

In [21]:
companies.head()

Unnamed: 0,ProcessingStatus,Match OpenPermID,Match OrgName,Match Score,Match Level,Match Ordinal,Original Row Number,Input_Name,Sector PermID,Sector,Sector Description
0,OK,https://permid.org/1-5000936840,2I Rete Gas SpA,92%,Excellent,1,2,2i rete gas spa,https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."
1,OK,https://permid.org/1-5000005309,A2A SpA,92%,Excellent,1,3,a2a spa,https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."
2,OK,https://permid.org/1-5000066931,ABB Finance BV,92%,Excellent,1,4,abb finance bv,https://permid.org/1-4294952766,Industrial Goods,Manufacturers of aerospace & defense equipment...
3,OK,https://permid.org/1-4295889666,Abertis Infraestructuras SA,92%,Excellent,1,5,abertis infraestructuras sa,https://permid.org/1-4294952945,Transportation,"Transporters of freight and passengers by air,..."
4,OK,https://permid.org/1-4295875677,Acea SpA,92%,Excellent,1,6,acea spa,https://permid.org/1-4294952820,Utilities,"Producers and distributors of electricity, nat..."


In [20]:
print(sectors['Sector'].unique())

['Utilities' 'Industrial Goods' 'Transportation' 'Insurance'
 'Industrial & Commercial Services' 'Chemicals' 'Real Estate'
 'Banking & Investment Services' 'Food & Beverages'
 'Energy - Fossil Fuels' 'Technology Equipment'
 'Pharmaceuticals & Medical Research' 'Cyclical Consumer Services'
 'Software & IT Services' 'Telecommunications Services'
 'Cyclical Consumer Products' 'Automobiles & Auto Parts'
 'Financial Technology (Fintech) & Infrastructure'
 'Investment Holding Companies' 'Mineral Resources' 'Retailers'
 'Consumer Goods Conglomerates' 'Collective Investments'
 'Healthcare Services & Equipment' 'Food & Drug Retailing'
 'Applied Resources']


<p><b> Note: I think Business Sectors are ok to use for the analysis. </b></p>