<h1><center>  Single Names' ESG Scoring - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>
<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


#  

 


## Objective: 
-	Develop an ESG scoring methodology using content published by corporations


## Data:
-	A chosen list of corporations, for instance within the same industry sector, or with similar market capitalization
 - 	Company reports such as SEC filings
 - 	Press releases
 - 	Earning call transcripts



## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Topic Sentiment API
 - 	*api_instance.post_topic_sentiment_api(payload)*


-	DocInfo API
 - 	*api_instance.post_doc_info(payload)*


-	DatasetInfo API
 - 	*api_instance.post_dataset_info(payload)*


## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents over a chosen historical period
-   Leverage the corporate filings' feed provided by Nucleus
-   Focus on the Information Technology sector for this example
    

In [None]:
import csv
import json
import regex as re
import numpy as np
import datetime
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

In [None]:
dataset = "Corporate_docs" 
period_start = "2015-01-01" 
period_end= "2019-06-01"
tickers = ['AAPL','MSFT','INTC','CSCO','MA','ORCL','IBM','CRM','PYPL','ACN US','ADBE','TXN','NVDA','INTU','QCOM','ADSK','CTSH','XLNX','HPQ','SPLK','TEL US','HPE','FISV','AMD','LRCX','MCHP','DXC','NOW','SYMC','ON','CDW','AKAM','FIS','NTAP','MXIM','DELL','ADS','VRSN','JNPR','LDOS','ANET','TER','GPN','TSS','IT','GDDY','CTXS','FTNT','DATA','ZBRA','WU','TYL','PAYC','CGNX','DOX']

payload = nucleus_api.EdgarQuery(destination_dataset=dataset,
                                tickers=tickers, 
                                filing_types=["10-K", "10-K/A", "10-Q", "10-Q/A", "8-K", "8-K/A"], 
                                sections=[],
                                period_start=period_start,
                                period_end=period_end)

api_response = api_instance.post_create_dataset_from_sec_filings(payload)

**You can subsequently work on specific time periods within your dataset directly in the APIs, as illustrated below**

### 2. Define ESG queries to focus the content analysis

In [None]:
query_E = "Biodiversity OR Carbon OR Cleantech OR Clean OR Climate OR Coal OR Conservation OR Ecosystem OR Emission OR Energy OR Fuel OR Green OR Land OR Natural OR Pollution OR (Raw AND materials) OR Renewable OR Resources OR Sustainability OR Sustainable OR Toxic OR Waste OR Water"

query_S = "Accident OR Adult entertainment OR Alcohol OR Anti-personnel OR Behavior OR Charity OR Child Labor OR Community OR Controversial OR Controversy OR Discrimination OR Gambling OR Health OR (Human AND capital) OR (Human AND rights) OR Inclusion OR Injury OR Labor OR Munitions OR Opposition OR Pay OR Philanthropic OR Quality OR Responsible"

query_G = "Advocacy OR Bribery OR Compensation OR Competitive OR Corruption OR Data breach OR Divestment OR Fraud OR Global Compact OR GRI OR Global Reporting Initiative OR Independent OR Justice OR Stability OR Stewardship OR Transparency"

### 3.	Rank companies on each ESG subject
- Identify and Extract key topics at a given point in time on each of the 3 ESG subjects


- Measure the sentiment on each topic to classify all key topics into ‘good’ and ‘bad’ topics


- Determine the exposure of each company to each topic


- Aggregate the exposures of a given company across key topics based on the ‘good’ or ‘bad’ nature of the topics, to derive a ranking of the companies
 - The top company is the one with the most exposure to good topics and/or the least exposure to bad topics
 
 
- Further down, we discuss how to refine this analysis by leveraging the different parameters available to the user




In [None]:
# Determine which companies are associated to the documents contributing to the topics
company_list = tickers


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                query=query_E,                   
                                num_topics=8,
                                num_keywords=8,
                                period_start = "2015-01-01" 
                                period_end= "2015-03-01")
try:
    api_response = api_instance.post_topic_sentiment_api(payload)    
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    company_rankings = np.zeros([len(company_list), len(api_response.result)])
    for i, res in enumerate(api_response.result):
        print('Topic', i, 'sentiment:')
        print('    Keywords:', res.keywords)

        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_ids)
        try:
            api_response1 = api_instance.post_doc_info(payload)
            api_ok = True
        except ApiException as e:
            api_error = json.loads(e.body)
            print('ERROR:', api_error['message'])
            api_ok = False

        if api_ok:
            company_sources = [] # This list might shorter than the whole dataset because not all companies necessarily contribute to a given topic
            for res1 in api_response1.result:        
                company_sources.append(re.split('\s',res1.attribute['filename'])[0]) 

            company_contributions = np.zeros([len(company_list), 1])
            for j in range(len(company_list)):
                for k in range(len(company_sources)):
                    if company_sources[k] == company_list[j]:
                        company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

            company_rankings[:, i] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]    

            print('---------------')


    # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
    ESG_score = np.mean(company_rankings, axis=1)

-	Repeat the above tasks for each date in the historical period to get the complete history of your ESG scores

-   Change the query used: query_E, query_S, query_G to get scores per company on each of the 3 sustainability pillars

In [None]:
print('------------ Retrieve all companies found in the dataset ----------')

company_list = tickers


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date - first_date

# Now loop through time and at each date, compute the ranking of companies
T = 180 # The look-back period in days

ESG_score = []
for i in range(delta.days):  
    if i == 0:
        end_date = first_date + datetime.timedelta(days=T)
 
    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%m-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1) 
    end_date_str = end_date.strftime("%Y-%m-%d 00:00:00")

    payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                query=query_E,                   
                                num_topics=8,
                                num_keywords=8,
                                period_start=start_date_str,
                                period_end=end_date_str)
    api_response = api_instance.post_topic_sentiment_api(payload)

    company_rankings = np.zeros([len(company_list), len(api_response.result)])
    for l, res in enumerate(api_response.result):
        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_ids)
        api_response1 = api_instance.post_doc_info(payload)

        company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
        for res1 in api_response1.result:        
            company_sources.append(re.split('\s',res1.attribute['filename'])[0]) 

        company_contributions = np.zeros([len(company_list), 1])
        for j in range(len(company_list)):
            for k in range(len(company_sources)):
                if company_sources[k] == company_list[j]:
                    company_contributions[j] += json.loads(res.doc_topic_exposures[0])[k]

        company_rankings[:, l] = [x[0] for x in  float(res.strength) * float(res.sentiment) * company_contributions[:]]     

    # Add up the ranking of companies per topic into the final ESG score on the subject (E, S, G) currently analyzed
    ESG_score.append(np.mean(company_rankings, axis=1))

### 3.	Results Interpretation
-	Plot the time series of companies' ESG score

### 4.	Fine Tuning

#### a.	Tailoring the topics
-	See whether some tailoring may be applied to your single name screen by excluding certain topics considered not impactful. This is achieved by using the custom_stop_words parameter in input to the Topic Sentiment API


-	Identify and Extract key topics on the corpus, for each of the 3 subjects (E, S, G) and print the corresponding keywords



In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                                query=query_E,                       
                                num_topics=20, 
                                num_keywords=8,
                                period_start="2015-01-01",
                                period_end="2019-06-01")
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:    
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')

You can then tailor the scoring analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2: 

In [None]:
custom_stop_words = ["call","report"] # str | List of stop words. (optional)

#### b.	Exploring the impact of the type of documents, the lookback period, the number of topics being extracted
**num_topics**: You can compute the companies' ESG score using different breadth of topics by changing the variable num_topics in the payload in the main code of section 2. A larger value will provide more breadth in establishing scores while a smaller value will provide a shallower measure. If num_topics is too large, some very marginal topics may bring in a lot of noise in measuring company ESG scores.

**T**: You can compute the companies' ESG score with different speeds of propagation by changing the variable T (lookback) in the main code of section 2. A larger value will provide a slowly changing ESG score while a smaller value will lead to a very responsive scoring. If T is too small, too few documents may be used and this may lead to a lot of noise in scoring companies. If T is too long, the ESG scores won’t reflect quickly enough important new information. 

**Document types**: You can investigate how the companies' ESG score changes if it is measured using only one type of document among the different kinds of company filings by leveraging the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:

In [None]:
metadata_selection = {"category": "Report"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

### 5.	Next Steps
-	Possible extension: repeat the above tasks for different industry sectors

-	Possible extension: transform the raw ESG scores into a normalized metric

        Score(Company i) = ( Rank(Company i) – Average(Ranks, [Companies]) ) / Std(Ranks, [Companies])

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.