<h1><center>  Contrast Analysis - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Develop a pipeline to customize and fine-tune contrast analysis of datasets
  - Extraction of a contrasted topic
  - Contrasted Summarization
  - Classification of documents into 2 predefined categories

**SumUp contrast analysis works on the premise of two distinct categories of documents within a corpus, defined by the user based on metadata or content**

## Data:
-	Any collection of documents, ideally from the same field, possibly with further refinement in terms of categorization such as document type

    **The Nucleus Datafeed can be leveraged for all content from major Central Banks and SEC filings**


## Nucleus APIs:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Contrasted Topic Modeling API
 - 	*api_instance.post_topic_contrast_api(payload)*
 
 
-	Document Contrasted Summary API
 - 	*api_instance.post_document_contrast_summary_api(payload)*


-	Documents Classification API
 - 	*api_instance.post_doc_classify_api(payload)*

## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents

    

In [None]:
import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

In [None]:
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'company': 'Apple',
                                      'research_analyst': 'MS',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/central_banks_chinese'# embedded datafeeds in Nucleus.
metadata_selection = {'bank': 'people_bank_of_china', 'document_category': ('speech', 'press release')}

### 2. Contrasted Topic Modeling

-     In this example, we define one category of documents as being produced by research analysts at Morgan Stanley. The second category of documents will be comprised of all other research reports.
-     We extract one topic that separates those two categories

In [None]:
metadata_selection_contrast = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
num_keywords = 20 # integer | Number of keywords for the contrasted topic that is extracted from the dataset. (optional) (default to 50)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research', 
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
try:
    api_response = api_instance.post_topic_contrast_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Contrasted Topic')
    print('    Keywords:', api_response.result.keywords)
    print('    Keywords Weight:', api_response.result.keywords_weight)

### 3.	Document Contrasted Summarization

-   With the same dataset, we aim to find summary sentences that separate documents in two categories


-	Use the following input parameters to control the size of the summary and filter sentences that are too short or too lengthy.
    - `summary_length`
    - `context_amount` (the number of sentences around each key summary sentence)
    - `short_sentence_length`
    - `long_sentence_length`
    

-	Set the following parameters to adjust or refine the focus and content of the contrasted summary
    - `custom_stop_words` (list of custom stopwords)
    - `syntax_variables` (including / excluding syntax variables)
    - `num_keywords` (controlling the breadth of the contrasted summary)
    - `remove_redundancies` (removing redundancies)


-	Further down, we discuss how to construct a customized stopwords list to refine document contrasted summaries

In [None]:
print('---------------- Get doc contrasted summaries ------------------------')
metadata_selection_contrast = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the contrasted summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
num_keywords = 20 # integer | Number of keywords for the contrasted topic that is extracted from the dataset and used in the summary. (optional) (default to 50)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.DocumentContrastSummaryModel(dataset="Sellside_research", 
                                                    metadata_selection_contrast=metadata_selection_contrast,
                                                    custom_stop_words=custom_stop_words,
                                                    period_start='2018-01-01',
                                                    period_end='2019-01-01')
try:
    api_response = api_instance.post_document_contrast_summary_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Summary for', [x for x in  metadata_selection_contrast.values()])
    for sent in api_response.result.class_1_content.sentences:
        print('    *', sent)
    print('======')
    for sent in api_response.result.class_2_content.sentences:
        print('    *', sent)   

### 4. Documents Classification

This task requires 3 steps:
-     Extract a contrast topic on a labeled dataset
-     Train the classifier with the contrast topic by providing a labeled dataset. In this step, you can adjust the weight of each keyword from the contrasted topic, remove certain keywords, and even compare the contrasted topic produced by step 1 against topics of your own choosing
-     Test the classifier with test set

-     In the example below, we assume that the contrasted topic has already been obtained. The structure of 'fixed_topics' is exactly that which would come out of the Contrasted Topic API

In [None]:
fixed_topics = {"keywords": ["price target", "projected revenue", "economy"], "weights": [0.5, 0.25, 0.25]} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection_contrast = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = ["morgan stanley"] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis and reatins only one copy of it. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
    print('    Actual Category:', api_response.result.detailed_results.true_class)
    print('\n')

    print('Perf Metrics')
    print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
    print('    Recall:', api_response.result.perf_metrics.recall)
    print('    Precision:', api_response.result.perf_metrics.precision)

Then, we can move to the testing phase

In [None]:
fixed_topics = {"keywords": ["price target", "projected revenue", "economy"], "weights": [0.5, 0.25, 0.25]} # dict | The contrasting topic used to separate the two categories of documents
metadata_selection_contrast = {"research_analyst": "MS"} # dict | The metadata selection defining the two categories of documents that a document can be classified into

payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False,
                                        period_start='2019-01-02',
                                        period_end='2019-06-01')
try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:   
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)

### 5.	Fine-tuning

#### a.	Excluding certain content from the contrast analysis

-   Exclude irrelevant keywords / topics to tailor your contrast analysis by using the `custom_stop_words` parameter in the Topic Contrast API


-	Extract keywords from topics within your corpus and print the keywords of these topics. You could do the same when extracting contrasting topics



In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection_contrast)
try:
    api_response = api_instance.post_topic_api(payload)        
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:       
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')

Using your domain expertise or client / advisor input, you can determine if specific topics or keywords are not differentiated enough to contribute to contrast analysis. 

You can then tailor the contrast analysis by creating a `custom_stop_words` variable that contains those words. As shown below, initialize the variable and pass it in the payload of the main code of section 2: 

In [1]:
custom_stop_words = ["disclaimer","disclosure"] # str | List of stop words. (optional)

#### b. Focusing the contrasted summary on specific subjects potentially discussed in your corpus
**query**: You can refine the contrast analysis by leveraging the query variable of the Doc Contrasted Summary API.

Rerun Contrast Analysis APIs with a specific query or queries. Create a variable query and pass it in to the payload:

In [None]:
query = '(earnings OR cash flows)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

#### c. Specifying the metadata_selection_contrast for your contrasted topic

-     Contrasting documents from two different entities

    on your own data, e.g. sell-side research: 

In [None]:
metadata_selection_contrast = {"research_analyst": ["MS", "JPM"]}

    on SumUp data feed, e.g. Central Banks:

In [None]:
metadata_selection_contrast = {"bank": ["federal_reserve", "ECB"]}

-     Contrasting different documents from a given entity

    on SumUp data feed, e.g. Central Banks: 

In [None]:
metadata_selection_contrast = {"document_category": ["speech", "press release"]}

-     Contrasting documents that contain different keywords

    on your own data, or on SumUp data feed: 

In [None]:
metadata_selection_contrast = {"content": "fundamentals"}

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.