<h1><center>  Topics Transfer Learning - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>

 


## Objective: 
-	Extract topics on a reference dataset and measure their key metrics (strength, sentiment, consensus, exposures) in a validation dataset


## Data:
-	Any two datasets, whether they are time ordered or chosen through another methodology

    **The Nucleus Datafeed can be leveraged for all content from major Central Banks**


## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Topic  Transfer API
 - 	*api_instance.post_topic_transfer_api(payload)*


-	Topic Sentiment Transfer API
 - 	*api_instance.post_topic_sentiment_transfer_api(payload)*


-	Topic Consensus Transfer API
 - 	*api_instance.post_topic_consensus_transfer_api(payload)*


## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents over a chosen historical period

    

In [None]:
# Leverage your own corpus
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'News_feed'         
dataset = 'News_feed'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'source': 'Tech Crunch',
                                  'author': 'Sarah Moore'
                                  'category': 'Media',
                                  'date': '2019-01-01'}}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
# Leverage a Nucleus embedded feed    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/rss_feed_finance'# embedded datafeeds in Nucleus.


-	For a given date in that period, retain only a subset of the documents published during a chosen lookback period

**This can be done directly into the APIs that perform content analysis, see below**



### 2.	Transfer Learning
-	Identify and Extract key topics on a reference dataset 


-	Measure the strength, sentiment, consensus on each topic onto the validation dataset


-   Measure the exposure and sentiment contribution of each document in the validation dataset to each topic


-	Further down, we discuss how to refine the transfer learning by leveraging the different parameters available to the user



In [None]:
print('------------------- Get topic transfer -----------------------')

payload = nucleus_api.TopicTransferModel(dataset0='News_feed', 
                                         dataset1="test_feed",
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        metadata_selection='')
api_response = api_instance.post_topic_transfer_api(payload)

doc_ids_t1 = api_response.result.doc_ids_t1
topics = api_response.result.topics
for i,res in enumerate(topics):
    print('Topic', i, 'exposure within validation dataset:')
    print('    Keywords:', res.keywords)
    print('    Strength:', res.strength)
    print('    Document IDs:', doc_ids_t1)
    print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
    print('---------------')
    
print('-------------------------------------------------------------')

-	Repeat the above task for Topic Sentiment transfer, and Topic Consensus transfer, depending on what aspect of the analysis you would like to transfer from reference to validation

In [None]:
print('------------------- Get topic sentiment transfer -----------------------')

payload = nucleus_api.TopicSentimentTransferModel(dataset0='News_feed', 
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        period_0_start='2018-08-12',
                                        period_0_end='2018-08-15',
                                        period_1_start='2018-08-16',
                                        period_1_end='2018-08-19',
                                        metadata_selection='')
api_response = api_instance.post_topic_sentiment_transfer_api(payload)

topics = api_response.result
for i,res in enumerate(topics):
    print('Topic', i, 'exposure within validation dataset:')
    print('    Keywords:', res.keywords)
    print('    Strength:', res.strength)
    print('    Sentiment:', res.sentiment)
    print('    Document IDs:', res.doc_ids_t1)
    print('    Sentiment per Doc in Validation Dataset:', res.doc_sentiments_t1)
    print('---------------')
    
print('-------------------------------------------------------------')

In [None]:
print('------------------- Get topic consensus transfer -----------------------')

payload = nucleus_api.TopicConsensusTransferModel(dataset0='News_feed', 
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        period_0_start='2018-08-12',
                                        period_0_end='2018-08-15',
                                        period_1_start='2018-08-16',
                                        period_1_end='2018-08-19',
                                        metadata_selection='')
api_response = api_instance.post_topic_consensus_transfer_api(payload)

topics = api_response.result
for i,res in enumerate(topics):
    print('Topic', i, 'exposure within validation dataset:')
    print('    Keywords:', res.keywords)
    print('    Consensus:', res.consensus)
    print('---------------')
    
print('-------------------------------------------------------------')

## 3.	Results Interpretation
-	Possible comparison between metrics on the reference and the validation dataset, or use metrics on the validation dataset for production signal generation

## 4.	Fine Tuning

### a.	Tailoring the topics
-	See whether some tailoring may be applied to your transfer learning by excluding certain topics considered not impactful. This is achieved by using the custom_stop_words parameter in input to any of the Topic * Transfer APIs


-	Identify and Extract key topics on the reference documents and print their keywords



In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='News_feed',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection,
                            period_start='2018-08-12',
                            period_end='2018-08-15')
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result.topics):
    print('Topic', i, ' keywords: ', res.keywords)    
    print('---------------')

You can then tailor the transfer learning by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2: 

In [None]:
custom_stop_words = ["conference","interview"] # str | List of stop words. (optional)

### b.	Focusing the transfer learning on certain subjects
In case you decide to focus the transfer learning, for instance on policy and macro-economic subjects, simply substitute the query variable in the main code of section 2. with: 

In [1]:
query = '(inflation OR growth OR unemployment OR stability OR regulation)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

### c.	Alternative specifications for the validation dataset
**validation dataset**: Two approaches are possible. 

1) the reference and validation datasets are time ordered. In such case, simply append the documents belonging to the validation dataset to the reference dataset, and use time selectors to define which time period is reference, and which is validation

2) the reference and validation datasets are not necessarily time ordered. In such case, you need to pass in two different datasets to the Topic Transfer APIs. dataset0 will be your reference dataset and dataset1 will be the validation dataset.

Note that Topic Transfer may not lead to any result if the topics extracted from the reference dataset aren't present in the validation dataset.

### d. Specifying topics exogenously

You can impose topics to be transferred to your validation dataset. These fixed topics can be chosen through whichever approach you decide. To pass them to any Transfer Learning API, use the fixed_topics optional input argument in the payload.

In [None]:
# Example 1: in English and you decide of weights
fixed_topics = [{"keywords":["inflation expectations", "forward rates", "board projections"], "weights":[0.7, 0.2, 0.1]}]

# Example 2: in English and you don't provide weights. Equal weights will then be used
fixed_topics = [{"keywords":["inflation expectations", "forward rates", "board projections"]}]

# Example 3: in Chinese (if your dataset is in Chinese) and you don't provide weights
fixed_topics = [{"keywords":["操作", "流动性", "基点", "元", "点", "央行", "进一步", "投资"]},
                {"keywords":["认为", "价格", "数据", "调查", "全国", "统计", "金融市场", "要求"]}]


payload = nucleus_api.TopicTransferModel(dataset0='News_feed', 
                                        dataset1="test_feed",
                                        fixed_topics=fixed_topics,
                                        query='', 
                                        custom_stop_words='', 
                                        num_topics=8, 
                                        num_keywords=8,
                                        metadata_selection='')
api_response = api_instance.post_topic_transfer_api(payload)

doc_ids_t1 = api_response.result.doc_ids_t1
topics = api_response.result.topics
for i,res in enumerate(topics):
    print('Topic', i, 'exposure within validation dataset:')
    print('    Keywords:', res.keywords)
    print('    Strength:', res.strength)
    print('    Document IDs:', doc_ids_t1)
    print('    Exposure per Doc in Validation Dataset:', res.doc_topic_exposures_t1)
    print('---------------')

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.