<h1><center>  Nucleus APIs 应用案例 - 股票筛选</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>
<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


#  

 


## 目的: 
-	利用上市公司有关披露对单名股票进行排名和筛选


## 数据集:
-	选定的公司列表，例如在同一行业部门或具有类似市值的公司的
 - 	公司的投资者报告
 - 	新闻稿
 - 	财报会议记录



## 使用的Nucleus API:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Topic Sentiment API
 - 	*api_instance.post_topic_sentiment_api(payload)*


-	DocInfo API
 - 	*api_instance.post_doc_info(payload)*


-	DatasetInfo API
 - 	*api_instance.post_dataset_info(payload)*


## 具体步骤:

### 1.	数据准备
-	创建一个包含选定历史期间内所有相关文档的Nucleus数据集

    

In [None]:
print('--------- Append all files from local folder to dataset in parallel -----------')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'ticker': 'CHU',
                                      'company': 'China Unicom',
                                      'category': 'Report',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

-	在历史期间中指定任意日期，只选取特定回望周期的文档子集

**此步骤可以直接在执行内容分析的API中完成，如下**



### 2.	情感和主题贡献度：筛选分析
- 在指定时间点，识别并提取当时文档子集中的关键主题


- 量化每个主题的情感，将所有关键主题分为“积极”和“消极”主题


- 确定每家公司对每个关键主题的关联程度


- 使用上述计算出每个公司跨所有关键主题的综合情感贡献度，得出公司的排名
 - 排列最前的公司是与积极话题最相关，和/或与消极话题最不相关的公司
 
 
- 接下来，我们将讨论如何利用用户可用的不同参数来改进此分析。




In [None]:
# Determine which companies are associated to the documents contributing to the topics
import numpy as np

payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker']) 

company_list = np.unique(company_sources)


print('-------- Get topic sentiment and exposure per firm ----------------')

payload = nucleus_api.TopicSentimentModel(dataset='Corporate_docs',          
                                query='',                   
                                num_topics=20,
                                num_keywords=8,
                                period_start="2018-11-01 00:00:00",
                                period_end="2019-01-01 00:00:00")
api_response = api_instance.post_topic_sentiment_api(payload)    

company_rankings = np.zeros(len(company_list), len(enumerate(api_response.result))
for i, res in enumerate(api_response.result):
    print('Topic', i, 'sentiment:')
    print('    Keywords:', res.topic)

    # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
    payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids = res.doc_id)
    api_response1 = api_instance.post_doc_info(payload)

    company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
    for res1 in api_response1.result:        
        company_sources.append(res1.attribute['ticker']) 

    company_contributions = np.zeros([len(company_list), 1])
    for j in range(len(company_list)):
        for k in range(len(company_sources)):
            if company_sources[k] == company_list[j]:
                company_contributions[j] += res.doc_score[k]

    company_rankings[:, i] = res.strength * res.sentiment * company_contributions[:]   

    print('---------------')


# Add up the ranking of companies per topic into the final credit screen
Corporate_screen = np.mean(company_rankings, axis=1)

In [None]:
import datetime
import numpy as np

print('------------ Retrieve all companies found in the dataset ----------')

payload = nucleus_api.DocInfo(dataset='Corporate_docs')
api_response = api_instance.post_doc_info(payload)

company_sources = []
for res in api_response.result:        
    company_sources.append(res.attribute['ticker']) 

company_list = np.unique(company_sources)


print('--------------- Retrieve the time range of the dataset -------------')

payload = nucleus_api.DatasetInfo(dataset='Corporate_docs', query='')
api_response = api_instance.post_dataset_info(payload)

first_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[0]))
last_date = datetime.datetime.fromtimestamp(float(api_response.result.time_range[1]))
delta = last_date – first_date

# Now loop through time and at each date, compute the ranking of companies
T = 90 # The look-back period in days

Corporate_screen = []
for i in range(delta.days):  
    if i == 1:
        end_date = first_date + datetime.timedelta(days=T)
 
    # first and last date used for the lookback period of T days
    start_date = end_date - datetime.timedelta(days=T)
    start_date_str = start_date.strftime("%Y-%M-%d 00:00:00")

    # We want a daily indicator
    end_date = end_date + datetime.timedelta(days=1) 
    end_date_str = end_date.strftime("%Y-%M-%d 00:00:00")

    payload = nucleus_api.TopicSentimentModel(dataset="Corporate_docs",      
                                query='',                   
                                num_topics=20,
                                num_keywords=8,
                                period_start= start_date_str,
                                period_end= end_date_str)
    api_response = api_instance.post_topic_sentiment_api(payload)

    company_rankings = np.zeros(len(company_list), len(enumerate(api_response.result))
    for l, res in enumerate(api_response.result):
        # Aggregate all document exposures within a topic into a company exposure, using the dataset metadata
        payload = nucleus_api.DocInfo(dataset='Corporate_docs', doc_ids=res.doc_id)
        api_response1 = api_instance.post_doc_info(payload)

        company_sources = [] # This list will be much shorter than the whole dataset because not all documents contribute to a given topic
        for res1 in api_response1.result:        
            company_sources.append(res1.attribute['ticker']) 

        company_contributions = np.zeros([len(company_list), 1])
        for j in range(len(company_list)):
            for k in range(len(company_sources)):
                if company_sources[k] == company_list[j]:
                    company_contributions[j] += res.doc_score[k]

        company_rankings[:, l] = res.strength * res.sentiment * company_contributions[:]       

    # Add up the ranking of companies per topic into the final credit screen
    Corporate_screen.append(np.mean(company_rankings, axis=1))

-	对历史期间中的每个日期重复上述任务，以获取单个名称屏幕的完整历史记录

## 3.	得出结论
-	将公司排名的时间序列与风险调整后的股本回报进行比较

## 4.	用户自定义调整

### a.	调整主题范围
-	通过排除某些被认为不具影响力的主题，来调整筛选机制。这可以通过在主题情感API的输入中用户自定义的_stop_words参数实现的。


-	识别和提取文档子集上的关键主题并print其关键字



In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                       
                                query='',                       
                                num_topics=20, 
                                num_keywords=8,
                                period_start="2018-11-01 00:00:00",
                                period_end="2019-01-01 00:00:00")
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result):
    print('Topic', i, ' keywords: ', res.topic)    
    print('---------------')

然后，您可以通过创建自定义的stopwords变量，来调整筛选机制，例如

In [None]:
custom_stop_words = ["call","report"] # str | List of stop words. (optional)

### b.	聚焦特定主题
如果您决定将筛选的重点放在特定主题上，如“财务健康”和“企业行动”，只需替换第2节主代码中的查询变量到以下即可

In [None]:
query = '("earnings" OR "debt" OR "competition" OR "lawsuit" OR "restructuring")' # str | Fulltext query, using mysql MATCH boolean query format. Example, (\"word1\" OR \"word2\") AND (\"word3\" OR \"word4\") (optional)

### c.	探索文档类型、回望周期、提取的主题数量的影响
**num_topics**: 通过更改第2节主代码中的变量num_topics，您可以改变主题数目即宽广度。较大数值将在建立排名时提供更大的广度，较小的值将缩小主题的广度。如果num_主题太大，一些非常边缘的主题可能会在衡量公司排名时带来很多干扰。

**T**: 通过更改第2节主代码中的变量t来改变回望周期。较大的t值将得出一个缓慢变化的排名，较小的值将导致一个迅速变化的排名。如果t太小，可能会使数据集太小，这可能会导致公司排名收到较大的干扰。如果t太长，排名就不会很快反映出足够重要的新信息

**Document types**: 仅选用某一类文档（如公司报告、新闻稿、财报会议记录中的一种），对比得出公司排名的变化。数据集的元数据是已知的，只需创立一个元数据变量，填入到第2节主代码中

In [None]:
metadata_selection = {"category": "Reports"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

## 5.	后续
-	可能的扩展：针对不同行业重复上述任务
 - 这将为您提供一个更宽广灵活的筛选机制，让您可以在每个行业细分中排列和筛选股票
 - 如果您将跨行业的所有公司混合在一个数据集中，配合行业标记，可以进而在行业的纬度上进行排列和筛选


-	对公司排名的时间序列与风险调整后的股本回报进行相关性分析
 - 可以研究价格影响的不同时间范围：1天、7天、几周，甚至更长的持续影响
 - 可以研究价格影响的不同时间滞后：市场受到新信息影响开始调整前的2到3天，前一周，甚至更长的时间间隔。
 - 您也可以做跟股息收益率或市盈率相关性的分析。与股息收益率较低的公司相比，股息收益率较高的公司是否表现出不同的价格影响？



-	可能的扩展：对筛选指（排名）标进行简单的变换
 - 例如，定义以下score,缩放和平滑排名 

        Score(Company i) = ( Rank(Company i) – Average(Ranks, [Companies]) ) / Std(Ranks, [Companies])

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.