<h1><center>  Entity Tagging - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Generate a metadata with entities found in each document of a given dataset, given a list of a-prioris


## Data:
-	Any collection of documents

-   A list of entities that you are interested in detecting and tagging in the dataset


## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Dataset Tagging API
 - 	*api_instance.post_dataset_tagging(payload)*


-	Document Info API
 - 	*api_instance.post_doc_info(payload)*


## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents

    

In [None]:
print('---- Upload documents from a local folder into a new Nucleus dataset ----')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'category': 'News'}} # You don't have the tickers from each news, let's tag them
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

### 2.	Dataset Tagging
-	Define a list of company tickers (or any other entity relevant to you)


-	Loop through this list and use the Dataset Tagging API to determine which documents contain a given ticker


-	Further down, we discuss how to construct a customized stopwords list to refine document summaries



In [None]:
print('---------------- Tag dataset ------------------------')

payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query='AAPL OR Apple', 
                                    metadata_selection='', 
                                    time_period='')
api_response = api_instance.post_dataset_tagging(payload)

print('Information about dataset', dataset)
print('    Entity Tagged:', api_response.result.entity_tagged)
print('    Docids tagged with Entity:', api_response.result.docids)

Now we create our list of entities and loop through it

In [None]:
entities = [['AAPL', 'Apple'], ['GOOG', 'Google', 'Alphabet']]

docs_tagged = []
entities_tagged = []
for i in range(len(entities)):
    query = " OR ".join(entities[i])
    payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query=query, 
                                    metadata_selection='', 
                                    time_period='')
    api_response = api_instance.post_dataset_tagging(payload)

    for docid in api_response['docids']::
        docs_tagged.append(docid)
        entities_tagged.append(api_response['entity_tagged'][0]) # Retain the first naming of an entity as label

# Let's regroup the entities that are tagged per document so we have a unique list of docids
# and all entities tagged in them

# This table will be useful to generate an updated dataset with tickers provided as metadata
# so what we really care about are filenames rather than docids  
from collections import defaultdict
d = defaultdict(list)
for i, entity in enumerate(entities_tagged):
    payload = nucleus_api.DocInfo(
        dataset='Corporate_docs', 
        doc_ids=docs_tagged[i],
        metadata_selection='')
    api_response = api_instance.post_doc_info(payload)
    key = api_response.result[0].attribute['filename']
    d[key].append(entity)
d = dict(d)

Using these tags, we can construct a second dataset enriched with this extra metadata, which will be very convenient notably in signal research and compliance analytics.

We can use the filename to match raw documents with documents that have been tagged

In [None]:
dataset = 'Corporate_docs_2'# str | Destination dataset where the file will be inserted.

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        
        # We know the filename of the file currently being injected, we can match it against the 
        # table of tagged documents
        if d[os.path.join(root, file)] != [] # Only build the new dataset with the documents that have tagged entities
            tickers = d[os.path.join(root, file)]
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'companies': tickers,
                                      'category': 'News'}}
            
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

### 3.	Fine Tuning

#### a.	Expanding the list of synonyms for a given entity
**query**: You can refine the dataset tagging and expand your list of tickers (or other entity of relevance) to contain as many alternatives as you want. 

You can also create a conservative superset list of tickers once, keep that list saved and reuse it for every of the datasets you want to tag. 

Finally, you can also do the same with foreign companies. For instance, you could define an entry of your list as ['Nintendo', 'NTDOY', '任天堂株式会社']

Pass that expanded list, looping through all distinct tickers, to the query argument in the main code of section 2. and rerun that code:



In [None]:
entities = [['AAPL', 'Apple', 'iPhone'], ['GOOG', 'Google', 'Alphabet', 'Android'], ['NTDOY', 'Nintendo', '任天堂株式会社']]

docs_tagged = []
entities_tagged = []
for i in range(len(entities)):
    query = " OR ".join(entities[i])
    payload = nucleus_api.DatasetTaggingModel(dataset='Corporate_docs', 
                                    query=query, 
                                    metadata_selection='', 
                                    time_period='')
    api_response = api_instance.post_dataset_tagging(payload)

    for docid in api_response['docids']:
        docs_tagged.append(docid)
        entities_tagged.append(api_response['entity_tagged'][0]) # Retain the first naming of an entity as label

# Let's regroup the entities that are tagged per document so we have a unique list of docids
# and all entities tagged in them

# This table will be useful to generate an updated dataset with tickers provided as metadata
# so what we really care about are filenames rather than docids  
d = defaultdict(list)
for i, entity in enumerate(entities_tagged):
    payload = nucleus_api.DocInfo(
        dataset='Corporate_docs', 
        doc_ids=docs_tagged[i],
        metadata_selection='')
    api_response = api_instance.post_doc_info(payload)
    key = api_response.result[0].attribute['filename']
    d[key].append(entity)
d = dict(d)

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.