<h1><center>  Constructing a Sentiment Dictionary - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Develop a pipeline to create a custom sentiment dictionary with contrast analysis
  - Facilitate data labeling for sentiment modeling
  - Define a programmatic approach to creating sentiment dictionaries on a corpus of user's choice

**SumUp contrast analysis works on the premise of two distinct categories of documents within a corpus, defined by the user based on metadata or content**

## Data:
-	Any collection of documents, where at least a subset of it is labeled with sentiment categories such as POSITIVE / NEUTRAL / NEGATIVE


## Nucleus APIs:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Contrasted Topic Modeling API
 - 	*api_instance.post_topic_contrast_api(payload)*
 

-	Documents Classification API
 - 	*api_instance.post_doc_classify_api(payload)*

## Approach:

### 1.	Training, Validation, Testing Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents


-   The input documents must be labeled
    - Either these documents are stored in a CSV or JSON and one column of the data corresponds to the sentiment label
    - Or you need to specify the sentiment label as an extra metadata field when you construct your dataset in Nucleus
    

In [None]:
print('---- Train / Validate / Test dataset ----')
folder = 'Sellside_research'         
dataset = 'Sellside_research'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }

# If the documents are not in a CSV or JSON, then you must specify sentiment labels in the file_iter object
# as an extra metadata field.

# If you are reading from a file where the sentiment label is already provided, 
# no need to pass the 'metadata' in the file_dict

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'sentiment': 'positive' # Here build some logic to decide how to assign POS / NEU / NEG
                                }}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

### 2.	Unlabeled Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents


-   The input documents are not labeled    

In [None]:
print('---- Dataset to label ----')
folder = 'Sellside_research_unlabeled'         
dataset = 'Sellside_research_unlabeled'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED }

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file)}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

### 3. Accelerating the Labeling of Data

-     In this example, we define one category of documents as having positive sentiment. The second category has negative sentiment
-     We extract the contrast topic on the training set that separates those two categories with TopicContrastModel
-     We divide our dataset into training, validation and testing sets based on date created

In [None]:
metadata_selection_contrast = {"sentiment": ["positive", "negative"]} # dict | Specifies the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
num_keywords = 20 # integer | Number of keywords for the contrasted topic that is extracted from the dataset. (optional) (default to 50)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research', 
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        period_start='2017-01-01',
                                        period_end='2018-01-01')
api_response = api_instance.post_topic_contrast_api(payload)

print('Contrasted Topic')
print('    Keywords:', api_response.result.keywords)
print('    Keywords Weight:', api_response.result.keywords_weight)

-     Determine how well this contrasting topic performs at sentiment labeling of your corpus, on the validation dataset
-     We detail further down how to fine-tune the sentiment labeler

In [None]:
fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection_contrast = {"sentiment": ["positive", "negative"]} # dict | Specifies the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True, # This argument tells the API that data is labeled, so produces perf metrics
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Exposure:', api_response.result.detailed_results.exposures)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
print('    Actual Category:', api_response.result.detailed_results.true_class)
print('\n')

print('Perf Metrics')
print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
print('    Recall:', api_response.result.perf_metrics.recall)
print('    Precision:', api_response.result.perf_metrics.precision)

-     Once you are satified with your labeling model, you can apply it to the test data

In [None]:
fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection_contrast = {"sentiment": ["positive", "negative"]} # dict | Specifies the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research",
                                        fixed_topics=fixed_topics,
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=True, # This argument tells the API that data is labeled, so produces perf metrics
                                        period_start='2019-01-01',
                                        period_end='2019-07-01')
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Exposure:', api_response.result.detailed_results.exposures)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
print('    Actual Category:', api_response.result.detailed_results.true_class)
print('\n')

print('Perf Metrics')
print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
print('    Recall:', api_response.result.perf_metrics.recall)
print('    Precision:', api_response.result.perf_metrics.precision)

-     You are now ready to run the above model on the unlabeled dataset
-     You will retrieve 'estimated_class' for each document, which completes your dataset labeling
-     You can repeat this process for any pair of sentiment labels, and cross-validate the sentiment labels of the unlabeled dataset

In [None]:
fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
metadata_selection_contrast = {"sentiment": ["positive", "negative"]} # dict | Specifies the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrastic topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)


payload = nucleus_api.DocClassifyModel(dataset="Sellside_research_unlabeled",
                                        fixed_topics=fixed_topics,
                                        metadata_selection_contrast=metadata_selection_contrast,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False)
api_response = api_instance.post_doc_classify_api(payload)

print('Detailed Results')
print('    Docids:', api_response.result.detailed_results.docids)
print('    Estimated Category:', api_response.result.detailed_results.estimated_class)

### 4. Generating a Sentiment Dictionary

-     Use the whole dataset: train/validate/test data + the data that was labeled in the previous step
-     Generate topics that best contrast any two labels from above
-     You can repeat this process for any pair of sentiment labels, and cross-validate the sentiment labels of each word

In [None]:
print('---- Complete dataset ----')
folder = 'Sellside_research_combined'         
dataset = 'Sellside_research_combined'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }

# If the documents are not in a CSV or JSON, then you must specify sentiment labels in the file_iter object
# as an extra metadata field.

# If you are reading from a file where the sentiment label is already provided, 
# no need to pass the 'metadata' in the file_dict

file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'sentiment': 'positive' # Here pass in the labels obtained in the previous step
                                }}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

In [None]:
metadata_selection_contrast = {"sentiment": ["positive", "negative"]} # dict | Specifies the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = False # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
num_keywords = 20 # integer | Number of keywords for the contrasted topic that is extracted from the dataset. (optional) (default to 50)
remove_redundancies = False # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default False)

payload = nucleus_api.TopicContrastModel(dataset='Sellside_research_combined', 
                                        metadata_selection_contrast=metadata_selection_contrast)
api_response = api_instance.post_topic_contrast_api(payload)

print('Contrasted Topic')
print('    Keywords:', api_response.result.keywords)
print('    Keywords Weight:', api_response.result.keywords_weight)

### 5.	Fine Tuning

#### a.	Excluding certain content from the contrast analysis

-   Exclude irrelavant keywords / topics to tailor your contrast analysis by using the `custom_stop_words` parameter in the Contrast Analysis API


-	Extract keywords of the contrast topic on documents within your corpus and print the keywords of these topics

In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Sellside_research',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection_contrast)
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result.topics):
    print('Topic', i, ' keywords: ', res.keywords)    
    print('---------------')

Using your domain expertise or client / advisor input, you can determine if specific topics or keywords are not differentiated enough to contribute to contrast analysis. 

You can then tailor the contrast analysis by creating a `custom_stop_words` variable that contains those words. As shown below, initialize the variable and pass it in the payload of the main code of section 3: 

In [1]:
custom_stop_words = ["disclaimer", "disclosure"] # str | List of stop words. (optional)

#### b. Specifying the metadata_selection_contrast for your contrasted topic

-     Contrasting documents from two different entities

In [None]:
metadata_selection_contrast = {"research_analyst": ["MS", "JPM"]}

-     Contrasting documents that contain different keywords

In [None]:
metadata_selection_contrast = {"content": "fundamentals"}

#### c. Fine-tuning the contrasting topic

**num_keywords**: The larger num_keywords, the more words will be retained in the contrasting topic, with increasingly less impact on separating the two categories of sentiment you work with

**syntax_variables**: If True, then certain Part-of-Speech features are automatically included in the contrasting topic model. It may help if certain authors have vastly different writing styles. This is frequent with social media data and news. It is less likely to be in institutional publications

**threshold**: This is the minimum exposure a document must have to the contrasting topic to be assigned to category_1 that you defined. A perfect model would have a threshold of 0, the default value. You may observe that higher performance metrics are obtained in validation from choosing a different value. This may be explained in particular in smaller samples for training and validation, or if there are generic words that appear as keywords in the contrasting topic

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.