<h1><center>  Document Summarization - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Develop a pipeline to customize and fine tune document summaries


## Data:
-	Any collection of documents, ideally from the same field, possibly with further refinement in terms of categorization such as document type

    **The Nucleus Datafeed can be leveraged for all content from major Central Banks**


## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Document Summary API
 - 	*api_instance.post_doc_summary_api(payload)*



## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents

    

In [None]:
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Corporate_documents'         
dataset = 'Corporate_docs'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
            file_dict = {'filename': os.path.join(root, file),
                         'metadata': {'company': 'Apple',
                                      'category': 'Press Release',
                                      'date': '2019-01-01'}}
            file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/central_banks_chinese'# embedded datafeeds in Nucleus.
metadata_selection = {'bank': 'people_bank_of_china', 'document_category': ('speech', 'press release', 'publication')}


### 2.	Document Summarization
-	Most input parameters to document summarization are controlling the size of the summary and the contribution of sentences considered too short or too long (so affecting UX)


-	The one parameter that allows to fine tune what type of content is considered relevant for summarization is the list of custom stop words 


-	Further down, we discuss how to construct a customized stopwords list to refine document summaries



In [None]:
print('---------------- Get doc summaries ------------------------')
# These are all possible input arguments to the summarization API
custom_stop_words = ["decree","motion"] # List of stop words. (optional)
summary_length = 6 # int | The maximum number of bullet points a user wants to see in the document summary. (optional) (default to 6)
context_amount = 0 # int | The number of sentences surrounding key summary sentences in the documents that they come from. (optional) (default to 0)
short_sentence_length = 0 # int | The sentence length below which a sentence is excluded from summarization (optional) (default to 4)
long_sentence_length = 40 # int | The sentence length beyond which a sentence is excluded from summarization (optional) (default to 40)

payload = nucleus_api.DocumentSummaryModel(dataset='Corporate_docs', 
                                        doc_title='my_title', 
                                        summary_length=summary_length)
api_response = api_instance.post_doc_summary_api(payload)

print('Summary for', api_response.result.doc_title)
for sent in api_response.result.summary.sentences:
    print('    *', sent)


### 3.	Fine Tuning

#### a.	Extracting topics found across documents of your corpus
-	See whether some tailoring may be applied to your summaries by excluding certain topics considered not information-bearing for your end-user or your application. This is achieved by using the custom_stop_words parameter in input to the Doc Summary API


-	Identify and Extract key topics on subset of documents within your corpus, such that this subset is comprised of documents sharing some similar attributes (could be the type of doc, for instance) and print the keywords of these topics



In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Corporate_docs',                         
                            query='',                       
                            num_topics=8, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result.topics):
    print('Topic', i, ' keywords: ', res.keywords)    
    print('---------------')

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to document summaries. 

You can then tailor the document summaries by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2: 

In [1]:
custom_stop_words = ["decree","motion"] # str | List of stop words. (optional)

#### b.	Isolating specific subsets of documents within your corpus
**Document types**: You can refine the extraction of topics and isolation of non-information-bearing topics by leveraging the metadata selector provided during the construction of the dataset, to get any level of granularity you are interested in. 

Rerun the code from two blocks above on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload:
 

In [None]:
# If you created a dataset where one metadata is the category of the document, 
# and one possible value for this category is 'speech'
# you could focus the topic analysis and the creation of a customized stopword list for all speech documents 
# within your corpus and later on in production
metadata_selection = {"document_category": "speech"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

#### c. Creating custom stopword lists on certain themes within your corpus
**query**: You can refine the extraction of topics and isolation of non-information-bearing topics by leveraging the query variable of the Topic API.

Rerun the code from 3 blocks above on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

In [None]:
query = '(veto rights OR jury decision OR verdict)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

Copyright (c) 2018 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.