<h1><center>  News Tracking & Analysis - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1> 

 


## Objective: 
-	Develop a workflow to identify and track certain topics in news / social media, and to have the ability to analyze those topics in terms of key contributors and key takeaways


## Data:
-	A collection of News Media RSS
-   Social media feeds

    **The Nucleus Datafeed can be leveraged for content from 200 News Media RSS**


## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Topic Historical Analysis API
 - 	*api_instance.post_topic_historical_analysis_api(payload)*


-	Author Connectivity API
 - 	*api_instance.post_author_connectivity_api(payload)*
 

-	Topic Summary API
 -  *api_instance.post_topic_summary_api(payload)


## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents over a chosen historical period

    

In [None]:
# Leverage your own corpus
print('---- Case 1: you are using your own corpus, coming from a local folder ----')
folder = 'Twitter_feed'         
dataset = 'Twitter_feed'# str | Destination dataset where the file will be inserted.

# build file iterable from a folder recursively. 
# Each item in the iterable is in the format below:
# {'filename': filename,   # filename to be uploaded. REQUIRED
#  'metadata': {           # metadata for the file. Optional
#      'key1': val1,       # keys can have arbiturary names as long as the names only
#      'key2': val2        # contain alphanumeric (0-9|a-z|A-Z) and underscore (_)
#   } 
# }
file_iter = []
for root, dirs, files in os.walk(folder):
    for file in files:
        #if Path(file).suffix == '.pdf': # .txt .doc .docx .rtf .html .csv also supported
        file_dict = {'filename': os.path.join(root, file),
                     'metadata': {'source': 'Tech Crunch',
                                  'author': 'Sarah Moore'
                                  'category': 'Media',
                                  'date': '2019-01-01'}}
        file_iter.append(file_dict)

file_props = nucleus_helper.upload_files(api_instance, dataset, file_iter, processes=4)
for fp in file_props:
    print(fp.filename, '(', fp.size, 'bytes) has been added to dataset', dataset)

    
# Leverage a Nucleus embedded feed    
print('---- Case 2: you are using an embedded datafeed ----')
dataset = 'sumup/rss_feed_ai'# embedded datafeeds in Nucleus.

-	For a given date in that period, retain only a subset of the documents published during a chosen lookback period

**This can be done directly into the APIs that perform content analysis, see below**



### 2.	Content Analysis & Tracking
-	Identify and Extract key topics over a recent period of your corpus 


-	Track how those key topics have evolved up until now: relevance, perception 

In [None]:
print('------------ Get topics + historical analysis ----------------')

payload = nucleus_api.TopicHistoryModel(dataset='Twitter_feed', 
                                    update_period='d', 
                                    query='',
                                    num_topics=20, 
                                    num_keywords=8,
                                    inc_step=1,
                                    period_start="2018-12-01 00:00:00",
                                    period_end="2019-01-01 00:00:00")
api_response = api_instance.post_topic_historical_analysis_api(payload)

print('Plotting historical metrics data...')
historical_metrics = []
for res in api_response.result:
    # construct a list of historical metric' dictionaries for charting
    historical_metrics.append({
        'topic'    : res.topic,
        'time_stamps' : np.array(res.time_stamps),
        'strength' : np.array(res.strengths, dtype=np.float32),
        'consensus': np.array(res.consensuses, dtype=np.float32), 
        'sentiment': np.array(res.sentiments, dtype=np.float32)})

selected_topics = range(len(historical_metrics)) 
nucleus_helper.topic_charts_historical(historical_metrics, selected_topics, True)


-   For a few topics standing out as relevant to you, pull out key recent takeaways (summaries, best sources)

In [None]:
print('------------- Get the summaries of recent topics in your feed --------------')

payload = nucleus_api.TopicSummaryModel(dataset='Twitter_feed',                         
                            query='',                       
                            num_topics=20, 
                            num_keywords=8,
                            period_start="2018-12-31 00:00:00",
                            period_end="2019-01-01 00:00:00")
api_response = api_instance.post_topic_summary_api(payload)        
    
for res in api_response.result:
    print('Topic', i, 'summary:')
    print('    Keywords:', res.topic)
    for j in range(len(res.summary)):
        print(res.summary[j])
        print('    Document ID:', res.summary[j].sourceid)
        print('        Title:', res.summary[j].title)
        print('        Sentences:', res.summary[j].sentences)
        print('        Author:', res.summary[j].attribute['author'])
        print('        Time:', datetime.datetime.fromtimestamp(float(res.summary[j].attribute['time'])))   


-   You can drill down on some influencers or interesting emerging contributors by leveraging the author connectivity analysis: who is most similar to that person based on the topics they participate in and the nature of their participation

In [None]:
print('----------------- Get author connectivity -------------------')

payload = nucleus_api.AuthorConnection(dataset='Twitter_feed', 
                                        target_author='Yann LeCun', 
                                        query='',
                                        period_start="2018-12-31 00:00:00",
                                        period_end="2019-01-01 00:00:00")
api_response = api_instance.post_author_connectivity_api(payload)    

print('Mainstream connections:')
for mc in api_response.result.mainstream_connections:
    print('    Topic:', mc.keywords)
    print('    Authors:', " ".join(str(x) for x in mc.authors))
    
print('Niche connections:')
for nc in api_response.result.niche_connections:
    print('    Topic:', nc.keywords)
    print('    Authors:', " ".join(str(x) for x in nc.authors))  

-	Further down, we discuss how to refine the content analysis by leveraging the different parameters available to the user

### 3.	Fine Tuning

#### a.	Tailoring the topics
-	See whether some tailoring may be applied to your content analysis by excluding certain topics considered not differentiated. This is achieved by using the custom_stop_words parameter in input to the Topic Historical Analysis API


-	Identify and Extract key topics on your corpus and print their keywords



In [None]:
print('------------- Get the recent topics in your feed --------------')

payload = nucleus_api.Topics(dataset='Twitter_feed',                         
                            query='',                       
                            num_topics=20, 
                            num_keywords=8,
                            period_start="2018-12-31 00:00:00",
                            period_end="2019-01-01 00:00:00")
api_response = api_instance.post_topic_api(payload)        
    
for i, res in enumerate(api_response.result.topics):
    print('Topic', i, ' keywords: ', res.keywords)    
    print('---------------')

You can then tailor the content analysis by creating a custom_stop_words variable. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2: 

In [None]:
custom_stop_words = ["supervised learning", "training"] # str | List of stop words. (optional)

#### b.	Focusing the content analysis on certain subjects
In case you decide to focus the content analysis, for instance on deep-learning subjects, simply substitute the query variable in the main code of section 2. with: 

In [None]:
query = '(deep-learning OR LSTM OR RNN OR Neural network)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

#### c.	Exploring the impact of the type of documents, the lookback period, the number of topics being extracted
**period_starts**: You can perform content analysis on any lookback you want, with granularities ranging from intraday to monthly. Depending on your objectives, such options give you the flexibility to slice the data in the most relevant time horizon.

**Document types**: You can investigate how topics and their evolution over time change, based on the types of sources contributing to your content. Whether it is sources in different languages, or contributors from academia, the private sector or independent individuals, as long as your corpus has that info available, dicing and slicing is a piece of cake thanks to the metadata selector provided during the construction of the dataset. Rerun the main code of section 2. on a subset of the whole corpus. Create a variable metadata_selection and pass it in to the payload (works if using your docs or the Central Bank feed, News Media RSS feed doesn't have metadata that can be selected):


In [None]:
metadata_selection = {"category": "Academia"}   # str | json object of {\"metadata_field\":[\"selected_values\"]} (optional)

### 4.	Next Steps
-	Possible extension: build an internal BI report pulling out top topics and key take-aways


-	Possible extension: compare internal from your domain-expert teams and external discussions to support competitive landscape intelligence


Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.