<h1><center>  Low Quality Content Detection - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Develop a pipeline to detect low quality content in social-media or gaming chatroom

**In its current version, SumUp contrast analysis works comparing two categories against each other, where the user defines what the two categories are.**

## Data:
-	A labeled corpus of posts from a social media or gaming platform
 -     You can have multiple labels in your corpus, but the algorithms will deal with two labels at a time when learning / predicting
 
 
 -     Illustrative labels for low-quality content detection: **"Violence", "Drugs", "Pornographic", "Religiously Sensitive", "Politically Sensitive", "Scam", "Clickbait", "Fake", "All clear"**



## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Topic Modeling API
 - 	*api_instance.post_topic_api(payload)*


-	Contrasted Topic Modeling API
 - 	*api_instance.post_topic_contrast_api(payload)*


-	Documents Classification API
 - 	*api_instance.post_doc_classify_api(payload)*

## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents


-   We assume that the data is stored in a csv file. A similar code could be built to inject from a database table. There are some requirements on the name of data and metadata fields passed to the API to create a dataset


    - Illustrative template for the data uploaded: ["author", "label", "time", "content", "title"]

    

In [1]:
import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

In [None]:
csv_file = 'Social_media_feed.csv'
dataset = 'Social_media_feed'# str | Destination dataset where the file will be inserted.

with open(csv_file, encoding='utf-8-sig') as csvfile:
    reader = csv.DictReader(csvfile)
    json_props = nucleus_helper.upload_jsons(api_instance, dataset, reader, processes=4)
    
    total_size = 0
    total_jsons = 0
    for jp in json_props:
        total_size += jp.size
        total_jsons += 1
        
    print(total_jsons, 'JSON records (', total_size, 'bytes) appended to', dataset)


### 2. Contrasted Topic Modeling

-     In this example, we define one category of documents tagged "Violence". The second category of documents are tagged "All clear".
-     We extract one topic that separates those two categories

In [2]:
metadata_selection = {"label": ["Violence", "All clear"]} # dict | The metadata selection defining the two categories of documents to contrast and summarize against each other

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | Specifies whether to take into account syntax aspects of each category of documents to help with contrasting them (optional) (default to False)
compression = 0.002 # float | Parameter controlling the breadth of the contrasted topic. Contained between 0 and 1, the smaller it is, the more contrasting terms will be captured, with decreasing weight. (optional) (default to 0.000002)
remove_redundancies = True # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)

payload = nucleus_api.TopicContrastModel(dataset='Social_media_feed', 
                                        metadata_selection=metadata_selection,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
try:
    api_response = api_instance.post_topic_contrast_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Contrasted Topic')
    print('    Keywords:', api_response.result.keywords)
    print('    Keywords Weight:', api_response.result.keywords_weight)

    print('In-Sample Perf Metrics')
    print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
    print('    Recall:', api_response.result.perf_metrics.recall)
    print('    Precision:', api_response.result.perf_metrics.precision)

Contrasted Topic
    Keywords: ['trump foreign', 'social media', 'rasmussen_poll realdonaldtrump', 'claims heard', 'donald trump', 'trump campaign', 'omarosa legitimate', 'strzok fbi', 'fake dossier', 'judgejeanine bob', 'director brennan', 'bruce ohr', 'christopher steele', 'fbi criminally', 'unfortunate situation', 'hillary clinton', 'fox news', 'lou dobbs', 'mark levin', 'concerned comey', 'department justice', 'brennan stain', 'department believe', 'collusion obstruction', 'pushback governor', 'heard heard', 'frame donald', 'gregg jarrett', 'cuomo resign', 'nelly time', 'gps fake', 'foreign policy', 'media totally', 'realdonaldtrump approval', 'presidential lowlife', 'governor andrew', 'resign ratings', 'situation decided', 'criminally investigated', 'america standing', 'boosting america', 'policy boosting', 'administration happen', 'trump administration', 'loudly trump', 'speaking loudly', 'republicanconservative voices', 'discriminating republicanconservative', 'totally discrimin

### 3. Documents Classification

This task requires 3 steps:
-     First, extract a contrasted topic on a labeled dataset
-     Second, train the documents' classifier by providing a labeled dataset. In this step, you can adjust the weight of each keyword from the contrasted topic, remove certain keywords, and even compare the contrasted topic produced by step 1 against topics of your own choosing
-     Third, test the classifier

-     In the example below, we assume that the contrasted topic has already been obtained. The structure of 'fixed_topics' is exactly that which would come out of the Contrasted Topic API

In [3]:
# Here we re-use the contrasted topic from section 2
fixed_topics = {"keywords": api_response.result.keywords, "weights": api_response.result.keywords_weight} # dict | The contrasting topic used to separate the two categories of documents. Weights optional
print(len(api_response.result.keywords), len(api_response.result.keywords_weight))
metadata_selection = {"label": ["Violence", "All clear"]} # dict | The metadata selection defining the two categories of documents that a document can be classified into

query = '' # str | Dataset-language-specific fulltext query, using mysql MATCH boolean query format (optional)
custom_stop_words = [""] # List of stop words. (optional)
excluded_docs = '' # str | List of document IDs that should be excluded from the analysis. Example, ["docid1", "docid2", ..., "docidN"]  (optional)
syntax_variables = True # bool | If True, the classifier will include syntax-related variables on top of content variables (optional) (default to False)
threshold = 0 # float | Threshold value for a document exposure to the contrasted topic, above which the document is assigned to class 1 specified through metadata_selection. (optional) (default to 0)
remove_redundancies = True # bool | If True, this option removes quasi-duplicates from the analysis. A quasi-duplicate would have the same NLP representation, but not necessarily the exact same text. (optional) (default True)


payload = nucleus_api.DocClassifyModel(dataset="Social_media_feed",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        validation_phase=True,
                                        period_start='2018-01-01',
                                        period_end='2019-01-01')
                                        #period_start='2019-01-01',
                                        #period_end='2019-03-01')

try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)
    print('    Actual Category:', api_response.result.detailed_results.true_class)
    print('\n')

    print('Out-Sample Perf Metrics')
    print('    Accuracy:', api_response.result.perf_metrics.hit_rate)
    print('    Recall:', api_response.result.perf_metrics.recall)
    print('    Precision:', api_response.result.perf_metrics.precision)

163 163
Detailed Results
    Docids: [372746459070796601, 656244823936517128, 776902852041351634, 950604085993420810, 1292265014981711161, 1380411530707030282, 1620156333107313580, 1854520462215508183, 2205902445999073018, 2365960778917245307, 2373450842905457495, 2383865888350638791, 2554924790797026542, 2952292854093486503, 3325720912382988533, 3397215194896514820, 3499421997204683102, 3545423942726121399, 3683627708016583172, 4555868983588618437, 4625946039318940221, 4746121785136787662, 4767189974744133712, 4825367511331474696, 5217366909427623007, 5566900818722282521, 5620968974223273808, 5821020073909755150, 5864841412738683134, 6173618630202756293, 6303783743713708484, 6468365417517605478, 7014079786619530089, 7180359259391996839, 7242230233701612989, 7290029718334628379, 7887407208809957066, 7967605045913198983, 8047817457772465264, 8073561612845847316, 8192928964490616283, 8991483632660067955, 9035906359710233744, 9384092744660032334, 9785400758777816854, 10006474250568936611,

Then, we can move to the testing phase

In [None]:
payload = nucleus_api.DocClassifyModel(dataset="Social_media_feed",
                                        fixed_topics=fixed_topics,
                                        metadata_selection=metadata_selection,
                                        custom_stop_words=custom_stop_words,
                                        validation_phase=False,
                                        period_start='2019-03-01',
                                        period_end='2019-06-01')

try:
    api_response = api_instance.post_doc_classify_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    print('Detailed Results')
    print('    Docids:', api_response.result.detailed_results.docids)
    print('    Exposure:', api_response.result.detailed_results.exposures)
    print('    Estimated Category:', api_response.result.detailed_results.estimated_class)

### 4.	Fine Tuning

#### a. Specifying the metadata_selection for your contrasted topic

-     Contrasting documents that contain different keywords 

    This can be useful to detect certain expressions that come up frequently in specific contexts

In [None]:
metadata_selection = {"content": "kill hate torture"}

-     Contrasting documents that come from different authors

    This can be useful to detect multiple accounts that link to the same actual person

In [None]:
metadata_selection = {"author": "@suspicious_author"}

#### b.	Reducing noise in your low-quality content detection
-	See whether some tailoring may be applied to your content classification by excluding certain topics considered not information-bearing for your end-user or your application. This is achieved by using the custom_stop_words parameter in input to the Contrasted Topic and Document Classify APIs


-	Identify and Extract key topics on documents within your corpus and print the keywords of these topics


In [None]:
print('------------- Get list of topics from dataset --------------')

payload = nucleus_api.Topics(dataset='Social_media_feed',                         
                            query='',                       
                            num_topics=20, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)

try:
    api_response = api_instance.post_topic_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to low-quality content detection. 

You can then tailor the low-quality content detection by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the main code of section 2: 

In [None]:
custom_stop_words = ["tough dude","bad boy"] # str | List of stop words. (optional)

#### c. Focusing the content detection on specific subjects potentially discussed in your corpus
**query**: You can refine the content detection by leveraging the query variable of the Contrasted Topic and Document Classify APIs.

Rerun any of these 2 APIs on the content from your corpus that mentions a specific theme. Create a variable query and pass it in to the payload:

In [None]:
query = '(LOL OR league of legends OR WOW OR world of warcraft)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.