<h1><center>  Training a Content Classifier - Nucleus APIs Use Cases</center></h1>


<h1><center>  SumUp Analytics, Proprietary & Confidential</center></h1>


<h1><center>  Disclaimers and Terms of Service available at www.sumup.ai</center></h1>


 


## Objective: 
-	Develop an automated content classification model in social-media or gaming chatroom

**In its current version, SumUp contrast analysis works comparing two categories against each other, where the user defines what the two categories are.**

## Data:
-	A labeled corpus of posts from a social media or gaming platform
 -     You can have multiple labels in your corpus, but the algorithms will deal with two labels at a time when learning / predicting
 
 
 -     Illustrative labels for low-quality content detection: **"Violence", "Drugs", "Pornographic", "Religiously Sensitive", "Politically Sensitive", "Scam", "Clickbait", "Fake", "All clear"**



## Nucleus APIs used:
-	Dataset creation API
 - 	*api_instance.post_upload_file(file, dataset)*
 - 	*nucleus_helper.import_files(api_instance, dataset, file_iters, processes=1)*

        nucleus_helper.import_files leverages api_instance.post_upload_file with parallel execution to speed-up the dataset creation


-	Contrasted Topic Modeling API
 - 	*api_instance.post_topic_contrast_api(payload)*


-	Documents Classification API
 - 	*api_instance.post_doc_classify_api(payload)*

## Approach:

### 1.	Dataset Preparation
-	Create a Nucleus dataset containing all relevant documents


-   We assume that the data is stored in a csv file. A similar code could be built to inject from a database table. There are some requirements on the name of data and metadata fields passed to the API to create a dataset


    - Illustrative template for the data uploaded: ["author", "label", "time", "content", "title"]

    

In [None]:
import csv
import json
import nucleus_api.api.nucleus_api as nucleus_helper
import nucleus_api
from nucleus_api.rest import ApiException

configuration = nucleus_api.Configuration()
configuration.host = 'UPDATE-WITH-API-SERVER-HOSTNAME'
configuration.api_key['x-api-key'] = 'UPDATE-WITH-API-KEY'

# Create API instance
api_instance = nucleus_api.NucleusApi(nucleus_api.ApiClient(configuration))

In [None]:
csv_file = 'social-media.csv'
dataset = 'social-media'# str | Destination dataset where the file will be inserted.

with open(csv_file, encoding='utf-8-sig') as csvfile:
    reader = csv.DictReader(csvfile)
    json_props = nucleus_helper.upload_jsons(api_instance, dataset, reader, processes=4)
    
    total_size = 0
    total_jsons = 0
    for jp in json_props:
        total_size += jp.size
        total_jsons += 1
        
    print(total_jsons, 'JSON records (', total_size, 'bytes) appended to', dataset)


### 2. Classifier's Training and Validation

-     In this example, we rely on the OLID dataset, which can be obtained from the authors: https://scholar.harvard.edu/malmasi/olid.


-     Each objectionable content task is dealt with one at a time


-     For each task, a contrasting topic is extracted on the training set and evaluated on the validation set, in order to determine an optimal length/profile of that contrasting topic


-     The optimal configuration is finally applied to a test set to derive out-of-sample performance metrics

In [None]:
syntax_variables = False # fixed
custom_stop_words = []

threshold_grid = np.linspace(100, 1000, 50) # Range of values for the hyper-parameter 'num_keywords'
perf_grid = []
granular_results = {"Accuracy":[], "Recall":[], "Precision":[], "F1":[], "Document_set":[]}

for j in range(3):
    if j == 0:
        test_set = "OLID_test_a"
        my_values = ['subtask_a', "OFF", "NOT"] # offensive and ok
    elif j == 1:
        test_set = "OLID_test_b"
        my_values = ['subtask_b', "UNT", "TIN"] # untargeted and targeted offense
    elif j == 2:
        test_set = "OLID_test_c"
        my_values = ['subtask_c', "GRP", "OTH"] # individual, group, other (IND|GRP|OTH)

    metadata_selection = {my_values[0]: [my_values[1], my_values[2]]} 

    # parameter sensitivity, we max accuracy
    optimal_num_keywords = 0.
    running_best = 0.
    for k in range(len(threshold_grid)):
        num_keywords = threshold_grid[k]
        try:
            payload = nucleus_api.TopicContrastModel(dataset='OLID_train', 
                                                    metadata_selection=metadata_selection,
                                                    num_keywords=num_keywords,
                                                    syntax_variables=syntax_variables,
                                                    custom_stop_words=custom_stop_words,
                                                    remove_redundancies=False)
            api_response = api_instance.post_topic_contrast_api(payload)

            fixed_topics = {'weights': api_response.result.keywords_weight, 'keywords': api_response.result.keywords}
            classifier_config = {'coefs': api_response.result.classifier_config.coef_[0], 'intercept': api_response.result.classifier_config.intercept_[0], 'keywords': api_response.result.keywords}

            payload = nucleus_api.DocClassifyModel(dataset="OLID_validate",
                                                    fixed_topics=fixed_topics,
                                                    classifier_config=classifier_config,
                                                    metadata_selection=metadata_selection,
                                                    validation_phase=True,
                                                    syntax_variables=syntax_variables,
                                                    custom_stop_words=custom_stop_words,
                                                    remove_redundancies=False)
            api_response1 = api_instance.post_doc_classify_api(payload)

            if api_response1.result.perf_metrics.f1 > running_best:
                optimal_num_keywords = num_keywords
                running_best = api_response1.result.perf_metrics.f1
        except (AttributeError, IndexError, ZeroDivisionError) as e:
                optimal_num_keywords = 0

    if optimal_num_keywords > 0:
        # then for the compression param that maximized, we test classification performance on a separate sample
        try:
            payload = nucleus_api.TopicContrastModel(dataset='OLID_train', 
                                                    metadata_selection=metadata_selection,
                                                    num_keywords=optimal_num_keywords,
                                                    syntax_variables=syntax_variables,
                                                    custom_stop_words=custom_stop_words,
                                                    remove_redundancies=False)
            api_response = api_instance.post_topic_contrast_api(payload)

            fixed_topics = {'weights': api_response.result.keywords_weight, 'keywords': api_response.result.keywords}
            classifier_config = {'coefs': api_response.result.classifier_config.coef_[0], 'intercept': api_response.result.classifier_config.intercept_[0], 'keywords': api_response.result.keywords}

            payload = nucleus_api.DocClassifyModel(dataset=test_set,
                                                    fixed_topics=fixed_topics,
                                                    classifier_config=classifier_config,
                                                    metadata_selection=metadata_selection,
                                                    validation_phase=True,
                                                    syntax_variables=syntax_variables,
                                                    custom_stop_words=custom_stop_words,
                                                    remove_redundancies=False)
            api_response1 = api_instance.post_doc_classify_api(payload)
            granular_results['Accuracy'].append(api_response1.result.perf_metrics.accuracy)
            granular_results['Recall'].append(api_response1.result.perf_metrics.recall)
            granular_results['Precision'].append(api_response1.result.perf_metrics.precision) 
            granular_results['F1'].append(api_response1.result.perf_metrics.f1) 
        except (AttributeError, IndexError, ZeroDivisionError) as e:
            granular_results['Accuracy'].append(np.nan)
            granular_results['Recall'].append(np.nan)
            granular_results['Precision'].append(np.nan)  
            granular_results['F1'].append(np.nan)  
    else:
        granular_results['Accuracy'].append(np.nan)
        granular_results['Recall'].append(np.nan)
        granular_results['Precision'].append(np.nan)
        granular_results['F1'].append(np.nan)
    granular_results['Document_set'].append([my_values[1], my_values[2]])

### 3.	Fine Tuning

#### a.	Reducing noise in your low-quality content detection
-	See whether some tailoring may be applied to your content classification by excluding certain topics considered not information-bearing for your end-user or your application. This is achieved by using the custom_stop_words parameter in input to the Topic Contrast and Document Classify APIs


-	Identify and Extract key topics on objectionable documents within your corpus and print the keywords of these topics

In [None]:
print('------------- Get list of topics from dataset --------------')

metadata_selection = {"subtask_a": "OFF"}
payload = nucleus_api.Topics(dataset=dataset,                         
                            query='',                       
                            num_topics=20, 
                            num_keywords=8,
                            metadata_selection=metadata_selection)

try:
    api_response = api_instance.post_topic_api(payload)
    api_ok = True
except ApiException as e:
    api_error = json.loads(e.body)
    print('ERROR:', api_error['message'])
    api_ok = False

if api_ok:
    for i, res in enumerate(api_response.result.topics):
        print('Topic', i, ' keywords: ', res.keywords)    
        print('---------------')

Using your domain expertise / client input / advisor input, you can determine whether certain of those topics or keywords are not differentiated enough to contribute to low-quality content detection. 

You can then tailor the low-quality content detection by creating a custom_stop_words variable that contains those words. Initialize the variable as follows, for instance, and pass it in the payload of the code of section 2: 

In [None]:
custom_stop_words = ["tough dude","bad boy"] # str | List of stop words. (optional)

#### b. Focusing the content detection on specific subjects potentially discussed in your corpus
**query**: You can refine the content detection by leveraging the query variable of the Contrasted Topic and Document Classify APIs.

This can be especially useful when content monitors are flagging objectionable content that typically surrounds specific events not captured by the existing process. One could try to understand ex-post whether such content has characteristic patterns in how users write about it.

Create a variable query and pass it in the payload of the code of section 2:

In [None]:
query = '(christchurch)' # str | Fulltext query, using mysql MATCH boolean query format. Example: "(word1 OR word2) AND (word3 OR word4)" (optional)

Copyright (c) 2019 SumUp Analytics, Inc. All Rights Reserved.

NOTICE: All information contained herein is, and remains the property of SumUp Analytics Inc. and its suppliers, if any. The intellectual and technical concepts contained herein are proprietary to SumUp Analytics Inc. and its suppliers and may be covered by U.S. and Foreign Patents, patents in process, and are protected by trade secret or copyright law.

Dissemination of this information or reproduction of this material is strictly forbidden unless prior written permission is obtained from SumUp Analytics Inc.