# Cloud Constable Content-Based Threat Detection
______
### Stephen Camera-Murray, Himani Garg, Vijay Thangella
## Wikipedia Personal Attacks corpus
(https://figshare.com/articles/Wikipedia_Detox_Data/4054689)

115,864 verbatims out of which 13,590 are labelled aggressive and 102,274 are not.

Aggressive Speech                                      |  Normal Speech
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="thumbsdown.png" alt="Aggressive" style="width: 200px;"/> | <img src="thumbsup.png" alt="Normal" style="width: 200px;"/>

### Step 4 - Operationalize the Model
____
In this step, we'll build a function that accepts the email contents as a string and: 1) cleanses the text, 2) vectorizes the words, and 4) returns an "aggressive" probablity score.

**Note**: A lot of preprocessing goes in to producing a single score, including unpickling our model. In order to do this in real-time, we'd likely need a good amount of tuning or more likely a way to keep the vectorizer and model "warm" while waiting.

**Update**: We were successful in doing this in OpenWhisk. In order to keep the model warm, we might need to set up a timed trigger to call the service every 5-10 minutes. Processing time averaged less than 100ms for short phrases.

#### Import required libraries

In [9]:
#import libraries
import numpy as np
import pandas as pd
import re, pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib

#### Text scoring function
This function scores our text

In [10]:
# score_text ( text ) - function to score text
#   parameters:
#     text - the entire text of the email
#   returns:
#     aggressive_score  - the probability score that the text is aggressive
#
#   example: score = score_text ( text )
#
def score_text ( text ):
    
    # unpickle our dictionary and model
    with open ( "dictionary.pkl", "rb" ) as fp:
        dict = pickle.load ( fp )

    model = joblib.load ( 'ThreatClassificationModel.pkl' )
    
    # cleanse our text (may be unnecessary)
    clean_text = re.sub('[^a-zA-Z]+',' ', text).lower()

    # set up the vectorizer with our dictionary
    vectorizer = CountVectorizer(stop_words='english', vocabulary=dict)
    verbatimsVec = vectorizer.fit_transform([clean_text])
    
    # clean up our features set into a tidy dataframe and score
    wordCounts = pd.DataFrame(verbatimsVec.toarray(), columns=dict) # convert vectors to a dataframe
    aggressive_score = 1.0 - model.predict_proba ( wordCounts ) [:,0] [0]
    
    return aggressive_score

#### Example run
This test run simulates real-time prediction with our model.

##### Aggressive speech example

In [15]:
# define our example text
text = "I'm gonna kick your ass"

# get our score
aggressive_score = score_text ( text )

print ( "The predicted probability that this text is aggressive is", ( aggressive_score * 100.0 ), "%" )

The predicted probability that this text is aggressive is 80.0311436744 %


##### Normal speech example

In [13]:
# define our example text
text = "Lovely weather today, isn't it?"

# get our score
aggressive_score = score_text ( text )

print ( "The predicted probability that this text is aggressive is", ( aggressive_score * 100.0 ), "%" )

The predicted probability that this text is aggressive is 5.9431554097 %


##### Normal speech fail
As is typical with NLP, context is very important and can easily be confused with certain words.

In [14]:
# define our example text
text = "Did you see the news about Dick Cheney?"

# get our score
aggressive_score = score_text ( text )

print ( "The predicted probability that this text is aggressive is", ( aggressive_score * 100.0 ), "%" )

The predicted probability that this text is aggressive is 67.9201767817 %
