# Classify toxic comments using AWS Comprehend

AWS Comprehend is a service that uses NLP to extract information of text files (comments, documents, ...). There is 2 way to use it :
* Using the native model : simple and powerful way to extract Entities (people, dates, proper names, ...), key phrases, language, sentiment and syntax. It is based on large amount of training data.
* Use custom classifier : create a model from s3 labeled documents, then run asynchronous jobs to classifier new documents

More about comprehend here : https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html

Here we will show the way to format files that would be readable for aws comprehend. Then explain how to use it via AWS console. Further coding would be to replace console actions by programs using boto3

In [60]:
import pandas as pd

In [61]:
toxicComments = pd.read_csv('../0-data/toxic-comments/train_preprocessed.csv');

In [62]:
toxicComments.shape

(159571, 10)

In [63]:
toxicComments.head(2)

Unnamed: 0,comment_text,id,identity_hate,insult,obscene,set,severe_toxic,threat,toxic,toxicity
0,explanation why the edits made under my userna...,0000997932d777bf,0.0,0.0,0.0,train,0.0,0.0,0.0,0.0
1,d aww he matches this background colour i m s...,000103f0d9cfb60f,0.0,0.0,0.0,train,0.0,0.0,0.0,0.0


In [64]:
def getRowToxicityLabel (row):
    if(row['identity_hate']):
        return "IDENTITY_HATE"
    elif(row.insult):
        return "INSULT"
    elif(row.obscene):
        return "OBSCENE"
    elif(row.severe_toxic):
        return "SEVERE_TOXIC"
    elif(row.threat):
        return "threat"
    elif(row.toxic):
        return "TOXIC"
    elif(row.toxicity):
        return "TOXICITY"
    else:
        return "CLEAN"

In [65]:
labeledComments = pd.DataFrame({
    "label": toxicComments.apply(getRowToxicityLabel, axis=1),
    "text": toxicComments.comment_text
})

In [66]:
labeledComments.head(10)

Unnamed: 0,label,text
0,CLEAN,explanation why the edits made under my userna...
1,CLEAN,d aww he matches this background colour i m s...
2,CLEAN,hey man i m really not trying to edit war it...
3,CLEAN,more i can t make any real suggestions on im...
4,CLEAN,you sir are my hero any chance you remember...
5,CLEAN,congratulations from me as well use the tool...
6,INSULT,cock suck before you piss around on my work
7,CLEAN,your vandalism to the matt shirvington article...
8,CLEAN,sorry if the word nonsense was offensive to ...
9,CLEAN,alignment on this subject and which are contra...


In [67]:
labeledComments.to_csv('./train_set.csv', encoding='utf-8', index=False, header=False)

In [68]:
testComments = pd.DataFrame(data=[
    "hello, I'd like to know where I could find tutorials about Machine Learning",
    "You bitch have no clue of what you say'in"
])
testComments.to_csv('./test_set.csv', encoding='utf-8', index=False, header=False)

## Next steps

To use comprehend to classify further comments :
* Upload train_set.csv to s3
* Create a custom classifier using train_set.csv as train data (using aws-cli or online console)
* Upload test_set.csv to s3
* Create a job using test_set.csv as input
* Run the job
* After job processing, results will be available on s3

Note that AWS Comprehend custom classifiers are currently async. Meaning that you won't be able to get predictions via simple request, but you have to create and run a job and then read the s3 result file.

Ref: https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html
> "Training and calling custom comprehend models are both async (batch) operations."

Result for the test_set.csv predictions:
```
~/Sources/datascience/datakit/4-documents-classification$ cat ~/Downloads/predictions.jsonl
{"File": "comments-tests.csv", "Line": "1", "Classes": [{"Name": "CLEAN", "Score": 0.9999}, {"Name": "TOXIC", "Score": 0.0001}, {"Name": "OBSCENE", "Score": 0.0}]}
{"File": "comments-tests.csv", "Line": "2", "Classes": [{"Name": "INSULT", "Score": 0.5994}, {"Name": "TOXIC", "Score": 0.3135}, {"Name": "IDENTITY_HATE", "Score": 0.0585}]}
```

### Re-train/re-run the model
To run job for new values :
* Replace test_set.csv on S3
* Create job with same parameters as before
* Run the job and read results after processing

To re-train model :
* Replace train_set.csv with new values (it shall have been easier to label toxic comments thanks to help of classifier's jobs' results)
* Copy previous classifier
* Train newly created classifier
* Create new job using same parameters as before, except the model choice
* Delete the previous job & classifier

### Considerations about this Notebook

According to AWS documentation (https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html), it doesn't seems to be possible to specify multiple labels per text document. That will explain why we will have a good accuracy result, but a low presion and recall. We train classifier with only 1 label per document when theses might have several.

See this as an example only.