

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_MULTILABEL_TOXIC.ipynb)




# **Detect toxic content in comments**

## 1. Colab setup

In [None]:
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

! pip install --ignore-installed -q pyspark==2.4.4
! pip install --ignore-installed -q spark-nlp

[K     |████████████████████████████████| 215.7MB 69kB/s 
[K     |████████████████████████████████| 204kB 39.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 133kB 4.6MB/s 
[?25h

In [None]:
import os
import json
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

# Start Spark session
spark = sparknlp.start(gpu=True)

## 2. Pipeline creation and training

Create pipeline to be trained on example inputs.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = UniversalSentenceEncoder.pretrained(lang='en') \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

classsifierdl = MultiClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("labels") \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.7) \
    .setValidationSplit(0.2) \
    .setOutputLogsPath('./') \
    .setEnableOutputLogs(True)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    embeddingsSentence,
    classsifierdl
])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Download training and testing datasets.

In [None]:
! curl -O 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/toxic_comments/toxic_train.snappy.parquet'
! curl -O 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/toxic_comments/toxic_test.snappy.parquet'
trainDataset = spark.read.parquet("toxic_train.snappy.parquet")
testDataset = spark.read.parquet("toxic_test.snappy.parquet")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2702k  100 2702k    0     0  9099k      0 --:--:-- --:--:-- --:--:-- 9099k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289k  100  289k    0     0  1288k      0 --:--:-- --:--:-- --:--:-- 1288k


In [None]:
res = trainDataset.toPandas()

In [None]:
res

Unnamed: 0,id,text,labels
0,00024b59235015f3,Virgin\nMy only warning? You'll block me? Well...,"[toxic, obscene, insult]"
1,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,"[toxic, severe_toxic, obscene, insult]"
2,000521f420b7ac15,Words can't describe how annoying I find you W...,[toxic]
3,0005c987bdfc9d4b,Hey... what is it..\n@ | talk .\nWhat is it......,[toxic]
4,000e5ac5aa216bac,"""\n\n Cut the Shit \n\nWill you please cut the...","[toxic, obscene, insult]"
...,...,...,...
14615,fff631d42c6abb63,you marked my edit as vandilisim when i was st...,"[toxic, obscene, insult]"
14616,fffb8bea1d5e4d3b,"yes, yes, thank you. good to know, but who dec...","[toxic, obscene, insult]"
14617,fffd0ce82f58251e,[to any of those fucking admins],"[toxic, obscene]"
14618,fffdc608b84c9b27,That last link you gave me does not make sense...,[toxic]


In [None]:
all_labs = []
for r in res['labels'].values:
    all_labs.extend(r)
set(all_labs)

{'identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic'}

Train the pipeline model on the training dataset.

In [None]:
pipelineModel = pipeline.fit(trainDataset)

Example training log:


```
Training started - epochs: 10 - learning_rate: 0.001 - batch_size: 64 - training_examples: 127657 - classes: 7
Epoch 0/10 - 46.00s - loss: 0.079907365 - acc: 0.9723665 - val_loss: 0.07093345 - val_acc: 0.97352755 - val_f1: 0.91585624 - val_tpr: 0.9008827 - batches: 1995
Epoch 1/10 - 28.26s - loss: 0.06964213 - acc: 0.9747156 - val_loss: 0.069492534 - val_acc: 0.9739863 - val_f1: 0.91734964 - val_tpr: 0.9029063 - batches: 1995
Epoch 2/10 - 27.99s - loss: 0.06808146 - acc: 0.9752242 - val_loss: 0.06841504 - val_acc: 0.97444665 - val_f1: 0.9187962 - val_tpr: 0.9039833 - batches: 1995
Epoch 3/10 - 27.94s - loss: 0.066884466 - acc: 0.9757066 - val_loss: 0.06769186 - val_acc: 0.9746787 - val_f1: 0.9195223 - val_tpr: 0.90438014 - batches: 1995
Epoch 4/10 - 28.04s - loss: 0.06587076 - acc: 0.9761073 - val_loss: 0.067252316 - val_acc: 0.9749198 - val_f1: 0.92028916 - val_tpr: 0.9049118 - batches: 1995
Epoch 5/10 - 28.05s - loss: 0.06501821 - acc: 0.97637606 - val_loss: 0.06700255 - val_acc: 0.9750182 - val_f1: 0.9205762 - val_tpr: 0.90496385 - batches: 1995
Epoch 6/10 - 27.92s - loss: 0.064287946 - acc: 0.9765889 - val_loss: 0.06686394 - val_acc: 0.9750045 - val_f1: 0.9205367 - val_tpr: 0.904986 - batches: 1995
Epoch 7/10 - 28.19s - loss: 0.063645855 - acc: 0.97682655 - val_loss: 0.0667824 - val_acc: 0.97498673 - val_f1: 0.92048156 - val_tpr: 0.90499973 - batches: 1995
Epoch 8/10 - 28.53s - loss: 0.06306508 - acc: 0.97704613 - val_loss: 0.06673989 - val_acc: 0.974991 - val_f1: 0.9205486 - val_tpr: 0.9056079 - batches: 1995
Epoch 9/10 - 28.61s - loss: 0.0625258 - acc: 0.97720623 - val_loss: 0.06672904 - val_acc: 0.9750411 - val_f1: 0.9207759 - val_tpr: 0.90621996 - batches: 1995
```



## 3. Testing and examples

Run the model on the test dataset to evaluate performance and generate examples.

In [None]:
test_res = pipelineModel.transform(testDataset)

Visualizing the raw test dataset after classification.

In [None]:
test_res.show()

+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|              id|                text|              labels|            document|               token| sentence_embeddings|               class|
+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|0007e25b2121310b|Bye! 

Don't look...|             [toxic]|[[document, 0, 56...|[[token, 0, 2, By...|[[sentence_embedd...|[[category, 0, 56...|
|001956c382006abd|I'm Sorry 

I'm s...|             [toxic]|[[document, 0, 31...|[[token, 0, 2, I'...|[[sentence_embedd...|[[category, 0, 31...|
|00c1b6962307c80e|Well you are ridi...|             [toxic]|[[document, 0, 95...|[[token, 0, 3, We...|[[sentence_embedd...|[[category, 0, 95...|
|0109d5a4788850f7|Thank you for you...|[toxic, obscene, ...|[[document, 0, 71...|[[token, 0, 4, Th...|[[sentence_embedd...|[[categ

Write some sample inputs and outputs from the test dataset to file.

In [None]:
! mkdir -p inputs
! mkdir -p outputs

result = test_res.toPandas()
for i in range(1, 11):
    text = result[['document']].iloc[i][0][0].result
    with open(f'inputs/{i}.txt', 'w') as f:
        f.write(text[:96].replace('\n', '') + " ...\n" + text)
    result[['class']].iloc[i].to_json(f'outputs/{i}.json')