![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/classification/MultiClassifierDL_Train_and_Evaluate.ipynb)

# Multi-label Text Classification of Toxic Comments using MultiClassifierDL

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

Let's download our Toxic comments for tarining and testing:

In [None]:
!curl -O 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/toxic_comments/toxic_train.snappy.parquet'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2702k  100 2702k    0     0  1699k      0  0:00:01  0:00:01 --:--:-- 1699k


In [None]:
!curl -O 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/toxic_comments/toxic_test.snappy.parquet'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289k  100  289k    0     0   249k      0  0:00:01  0:00:01 --:--:--  249k


In this notebook we are going to check the training logs on the fly. Thus, we start a session with real_time_output=True

In [None]:
import sparknlp

spark = sparknlp.start(real_time_output=True)
print("Spark NLP version")
sparknlp.version()


Spark NLP version


'4.3.1'

23/02/20 17:43:36 WARN Utils: Your hostname, duc-manjaro resolves to a loopback address: 127.0.1.1; using 192.168.0.34 instead (on interface enp3s0)
23/02/20 17:43:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Let's read our Toxi comments datasets:

In [None]:
train_dataset = spark.read.parquet("toxic_train.snappy.parquet").repartition(120)
toxic_test_dataset = spark.read.parquet("toxic_test.snappy.parquet").repartition(10)

In [None]:
train_dataset.show(2)

+----------------+--------------------+-------+
|              id|                text| labels|
+----------------+--------------------+-------+
|e63f1cc4b0b9959f|EAT SHIT HORSE FA...|[toxic]|
|ed58abb40640f983|PN News\nYou mean...|[toxic]|
+----------------+--------------------+-------+
only showing top 2 rows



As you can see, there are lots of new lines in our comments which we can fix them with `DocumentAssembler`

In [None]:
print(train_dataset.cache().count())
print(toxic_test_dataset.cache().count())

14620
1605


# Evaluation 

Let's evaluate our MultiClassifierDL model during training, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen. To do this we first need to prepare a test dataset parquet file as shown below:

In [None]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
# Let's use shrink to remove new lines in the comments
document = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")\
  .setCleanupMode("shrink")

# Here we use the state-of-the-art Universal Sentence Encoder model from TF Hub
embeddings = UniversalSentenceEncoder.pretrained() \
  .setInputCols(["document"])\
  .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages = [document, embeddings])

test_dataset = pipeline.fit(toxic_test_dataset).transform(toxic_test_dataset)  

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ / ]Download done! Loading the resource.
[OK!]


In [None]:
test_dataset.show(2)

+----------------+--------------------+----------------+--------------------+--------------------+
|              id|                text|          labels|            document| sentence_embeddings|
+----------------+--------------------+----------------+--------------------+--------------------+
|47d256dea1223d39|Vegan \n\nWhat in...|         [toxic]|[{document, 0, 78...|[{sentence_embedd...|
|5e0dea75de819976|Fight Club! F**k ...|[toxic, obscene]|[{document, 0, 29...|[{sentence_embedd...|
+----------------+--------------------+----------------+--------------------+--------------------+
only showing top 2 rows



Now, that out test dataset has the required embeddings, we save it as parquet and use it while training our MultiClassifierDL model.

In [None]:
test_dataset.write.parquet("./toxic_test.parquet")

Now let's train it and use a validation and the test dataset above for evaluation

In [None]:
# We will use MultiClassifierDL built by using Bidirectional GRU and CNNs inside TensorFlow that supports up to 100 classes
# We will use only 5 Epochs but feel free to increase it on your own dataset
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("category")\
  .setLabelColumn("labels")\
  .setBatchSize(128)\
  .setMaxEpochs(5)\
  .setLr(1e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setValidationSplit(0.1)\
  .setEvaluationLogExtended(True)\
  .setTestDataset("./toxic_test.parquet")

pipeline = Pipeline(
    stages = [
        document,
        embeddings,
        multiClassifier
    ])

In [None]:
pipelineModel = pipeline.fit(train_dataset)

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 128 - training_examples: 13158 - classes: 6
Epoch 1/5 - 4.34s - loss: 0.38046357 - acc: 0.848714 - batches: 103
Quality on validation dataset (10.0%), validation examples = 1462 
time to finish evaluation: 2.05s
label           tp	 fp	 fn	 prec	 rec	 f1
toxic           1385	 77	 0	 0.94733244	 1.0	 0.97295403
threat          0	 0	 47	 0.0	 0.0	 0.0
obscene         545	 141	 216	 0.79446065	 0.7161629	 0.75328267
insult          456	 173	 244	 0.72496027	 0.6514286	 0.6862303
severe_toxic    28	 21	 101	 0.5714286	 0.21705426	 0.31460676
identity_hate   24	 7	 101	 0.7741935	 0.192	 0.30769232
tp: 2438 fp: 419 fn: 709 labels: 6
Macro-average	 prec: 0.63539594, rec: 0.46277428, f1: 0.5355179
Micro-average	 prec: 0.85334265, recall: 0.77470607, f1: 0.81212527
Quality on test dataset: 
time to finish evaluation: 0.08s
label           tp	 fp	 fn	 prec	 rec	 f1
toxic           1504	 101	 0	 0.9370716	 1.0	 0.9675137
threat    

Let's save our trained multi-label classifier model to be loaded in our prediction pipeline:

In [None]:
pipelineModel.stages[-1].write().overwrite().save('tmp_multi_classifierDL_model')

## Load the Saved Pipeline

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

multiClassifier = MultiClassifierDLModel.load("tmp_multi_classifierDL_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("category")\
  .setThreshold(0.5)

pipeline = Pipeline(
    stages = [
        document,
        use,
        multiClassifier
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
