![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)
# Spark NLP
## Multi-label Text Classification
### E2E Challenge
#### By using MultiClassifierDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/MultiClassifierDL_train_multi_label_E2E_challenge_classifier.ipynb)

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)


Let's download our Toxic comments for tarining and testing:

In [2]:
!curl -O 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/e2e_challenge/e2e_train.snappy.parquet'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1922k  100 1922k    0     0  2000k      0 --:--:-- --:--:-- --:--:-- 1998k


In [2]:
import sparknlp

spark=sparknlp.start()
print("Spark NLP version")
sparknlp.version()

Spark NLP version


'2.6.0'

Let's read our Toxi comments datasets:

In [3]:
trainDataset, testDataset = spark.read.parquet("/content/e2e_train.snappy.parquet")\
  .randomSplit([0.9, 0.1], seed = 12345)  

In [4]:
trainDataset.show(2)

+--------------------+--------------------+
|                 ref|              labels|
+--------------------+--------------------+
|'Bibimbap House' ...|[name[Bibimbap Ho...|
|'Browns Cambridge...|[name[Browns Camb...|
+--------------------+--------------------+
only showing top 2 rows



As you can see, there are lots of new lines in our comments which we can fix them with `DocumentAssembler`

In [5]:
print(trainDataset.cache().count())
print(testDataset.cache().count())

37792
4269


In [6]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [7]:
# The actual text is in a column named ref
document = DocumentAssembler()\
  .setInputCol("ref")\
  .setOutputCol("document")

# Here we use the state-of-the-art Universal Sentence Encoder model from TF Hub
embeddings = UniversalSentenceEncoder.pretrained() \
  .setInputCols(["document"])\
  .setOutputCol("sentence_embeddings")

# We will use MultiClassifierDL built by using Bidirectional GRU and CNNs inside TensorFlow that supports up to 100 classes
# We will use only 5 Epochs but feel free to increase it on your own dataset
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("category")\
  .setLabelColumn("labels")\
  .setBatchSize(128)\
  .setMaxEpochs(5)\
  .setLr(1e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setValidationSplit(0.1)

pipeline = Pipeline(
    stages = [
        document,
        embeddings,
        multiClassifier
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [8]:
pipelineModel = pipeline.fit(trainDataset)

In [None]:
!ls -l ~/annotator_logs/

In [19]:
!cat ~/annotator_logs/MultiClassifierDLApproach_b80de1f04776.log

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 128 - training_examples: 34013 - classes: 79
Epoch 0/5 - 18.96s - loss: 0.22942108 - acc: 0.9338577 - val_loss: 0.17501871 - val_acc: 0.9417629 - val_f1: 0.3146024 - val_tpr: 0.19535509 - batches: 266
Epoch 1/5 - 10.60s - loss: 0.14757492 - acc: 0.953353 - val_loss: 0.12445798 - val_acc: 0.9562459 - val_f1: 0.57075405 - val_tpr: 0.4252112 - batches: 266
Epoch 2/5 - 10.46s - loss: 0.112007715 - acc: 0.96444803 - val_loss: 0.1024009 - val_acc: 0.9635221 - val_f1: 0.667721 - val_tpr: 0.5356968 - batches: 266
Epoch 3/5 - 10.66s - loss: 0.09598791 - acc: 0.96988803 - val_loss: 0.09133494 - val_acc: 0.9674665 - val_f1: 0.71459305 - val_tpr: 0.5951355 - batches: 266
Epoch 4/5 - 10.39s - loss: 0.08701118 - acc: 0.9730473 - val_loss: 0.08419453 - val_acc: 0.96987855 - val_f1: 0.74224013 - val_tpr: 0.63378865 - batches: 266


Let's save our trained multi-label classifier model to be loaded in our prediction pipeline:

In [10]:
pipelineModel.stages[-1].write().overwrite().save('/content/tmp_multi_classifierDL_model')

## load saved pipeline

In [11]:
document = DocumentAssembler()\
    .setInputCol("ref")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

multiClassifier = MultiClassifierDLModel.load("/content/tmp_multi_classifierDL_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("category")\
  .setThreshold(0.5)

pipeline = Pipeline(
    stages = [
        document,
        use,
        multiClassifier
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Let's now use our testing datasets to evaluate our model:

In [12]:
# let's see our labels:
print(pipeline.fit(testDataset).stages[2].getClasses())
print(len(pipeline.fit(testDataset).stages[2].getClasses()))

['name[Bibimbap House]', 'name[Wildwood]', 'name[Cotto]', 'name[Clowns]', 'near[Burger King]', 'name[The Dumpling Tree]', 'name[The Vaults]', 'near[Crowne Plaza Hotel]', 'name[The Golden Palace]', 'name[The Rice Boat]', 'customer rating[high]', 'near[Avalon]', 'name[Alimentum]', 'near[The Bakers]', 'name[The Waterman]', 'near[Ranch]', 'name[The Olive Grove]', 'name[The Eagle]', 'name[The Wrestlers]', 'eatType[restaurant]', 'near[All Bar One]', 'customer rating[low]', 'near[Café Sicilia]', 'near[Yippee Noodle Bar]', 'food[Indian]', 'eatType[pub]', 'name[Green Man]', 'name[Strada]', 'near[Café Adriatic]', 'eatType[coffee shop]', 'name[Loch Fyne]', 'customer rating[5 out of 5]', 'near[Express by Holiday Inn]', 'food[French]', 'name[The Mill]', 'food[Japanese]', 'name[Travellers Rest Beefeater]', 'name[The Plough]', 'name[Cocum]', 'near[The Six Bells]', 'name[The Phoenix]', 'priceRange[cheap]', 'name[Midsummer House]', 'near[Rainbow Vegetarian Café]', 'near[The Rice Boat]', 'customer ratin

In [13]:
preds = pipeline.fit(testDataset).transform(testDataset)


In [14]:
preds.select('labels', 'ref', 'category.result').show(2)

+--------------------+--------------------+--------------------+
|              labels|                 ref|              result|
+--------------------+--------------------+--------------------+
|[name[Strada], ea...|'Strada' is a pub...|[name[Alimentum],...|
|[name[The Eagle],...|'The Eagle' is lo...|[name[The Eagle],...|
+--------------------+--------------------+--------------------+
only showing top 2 rows



In [18]:
preds_df = preds.select('labels', 'category.result').toPandas()

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['labels'])
y_pred = mlb.fit_transform(preds_df['result'])

print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))

Classification report: 
               precision    recall  f1-score   support

           0       0.84      0.86      0.85       795
           1       0.89      0.82      0.86      1724
           2       0.72      0.07      0.12       415
           3       0.68      0.13      0.21       377
           4       0.68      0.21      0.32       504
           5       0.72      0.40      0.51       557
           6       0.65      0.09      0.16       437
           7       0.74      0.28      0.41       541
           8       0.99      0.96      0.98      1000
           9       0.94      0.91      0.92       701
          10       0.86      0.52      0.65       329
          11       0.84      0.52      0.64       908
          12       0.81      0.81      0.81      1784
          13       0.95      0.91      0.93       294
          14       0.92      0.56      0.70       410
          15       0.95      0.77      0.85       566
          16       0.89      0.76      0.82       581
  

  _warn_prf(average, modifier, msg_start, len(result))


In [22]:
preds.select("category.metadata").show(10)

+--------------------+
|            metadata|
+--------------------+
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
|[[name[Alimentum]...|
+--------------------+
only showing top 10 rows



In [17]:
preds.select("category.metadata").printSchema()

root
 |-- metadata: array (nullable = true)
 |    |-- element: map (containsNull = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

