# Training Multilabel Classification Models with Legal NLP


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/04.3.Training_Legal_Multilabel_Classifier.ipynb)

In this notebook, you will learn how to use Spark NLP and Legal NLP to train multilabel classification models.

Let`s dive in!

# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import nlp, legal
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install(force_browser=True)

## Start Spark Session

In [3]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

## Loading the data

In [4]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Legal/data/finance_data.csv

In [5]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


> We will use a sample from this dataset to avoid making the training process faster (to illustrate how to perform them). Use the full dataset if you want to experiment with it and achieve more realistic results. 
>
> The sample has size of 1000 observations only, please keep in mind that this will impact the accuracy and generalization capabilities of the model. Since the dataset is smaller now, we use 90% of it to train the model and the other 10% for testing.

In [6]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.limit(1000).randomSplit([0.9, 0.1], seed=42)

In [7]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------------+
|                                         provision|                              label|
+--------------------------------------------------+-----------------------------------+
|(a) No failure or delay of the Administrative A...|              [waivers, amendments]|
|(a) Seller, the Agent, each Managing Agent, eac...|                      [assignments]|
|(a) To induce the other parties hereto to enter...|      [representations, warranties]|
|(a)  The provisions of this Agreement shall be ...|              [assigns, successors]|
|(a) All of the representations and warranties m...|      [representations, warranties]|
|(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY, ...|[governing laws, entire agreements]|
|(a) This Agreement may be executed by one or mo...|                     [counterparts]|
|All Bank Expenses (including reasonable attorne...|                         [expenses]|
|All agreements, cove

In [8]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()


+--------------------+-----+
|               label|count|
+--------------------+-----+
| [entire agreements]|   18|
|    [governing laws]|   17|
|      [severability]|   17|
|           [notices]|   17|
|          [survival]|   12|
|      [counterparts]|   10|
|      [terminations]|    9|
|        [amendments]|    8|
|[assigns, success...|    7|
|          [expenses]|    6|
|       [assignments]|    5|
|           [waivers]|    5|
|[waivers, amendme...|    3|
|[amendments, enti...|    2|
|[representations,...|    1|
|        [successors]|    1|
|        [warranties]|    1|
|   [representations]|    1|
+--------------------+-----+



 ## With Universal Encoder

In [9]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("provision")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(20)
    .setLr(0.001)
    .setRandomSeed(42)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_use_logs")
    .setBatchSize(8)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Since this model can takes longer time to train, we will limit (reduce) the size of the training data to avoid having it training for hours. 

> Please note that this reduction can greatly impact the performance of the model

In [10]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 663 ms, sys: 80.9 ms, total: 744 ms
Wall time: 1min 41s


In [11]:
import os
log_file_name = os.listdir("multilabel_use_logs")[0]

with open("multilabel_use_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 20 - learning_rate: 0.001 - batch_size: 8 - training_examples: 860 - classes: 15
Epoch 0/20 - 6.11s - loss: 0.3046494 - acc: 0.92453307 - batches: 108
Epoch 1/20 - 2.59s - loss: 0.1938937 - acc: 0.9450936 - batches: 108
Epoch 2/20 - 3.35s - loss: 0.14427304 - acc: 0.95981264 - batches: 108
Epoch 3/20 - 2.55s - loss: 0.11918045 - acc: 0.9693143 - batches: 108
Epoch 4/20 - 2.82s - loss: 0.10459153 - acc: 0.97313035 - batches: 108
Epoch 5/20 - 2.65s - loss: 0.09456005 - acc: 0.9764794 - batches: 108
Epoch 6/20 - 2.43s - loss: 0.08702293 - acc: 0.978816 - batches: 108
Epoch 7/20 - 1.96s - loss: 0.08120587 - acc: 0.9813083 - batches: 108
Epoch 8/20 - 1.98s - loss: 0.07661503 - acc: 0.98239857 - batches: 108
Epoch 9/20 - 2.28s - loss: 0.072891876 - acc: 0.9845014 - batches: 108
Epoch 10/20 - 2.10s - loss: 0.06978773 - acc: 0.9856695 - batches: 108
Epoch 11/20 - 1.97s - loss: 0.06713975 - acc: 0.9867597 - batches: 108
Epoch 12/20 - 2.59s - loss: 0.06483908 - acc: 0.

In [12]:
preds = clf_pipelineModel.transform(test)

In [13]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,"[assigns, successors]",(a) The provisions of this Agreement shall be ...,[successors]
1,[waivers],(a) Any provision of this Agreement may be wai...,"[waivers, amendments]"
2,"[waivers, amendments]","(a) This Agreement may be amended, supplemente...",[waivers]
3,[survival],"All agreements, representations and warranties...",[survival]
4,[survival],"All covenants, agreements, representations and...",[survival]


In [14]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.82      0.69      0.75        13
           1       0.00      0.00      0.00         5
           2       1.00      0.71      0.83         7
           3       1.00      1.00      1.00        10
           4       1.00      1.00      1.00        20
           5       0.86      1.00      0.92         6
           6       1.00      1.00      1.00        17
           7       1.00      0.88      0.94        17
           8       0.33      0.50      0.40         2
           9       1.00      1.00      1.00        17
          10       1.00      1.00      1.00         8
          11       0.82      0.75      0.78        12
          12       1.00      0.44      0.62         9
          13       0.88      0.88      0.88         8
          14       0.33      0.50      0.40         2

   micro avg       0.93      0.84      0.88       153
   macro avg       0.80      0.76      0.77       153
w

## With RoBerta Embeddings

**Please restart your runtime to get rid of the out-of-memory error and read dataset again**

We do not have have any specific Legal Sentence Embeddings, but we can use Legal RoBerta Embeddings and then average them.

In [3]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


In [4]:
import pandas as pd
df = pd.read_csv("finance_data.csv")
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


In [5]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.

train, test = data.limit(500).randomSplit([0.9, 0.1], seed=42)

In [6]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


This model takes longer to train, so we limit the number of epochs to `3`.

In [7]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("provision").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classsifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(5)
    .setLr(0.001)
    .setRandomSeed(42)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_roberta_logs")
    .setBatchSize(8)
)


clf_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl,
    ]
)

In [8]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 5.92 s, sys: 682 ms, total: 6.6 s
Wall time: 17min 12s


In [9]:
import os
log_file_name = os.listdir("multilabel_roberta_logs")[0]

with open("multilabel_roberta_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 3 - learning_rate: 0.001 - batch_size: 4 - training_examples: 461 - classes: 15
Epoch 0/3 - 5.87s - loss: 0.24340223 - acc: 0.9349273 - batches: 116
Epoch 1/3 - 2.05s - loss: 0.11221627 - acc: 0.9701446 - batches: 116
Epoch 2/3 - 1.94s - loss: 0.07115805 - acc: 0.9828982 - batches: 116



In [10]:
preds = clf_pipelineModel.transform(test)

In [11]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [12]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],[]
1,All covenants of the Company contained in this...,[survival],[]
2,"All representations, warranties, covenants and...",[survival],[]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]


In [14]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])

print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       1.00      0.67      0.80         6
           1       0.25      1.00      0.40         1
           2       0.75      1.00      0.86         3
           3       1.00      0.67      0.80         6
           4       1.00      0.86      0.92         7
           5       0.00      0.00      0.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         1
           8       0.50      0.50      0.50         2
           9       1.00      0.67      0.80         3
          10       1.00      1.00      1.00         4
          11       0.00      0.00      0.00         3
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         2
          14       1.00      0.33      0.50         3

   micro avg       0.85      0.63      0.72        46
   macro avg       0.63      0.58      0.57        46
w

## Saving & loading back the trained model

In [15]:
clf_pipelineModel.stages

[DocumentAssembler_5aa961429664,
 REGEX_TOKENIZER_4df74a000c64,
 ROBERTA_EMBEDDINGS_b915dff90901,
 SentenceEmbeddings_2d572104dd61,
 MultiClassifierDLModel_d13ebede845b]

In [16]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfRoBerta')

In [17]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfRoBerta')

In [18]:
ld_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        MultilabelClfModel,
    ]
)
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("provision"))

In [19]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [20]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [21]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],[]
1,All covenants of the Company contained in this...,[survival],[]
2,"All representations, warranties, covenants and...",[survival],[]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]
5,Except as otherwise provided herein or in any ...,[waivers],[]
6,Franchisee acknowledges that the Foodservice D...,[amendments],[]
7,Guarantor represents and warrants to Lender th...,[warranties],[representations]
8,"If any provision of this Plan or any Award is,...",[severability],[]
9,"No amendment, modification, termination or can...",[amendments],[amendments]


## Zip Models for Modelshub Upload/Downloads

In [22]:
# cd into saved dir and zip
! cd /content/MultilabelClfRoBerta ; zip -r /content/MultilabelClfRoBerta.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 34%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/part-00000 (deflated 42%)
  adding: multiclassifierdl_tensorflow (deflated 85%)
