# Training Multilabel Classification Models with Legal NLP


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/04.4.Training_Legal_Multilabel_Classifier.ipynb)

In this notebook, you will learn how to use Spark NLP and Legal NLP to train multilabel classification models.

Let`s dive in!

# Colab Setup

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/finance_data.csv

## With RoBerta Embeddings

We do not have have any specific Legal Sentence Embeddings, but we can use Legal RoBerta Embeddings and then average them.

In [None]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

Spark Session already created, some configs may not take.
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (2).json


In [None]:
import pandas as pd
df = pd.read_csv("finance_data.csv")
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


In [None]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.

train, test = data.limit(500).randomSplit([0.7, 0.3], seed=42)

In [None]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


This model takes longer to train, so we limit the number of epochs to `3`.

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("provision").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classsifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(5)
    .setLr(0.001)
    .setRandomSeed(42)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_roberta_logs")
    .setBatchSize(8)
)


clf_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl,
    ]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 2.84 s, sys: 400 ms, total: 3.24 s
Wall time: 9min 39s


In [None]:
import os
log_file_name = os.listdir("multilabel_roberta_logs")[0]

with open("multilabel_roberta_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 8 - training_examples: 374 - classes: 15
Epoch 0/5 - 2.69s - loss: 0.29948545 - acc: 0.9309782 - batches: 47
Epoch 1/5 - 0.54s - loss: 0.18318962 - acc: 0.9618359 - batches: 47
Epoch 2/5 - 0.56s - loss: 0.12013305 - acc: 0.9798308 - batches: 47
Epoch 3/5 - 0.53s - loss: 0.08942111 - acc: 0.9921496 - batches: 47
Epoch 4/5 - 0.54s - loss: 0.070660174 - acc: 0.9977053 - batches: 47



In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY,...","[governing laws, entire agreements]","[governing laws, entire agreements]"
1,"All agreements, statements, representations an...",[survival],[]
2,All covenants of the Company contained in this...,[survival],[]
3,"All covenants, agreements, representations and...",[survival],[]
4,"All indemnities set forth herein including, wi...",[survival],[survival]


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])

print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.86      0.50      0.63        12
           1       0.25      0.17      0.20         6
           2       0.88      0.78      0.82         9
           3       1.00      0.86      0.92        14
           4       1.00      0.76      0.87        17
           5       1.00      0.60      0.75         5
           6       1.00      0.93      0.97        15
           7       1.00      0.88      0.93        16
           8       1.00      0.50      0.67         4
           9       0.92      1.00      0.96        12
          10       0.82      0.75      0.78        12
          11       0.33      0.17      0.22         6
          12       1.00      0.40      0.57         5
          13       1.00      0.57      0.73         7
          14       0.50      0.25      0.33         4

   micro avg       0.90      0.70      0.79       144
   macro avg       0.84      0.61      0.69       144
w

## Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_1d2e84f9329b,
 REGEX_TOKENIZER_6535f7ba66e8,
 ROBERTA_EMBEDDINGS_b915dff90901,
 SentenceEmbeddings_c91854a61841,
 MultiClassifierDLModel_e68f1838e916]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfRoBerta')

In [None]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfRoBerta')

In [None]:
ld_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        MultilabelClfModel,
    ]
)
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("provision"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [None]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY,...","[governing laws, entire agreements]","[governing laws, entire agreements]"
1,"All agreements, statements, representations an...",[survival],[]
2,All covenants of the Company contained in this...,[survival],[]
3,"All covenants, agreements, representations and...",[survival],[]
4,"All indemnities set forth herein including, wi...",[survival],[survival]
5,All issues and questions concerning the constr...,[governing laws],[governing laws]
6,All notices and communications that are requir...,[notices],[notices]
7,All notices and other communications hereunder...,[notices],[notices]
8,All notices and other communications provided ...,[notices],[notices]
9,All notices and other communications required ...,[notices],[notices]


## Zip Models for Modelshub Upload/Downloads

In [None]:
# cd into saved dir and zip
! cd /content/MultilabelClfRoBerta ; zip -r /content/MultilabelClfRoBerta.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 34%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 41%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: multiclassifierdl_tensorflow (deflated 85%)
