# Training Multilabel Classification Models with Legal NLP


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/04.2.Training_Legal_Multiclass_Classifier.ipynb)

In this notebook, you will learn how to use Spark NLP and Legal NLP to train multilabel classification models.

Let`s dive in!

## Installation

In [None]:
! pip install -q johnsnowlabs

### Automatic Installation
Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO

In [3]:
from johnsnowlabs import nlp, legal, viz

# nlp.install(force_browser=True)

### Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [5]:
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.8-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.8-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.8.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.8.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.8-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.2.8 installed! ✅ Heal the planet with NLP! 


### Start Spark Session

In [6]:
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2


In [7]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_clf_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv', encoding="utf8")
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (6128, 3)


In [None]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [None]:
df['label'].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

In [None]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations)
filter_classes = ["risk_factors", "financial_statements", "business"]

# We make a random sample with 1000 observations
df = df.loc[df.label.isin(filter_classes)].sample(1000)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data)
test = spark.createDataFrame(test_data)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|financial_statements|  372|
|        risk_factors|  354|
|            business|  174|
+--------------------+-----+



In [None]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|financial_statements|   41|
|        risk_factors|   39|
|            business|   20|
+--------------------+-----+



 ## With Universal Encoder

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classsifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_use_logs")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classsifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 473 ms, sys: 66.3 ms, total: 539 ms
Wall time: 1min 11s


In [None]:
import os
log_file_name = os.listdir("multiclass_use_logs")[0]

with open("multiclass_use_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/30 - 2.09s - loss: 189.27339 - acc: 0.71444446 - batches: 225
Epoch 1/30 - 1.71s - loss: 158.02815 - acc: 0.8611111 - batches: 225
Epoch 2/30 - 1.68s - loss: 153.44913 - acc: 0.8933333 - batches: 225
Epoch 3/30 - 1.66s - loss: 151.10738 - acc: 0.9033333 - batches: 225
Epoch 4/30 - 1.68s - loss: 149.2621 - acc: 0.9066667 - batches: 225
Epoch 5/30 - 1.93s - loss: 147.8518 - acc: 0.9111111 - batches: 225
Epoch 6/30 - 1.75s - loss: 146.85054 - acc: 0.91333336 - batches: 225
Epoch 7/30 - 2.04s - loss: 146.17546 - acc: 0.9166667 - batches: 225
Epoch 8/30 - 2.45s - loss: 145.68738 - acc: 0.91888887 - batches: 225
Epoch 9/30 - 1.79s - loss: 145.30891 - acc: 0.9222222 - batches: 225
Epoch 10/30 - 1.67s - loss: 145.0165 - acc: 0.9266667 - batches: 225
Epoch 11/30 - 1.67s - loss: 144.76372 - acc: 0.92777777 - batches: 225
Epoch 12/30 - 1.67s - loss: 144.54723 - acc: 0.92777777 - batc

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,Remaining performance obligations represent co...,[financial_statements]
1,risk_factors,In addition under Sections 382 and 383 of the ...,[risk_factors]
2,business,In March 2020 the World Health Organization de...,[business]
3,financial_statements,96\nTable of Contents\n \nNew accounting prono...,[financial_statements]
4,financial_statements,If the Company concludes that it is not more l...,[financial_statements]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print(classification_report(preds_df["label"], preds_df["result"]))

                      precision    recall  f1-score   support

            business       0.78      0.70      0.74        20
financial_statements       0.85      0.95      0.90        41
        risk_factors       0.83      0.77      0.80        39

            accuracy                           0.83       100
           macro avg       0.82      0.81      0.81       100
        weighted avg       0.83      0.83      0.83       100



## With RoBerta Embeddings


We do not have Legal Sentence Embeddings yet, But we can use the Legal RoBerta Embeddings and then average them.

In [None]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(3)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_roberta_logs")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 9.71 s, sys: 1.47 s, total: 11.2 s
Wall time: 28min 4s


In [None]:
import os
log_files = os.listdir("multiclass_roberta_logs")

with open("multiclass_roberta_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 3 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/3 - 2.78s - loss: 179.17516 - acc: 0.74666667 - batches: 225
Epoch 1/3 - 2.34s - loss: 148.01608 - acc: 0.8977778 - batches: 225
Epoch 2/3 - 1.70s - loss: 142.09373 - acc: 0.9111111 - batches: 225



In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,Remaining performance obligations represent co...,[financial_statements]
1,risk_factors,In addition under Sections 382 and 383 of the ...,[risk_factors]
2,business,In March 2020 the World Health Organization de...,[risk_factors]
3,financial_statements,96\nTable of Contents\n \nNew accounting prono...,[financial_statements]
4,financial_statements,If the Company concludes that it is not more l...,[financial_statements]


In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                      precision    recall  f1-score   support

            business       1.00      0.70      0.82        20
financial_statements       0.90      0.93      0.92        41
        risk_factors       0.84      0.95      0.89        39

            accuracy                           0.89       100
           macro avg       0.92      0.86      0.88       100
        weighted avg       0.90      0.89      0.89       100



## Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 58%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 30%)
  adding: fields/datasetParams/part-00000 (deflated 26%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
  adding: metadata/._SUCCESS.crc (stored 0%)
