# Training Classification Models with Legal NLP


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/4.1.Training_Legal_Classifiers.ipynb)

In this notebook, you will learn how to use Spark NLP and Legal NLP to train custom classification models.

Let`s dive in!

# Colab Setup

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.2/74.2 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.6/570.6 KB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.3/82.3 KB[0m [31m10.2 MB/s

In [2]:
from johnsnowlabs import nlp, legal
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [09/Jan/2023 13:43:41] "GET /login?code=gpINstDr4UFM2NNQlHf2M206lsgA3k HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl
Installed 1 products:
💊 Spark-Healthcare==4.2.4 installed! ✅ Heal the planet with NLP! 


## Start Spark Session

In [3]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

# Multilabel classifier training

## Loading the data

In [4]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Legal/data/finance_data.csv

In [5]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


> We will use a sample from this dataset to avoid making the training process faster (to illustrate how to perform them). Use the full dataset if you want to experiment with it and achieve more realistic results. 
>
> The sample has size of 500 observations only, please keep in mind that this will impact the accuracy and generalization capabilities of the model. Since the dataset is smaller now, we use 90% of it to train the model and the other 10% for testing.

In [6]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.limit(500).randomSplit([0.9, 0.2], seed=42)

In [7]:
train.show(truncate=50)

+--------------------------------------------------+---------------------+
|                                         provision|                label|
+--------------------------------------------------+---------------------+
|(a) Seller, the Agent, each Managing Agent, eac...|        [assignments]|
|(a)  The provisions of this Agreement shall be ...|[assigns, successors]|
|(a) This Agreement may be executed by one or mo...|       [counterparts]|
|All Bank Expenses (including reasonable attorne...|           [expenses]|
|All agreements, representations and warranties ...|           [survival]|
|All communications hereunder will be in writing...|            [notices]|
|All covenants, agreements, representations and ...|           [survival]|
|All covenants, agreements, representations and ...|           [survival]|
|All demands, notices and communications hereund...|            [notices]|
|All headings and subdivisions of this Agreement...|       [severability]|
|All issues and questions

In [8]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()


+--------------------+-----+
|               label|count|
+--------------------+-----+
|      [counterparts]|    9|
| [entire agreements]|    8|
|        [amendments]|    7|
|           [notices]|    7|
|      [severability]|    5|
|          [survival]|    4|
|           [waivers]|    4|
|[assigns, success...|    4|
|[representations,...|    3|
|      [terminations]|    3|
|       [assignments]|    3|
|          [expenses]|    2|
|    [governing laws]|    2|
|[governing laws, ...|    2|
|        [successors]|    2|
|   [representations]|    1|
|        [warranties]|    1|
|[amendments, enti...|    1|
+--------------------+-----+



 ## With Universal Encoder

In [10]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("provision")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setLr(0.001)
    .setRandomSeed(42)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_use_logs")
    .setBatchSize(8)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Since this model can takes longer time to train, we will limit (reduce) the size of the training data to avoid having it training for hours. 

> Please note that this reduction can greatly impact the performance of the model

In [11]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 337 ms, sys: 42.6 ms, total: 380 ms
Wall time: 1min 1s


In [12]:
import os
log_file_name = os.listdir("multilabel_use_logs")[0]

with open("multilabel_use_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 8 - training_examples: 432 - classes: 15
Epoch 0/30 - 3.28s - loss: 0.34755498 - acc: 0.91280854 - batches: 54
Epoch 1/30 - 0.73s - loss: 0.24667196 - acc: 0.9234566 - batches: 54
Epoch 2/30 - 0.72s - loss: 0.2096713 - acc: 0.93163574 - batches: 54
Epoch 3/30 - 0.70s - loss: 0.17545508 - acc: 0.94074094 - batches: 54
Epoch 4/30 - 0.70s - loss: 0.15081559 - acc: 0.9478394 - batches: 54
Epoch 5/30 - 0.69s - loss: 0.13488992 - acc: 0.9533951 - batches: 54
Epoch 6/30 - 0.68s - loss: 0.12369589 - acc: 0.9557098 - batches: 54
Epoch 7/30 - 0.68s - loss: 0.114798985 - acc: 0.9583333 - batches: 54
Epoch 8/30 - 0.67s - loss: 0.10730778 - acc: 0.9598766 - batches: 54
Epoch 9/30 - 0.68s - loss: 0.10088395 - acc: 0.96311724 - batches: 54
Epoch 10/30 - 0.68s - loss: 0.09533796 - acc: 0.96543205 - batches: 54
Epoch 11/30 - 0.68s - loss: 0.09051009 - acc: 0.9674382 - batches: 54
Epoch 12/30 - 0.71s - loss: 0.0862858 - acc: 0.9688271 - 

In [13]:
preds = clf_pipelineModel.transform(test)

In [14]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,"[governing laws, entire agreements]","(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY,...","[governing laws, entire agreements]"
1,[survival],"All agreements, statements, representations an...",[warranties]
2,[survival],All covenants of the Company contained in this...,[]
3,[survival],"All indemnities set forth herein including, wi...",[survival]
4,[notices],All notices and other communications hereunder...,[notices]


In [15]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.80      0.50      0.62         8
           1       0.00      0.00      0.00         3
           2       0.67      1.00      0.80         4
           3       0.89      0.89      0.89         9
           4       0.91      0.91      0.91        11
           5       1.00      0.50      0.67         2
           6       0.80      1.00      0.89         4
           7       1.00      0.86      0.92         7
           8       1.00      1.00      1.00         4
           9       1.00      1.00      1.00         5
          10       0.75      1.00      0.86         6
          11       0.67      0.50      0.57         4
          12       0.50      0.33      0.40         3
          13       0.67      0.50      0.57         4
          14       0.50      0.75      0.60         4

   micro avg       0.80      0.77      0.78        78
   macro avg       0.74      0.72      0.71        78
w

## With RoBerta Embeddings

We do not have have any specific Legal Sentence Embeddings, but we can use Legal RoBerta Embeddings and then average them.

In [16]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


This model takes longer to train, so we limit the number of epochs to `5`.

In [17]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("provision").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classsifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(5)
    .setLr(0.001)
    .setRandomSeed(42)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_roberta_logs")
    .setBatchSize(8)
)


clf_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl,
    ]
)

In [18]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 3.65 s, sys: 461 ms, total: 4.11 s
Wall time: 10min 3s


In [19]:
preds = clf_pipelineModel.transform(test)

In [20]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [21]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY,...","[governing laws, entire agreements]","[governing laws, entire agreements]"
1,"All agreements, statements, representations an...",[survival],[]
2,All covenants of the Company contained in this...,[survival],[]
3,"All indemnities set forth herein including, wi...",[survival],[survival]
4,All notices and other communications hereunder...,[notices],[notices]


In [22]:
import os
log_file_name = os.listdir("multilabel_roberta_logs")[0]

with open("multilabel_roberta_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 8 - training_examples: 432 - classes: 15
Epoch 0/5 - 3.25s - loss: 0.28196928 - acc: 0.91358006 - batches: 54
Epoch 1/5 - 0.71s - loss: 0.16264302 - acc: 0.9490741 - batches: 54
Epoch 2/5 - 0.71s - loss: 0.1064876 - acc: 0.9643518 - batches: 54
Epoch 3/5 - 0.68s - loss: 0.0797876 - acc: 0.97268516 - batches: 54
Epoch 4/5 - 0.69s - loss: 0.06431785 - acc: 0.9768517 - batches: 54



In [23]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       1.00      0.62      0.77         8
           1       0.00      0.00      0.00         3
           2       0.75      0.75      0.75         4
           3       1.00      0.89      0.94         9
           4       1.00      0.91      0.95        11
           5       1.00      0.50      0.67         2
           6       1.00      1.00      1.00         4
           7       0.88      1.00      0.93         7
           8       1.00      0.50      0.67         4
           9       1.00      1.00      1.00         5
          10       1.00      0.67      0.80         6
          11       0.33      0.25      0.29         4
          12       0.00      0.00      0.00         3
          13       1.00      0.50      0.67         4
          14       0.50      0.25      0.33         4

   micro avg       0.91      0.68      0.78        78
   macro avg       0.76      0.59      0.65        78
w

### Saving & loading back the trained model

In [24]:
clf_pipelineModel.stages

[DocumentAssembler_1ec143588b9d,
 REGEX_TOKENIZER_b78a842f9d91,
 ROBERTA_EMBEDDINGS_b915dff90901,
 SentenceEmbeddings_6ffc2b49f70a,
 MultiClassifierDLModel_4bfad35629ae]

In [25]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfRoBerta')

In [26]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfRoBerta')

In [27]:
ld_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        MultilabelClfModel,
    ]
)
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("provision"))

In [28]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [29]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [30]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY,...","[governing laws, entire agreements]","[governing laws, entire agreements]"
1,"All agreements, statements, representations an...",[survival],[]
2,All covenants of the Company contained in this...,[survival],[]
3,"All indemnities set forth herein including, wi...",[survival],[survival]
4,All notices and other communications hereunder...,[notices],[notices]
5,All notices and other communications required ...,[notices],[notices]
6,"All notices, requests and other communications...",[notices],[notices]
7,All representations and warranties made by the...,[representations],"[representations, warranties]"
8,"All representations, warranties, covenants and...",[survival],[]
9,Any amendment to the Plan shall be deemed to b...,[amendments],[]


# Multiclass classifier training

The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [38]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_clf_data.csv

In [39]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv', encoding="utf8")
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (6128, 3)


In [40]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [41]:
df['label'].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

In [42]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations)
filter_classes = ["risk_factors", "financial_statements", "business"]

# We make a random sample with 1000 observations
df = df.loc[df.label.isin(filter_classes)].sample(1000)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data)
test = spark.createDataFrame(test_data)

In [43]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|financial_statements|  366|
|        risk_factors|  356|
|            business|  178|
+--------------------+-----+



In [44]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|financial_statements|   41|
|        risk_factors|   39|
|            business|   20|
+--------------------+-----+



 ## With Universal Encoder

In [45]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classsifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_use_logs")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classsifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [46]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 365 ms, sys: 43 ms, total: 408 ms
Wall time: 50.9 s


In [47]:
import os
log_file_name = os.listdir("multiclass_use_logs")[0]

with open("multiclass_use_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/30 - 1.66s - loss: 189.3935 - acc: 0.70111114 - batches: 225
Epoch 1/30 - 1.27s - loss: 157.496 - acc: 0.84555554 - batches: 225
Epoch 2/30 - 1.24s - loss: 151.73296 - acc: 0.8711111 - batches: 225
Epoch 3/30 - 1.31s - loss: 149.28844 - acc: 0.8811111 - batches: 225
Epoch 4/30 - 1.30s - loss: 147.5462 - acc: 0.8933333 - batches: 225
Epoch 5/30 - 1.27s - loss: 146.10075 - acc: 0.9011111 - batches: 225
Epoch 6/30 - 1.25s - loss: 144.94044 - acc: 0.9088889 - batches: 225
Epoch 7/30 - 1.19s - loss: 144.06827 - acc: 0.91333336 - batches: 225
Epoch 8/30 - 1.27s - loss: 143.42535 - acc: 0.91333336 - batches: 225
Epoch 9/30 - 1.28s - loss: 142.91154 - acc: 0.9166667 - batches: 225
Epoch 10/30 - 1.24s - loss: 142.4443 - acc: 0.92333335 - batches: 225
Epoch 11/30 - 1.18s - loss: 141.97804 - acc: 0.9266667 - batches: 225
Epoch 12/30 - 1.19s - loss: 141.51053 - acc: 0.9288889 - batche

In [48]:
preds = clf_pipelineModel.transform(test)

In [49]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,559 \n211 \n697 \nNet loss\n \n 190 890 \n \n ...,[financial_statements]
1,financial_statements,\nTotal operating lease liabilities\n \n \n5 ...,[financial_statements]
2,business,\nVeeva Vault Study Startup\n helps life scie...,[business]
3,business,\n Simulation Platform provides large scale v...,[business]
4,financial_statements,\n5 796 \nThereafter\n15 794 \n \n15 794 \nTo...,[financial_statements]


In [50]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [51]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print(classification_report(preds_df["label"], preds_df["result"]))

                      precision    recall  f1-score   support

            business       0.64      0.70      0.67        20
financial_statements       0.93      0.93      0.93        41
        risk_factors       0.84      0.79      0.82        39

            accuracy                           0.83       100
           macro avg       0.80      0.81      0.80       100
        weighted avg       0.83      0.83      0.83       100



### Saving & loading back the trained model

In [52]:
clf_pipelineModel.stages

[DocumentAssembler_9a17ea1645b8,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 LegalClassifierDLModel_68cf14f784dd]

In [53]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [54]:
# Load back  saved Classifier Model
ClfModel = legal.ClassifierDLModel.load('Clf_Use')

In [55]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [56]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [57]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()

In [58]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,559 \n211 \n697 \nNet loss\n \n 190 890 \n \n ...,financial_statements,[financial_statements]
1,\nTotal operating lease liabilities\n \n \n5 ...,financial_statements,[financial_statements]
2,\nVeeva Vault Study Startup\n helps life scie...,business,[business]
3,\n Simulation Platform provides large scale v...,business,[business]
4,\n5 796 \nThereafter\n15 794 \n \n15 794 \nTo...,financial_statements,[financial_statements]


## With RoBerta Embeddings

We do not have Legal Sentence Embeddings yet, But we can use the Legal RoBerta Embeddings and then average them.

In [59]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [60]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(5)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_roberta_logs")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [62]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 8.45 s, sys: 980 ms, total: 9.43 s
Wall time: 24min 3s


In [63]:
preds = clf_pipelineModel.transform(test)

In [64]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [65]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,559 \n211 \n697 \nNet loss\n \n 190 890 \n \n ...,[financial_statements]
1,financial_statements,\nTotal operating lease liabilities\n \n \n5 ...,[financial_statements]
2,business,\nVeeva Vault Study Startup\n helps life scie...,[business]
3,business,\n Simulation Platform provides large scale v...,[business]
4,financial_statements,\n5 796 \nThereafter\n15 794 \n \n15 794 \nTo...,[financial_statements]


In [66]:
log_files = os.listdir("multiclass_roberta_logs")

with open("multiclass_roberta_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/5 - 83.26s - loss: 182.05006 - acc: 0.7188889 - batches: 225
Epoch 1/5 - 89.49s - loss: 146.9867 - acc: 0.89111114 - batches: 225
Epoch 2/5 - 79.53s - loss: 142.58612 - acc: 0.9077778 - batches: 225
Epoch 3/5 - 80.60s - loss: 140.33891 - acc: 0.91888887 - batches: 225
Epoch 4/5 - 67.85s - loss: 139.06071 - acc: 0.92444444 - batches: 225
Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/5 - 1.62s - loss: 196.78406 - acc: 0.6666667 - batches: 225
Epoch 1/5 - 1.24s - loss: 149.72516 - acc: 0.88555557 - batches: 225
Epoch 2/5 - 1.22s - loss: 142.92361 - acc: 0.9022222 - batches: 225
Epoch 3/5 - 1.25s - loss: 140.43 - acc: 0.91333336 - batches: 225
Epoch 4/5 - 1.23s - loss: 139.03525 - acc: 0.9177778 - batches: 225



In [67]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                      precision    recall  f1-score   support

            business       0.81      0.85      0.83        20
financial_statements       0.90      0.93      0.92        41
        risk_factors       0.89      0.85      0.87        39

            accuracy                           0.88       100
           macro avg       0.87      0.87      0.87       100
        weighted avg       0.88      0.88      0.88       100



# Save model and Zip it for Modelshub Upload/Downloads

In [68]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 58%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 30%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 26%)
  adding: metadata/ (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
