# Training Binary Classification Models with Legal NLP


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/04.1.Training_Legal_Binary_Classifier.ipynb)

In this notebook, you will learn how to use Spark NLP and Legal NLP to train binary classification models.

Let`s dive in!

# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import nlp, legal
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install(force_browser=True)

Installed 1 products:
💊 Spark-Healthcare==4.2.4 installed! ✅ Heal the planet with NLP! 


## Start Spark Session

In [None]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

Since this model can takes longer time to train, we will limit (reduce) the size of the training data to avoid having it training for hours. 

> Please note that this reduction can greatly impact the performance of the model

## Loading the data

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

Here we will use very small sample. We will train a sample model to classify if a cluse relevant `ti-allowance` or `other` in legal documents.

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Legal/data/legal_clf.csv

In [None]:
import pandas as pd
df = pd.read_csv('legal_clf.csv', encoding="utf8")
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (254, 2)


In [None]:
df.head()

Unnamed: 0,text,category
0,meal-allowance Subject to the terms and provis...,ti-allowance
1,construction-of-the-tenant-improvements Tenant...,ti-allowance
2,"Tenant Improvements Lessor, at Lessor’s cost, ...",ti-allowance
3,Provided there shall not be existing a default...,ti-allowance
4,Landlord shall provide Tenant a tenant improve...,ti-allowance


In [None]:
df['category'].value_counts()

other           135
ti-allowance    119
Name: category, dtype: int64

 ## With Bert Sentence Embeddings

In [None]:
documentAssembler = nlp.DocumentAssembler() \
     .setInputCol("text") \
     .setOutputCol("document")
  
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = legal.ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class") \
    .setLabelColumn("category") \
    .setBatchSize(64) \
    .setMaxEpochs(32) \
    .setEnableOutputLogs(True)\
    .setOutputLogsPath("binary_bert_logs")\
    .setLr(0.002)\
    .setRandomSeed(0)\
    .setDropout(0.2)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    embeddings,
    docClassifier
    ])

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]


In [None]:
spark_df = spark.createDataFrame(df)

In [None]:
%%time
# splitting dataset into train and test set
train, test = spark_df.randomSplit([0.90, 0.10], seed = 0)

clf_model = nlpPipeline.fit(train)


CPU times: user 1.52 s, sys: 193 ms, total: 1.72 s
Wall time: 3min 43s


In [None]:
import os
log_file_name = os.listdir("binary_bert_logs")[1]

with open("binary_bert_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 32 - learning_rate: 0.002 - batch_size: 64 - training_examples: 232 - classes: 2
Epoch 0/32 - 0.32s - loss: 3.2437663 - acc: 0.61145836 - batches: 4
Epoch 1/32 - 0.07s - loss: 2.268199 - acc: 1.0 - batches: 4
Epoch 2/32 - 0.07s - loss: 1.4406782 - acc: 1.0 - batches: 4
Epoch 3/32 - 0.08s - loss: 1.3018677 - acc: 1.0 - batches: 4
Epoch 4/32 - 0.07s - loss: 1.2646831 - acc: 1.0 - batches: 4
Epoch 5/32 - 0.07s - loss: 1.2572342 - acc: 1.0 - batches: 4
Epoch 6/32 - 0.08s - loss: 1.2554077 - acc: 1.0 - batches: 4
Epoch 7/32 - 0.07s - loss: 1.2545748 - acc: 1.0 - batches: 4
Epoch 8/32 - 0.11s - loss: 1.2542142 - acc: 1.0 - batches: 4
Epoch 9/32 - 0.08s - loss: 1.2540405 - acc: 1.0 - batches: 4
Epoch 10/32 - 0.07s - loss: 1.2539155 - acc: 1.0 - batches: 4
Epoch 11/32 - 0.07s - loss: 1.2538443 - acc: 1.0 - batches: 4
Epoch 12/32 - 0.08s - loss: 1.2537851 - acc: 1.0 - batches: 4
Epoch 13/32 - 0.07s - loss: 1.2536979 - acc: 1.0 - batches: 4
Epoch 14/32 - 0.07s - loss: 

In [None]:
preds = clf_model.transform(test)

preds_df = preds.select('category','text',"class.result").toPandas()

preds_df.head()

Unnamed: 0,category,text,result
0,ti-allowance,Disbursement of Tenant Improvement Allowance N...,[ti-allowance]
1,ti-allowance,"In consideration of the foregoing, the parties...",[ti-allowance]
2,ti-allowance,Landlord shall make available for use by Tenan...,[ti-allowance]
3,ti-allowance,Landlord shall provide Tenant the Tenant Impro...,[ti-allowance]
4,ti-allowance,"Landlord, at Landlord’s sole cost and expense,...",[ti-allowance]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print(classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

       other       1.00      1.00      1.00         9
ti-allowance       1.00      1.00      1.00        13

    accuracy                           1.00        22
   macro avg       1.00      1.00      1.00        22
weighted avg       1.00      1.00      1.00        22



## Saving & loading back the trained model

In [None]:
clf_model.stages

[DocumentAssembler_8a0703d495a8,
 BERT_SENTENCE_EMBEDDINGS_68370801062d,
 LegalClassifierDLModel_da57327f33e1]

In [None]:
clf_model.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = legal.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = nlp.Pipeline(stages=[documentAssembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('text','category',"class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,category,result
0,Disbursement of Tenant Improvement Allowance N...,ti-allowance,[ti-allowance]
1,"In consideration of the foregoing, the parties...",ti-allowance,[ti-allowance]
2,Landlord shall make available for use by Tenan...,ti-allowance,[ti-allowance]
3,Landlord shall provide Tenant the Tenant Impro...,ti-allowance,[ti-allowance]
4,"Landlord, at Landlord’s sole cost and expense,...",ti-allowance,[ti-allowance]


## With RoBerta Embeddings

We do not have Legal Sentence Embeddings yet, But we can use the Legal RoBerta Embeddings and then average them.

In [None]:
embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("category")
    .setMaxEpochs(3)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("binary_roberta_logs")
    .setBatchSize(4)
    .setDropout(0.2)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 5.26 s, sys: 623 ms, total: 5.89 s
Wall time: 12min 45s


In [None]:
log_files = os.listdir("multiclass_roberta_logs")

with open("multiclass_roberta_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 3 - learning_rate: 0.001 - batch_size: 4 - training_examples: 232 - classes: 2
Epoch 0/3 - 1.59s - loss: 28.548275 - acc: 0.8189655 - batches: 58
Epoch 1/3 - 1.32s - loss: 19.857958 - acc: 0.9741379 - batches: 58
Epoch 2/3 - 1.09s - loss: 18.892944 - acc: 0.99568963 - batches: 58



In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("category", "text", "class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,category,text,result
0,ti-allowance,Disbursement of Tenant Improvement Allowance N...,[ti-allowance]
1,ti-allowance,"In consideration of the foregoing, the parties...",[ti-allowance]
2,ti-allowance,Landlord shall make available for use by Tenan...,[ti-allowance]
3,ti-allowance,Landlord shall provide Tenant the Tenant Impro...,[ti-allowance]
4,ti-allowance,"Landlord, at Landlord’s sole cost and expense,...",[ti-allowance]


In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['category'], preds_df['result']))


              precision    recall  f1-score   support

       other       1.00      0.89      0.94         9
ti-allowance       0.93      1.00      0.96        13

    accuracy                           0.95        22
   macro avg       0.96      0.94      0.95        22
weighted avg       0.96      0.95      0.95        22



# Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_RoBerta')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

/bin/bash: line 0: cd: /content/ClfBert: No such file or directory
  adding: clf_logs/ (stored 0%)
  adding: clf_logs/.ipynb_checkpoints/ (stored 0%)
  adding: clf_logs/LegalClassifierDLApproach_7b468e291f4b.log (deflated 82%)
  adding: clf_logs/LegalClassifierDLApproach_66f298998d7d.log (deflated 87%)
  adding: Clf_RoBerta/ (stored 0%)
  adding: Clf_RoBerta/classifierdl_tensorflow (deflated 58%)
  adding: Clf_RoBerta/.classifierdl_tensorflow.crc (deflated 42%)
  adding: Clf_RoBerta/metadata/ (stored 0%)
  adding: Clf_RoBerta/metadata/_SUCCESS (stored 0%)
  adding: Clf_RoBerta/metadata/._SUCCESS.crc (stored 0%)
  adding: Clf_RoBerta/metadata/.part-00000.crc (stored 0%)
  adding: Clf_RoBerta/metadata/part-00000 (deflated 40%)
  adding: Clf_RoBerta/fields/ (stored 0%)
  adding: Clf_RoBerta/fields/datasetParams/ (stored 0%)
  adding: Clf_RoBerta/fields/datasetParams/.part-00001.crc (stored 0%)
  adding: Clf_RoBerta/fields/datasetParams/_SUCCESS (stored 0%)
  adding: Clf_RoBerta/fields/dat