
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/04.3.Training_Financial_Multilabel_Classifier.ipynb)

# Train Domain-specific Multilabel classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom multilabel classification models.

## Installation

First, you need to setup the environment to be able to use the licensed package. If you are not running in Google Colab, please check the documentation [here](https://nlp.johnsnowlabs.com/docs/en/licensed_install).

In [None]:
! pip install -q johnsnowlabs

### Automatic Installation
Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

### Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

### Start Spark Session

In [None]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes) or binary classification
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

## Training Multilabel Models with `MultiClassifierDLApproach`

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

In [None]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


### Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


> We will use a sample from this dataset to avoid making the training process faster (to illustrate how to perform them). Use the full dataset if you want to experiment with it and achieve more realistic results. 
>
> The sample has size of 500 observations only, please keep in mind that this will impact the accuracy and generalization capabilities of the model. Since the dataset is smaller now, we use 90% of it to train the model and the other 10% for testing.

In [None]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.limit(500).randomSplit([0.9, 0.1], seed=42)

In [None]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------------+
|                                         provision|                              label|
+--------------------------------------------------+-----------------------------------+
|(a) Seller, the Agent, each Managing Agent, eac...|                      [assignments]|
|(a)  The provisions of this Agreement shall be ...|              [assigns, successors]|
|(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY, ...|[governing laws, entire agreements]|
|(a) This Agreement may be executed by one or mo...|                     [counterparts]|
|All Bank Expenses (including reasonable attorne...|                         [expenses]|
|All agreements, representations and warranties ...|                         [survival]|
|All communications hereunder will be in writing...|                          [notices]|
|All covenants, agreements, representations and ...|                         [survival]|
|All covenants, agree

In [None]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|      [counterparts]|    6|
|        [amendments]|    5|
| [entire agreements]|    5|
|      [severability]|    3|
|          [survival]|    3|
|[assigns, success...|    3|
|           [waivers]|    2|
|      [terminations]|    2|
|[representations,...|    2|
|           [notices]|    1|
|        [warranties]|    1|
|       [assignments]|    1|
|    [governing laws]|    1|
|[governing laws, ...|    1|
|          [expenses]|    1|
|        [successors]|    1|
|[amendments, enti...|    1|
+--------------------+-----+



### Train With Universal Encoder

Universal Encoder is a state-of-the-art architecture to create vector representations of text. We already have a pretrained model that can be used instead of training both embeddings and the classifier (but it could also be done). 

The pretrained model was trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector.

In [None]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("provision")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_use")
    .setLr(0.001)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 567 ms, sys: 64.5 ms, total: 631 ms
Wall time: 1min 44s


In [None]:
import os
log_file_name = os.listdir("multilabel_use")[0]

with open("multilabel_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 461 - classes: 15
Epoch 0/30 - 4.21s - loss: 0.30427498 - acc: 0.925797 - batches: 116
Epoch 1/30 - 1.86s - loss: 0.20737702 - acc: 0.94260854 - batches: 116
Epoch 2/30 - 1.84s - loss: 0.15418288 - acc: 0.9563765 - batches: 116
Epoch 3/30 - 1.86s - loss: 0.12557246 - acc: 0.9627533 - batches: 116
Epoch 4/30 - 1.89s - loss: 0.1082967 - acc: 0.9679704 - batches: 116
Epoch 5/30 - 1.86s - loss: 0.09582114 - acc: 0.97304296 - batches: 116
Epoch 6/30 - 3.04s - loss: 0.08615956 - acc: 0.97782564 - batches: 116
Epoch 7/30 - 2.52s - loss: 0.07847145 - acc: 0.9794198 - batches: 116
Epoch 8/30 - 1.79s - loss: 0.07229716 - acc: 0.98217356 - batches: 116
Epoch 9/30 - 1.84s - loss: 0.06728321 - acc: 0.98449224 - batches: 116
Epoch 10/30 - 1.78s - loss: 0.06313381 - acc: 0.9860863 - batches: 116
Epoch 11/30 - 1.87s - loss: 0.059632715 - acc: 0.9878254 - batches: 116
Epoch 12/30 - 1.81s - loss: 0.056623098 - acc:

#### Test the trained model

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "provision", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,[survival],"All agreements, statements, representations an...","[representations, warranties]"
1,[survival],All covenants of the Company contained in this...,"[representations, warranties, terminations]"
2,[survival],"All representations, warranties, covenants and...",[survival]
3,[notices],Any notice required or permitted by this Agree...,[notices]
4,[waivers],Each Canadian Loan Party acknowledges receipt ...,[]


To compare predictions with ground truth values, we will use the `MultiLabelBinarizer` class from the scikit-learn package. It is able to transform the predicted list of classes into a multilabel format that it can process, which is needed to use the classification report or other metrics from the same package.  

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.60      0.50      0.55         6
           1       0.00      0.00      0.00         1
           2       0.75      1.00      0.86         3
           3       0.83      0.83      0.83         6
           4       1.00      0.86      0.92         7
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         1
           8       0.40      1.00      0.57         2
           9       1.00      1.00      1.00         3
          10       0.80      1.00      0.89         4
          11       0.50      0.33      0.40         3
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         2
          14       0.40      0.67      0.50         3

   micro avg       0.70      0.72      0.71        46
   macro avg       0.62      0.68      0.63        46
w

### Train with Bert Embeddings

**Please restart your runtime to get rid of the out-of-memory error and read dataset again**

In [None]:
from johnsnowlabs import nlp, legal 
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


In [None]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (27527, 2)


We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them. 

In [None]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("provision").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(8)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_bert")
    .setLr(0.001)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 2.59 s, sys: 311 ms, total: 2.9 s
Wall time: 7min 41s


#### Testing the trained model

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("provision", "label", "class.result").toPandas()
preds_df.head()

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],"[representations, warranties]"
1,All covenants of the Company contained in this...,[survival],[survival]
2,"All representations, warranties, covenants and...",[survival],[warranties]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]


In [None]:
import os
log_file_name = os.listdir("multilabel_bert")[0]

with open("multilabel_bert/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 8 - learning_rate: 0.001 - batch_size: 4 - training_examples: 461 - classes: 15
Epoch 0/8 - 4.73s - loss: 0.21118948 - acc: 0.94202876 - batches: 116
Epoch 1/8 - 2.09s - loss: 0.08812641 - acc: 0.9775358 - batches: 116
Epoch 2/8 - 2.01s - loss: 0.056855213 - acc: 0.9868111 - batches: 116
Epoch 3/8 - 2.05s - loss: 0.042333648 - acc: 0.9924633 - batches: 116
Epoch 4/8 - 1.99s - loss: 0.033270992 - acc: 0.9960865 - batches: 116
Epoch 5/8 - 2.03s - loss: 0.027073074 - acc: 1.0002896 - batches: 116
Epoch 6/8 - 2.01s - loss: 0.022839691 - acc: 1.0023185 - batches: 116
Epoch 7/8 - 2.02s - loss: 0.019849315 - acc: 1.0027533 - batches: 116



In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])

print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       0.00      0.00      0.00         1
           2       0.75      1.00      0.86         3
           3       1.00      0.83      0.91         6
           4       1.00      0.86      0.92         7
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         1
           8       0.40      1.00      0.57         2
           9       0.75      1.00      0.86         3
          10       0.80      1.00      0.89         4
          11       1.00      0.33      0.50         3
          12       1.00      0.50      0.67         2
          13       0.50      0.50      0.50         2
          14       0.50      1.00      0.67         3

   micro avg       0.80      0.85      0.82        46
   macro avg       0.78      0.80      0.76        46
w

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_6e4b83bd8e34,
 REGEX_TOKENIZER_a51273a0ac5e,
 BERT_EMBEDDINGS_29ce72cd673e,
 SentenceEmbeddings_d7e0188b9ffa,
 MultiClassifierDLModel_c72989be3944]

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfBert')

In [None]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfBert')

In [None]:
ld_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        MultilabelClfModel,
    ]
)
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("provision"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select("provision", "label", "class.result").toPandas()

In [None]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],"[representations, warranties]"
1,All covenants of the Company contained in this...,[survival],[survival]
2,"All representations, warranties, covenants and...",[survival],[warranties]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]
5,Except as otherwise provided herein or in any ...,[waivers],[waivers]
6,Franchisee acknowledges that the Foodservice D...,[amendments],[amendments]
7,Guarantor represents and warrants to Lender th...,[warranties],"[representations, warranties]"
8,"If any provision of this Plan or any Award is,...",[severability],[severability]
9,"No amendment, modification, termination or can...",[amendments],[amendments]


### Save model and Zip it for Modelshub Upload/Downloads

[Models Hub](https://nlp.johnsnowlabs.com/models)

In [None]:
# cd into saved dir and zip
! cd /content/MultilabelClfBert ; zip -r /content/MultilabelClfBert.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 34%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 41%)
  adding: multiclassifierdl_tensorflow (deflated 85%)
