
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/04.1.Training_Financial_Binary_Classifier.ipynb)

# Train Domain-specific Binary Classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom binary classification models.

Here we will train a sample model to classify if a clause relevant `work_experince` or `other` in a finance document.

## Colab Setup

First, you need to setup the environment to be able to use the licensed package. If you are not running in Google Colab, please check the documentation [here](https://nlp.johnsnowlabs.com/docs/en/licensed_install).

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

In [None]:
from johnsnowlabs import nlp
# Log in to your John Snow Labs account to login and get your license keys
nlp.install(force_browser=True)

## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes) or binary classification
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

## Training Binary Models with `ClassifierDLApproach`

The `ClassifierDLApproach` annotator trains a multiclass model or binar models, where the predictions is one category out of a predifined set of categories that are present in the training data.

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

In [3]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


### Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_binary_clf.csv

In [21]:
import pandas as pd
df = pd.read_csv('./finance_binary_clf.csv')
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (313, 2)


In [22]:
df.head()

Unnamed: 0,text,label
0,Health LLC a healthcare data management solut...,work_experience
1,judgment including the involvement of tax pro...,work_experience
2,public information Accordingly investors shou...,work_experience
3,Over the next few years PCC acquired other com...,work_experience
4,s team with the Company since December 2018\na...,work_experience


In [23]:
df.tail()

Unnamed: 0,text,label
308,\nWe use open source software in our offerings...,other
309,\nOn February 18 2020 through our wholly owned...,other
310,sor company VASCO Corp entered the data securi...,other
311,a London based provider of insurance tax comp...,other
312,The Bank J. Safra Sarasin Ltd (previously name...,other


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 313 entries, 0 to 312
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    313 non-null    object
 1   label   313 non-null    object
dtypes: object(2)
memory usage: 5.0+ KB


In [25]:
df.value_counts("label")

label
other              175
work_experience    138
dtype: int64

> We will use a sample from this dataset to avoid making the training process faster (to illustrate how to perform them). Use the full dataset if you want to experiment with it and achieve more realistic results. 
>
> The sample has size of 314 observations only, please keep in mind that this will impact the accuracy and generalization capabilities of the model. Since the dataset is smaller now, we use 90% of it to train the model and the other 10% for testing.

In [26]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.randomSplit([0.9, 0.1], seed=42)

In [27]:
train.show(truncate=50)

+--------------------------------------------------+---------------+
|                                              text|          label|
+--------------------------------------------------+---------------+
|
 
Chief Executive Officer and Director
Jeffrey...|work_experience|
|
In 1996 we expanded our computer security busi...|          other|
|
In January 2020 we acquired 100 of the outstan...|          other|
|
In fiscal 2019 we acquired 100 of the equity o...|          other|
|
Lior
Kohavi joined Cyren in June 2013 as Chief...|work_experience|
|
On October 15 2018 we acquired tCell io Inc tC...|          other|
|
The fragmented nature of our market provides a...|          other|
|
have constructed our own Bitcoin mining facili...|          other|
| 000 Mr Lowrey
was also eligible for a cash and...|work_experience|
| 13a 15 f of the Exchange Act Under the supervi...|work_experience|
| Health LLC a healthcare data management soluti...|work_experience|
| Prior to his appointment as Corp

In [28]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+---------------+-----+
|          label|count|
+---------------+-----+
|          other|   16|
|work_experience|    8|
+---------------+-----+



### Train With Universal Encoder

Universal Encoder is a state-of-the-art architecture to create vector representations of text. We already have a pretrained model that can be used instead of training both embeddings and the classifier (but it could also be done). 

The pretrained model was trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector.

In [37]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("binary_use")
    .setLr(0.001)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [38]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 289 ms, sys: 47.2 ms, total: 337 ms
Wall time: 39.3 s


In [39]:
import os
log_file_name = os.listdir("binary_use")[0]

with open("binary_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 289 - classes: 2
Epoch 0/30 - 0.96s - loss: 33.215763 - acc: 0.8854167 - batches: 73
Epoch 1/30 - 0.71s - loss: 26.753807 - acc: 0.9652778 - batches: 73
Epoch 2/30 - 0.76s - loss: 25.876415 - acc: 0.9618056 - batches: 73
Epoch 3/30 - 0.59s - loss: 25.574816 - acc: 0.9756944 - batches: 73
Epoch 4/30 - 0.60s - loss: 25.297764 - acc: 0.9756944 - batches: 73
Epoch 5/30 - 0.56s - loss: 25.040972 - acc: 0.9861111 - batches: 73
Epoch 6/30 - 0.61s - loss: 24.821703 - acc: 0.9895833 - batches: 73
Epoch 7/30 - 0.76s - loss: 24.645504 - acc: 0.9895833 - batches: 73
Epoch 8/30 - 0.61s - loss: 24.500004 - acc: 0.9895833 - batches: 73
Epoch 9/30 - 0.68s - loss: 24.37931 - acc: 0.9930556 - batches: 73
Epoch 10/30 - 0.64s - loss: 24.279428 - acc: 0.9965278 - batches: 73
Epoch 11/30 - 0.64s - loss: 24.196024 - acc: 0.9965278 - batches: 73
Epoch 12/30 - 0.66s - loss: 24.125896 - acc: 1.0 - batches: 73
Epoch 13/30 -

### Test the trained model

In [40]:
preds = clf_pipelineModel.transform(test)

In [42]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,work_experience,\nPursuant to the requirements of the Securiti...,[work_experience]
1,other,\nWe plan to continue to grow our business by ...,[other]
2,work_experience,Goren is retained\nby the Company through an A...,[work_experience]
3,work_experience,"In December 2005, Ducati returned to Italian o...",[work_experience]
4,work_experience,Weaver was Senior Vice President and Deputy Ge...,[work_experience]


In [43]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [44]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                 precision    recall  f1-score   support

          other       1.00      0.88      0.93        16
work_experience       0.80      1.00      0.89         8

       accuracy                           0.92        24
      macro avg       0.90      0.94      0.91        24
   weighted avg       0.93      0.92      0.92        24



### Saving & loading back the trained model

In [45]:
clf_pipelineModel.stages

[DocumentAssembler_ee7810ffce98,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_8c17ef673d46]

In [46]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [47]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [48]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [49]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [50]:
ld_preds_df = ld_preds.select("text", "label", "class.result").toPandas()

In [51]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,\nPursuant to the requirements of the Securiti...,work_experience,[work_experience]
1,\nWe plan to continue to grow our business by ...,other,[other]
2,Goren is retained\nby the Company through an A...,work_experience,[work_experience]
3,"In December 2005, Ducati returned to Italian o...",work_experience,[work_experience]
4,Weaver was Senior Vice President and Deputy Ge...,work_experience,[work_experience]


### Train with Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them. Since this model takes a long time to train, we will train for only one epoch.

In [52]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [53]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(1)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("binary_bert")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [54]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 2.08 s, sys: 265 ms, total: 2.35 s
Wall time: 4min 58s


In [55]:
preds = clf_pipelineModel.transform(test)

In [56]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [57]:
preds_df.head()

Unnamed: 0,label,text,result
0,work_experience,\nPursuant to the requirements of the Securiti...,[work_experience]
1,other,\nWe plan to continue to grow our business by ...,[other]
2,work_experience,Goren is retained\nby the Company through an A...,[other]
3,work_experience,"In December 2005, Ducati returned to Italian o...",[other]
4,work_experience,Weaver was Senior Vice President and Deputy Ge...,[work_experience]


In [58]:
log_files = os.listdir("binary_bert")

with open("binary_bert/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 1 - learning_rate: 0.001 - batch_size: 4 - training_examples: 289 - classes: 2
Epoch 0/1 - 0.81s - loss: 39.71207 - acc: 0.7534722 - batches: 73



In [59]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                 precision    recall  f1-score   support

          other       0.88      0.94      0.91        16
work_experience       0.86      0.75      0.80         8

       accuracy                           0.88        24
      macro avg       0.87      0.84      0.85        24
   weighted avg       0.87      0.88      0.87        24



### Save model and Zip it for Modelshub Upload/Downloads

In [60]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 58%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 30%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 26%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
