
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/04.2.Training_Financial_Multiclass_Classifier.ipynb)

# Train Domain-specific Multiclass Classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom multiclass classification models.

## Colab Setup

First, you need to setup the environment to be able to use the licensed package. If you are not running in Google Colab, please check the documentation [here](https://nlp.johnsnowlabs.com/docs/en/licensed_install).

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

In [None]:
from johnsnowlabs import nlp
# Log in to your John Snow Labs account to login and get your license keys
nlp.install(force_browser=True)

## Start Spark Session

In [3]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes) or binary classification
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

## Multiclass Classifier Training


The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [4]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_clf_data.csv

In [10]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv')
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (6128, 3)


In [11]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [12]:
df['label'].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

In [13]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations) 
filter_classes = [
    "risk_factors",
    "financial_statements",
    "business"
]

# We make a random sample with 500 observations
df = df.loc[df.label.isin(filter_classes)].sample(500)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data) 
test = spark.createDataFrame(test_data)

In [14]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  193|
|financial_statements|  168|
|            business|   89|
+--------------------+-----+



In [15]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|   21|
|financial_statements|   19|
|            business|   10|
+--------------------+-----+



## Train with Universal Encoder

In [16]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_use")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [17]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 327 ms, sys: 51.9 ms, total: 379 ms
Wall time: 47.4 s


In [18]:
import os
log_file_name = os.listdir("multiclass_use")[0]

with open("multiclass_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 450 - classes: 3
Epoch 0/30 - 1.34s - loss: 100.55365 - acc: 0.62723213 - batches: 113
Epoch 1/30 - 0.96s - loss: 88.57661 - acc: 0.74776787 - batches: 113
Epoch 2/30 - 0.93s - loss: 81.75809 - acc: 0.84375 - batches: 113
Epoch 3/30 - 0.95s - loss: 78.55571 - acc: 0.859375 - batches: 113
Epoch 4/30 - 1.00s - loss: 76.94736 - acc: 0.875 - batches: 113
Epoch 5/30 - 1.00s - loss: 75.97311 - acc: 0.8839286 - batches: 113
Epoch 6/30 - 0.96s - loss: 75.219635 - acc: 0.89285713 - batches: 113
Epoch 7/30 - 0.91s - loss: 74.57784 - acc: 0.90178573 - batches: 113
Epoch 8/30 - 1.26s - loss: 74.00501 - acc: 0.90401787 - batches: 113
Epoch 9/30 - 1.19s - loss: 73.50686 - acc: 0.9129464 - batches: 113
Epoch 10/30 - 0.94s - loss: 73.07396 - acc: 0.9151786 - batches: 113
Epoch 11/30 - 1.02s - loss: 72.6979 - acc: 0.92410713 - batches: 113
Epoch 12/30 - 0.96s - loss: 72.364716 - acc: 0.9285714 - batches: 113
Epoch

In [19]:
preds = clf_pipelineModel.transform(test)

In [20]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,risk_factors,Item 1A Risk Factors\nThe following are certai...,[risk_factors]
1,financial_statements,\n24 529\n \n14 870\n \nDeferred revenue\n \n...,[financial_statements]
2,financial_statements,\n 60 \n \n 1 646 \n \n 56 \n \n \n \n \n \n ...,[financial_statements]
3,financial_statements,Note 11 Income Taxes\nLoss before the income t...,[financial_statements]
4,risk_factors,\ninability to maintain relationships with cu...,[risk_factors]


In [21]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [22]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                      precision    recall  f1-score   support

            business       0.82      0.90      0.86        10
financial_statements       0.95      0.95      0.95        19
        risk_factors       0.95      0.90      0.93        21

            accuracy                           0.92        50
           macro avg       0.91      0.92      0.91        50
        weighted avg       0.92      0.92      0.92        50



### Saving & loading back the trained model

In [23]:
clf_pipelineModel.stages

[DocumentAssembler_c11dc7c5b43f,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_7c86a83d5a9c]

In [24]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [25]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [26]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [27]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [28]:
ld_preds_df = ld_preds.select("text", "label", "class.result").toPandas()

In [29]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,Item 1A Risk Factors\nThe following are certai...,risk_factors,[risk_factors]
1,\n24 529\n \n14 870\n \nDeferred revenue\n \n...,financial_statements,[financial_statements]
2,\n 60 \n \n 1 646 \n \n 56 \n \n \n \n \n \n ...,financial_statements,[financial_statements]
3,Note 11 Income Taxes\nLoss before the income t...,financial_statements,[financial_statements]
4,\ninability to maintain relationships with cu...,risk_factors,[risk_factors]


## Train with Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them. Since this model takes a long time to train, we will train for only one epoch.

In [30]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [31]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(1)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_bert")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [32]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 2.77 s, sys: 340 ms, total: 3.11 s
Wall time: 7min 26s


In [33]:
preds = clf_pipelineModel.transform(test)

In [34]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [35]:
preds_df.head()

Unnamed: 0,label,text,result
0,risk_factors,Item 1A Risk Factors\nThe following are certai...,[risk_factors]
1,financial_statements,\n24 529\n \n14 870\n \nDeferred revenue\n \n...,[financial_statements]
2,financial_statements,\n 60 \n \n 1 646 \n \n 56 \n \n \n \n \n \n ...,[financial_statements]
3,financial_statements,Note 11 Income Taxes\nLoss before the income t...,[financial_statements]
4,risk_factors,\ninability to maintain relationships with cu...,[risk_factors]


In [36]:
log_files = os.listdir("multiclass_bert")

with open("multiclass_bert/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 1 - learning_rate: 0.001 - batch_size: 4 - training_examples: 450 - classes: 3
Epoch 0/1 - 1.14s - loss: 99.01693 - acc: 0.6875 - batches: 113



In [37]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                      precision    recall  f1-score   support

            business       0.00      0.00      0.00        10
financial_statements       0.86      1.00      0.93        19
        risk_factors       0.71      0.95      0.82        21

            accuracy                           0.78        50
           macro avg       0.53      0.65      0.58        50
        weighted avg       0.63      0.78      0.70        50



### Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('MultiClfBert')

# cd into saved dir and zip
! cd /content/MultiClfBert ; zip -r /content/MultiClfBert.zip *

  adding: classifierdl_tensorflow (deflated 57%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 30%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: metadata/ (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
