
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/4.1.Training_Financial_Classifiers.ipynb)

# Train Domain-specific Multiclass and Multilabel classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom multiclass and multilabel classification models.

## Colab Setup

First, you need to setup the environment to be able to use the licensed package. If you are not running in Google Colab, please check the documentation [here](https://nlp.johnsnowlabs.com/docs/en/licensed_install).

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.2/74.2 KB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.6/570.6 KB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 KB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.3/82.3 KB[0m [31m9.1 MB/s[

In [2]:
from johnsnowlabs import nlp
# Log in to your John Snow Labs account to login and get your license keys
nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [09/Jan/2023 11:58:11] "GET /login?code=gROpfMFB2Vutcg3Og0gsx1FUAskfDr HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl
Installed 1 products:
💊 Spark-Healthcare==4.2.4 installed! ✅ Heal the planet with NLP! 


## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes)
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

## Training Multilabel models with `MultiClassifierDLApproach`

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

In [3]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


### Loading the data

In [4]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_data.csv

In [5]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)
print(f"Shape of the full dataset: {df.shape}")

(27527, 2)

> We will use a sample from this dataset to avoid making the training process faster (to illustrate how to perform them). Use the full dataset if you want to experiment with it and achieve more realistic results. 
>
> The sample has size of 500 observations only, please keep in mind that this will impact the accuracy and generalization capabilities of the model. Since the dataset is smaller now, we use 90% of it to train the model and the other 10% for testing.

In [6]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.limit(500).randomSplit([0.9, 0.1], seed=42)

In [7]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------------+
|                                         provision|                              label|
+--------------------------------------------------+-----------------------------------+
|(a) Seller, the Agent, each Managing Agent, eac...|                      [assignments]|
|(a)  The provisions of this Agreement shall be ...|              [assigns, successors]|
|(a) THIS AGREEMENT AND ANY CLAIM, CONTROVERSY, ...|[governing laws, entire agreements]|
|(a) This Agreement may be executed by one or mo...|                     [counterparts]|
|All Bank Expenses (including reasonable attorne...|                         [expenses]|
|All agreements, representations and warranties ...|                         [survival]|
|All communications hereunder will be in writing...|                          [notices]|
|All covenants, agreements, representations and ...|                         [survival]|
|All covenants, agree

In [8]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|      [counterparts]|    6|
|        [amendments]|    5|
| [entire agreements]|    5|
|      [severability]|    3|
|          [survival]|    3|
|[assigns, success...|    3|
|           [waivers]|    2|
|      [terminations]|    2|
|[representations,...|    2|
|           [notices]|    1|
|        [warranties]|    1|
|       [assignments]|    1|
|    [governing laws]|    1|
|[governing laws, ...|    1|
|          [expenses]|    1|
|        [successors]|    1|
|[amendments, enti...|    1|
+--------------------+-----+



### Train With Universal Encoder

Universal Encoder is a state-of-the-art architecture to create vector representations of text. We already have a pretrained model that can be used instead of training both embeddings and the classifier (but it could also be done). 

The pretrained model was trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector.

In [9]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("provision")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_use")
    .setLr(0.001)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [10]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 440 ms, sys: 60.4 ms, total: 501 ms
Wall time: 1min 19s


In [11]:
import os
log_file_name = os.listdir("multilabel_use")[0]

with open("multilabel_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 461 - classes: 15
Epoch 0/30 - 4.11s - loss: 0.30427498 - acc: 0.925797 - batches: 116
Epoch 1/30 - 1.44s - loss: 0.20737702 - acc: 0.94260854 - batches: 116
Epoch 2/30 - 1.41s - loss: 0.15418288 - acc: 0.9563765 - batches: 116
Epoch 3/30 - 1.42s - loss: 0.12557246 - acc: 0.9627533 - batches: 116
Epoch 4/30 - 1.39s - loss: 0.1082967 - acc: 0.9679704 - batches: 116
Epoch 5/30 - 1.41s - loss: 0.09582114 - acc: 0.97304296 - batches: 116
Epoch 6/30 - 1.44s - loss: 0.08615956 - acc: 0.97782564 - batches: 116
Epoch 7/30 - 1.40s - loss: 0.07847145 - acc: 0.9794198 - batches: 116
Epoch 8/30 - 1.40s - loss: 0.07229716 - acc: 0.98217356 - batches: 116
Epoch 9/30 - 1.38s - loss: 0.06728321 - acc: 0.98449224 - batches: 116
Epoch 10/30 - 1.38s - loss: 0.06313381 - acc: 0.9860863 - batches: 116
Epoch 11/30 - 1.39s - loss: 0.059632715 - acc: 0.9878254 - batches: 116
Epoch 12/30 - 1.39s - loss: 0.056623098 - acc:

#### Test the trained model

In [12]:
preds = clf_pipelineModel.transform(test)

In [13]:
preds_df = preds.select("label", "provision", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,[survival],"All agreements, statements, representations an...","[representations, warranties]"
1,[survival],All covenants of the Company contained in this...,"[representations, warranties, terminations]"
2,[survival],"All representations, warranties, covenants and...",[survival]
3,[notices],Any notice required or permitted by this Agree...,[notices]
4,[waivers],Each Canadian Loan Party acknowledges receipt ...,[]


To compare predictions with ground truth values, we will use the `MultiLabelBinarizer` class from the scikit-learn package. It is able to transform the predicted list of classes into a multilabel format that it can process, which is needed to use the classification report or other metrics from the same package.  

In [15]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.60      0.50      0.55         6
           1       0.00      0.00      0.00         1
           2       0.75      1.00      0.86         3
           3       0.83      0.83      0.83         6
           4       1.00      0.86      0.92         7
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         1
           8       0.40      1.00      0.57         2
           9       1.00      1.00      1.00         3
          10       0.80      1.00      0.89         4
          11       0.50      0.33      0.40         3
          12       0.00      0.00      0.00         2
          13       0.00      0.00      0.00         2
          14       0.40      0.67      0.50         3

   micro avg       0.70      0.72      0.71        46
   macro avg       0.62      0.68      0.63        46
w

### Train with Bert Embeddings

We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them. 

In [16]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [17]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("provision").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    nlp.MultiClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(8)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multilabel_bert")
    .setLr(0.001)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [18]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 1.8 s, sys: 246 ms, total: 2.04 s
Wall time: 4min 52s


#### Testing the trained model

In [19]:
preds = clf_pipelineModel.transform(test)

In [20]:
preds_df = preds.select("provision", "label", "class.result").toPandas()
preds_df.head()

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],"[representations, warranties]"
1,All covenants of the Company contained in this...,[survival],[survival]
2,"All representations, warranties, covenants and...",[survival],[warranties]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]


In [21]:
import os
log_file_name = os.listdir("multilabel_bert")[0]

with open("multilabel_bert/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 8 - learning_rate: 0.001 - batch_size: 4 - training_examples: 461 - classes: 15
Epoch 0/8 - 4.16s - loss: 0.21118948 - acc: 0.94202876 - batches: 116
Epoch 1/8 - 1.48s - loss: 0.08812641 - acc: 0.9775358 - batches: 116
Epoch 2/8 - 1.42s - loss: 0.056855213 - acc: 0.9868111 - batches: 116
Epoch 3/8 - 1.46s - loss: 0.042333648 - acc: 0.9924633 - batches: 116
Epoch 4/8 - 1.45s - loss: 0.033270992 - acc: 0.9960865 - batches: 116
Epoch 5/8 - 1.44s - loss: 0.027073074 - acc: 1.0002896 - batches: 116
Epoch 6/8 - 1.43s - loss: 0.022839691 - acc: 1.0023185 - batches: 116
Epoch 7/8 - 1.42s - loss: 0.019849315 - acc: 1.0027533 - batches: 116



In [23]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       0.00      0.00      0.00         1
           2       0.75      1.00      0.86         3
           3       1.00      0.83      0.91         6
           4       1.00      0.86      0.92         7
           5       1.00      1.00      1.00         1
           6       1.00      1.00      1.00         2
           7       1.00      1.00      1.00         1
           8       0.40      1.00      0.57         2
           9       0.75      1.00      0.86         3
          10       0.80      1.00      0.89         4
          11       1.00      0.33      0.50         3
          12       1.00      0.50      0.67         2
          13       0.50      0.50      0.50         2
          14       0.50      1.00      0.67         3

   micro avg       0.80      0.85      0.82        46
   macro avg       0.78      0.80      0.76        46
w

### Saving & loading back the trained model

In [24]:
clf_pipelineModel.stages

[DocumentAssembler_b8a980d99733,
 REGEX_TOKENIZER_b37ee9fd64c2,
 BERT_EMBEDDINGS_29ce72cd673e,
 SentenceEmbeddings_0e27ddfbf438,
 MultiClassifierDLModel_9df08ed59a41]

In [25]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfBert')

In [26]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfBert')

In [27]:
ld_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        MultilabelClfModel,
    ]
)
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("provision"))

In [28]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [29]:
ld_preds_df = ld_preds.select("provision", "label", "class.result").toPandas()

In [30]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"All agreements, statements, representations an...",[survival],"[representations, warranties]"
1,All covenants of the Company contained in this...,[survival],[survival]
2,"All representations, warranties, covenants and...",[survival],[warranties]
3,Any notice required or permitted by this Agree...,[notices],[notices]
4,Each Canadian Loan Party acknowledges receipt ...,[waivers],[]
5,Except as otherwise provided herein or in any ...,[waivers],[waivers]
6,Franchisee acknowledges that the Foodservice D...,[amendments],[amendments]
7,Guarantor represents and warrants to Lender th...,[warranties],"[representations, warranties]"
8,"If any provision of this Plan or any Award is,...",[severability],[severability]
9,"No amendment, modification, termination or can...",[amendments],[amendments]


# Multiclass classifier training


The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [31]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_clf_data.csv

In [32]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv')
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (6128, 3)


In [33]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [34]:
df['label'].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

In [35]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations) 
filter_classes = [
    "risk_factors",
    "financial_statements",
    "business"
]

# We make a random sample with 1000 observations
df = df.loc[df.label.isin(filter_classes)].sample(1000)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data) 
test = spark.createDataFrame(test_data)

In [36]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  362|
|financial_statements|  358|
|            business|  180|
+--------------------+-----+



In [37]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|   40|
|financial_statements|   40|
|            business|   20|
+--------------------+-----+



### Train with Universal Encoder

In [38]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_use")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [39]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 333 ms, sys: 46 ms, total: 379 ms
Wall time: 50.7 s


In [40]:
import os
log_file_name = os.listdir("multiclass_use")[0]

with open("multiclass_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/30 - 1.64s - loss: 187.45738 - acc: 0.69 - batches: 225
Epoch 1/30 - 1.39s - loss: 154.93361 - acc: 0.8611111 - batches: 225
Epoch 2/30 - 1.37s - loss: 148.88728 - acc: 0.8844444 - batches: 225
Epoch 3/30 - 1.36s - loss: 146.69815 - acc: 0.8933333 - batches: 225
Epoch 4/30 - 1.43s - loss: 145.29247 - acc: 0.9011111 - batches: 225
Epoch 5/30 - 1.38s - loss: 144.27583 - acc: 0.9088889 - batches: 225
Epoch 6/30 - 1.35s - loss: 143.54022 - acc: 0.9111111 - batches: 225
Epoch 7/30 - 1.35s - loss: 143.01407 - acc: 0.91555554 - batches: 225
Epoch 8/30 - 1.33s - loss: 142.59346 - acc: 0.9177778 - batches: 225
Epoch 9/30 - 1.38s - loss: 142.21786 - acc: 0.91888887 - batches: 225
Epoch 10/30 - 1.37s - loss: 141.87605 - acc: 0.92 - batches: 225
Epoch 11/30 - 1.35s - loss: 141.57596 - acc: 0.9222222 - batches: 225
Epoch 12/30 - 1.35s - loss: 141.33492 - acc: 0.92444444 - batches: 225


In [41]:
preds = clf_pipelineModel.transform(test)

In [42]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\nASSETS\n \n \n \n \n \n \n \n \n\nCurrent...,[financial_statements]
1,financial_statements,\n108\nTable of Contents\nEnvestnet Inc \nNot...,[financial_statements]
2,business,Qumu s implementations can range in size from ...,[business]
3,business,Growing Existing Markets\nSherpa s goals for g...,[business]
4,risk_factors,Any of these events or other currently unfores...,[risk_factors]


In [43]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [44]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                      precision    recall  f1-score   support

            business       1.00      0.75      0.86        20
financial_statements       0.95      0.93      0.94        40
        risk_factors       0.83      0.95      0.88        40

            accuracy                           0.90       100
           macro avg       0.92      0.88      0.89       100
        weighted avg       0.91      0.90      0.90       100



### Saving & loading back the trained model

In [45]:
clf_pipelineModel.stages

[DocumentAssembler_8ac2e16e3c46,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_1ac6b46a0592]

In [46]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [47]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [48]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [49]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [50]:
ld_preds_df = ld_preds.select("text", "label", "class.result").toPandas()

In [51]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,\n\nASSETS\n \n \n \n \n \n \n \n \n\nCurrent...,financial_statements,[financial_statements]
1,\n108\nTable of Contents\nEnvestnet Inc \nNot...,financial_statements,[financial_statements]
2,Qumu s implementations can range in size from ...,business,[business]
3,Growing Existing Markets\nSherpa s goals for g...,business,[business]
4,Any of these events or other currently unfores...,risk_factors,[risk_factors]


### Train with Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them. Since this model takes a long time to train, we will train for only one epoch.

In [52]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [53]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(1)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_bert")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [54]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 2.48 s, sys: 282 ms, total: 2.77 s
Wall time: 6min 38s


In [55]:
preds = clf_pipelineModel.transform(test)

In [56]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [57]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\nASSETS\n \n \n \n \n \n \n \n \n\nCurrent...,[financial_statements]
1,financial_statements,\n108\nTable of Contents\nEnvestnet Inc \nNot...,[financial_statements]
2,business,Qumu s implementations can range in size from ...,[business]
3,business,Growing Existing Markets\nSherpa s goals for g...,[business]
4,risk_factors,Any of these events or other currently unfores...,[risk_factors]


In [58]:
log_files = os.listdir("multiclass_bert")

with open("multiclass_bert/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 1 - learning_rate: 0.001 - batch_size: 4 - training_examples: 900 - classes: 3
Epoch 0/1 - 1.87s - loss: 169.53094 - acc: 0.81 - batches: 225



In [59]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                      precision    recall  f1-score   support

            business       1.00      0.80      0.89        20
financial_statements       0.94      0.82      0.88        40
        risk_factors       0.80      0.97      0.88        40

            accuracy                           0.88       100
           macro avg       0.91      0.87      0.88       100
        weighted avg       0.90      0.88      0.88       100



# Save model and Zip it for Modelshub Upload/Downloads

In [60]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd ClfBert ; zip -r ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 57%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 30%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: metadata/ (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
