
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/04.2.Training_Financial_Multiclass_Classifier.ipynb)

# Train Domain-specific Multiclass Classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom multiclass classification models.

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes) or binary classification
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

## Multiclass Classifier Training


The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_clf_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv')
print(f"Shape of the full dataset: {df.shape}")

Shape of the full dataset: (6128, 3)


In [None]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [None]:
df['label'].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources. 

In [None]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations) 
filter_classes = [
    "risk_factors",
    "financial_statements",
    "business"
]

# We make a random sample with 500 observations
df = df.loc[df.label.isin(filter_classes)].sample(500)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data) 
test = spark.createDataFrame(test_data)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  187|
|financial_statements|  185|
|            business|   78|
+--------------------+-----+



In [None]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|   21|
|financial_statements|   20|
|            business|    9|
+--------------------+-----+



## Train with Universal Encoder

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_use")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 424 ms, sys: 85.4 ms, total: 509 ms
Wall time: 55.5 s


In [None]:
import os
log_file_name = os.listdir("multiclass_use")[0]

with open("multiclass_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 4 - training_examples: 450 - classes: 3
Epoch 0/30 - 1.86s - loss: 98.795044 - acc: 0.68526787 - batches: 113
Epoch 1/30 - 1.51s - loss: 85.520164 - acc: 0.76339287 - batches: 113
Epoch 2/30 - 1.00s - loss: 83.265274 - acc: 0.83258927 - batches: 113
Epoch 3/30 - 1.07s - loss: 81.60619 - acc: 0.86383927 - batches: 113
Epoch 4/30 - 1.02s - loss: 80.566414 - acc: 0.87946427 - batches: 113
Epoch 5/30 - 1.04s - loss: 79.766624 - acc: 0.88616073 - batches: 113
Epoch 6/30 - 1.04s - loss: 78.98129 - acc: 0.89508927 - batches: 113
Epoch 7/30 - 1.09s - loss: 78.20374 - acc: 0.8995536 - batches: 113
Epoch 8/30 - 1.10s - loss: 77.51029 - acc: 0.90625 - batches: 113
Epoch 9/30 - 1.05s - loss: 76.92082 - acc: 0.90625 - batches: 113
Epoch 10/30 - 0.99s - loss: 76.39791 - acc: 0.9129464 - batches: 113
Epoch 11/30 - 1.46s - loss: 75.92094 - acc: 0.91741073 - batches: 113
Epoch 12/30 - 1.14s - loss: 75.47811 - acc: 0.91741073 - batches: 

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,risk_factors,Since our customers use our solutions for impo...,[risk_factors]
1,financial_statements,S X dated May 21 2020 and has concluded that t...,[financial_statements]
2,financial_statements,Fair Value Measurements\nThe Company measures ...,[financial_statements]
3,financial_statements,The Company follows authoritative guidance rel...,[financial_statements]
4,risk_factors,If any of these suppliers manufacturers or par...,[risk_factors]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                      precision    recall  f1-score   support

            business       0.78      0.78      0.78         9
financial_statements       0.79      0.95      0.86        20
        risk_factors       1.00      0.81      0.89        21

            accuracy                           0.86        50
           macro avg       0.86      0.85      0.85        50
        weighted avg       0.88      0.86      0.86        50



### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_59dd8e111571,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_a529f7cb368a]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select("text", "label", "class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,Since our customers use our solutions for impo...,risk_factors,[risk_factors]
1,S X dated May 21 2020 and has concluded that t...,financial_statements,[financial_statements]
2,Fair Value Measurements\nThe Company measures ...,financial_statements,[financial_statements]
3,The Company follows authoritative guidance rel...,financial_statements,[financial_statements]
4,If any of these suppliers manufacturers or par...,risk_factors,[risk_factors]


## Train with Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them. Since this model takes a long time to train, we will train for only one epoch.

In [None]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(1)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("multiclass_bert")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 3.4 s, sys: 451 ms, total: 3.85 s
Wall time: 7min 27s


In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,label,text,result
0,risk_factors,Since our customers use our solutions for impo...,[risk_factors]
1,financial_statements,S X dated May 21 2020 and has concluded that t...,[financial_statements]
2,financial_statements,Fair Value Measurements\nThe Company measures ...,[financial_statements]
3,financial_statements,The Company follows authoritative guidance rel...,[financial_statements]
4,risk_factors,If any of these suppliers manufacturers or par...,[risk_factors]


In [None]:
log_files = os.listdir("multiclass_bert")

with open("multiclass_bert/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 1 - learning_rate: 0.001 - batch_size: 4 - training_examples: 450 - classes: 3
Epoch 0/1 - 1.12s - loss: 93.70362 - acc: 0.72321427 - batches: 113



In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                      precision    recall  f1-score   support

            business       0.00      0.00      0.00         9
financial_statements       0.87      1.00      0.93        20
        risk_factors       0.74      0.95      0.83        21

            accuracy                           0.80        50
           macro avg       0.54      0.65      0.59        50
        weighted avg       0.66      0.80      0.72        50



### Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('MultiClfBert')

# cd into saved dir and zip
! cd /content/MultiClfBert ; zip -r /content/MultiClfBert.zip *