
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/13.Training_Financial_Classifiers.ipynb)

# Train Domain-specific Multiclass and Multilabel classifiers

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=4.2.0 but should be Version=0.1.14
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up if John Snow Labs home exists in /root/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library internal_with_finleg-4.0.0rc1-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-assembly-4.2.0fl1.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare.json
Installing /root/.johnsnowlabs/py_installs/internal_with_finleg-4.0.0rc1-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/py

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

# Multilabel classifier training

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)

In [None]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.randomSplit([0.8, 0.2], seed = 123)

In [None]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------+
|                                         provision|                        label|
+--------------------------------------------------+-----------------------------+
|(a) Consultant or Company may terminate this Pr...|               [terminations]|
|(a) Each of Borrower and Guarantor, as applicab...|[representations, warranties]|
|(a) No amendment or waiver of any provision of ...|                 [amendments]|
|(a) No failure on the part of any Person to exe...|        [waivers, amendments]|
|(a) No failure or delay by any Agent or any Len...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Agent or any Len...|        [waivers, amendments]|
|(a)

In [None]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|    [governing laws]|  751|
|      [counterparts]|  580|
|           [notices]|  574|
| [entire agreements]|  571|
|      [severability]|  504|
|          [survival]|  327|
|[assigns, success...|  294|
|        [amendments]|  265|
|[waivers, amendme...|  229|
|      [terminations]|  227|
|          [expenses]|  227|
|           [waivers]|  206|
|[representations,...|  203|
|       [assignments]|  174|
|   [representations]|   88|
|[amendments, enti...|   60|
|        [successors]|   50|
|[amendments, term...|   35|
|        [warranties]|   24|
|[governing laws, ...|   13|
+--------------------+-----+
only showing top 20 rows



 ## With Universal Encoder

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = nlp.MultiClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/30 - 6.52s - loss: 0.16814168 - batches: 345
Epoch 1/30 - 4.47s - loss: 0.07261445 - batches: 345
Epoch 2/30 - 4.48s - loss: 0.06150661 - batches: 345
Epoch 3/30 - 4.51s - loss: 0.05632347 - batches: 345
Epoch 4/30 - 4.69s - loss: 0.05331255 - batches: 345
Epoch 5/30 - 4.56s - loss: 0.05126492 - batches: 345
Epoch 6/30 - 5.10s - loss: 0.04972897 - batches: 345
Epoch 7/30 - 4.49s - loss: 0.048504002 - batches: 345
Epoch 8/30 - 4.36s - loss: 0.04748668 - batches: 345
Epoch 9/30 - 4.44s - loss: 0.046618596 - batches: 345
Epoch 10/30 - 4.47s - loss: 0.045863405 - batches: 345
Epoch 11/30 - 4.53s - loss: 0.04519599 - batches: 345
Epoch 12/30 - 4.51s - loss: 0.044598866 - batches: 345
Epoch 13/30 - 4.46s - loss: 0.044059444 - batches: 345
Epoch 14/30 - 4.43s - loss: 0.043568727 - batches: 345
Epoch 15/30 - 4.47s - loss: 0.043119825 - batches: 345
Epoch 16/30 - 4.40s - loss: 

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,"[waivers, terminations]","(a) Effective as of the Effective Date, the Ho...",[representations]
1,"[waivers, amendments]",(a) No failure or delay by the Administrative ...,[waivers]
2,"[waivers, amendments]",(a) No failure or delay on the part of any par...,[waivers]
3,[assignments],"(a) Seller, the Agent, each Managing Agent, ea...","[successors, assignments]"
4,"[assigns, successors]",(a) The provisions of this Agreement shall be ...,"[successors, assigns]"


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.88      0.77      0.82       618
           1       0.73      0.52      0.61       198
           2       0.79      0.75      0.77       302
           3       0.99      0.98      0.99       587
           4       0.98      0.94      0.96       675
           5       0.98      0.92      0.95       228
           6       0.98      0.98      0.98       784
           7       0.98      0.96      0.97       574
           8       0.92      0.79      0.85       291
           9       0.99      0.94      0.96       531
          10       0.84      0.85      0.84       361
          11       0.96      0.91      0.94       329
          12       0.89      0.73      0.80       272
          13       0.90      0.75      0.82       460
          14       0.83      0.80      0.82       227

   micro avg       0.93      0.87      0.90      6437
   macro avg       0.91      0.84      0.87      6437
w

## With Bert Embeddings

We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them.

In [None]:
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
          .setInputCols(["document", "token"]) \
          .setOutputCol("embeddings")

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = nlp.MultiClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(8)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 35 s, sys: 4.29 s, total: 39.2 s
Wall time: 1h 56min 42s


In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[waivers]
1,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
2,(a) No failure or delay on the part of any par...,"[waivers, amendments]",[waivers]
3,"(a) Seller, the Agent, each Managing Agent, ea...",[assignments],[assignments]
4,(a) The provisions of this Agreement shall be ...,"[assigns, successors]","[successors, assigns]"


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/30 - 6.52s - loss: 0.16814168 - batches: 345
Epoch 1/30 - 4.47s - loss: 0.07261445 - batches: 345
Epoch 2/30 - 4.48s - loss: 0.06150661 - batches: 345
Epoch 3/30 - 4.51s - loss: 0.05632347 - batches: 345
Epoch 4/30 - 4.69s - loss: 0.05331255 - batches: 345
Epoch 5/30 - 4.56s - loss: 0.05126492 - batches: 345
Epoch 6/30 - 5.10s - loss: 0.04972897 - batches: 345
Epoch 7/30 - 4.49s - loss: 0.048504002 - batches: 345
Epoch 8/30 - 4.36s - loss: 0.04748668 - batches: 345
Epoch 9/30 - 4.44s - loss: 0.046618596 - batches: 345
Epoch 10/30 - 4.47s - loss: 0.045863405 - batches: 345
Epoch 11/30 - 4.53s - loss: 0.04519599 - batches: 345
Epoch 12/30 - 4.51s - loss: 0.044598866 - batches: 345
Epoch 13/30 - 4.46s - loss: 0.044059444 - batches: 345
Epoch 14/30 - 4.43s - loss: 0.043568727 - batches: 345
Epoch 15/30 - 4.47s - loss: 0.043119825 - batches: 345
Epoch 16/30 - 4.40s - loss: 

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.90      0.90      0.90       618
           1       0.82      0.66      0.73       198
           2       0.88      0.68      0.77       302
           3       0.99      0.99      0.99       587
           4       0.99      0.96      0.97       675
           5       1.00      0.97      0.98       228
           6       0.98      0.98      0.98       784
           7       1.00      0.97      0.98       574
           8       0.95      0.93      0.94       291
           9       0.99      0.97      0.98       531
          10       0.90      0.85      0.87       361
          11       0.96      0.93      0.95       329
          12       0.93      0.86      0.89       272
          13       0.96      0.75      0.84       460
          14       0.88      0.88      0.88       227

   micro avg       0.95      0.91      0.93      6437
   macro avg       0.94      0.89      0.91      6437
w

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_0befb157d5b5,
 REGEX_TOKENIZER_845e9cea52a1,
 BERT_EMBEDDINGS_29ce72cd673e,
 SentenceEmbeddings_c13729f7bf05,
 MultiClassifierDLModel_fb74a81172fa]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfBert')

In [None]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfBert')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, MultilabelClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("provision"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [None]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[waivers]
1,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
2,(a) No failure or delay on the part of any par...,"[waivers, amendments]",[waivers]
3,"(a) Seller, the Agent, each Managing Agent, ea...",[assignments],[assignments]
4,(a) The provisions of this Agreement shall be ...,"[assigns, successors]","[successors, assigns]"
5,(a) No failure or delay of the Administrative...,"[waivers, amendments]","[waivers, amendments]"
6,(a) All of the representations and warranties ...,"[representations, warranties]","[warranties, representations]"
7,(a) Any Lender may at any time assign to one o...,[assignments],[assignments]
8,(a) Each of the Borrower and the Parent hereby...,"[representations, warranties]","[warranties, representations]"
9,(a) Except as otherwise expressly provided her...,[notices],[notices]


# Multiclass classifier training


## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_clf_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('./finance_clf_data.csv')

In [None]:
df.head()

Unnamed: 0,text,label,len
0,\nOperating\nLeases\n \nOn\nJanuary 1 2010 th...,financial_statements,465
1,the Exercise Price and is exercisable for fiv...,financial_statements,406
2,Income Taxes\n69\nTable of Contents\nWe accoun...,financial_statements,843
3,Invoice2go\n has not been required to maintain...,risk_factors,474
4,A\nB\nC\nPlan Category\nNumber of Securitiesto...,equity,358


In [None]:
df['label'].value_counts()

risk_factors               3831
financial_statements       3726
business                   2002
financial_conditions        702
form_10k_summary            491
executives_compensation     304
controls_procedures         277
equity                      223
market_risk                 204
executives                  161
legal_proceedings            94
security_ownership           84
properties                   81
exhibits                     77
Name: label, dtype: int64

In [None]:
data = spark.createDataFrame(df)

train, test = data.randomSplit([0.8, 0.2], seed = 100)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors| 3071|
|financial_statements| 2983|
|            business| 1582|
|financial_conditions|  597|
|    form_10k_summary|  385|
| controls_procedures|  226|
|executives_compen...|  224|
|              equity|  174|
|         market_risk|  158|
|          executives|  122|
|   legal_proceedings|   72|
|  security_ownership|   70|
|            exhibits|   62|
|          properties|   57|
+--------------------+-----+



In [None]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  760|
|financial_statements|  743|
|            business|  420|
|    form_10k_summary|  106|
|financial_conditions|  105|
|executives_compen...|   80|
| controls_procedures|   51|
|              equity|   49|
|         market_risk|   46|
|          executives|   39|
|          properties|   24|
|   legal_proceedings|   22|
|            exhibits|   15|
|  security_ownership|   14|
+--------------------+-----+



 ## With Universal Encoder

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

classsifierdl = finance.ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(30)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 237 ms, sys: 35.8 ms, total: 273 ms
Wall time: 43.3 s


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/30 - 6.52s - loss: 0.16814168 - batches: 345
Epoch 1/30 - 4.47s - loss: 0.07261445 - batches: 345
Epoch 2/30 - 4.48s - loss: 0.06150661 - batches: 345
Epoch 3/30 - 4.51s - loss: 0.05632347 - batches: 345
Epoch 4/30 - 4.69s - loss: 0.05331255 - batches: 345
Epoch 5/30 - 4.56s - loss: 0.05126492 - batches: 345
Epoch 6/30 - 5.10s - loss: 0.04972897 - batches: 345
Epoch 7/30 - 4.49s - loss: 0.048504002 - batches: 345
Epoch 8/30 - 4.36s - loss: 0.04748668 - batches: 345
Epoch 9/30 - 4.44s - loss: 0.046618596 - batches: 345
Epoch 10/30 - 4.47s - loss: 0.045863405 - batches: 345
Epoch 11/30 - 4.53s - loss: 0.04519599 - batches: 345
Epoch 12/30 - 4.51s - loss: 0.044598866 - batches: 345
Epoch 13/30 - 4.46s - loss: 0.044059444 - batches: 345
Epoch 14/30 - 4.43s - loss: 0.043568727 - batches: 345
Epoch 15/30 - 4.47s - loss: 0.043119825 - batches: 345
Epoch 16/30 - 4.40s - loss: 

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[financial_statements]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[financial_statements]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[financial_statements]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[financial_statements]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                         precision    recall  f1-score   support

               business       0.70      0.80      0.75       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.63      0.93      0.75       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.75      0.88      0.81       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_036488880821,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_7adb6be4e193]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,\n\n\n \n \n 7\n million when the Company meet...,financial_statements,[financial_statements]
1,\n\n \n\n\nLevel 3 Inputs that are generally u...,financial_statements,[financial_statements]
2,\n\n \n\n\nOur products are complex and have a...,risk_factors,[risk_factors]
3,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,risk_factors,[financial_statements]
4,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,form_10k_summary,[financial_statements]


## With Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them.

In [None]:
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = finance.ClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(8)\
    .setLr(0.001)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 13 s, sys: 1.32 s, total: 14.3 s
Wall time: 50min 20s


In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[financial_statements]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[financial_statements]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[financial_statements]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[financial_statements]


In [None]:
log_files = os.listdir("/root/annotator_logs")

with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/30 - 6.52s - loss: 0.16814168 - batches: 345
Epoch 1/30 - 4.47s - loss: 0.07261445 - batches: 345
Epoch 2/30 - 4.48s - loss: 0.06150661 - batches: 345
Epoch 3/30 - 4.51s - loss: 0.05632347 - batches: 345
Epoch 4/30 - 4.69s - loss: 0.05331255 - batches: 345
Epoch 5/30 - 4.56s - loss: 0.05126492 - batches: 345
Epoch 6/30 - 5.10s - loss: 0.04972897 - batches: 345
Epoch 7/30 - 4.49s - loss: 0.048504002 - batches: 345
Epoch 8/30 - 4.36s - loss: 0.04748668 - batches: 345
Epoch 9/30 - 4.44s - loss: 0.046618596 - batches: 345
Epoch 10/30 - 4.47s - loss: 0.045863405 - batches: 345
Epoch 11/30 - 4.53s - loss: 0.04519599 - batches: 345
Epoch 12/30 - 4.51s - loss: 0.044598866 - batches: 345
Epoch 13/30 - 4.46s - loss: 0.044059444 - batches: 345
Epoch 14/30 - 4.43s - loss: 0.043568727 - batches: 345
Epoch 15/30 - 4.47s - loss: 0.043119825 - batches: 345
Epoch 16/30 - 4.40s - loss: 

In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                         precision    recall  f1-score   support

               business       0.00      0.00      0.00       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.55      0.99      0.70       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.65      0.96      0.77       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

# Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 56%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 32%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 26%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 27%)
  adding: metadata/ (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/part-00000 (deflated 40%)
  adding: metadata/.part-00000.crc (stored 0%)
