
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/07.1.Training_Financial_Assertion.ipynb)

# Training Finance Assertion


#Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from johnsnowlabs import nlp, finance
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install(force_browser=True)

## Start Spark Session

In [3]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


In [4]:
spark

#Data Prep 

In [5]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/assertion_df.csv

In [6]:
import pandas as pd

training_df = pd.read_csv('./assertion_df.csv')

In [7]:
training_data = spark.createDataFrame(training_df)
training_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|Marijuana Co of A...|PVM International...|positive|   10| 14|
|Marijuana Co of A...|Marijuana Co of A...|positive|    0|  6|
|NORTEK INC is not...|          NORTEK INC|negative|    0|  1|
|NORTEK INC is not...|EN2GO INTERNATION...|positive|    6|  8|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|    0|  1|
|TransDigm Group I...| TransDigm Group INC|negative|   

In [8]:
training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)



In [9]:
%time training_data.count()

CPU times: user 8.25 ms, sys: 6.46 ms, total: 14.7 ms
Wall time: 750 ms


98

In [10]:
(train_data, test_data) = training_data.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 71
Test Dataset Count: 27


In [11]:
train_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|NORTEK INC is not...|EN2GO INTERNATION...|positive|    6|  8|
|NORTEK INC is not...|          NORTEK INC|negative|    0|  1|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|    0|  1|
|TransDigm Group I...|         ABIOMED INC|positive|    9| 10|
|TransDigm Group I...| TransDigm Group INC|negative|    0|  2|
|Nexeo Solutions ,...|ARCA biopharma , ...|positive|    8| 12|
|Nexeo Solutions ,...|Nexeo Solutions ,...|negative|   

In [12]:
test_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Marijuana Co of A...|Marijuana Co of A...|positive|    0|  6|
|Marijuana Co of A...|PVM International...|positive|   10| 14|
|Fundrise Income e...|Fundrise Income e...|negative|    0|  5|
|Fundrise Income e...|MIDDLETON & CO IN...|positive|    9| 12|
|Angie's List , In...|Angie's List , Inc .|negative|    0|  4|
|Angie's List , In...|        RC-1 , Inc .|positive|   11| 14|
|ALEXANDRIA REAL E...|            CDEX INC|positive|    9| 10|
|ATMI INC is eligi...|NEAH POWER SYSTEM...|positive|    5| 10|
|Artificial Intell...| APA OPTICS INC /MN/|positive|   10| 13|
|Mountain Capital ...|            COSI INC|positive|   11| 12|
|Mountain Capital ...|Mountain Capital ...|negative|    0|  5|
|QUAINT OAK BANCOR...|QUAINT OAK BANCOR...|positive|   

#Using Bert Embeddings

In [13]:
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("document", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [14]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [15]:
assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    bert_embeddings])

assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)

In [16]:
assertion_test_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

In [17]:
assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')

In [18]:
assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)
assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')

In [19]:
assertion_train_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

##Graph setup

In [20]:
! pip install -q tensorflow==2.7.0
! pip install -q tensorflow-addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.6/489.6 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m463.1/463.1 KB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [22]:
graph_folder= "./tf_graphs"

In [23]:
assertion_graph_builder =  finance.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(1200)\
    .setHiddenUnitsNumber(25)

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [24]:
scope_window = [50, 50]

assertionStatus = finance.AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

In [25]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
    #document,
    #chunk,
    #token,
    #embeddings,
    assertion_graph_builder,
    assertionStatus])

In [26]:
training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)



In [27]:
assertion_train_data = spark.read.parquet('train_data.parquet')

In [28]:
assertion_train_data.groupBy('label').count().show()

+--------+-----+
|   label|count|
+--------+-----+
|positive|   71|
|negative|   27|
+--------+-----+



In [29]:
%%time
assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 768, 'max_seq_len': 1200, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping: no known devices.


Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping: no known devices.
assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 9.23 s, sys: 4.52 s, total: 13.8 s
Wall time: 32.1 s


Checking the results saved in the log file

In [30]:
import os

log_files = os.listdir("./training_logs")
log_files

['AssertionDLApproach_00411fdbdc10.log']

In [31]:
with open("./training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 98


Epoch: 0 started, learning rate: 0.001, dataset size: 98
Done, 9.570665722 total training loss: 0.80261874, avg training loss: 0.80261874, batches: 1
Quality on test dataset: 
time to finish evaluation: 2.13s
Total test loss: 0.7050	Avg test loss: 0.7050
label	 tp	 fp	 fn	 prec	 rec	 f1
positive	 0	 0	 18	 0.0	 0.0	 0.0
negative	 9	 18	 0	 0.33333334	 1.0	 0.5
tp: 9 fp: 18 fn: 18 labels: 2
Macro-average	 prec: 0.16666667, rec: 0.5, f1: 0.25
Micro-average	 prec: 0.33333334, rec: 0.33333334, f1: 0.33333334


Epoch: 1 started, learning rate: 9.5E-4, dataset size: 98
Done, 7.344641001 total training loss: 0.70753396, avg training loss: 0.70753396, batches: 1
Quality on test dataset: 
time to finish evaluation: 1.77s
Total test loss: 0.6590	Avg test loss: 0.6590
label	 tp	 fp	 fn	 prec	 rec	 f1
positive	 18	 9	 0	 0.6666667	 1.0	 0.8
negative	 0	 0	 9	 0.0	 0.0	 0.0
tp: 18 fp: 9 fn: 9 labels: 2


In [32]:
assertion_test_data = spark.read.parquet('test_data.parquet')

In [33]:
preds = assertion_model.transform(assertion_test_data).select('label','assertion.result')

preds.show()

+--------+----------+
|   label|    result|
+--------+----------+
|negative|[positive]|
|positive|[positive]|
|positive|[positive]|
|negative|[positive]|
|positive|[positive]|
|negative|[positive]|
|positive|[positive]|
|positive|[positive]|
|positive|[positive]|
|positive|[positive]|
|negative|[positive]|
|negative|[positive]|
|positive|[positive]|
|positive|[positive]|
|negative|[positive]|
|positive|[positive]|
|negative|[positive]|
|positive|[positive]|
|positive|[positive]|
|negative|[positive]|
+--------+----------+
only showing top 20 rows



In [34]:
preds_df = preds.toPandas()

In [35]:
preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)

preds_df

Unnamed: 0,label,result
0,negative,positive
1,positive,positive
2,positive,positive
3,negative,positive
4,positive,positive
5,negative,positive
6,positive,positive
7,positive,positive
8,positive,positive
9,positive,positive


In [38]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         9
    positive       0.67      1.00      0.80        18

    accuracy                           0.67        27
   macro avg       0.33      0.50      0.40        27
weighted avg       0.44      0.67      0.53        27



###Saving the trained model

In [39]:
assertion_model.stages

[TFGraphBuilderModel_2ba34934c1c2, FINANCE-ASSERTION_DL_a84ead075c86]

In [40]:
# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')

# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/part-00024 (deflated 27%)
  adding: fields/datasetParams/part-00007 (deflated 26%)
  adding: fields/datasetParams/.part-00016.crc (stored 0%)
  adding: fields/datasetParams/.part-00039.crc (deflated 44%)
  adding: fields/datasetParams/.part-00026.crc (stored 0%)
  adding: fields/datasetParams/.part-00020.crc (stored 0%)
  adding: fields/datasetParams/.part-00032.crc (stored 0%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/part-00015 (deflated 26%)
  adding: fields/datasetParams/part-00017 (deflated 27%)
  adding: fields/datasetParams/part-00036 (deflated 27%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/.part-00036.crc (stored 0%)
  adding: fields/datasetParams/.part-00013.crc (stored 0%)
  adding: fields/datasetParams/part-00016 (deflated 27%)
  adding: fields/datasetParams/part-00034 (deflate