
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/15.Training_Finance_Assertion.ipynb)

# Training Finance Assertion


#Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [3]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare-2.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up if John Snow Labs home exists in /root/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.0-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.0.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare-2.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.0-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.0-py3-none-any.whl
👌 Detected license file /

## Start Spark Session

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

DEBUG START!
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_5278 (1).json
👌 Launched [92mcpu-Optimized JVM[39m SparkSession with Jars for: 🚀Spark-NLP==4.2.0, 💊Spark-Healthcare==4.0.0rc1, 🕶Spark-OCR==4.1.0, running on ⚡ PySpark==3.1.2


In [None]:
spark

#Data Prep 

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/assertion_fin.csv

In [None]:
import pandas as pd

training_df = pd.read_csv('/content/assertion_fin.csv')

In [None]:
training_data = spark.createDataFrame(training_df)
training_data.show()

+-------+--------------------+---------+-------+--------------------+------+---------------+
|task_id|            sentence|tkn_start|tkn_end|               chunk|entity|assertion_label|
+-------+--------------------+---------+-------+--------------------+------+---------------+
|      1|The Swedish East ...|        1|      4|Swedish East Indi...|   ORG|           PAST|
|      1|The Swedish East ...|        6|      8|Svenska Ostindisk...| ALIAS|           PAST|
|      1|The Swedish East ...|       10|     10|                SOIC| ALIAS|           PAST|
|      1|The Swedish East ...|       14|     14|          Gothenburg|   LOC|           PAST|
|      1|The Swedish East ...|       15|     15|              Sweden|   LOC|           PAST|
|      1|The Swedish East ...|       17|     17|                1731|  DATE|           PAST|
|      1|The Swedish East ...|       25|     25|               China|   LOC|           PAST|
|      1|The Swedish East ...|       28|     29|            Far East| 

In [None]:
training_data.printSchema()

root
 |-- task_id: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- tkn_start: long (nullable = true)
 |-- tkn_end: long (nullable = true)
 |-- chunk: string (nullable = true)
 |-- entity: string (nullable = true)
 |-- assertion_label: string (nullable = true)



In [None]:
%time training_data.count()

CPU times: user 13.3 ms, sys: 1.27 ms, total: 14.5 ms
Wall time: 798 ms


8050

In [None]:
(train_data, test_data) = training_data.randomSplit([0.9, 0.1], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 8050
Test Dataset Count: 848


In [None]:
train_data.show()

+-------+--------------------+---------+-------+--------------------+------+---------------+
|task_id|            sentence|tkn_start|tkn_end|               chunk|entity|assertion_label|
+-------+--------------------+---------+-------+--------------------+------+---------------+
|      1|2.5 tonnes) and t...|       11|     11|          Gothenburg|   LOC|           PAST|
|      1|2.5 tonnes) and t...|       34|     34|              Sweden|   LOC|           PAST|
|      1|= Early attempts ...|        9|     11|  Swedish East India|   ORG|           PAST|
|      1|= Early attempts ...|       20|     21|    Willem Usselincx|   PER|           PAST|
|      1|= Sweden after th...|        1|      1|              Sweden|   LOC|           PAST|
|      1|= The Royal chart...|        9|     12|Henrik König & Co...|   ORG|           PAST|
|      1|= The Royal chart...|       39|     42|   Cape of Good Hope|   LOC|           PAST|
|      1|= The Royal chart...|       46|     46|               Japan| 

#Using Bert Embeddings

In [None]:
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("document", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("chunk")\
    .setStartCol("tkn_start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [None]:
assertion_pipeline = Pipeline(
    stages = [
    document,
    chunk,
    token,
    bert_embeddings])

assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)

In [None]:
assertion_test_data.columns

['task_id',
 'sentence',
 'tkn_start',
 'tkn_end',
 'chunk',
 'entity',
 'assertion_label',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

In [None]:
assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')

In [None]:
assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)
assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')

In [None]:
assertion_train_data.columns

['task_id',
 'sentence',
 'tkn_start',
 'tkn_end',
 'chunk',
 'entity',
 'assertion_label',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

##Graph setup

In [None]:
!pip install -q tensorflow==2.7.0
!pip install -q tensorflow-addons

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [None]:
graph_folder= "./tf_graphs"

In [None]:
assertion_graph_builder =  finance.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("assertion_label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(1200)\
    .setHiddenUnitsNumber(25)

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [None]:
scope_window = [50, 50]

assertionStatus = finance.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

In [None]:
clinical_assertion_pipeline = Pipeline(
    stages = [
    #document,
    #chunk,
    #token,
    #embeddings,
    assertion_graph_builder,
    assertionStatus])

In [None]:
training_data.printSchema()

root
 |-- task_id: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- tkn_start: long (nullable = true)
 |-- tkn_end: long (nullable = true)
 |-- chunk: string (nullable = true)
 |-- entity: string (nullable = true)
 |-- assertion_label: string (nullable = true)



In [None]:
assertion_train_data = spark.read.parquet('train_data.parquet')

In [None]:
%%time
assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 4, 'feat_size': 768, 'max_seq_len': 1200, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping: no known devices.


Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping: no known devices.
assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 32.1 s, sys: 2.83 s, total: 34.9 s
Wall time: 25min 7s


Checking the results saved in the log file

In [None]:
import os

log_files = os.listdir("/content/training_logs")
log_files

['AssertionDLApproach_d4d63cd5f46f.log']

In [None]:
with open("/content/training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 8050


Epoch: 0 started, learning rate: 0.001, dataset size: 8050
Done, 654.076256498 total training loss: 52.332344, avg training loss: 0.83067214, batches: 63
Quality on test dataset: 
time to finish evaluation: 60.89s
Total test loss: 2.3790	Avg test loss: 0.3399
label	 tp	 fp	 fn	 prec	 rec	 f1
PRESENT	 186	 22	 45	 0.8942308	 0.8051948	 0.8473804
POSSIBLE	 167	 21	 25	 0.88829786	 0.8697917	 0.8789474
FUTURE	 106	 13	 17	 0.8907563	 0.86178863	 0.87603307
PAST	 285	 48	 17	 0.8558559	 0.9437086	 0.89763784
tp: 744 fp: 104 fn: 104 labels: 4
Macro-average	 prec: 0.88228524, rec: 0.87012094, f1: 0.8761609
Micro-average	 prec: 0.8773585, rec: 0.8773585, f1: 0.8773585


Epoch: 1 started, learning rate: 9.5E-4, dataset size: 8050
Done, 703.401722604 total training loss: 21.033928, avg training loss: 0.33387187, batches: 63
Quality on test dataset: 
time to finish evaluation: 65.40s
Total test los

In [None]:
assertion_test_data = spark.read.parquet('test_data.parquet')

In [None]:
preds = assertion_model.transform(assertion_test_data).select('assertion_label','assertion.result')

preds.show()

+---------------+------+
|assertion_label|result|
+---------------+------+
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|        PRESENT|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
|           PAST|[PAST]|
+---------------+------+
only showing top 20 rows



In [None]:
preds_df = preds.toPandas()

In [None]:
preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)

preds_df

Unnamed: 0,assertion_label,result
0,PAST,PAST
1,PAST,PAST
2,PAST,PAST
3,PRESENT,PAST
4,PAST,PAST
...,...,...
843,PRESENT,PRESENT
844,PRESENT,PRESENT
845,PRESENT,PRESENT
846,PRESENT,PRESENT


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['assertion_label'], preds_df['result']))

              precision    recall  f1-score   support

      FUTURE       0.92      0.90      0.91       123
        PAST       0.91      0.94      0.93       301
    POSSIBLE       0.93      0.94      0.94       192
     PRESENT       0.93      0.88      0.91       231

    accuracy                           0.92       847
   macro avg       0.92      0.92      0.92       847
weighted avg       0.92      0.92      0.92       847



###Saving the trained model

In [None]:
assertion_model.stages

[TFGraphBuilderModel_1156e243cb5d, FINANCE-ASSERTION_DL_bc10b40a958e]

In [None]:
# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')

# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00038.crc (stored 0%)
  adding: fields/datasetParams/part-00028 (deflated 27%)
  adding: fields/datasetParams/.part-00032.crc (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00036.crc (stored 0%)
  adding: fields/datasetParams/part-00018 (deflated 26%)
  adding: fields/datasetParams/.part-00015.crc (stored 0%)
  adding: fields/datasetParams/part-00026 (deflated 27%)
  adding: fields/datasetParams/part-00011 (deflated 27%)
  adding: fields/datasetParams/.part-00029.crc (stored 0%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00039 (deflated 95%)
  adding: fields/datasetParams/part-00036 (deflated 27%)
  adding: fields/datasetParams/.part-00030.crc (stored 0%)
  adding: fields/datasetParams/part-00007 (deflated 27%)
  addin