
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/07.1.Training_Financial_Assertion.ipynb)

# Training Finance Assertion


#Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from johnsnowlabs import nlp, finance
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
nlp.install(force_browser=True)

## Start Spark Session

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

#Data Prep 

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/assertion_fin.csv

In [None]:
import pandas as pd

training_df = pd.read_csv('/content/assertion_fin.csv')

In [None]:
training_data = spark.createDataFrame(training_df)
training_data.show()

+-------+--------------------+---------+-------+--------------------+------+---------------+
|task_id|            sentence|tkn_start|tkn_end|               chunk|entity|assertion_label|
+-------+--------------------+---------+-------+--------------------+------+---------------+
|      1|The Swedish East ...|        1|      4|Swedish East Indi...|   ORG|           PAST|
|      1|The Swedish East ...|        6|      8|Svenska Ostindisk...| ALIAS|           PAST|
|      1|The Swedish East ...|       10|     10|                SOIC| ALIAS|           PAST|
|      1|The Swedish East ...|       14|     14|          Gothenburg|   LOC|           PAST|
|      1|The Swedish East ...|       15|     15|              Sweden|   LOC|           PAST|
|      1|The Swedish East ...|       17|     17|                1731|  DATE|           PAST|
|      1|The Swedish East ...|       25|     25|               China|   LOC|           PAST|
|      1|The Swedish East ...|       28|     29|            Far East| 

In [None]:
training_data.printSchema()

root
 |-- task_id: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- tkn_start: long (nullable = true)
 |-- tkn_end: long (nullable = true)
 |-- chunk: string (nullable = true)
 |-- entity: string (nullable = true)
 |-- assertion_label: string (nullable = true)



In [None]:
%time training_data.count()

CPU times: user 8.84 ms, sys: 2.02 ms, total: 10.9 ms
Wall time: 1.31 s


8050

In [None]:
(train_data, test_data) = training_data.randomSplit([0.9, 0.1], seed = 100)
print("Training Dataset Count: " + str(training_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 8050
Test Dataset Count: 786


In [None]:
train_data.show()

+-------+--------------------+---------+-------+--------------------+------+---------------+
|task_id|            sentence|tkn_start|tkn_end|               chunk|entity|assertion_label|
+-------+--------------------+---------+-------+--------------------+------+---------------+
|      1|"Stockholms-varve...|        6|      6|           Stockholm|   LOC|           PAST|
|      1|"The funny busine...|        5|      8|Swedish East Indi...|   ORG|           PAST|
|      1|             (1998).|        0|      0|                1998|  DATE|           PAST|
|      1|2.5 tonnes) and t...|       34|     34|              Sweden|   LOC|           PAST|
|      1|37. Gothenburg: R...|        2|      7|Royal Society of ...|   ORG|           PAST|
|      1|= Decline and fal...|       11|     11|                1806|  DATE|           PAST|
|      1|= Early attempts ...|        9|     11|  Swedish East India|   ORG|           PAST|
|      1|= Early attempts ...|       19|     19|            merchant| 

#Using Bert Embeddings

In [None]:
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("document", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("chunk")\
    .setStartCol("tkn_start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [None]:
assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    bert_embeddings])

assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)

In [None]:
assertion_test_data.columns

['task_id',
 'sentence',
 'tkn_start',
 'tkn_end',
 'chunk',
 'entity',
 'assertion_label',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

In [None]:
assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')

In [None]:
assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)
assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')

In [None]:
assertion_train_data.columns

['task_id',
 'sentence',
 'tkn_start',
 'tkn_end',
 'chunk',
 'entity',
 'assertion_label',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

##Graph setup

In [None]:
!pip install -q tensorflow==2.7.0
!pip install -q tensorflow-addons

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [None]:
from johnsnowlabs import nlp, finance

In [None]:
graph_folder= "./tf_graphs"

In [None]:
assertion_graph_builder =  finance.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("assertion_label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(1200)\
    .setHiddenUnitsNumber(25)

**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [None]:
scope_window = [50, 50]

assertionStatus = finance.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

In [None]:
clinical_assertion_pipeline = nlp.Pipeline(
    stages = [
    #document,
    #chunk,
    #token,
    #embeddings,
    assertion_graph_builder,
    assertionStatus])

In [None]:
training_data.printSchema()

root
 |-- task_id: long (nullable = true)
 |-- sentence: string (nullable = true)
 |-- tkn_start: long (nullable = true)
 |-- tkn_end: long (nullable = true)
 |-- chunk: string (nullable = true)
 |-- entity: string (nullable = true)
 |-- assertion_label: string (nullable = true)



In [None]:
assertion_train_data = spark.read.parquet('train_data.parquet')

In [None]:
%%time
assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)

Checking the results saved in the log file

In [None]:
import os

log_files = os.listdir("/content/training_logs")
log_files

In [None]:
with open("/content/training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

In [None]:
assertion_test_data = spark.read.parquet('test_data.parquet')

In [None]:
preds = assertion_model.transform(assertion_test_data).select('assertion_label','assertion.result')

preds.show()

In [None]:
preds_df = preds.toPandas()

In [None]:
preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)

preds_df

In [None]:
!pip install scikit-learn

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['assertion_label'], preds_df['result']))

###Saving the trained model

In [None]:
assertion_model.stages

In [None]:
# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')

# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *