
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/07.1.Training_Financial_Assertion.ipynb)

#🔎 Training Finance Assertion Status

📜Let's have a look at what takes to train your custom AssertionDL model for `negation`.

- First, make sure you have an **NER model** which retrieves those entities for you. In our case, we will use `finner_orgs_prods_alias` with requires `bert_embeddings_sec_bert_base` embeddings
- Second, check the embeddings the NER model is using and *reuse* them for the Assertion Model, so that you don't calculate embeddings twice.


#🎬 Installation

In [None]:
! pip install -q johnsnowlabs

##🔗 Automatic Installation
Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

##📌 Start Spark Session

In [None]:
from johnsnowlabs import nlp, finance
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

#🚀 Data Prep 

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/assertion_df.csv

In [None]:
import pandas as pd

training_df = pd.read_csv('./assertion_df.csv')

training_df

Unnamed: 0,text,target,label,start,end
0,CEC ENTERTAINMENT INC is not purchasing GEO GR...,CEC ENTERTAINMENT INC,negative,0,2
1,CEC ENTERTAINMENT INC is not purchasing GEO GR...,GEO GROUP INC,positive,6,8
2,BRAVE ASSET MANAGEMENT INC is paying Mondelez ...,"Mondelez International , Inc .",positive,6,10
3,BRAVE ASSET MANAGEMENT INC is paying Mondelez ...,BRAVE ASSET MANAGEMENT INC,positive,0,3
4,Compound Natural Foods Inc . is not investing ...,Compound Natural Foods Inc .,negative,0,4
...,...,...,...,...,...
93,"Cboe EDGA Exchange , Inc . is not providing Bl...","BlueStar Financial Group , Inc .",positive,9,14
94,VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE...,URSTADT BIDDLE PROPERTIES INC,positive,4,7
95,VSOURCE INC is hiring URSTADT BIDDLE PROPERTIE...,VSOURCE INC,positive,0,1
96,Emergent BioSolutions Inc . is not providing C...,Emergent BioSolutions Inc .,negative,0,3


📜
- `text`: your text examples;
- `target`: your NER chunk, extracted using `finner_orgs_prods_alias` in our case;
- `label`: the assertion label. In our example, we have two labels: `positive` and `negative`.
- `start`: the first token number of the chunk. You can get this information from the `begin` column in your NER model metadata.
- `end`: the last token number of the chunk. You can get this information from the `end` column in your NER model metadata.

###🏃‍♀️ Dataframe creation: training and test splits

In [None]:
# Create Spark Dataframe
training_data = spark.createDataFrame(training_df)
training_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|Marijuana Co of A...|PVM International...|positive|   10| 14|
|Marijuana Co of A...|Marijuana Co of A...|positive|    0|  6|
|NORTEK INC is not...|          NORTEK INC|negative|    0|  1|
|NORTEK INC is not...|EN2GO INTERNATION...|positive|    6|  8|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|    0|  1|
|TransDigm Group I...| TransDigm Group INC|negative|   

In [None]:
training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)



In [None]:
%time 
training_data.count()

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.15 µs


98

In [None]:
(train_data, test_data) = training_data.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Training Dataset Count: 69
Test Dataset Count: 29


In [None]:
train_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|3AM TECHNOLOGIES ...|3AM TECHNOLOGIES INC|negative|    0|  2|
|3AM TECHNOLOGIES ...|NATURAL ALTERNATI...|positive|    6|  9|
|ALEXANDRIA REAL E...|ALEXANDRIA REAL E...|negative|    0|  4|
|ATMI INC is eligi...|            ATMI INC|positive|    0|  1|
|ATMI INC is eligi...|NEAH POWER SYSTEM...|positive|    5| 10|
|Angie's List , In...|Angie's List , Inc .|negative|    0|  4|
|Angie's List , In...|        RC-1 , Inc .|positive|   11| 14|
|Artificial Intell...| APA OPTICS INC /MN/|positive|   10| 13|
|Artificial Intell...|Artificial Intell...|negative|    0|  5|
|CEC ENTERTAINMENT...|CEC ENTERTAINMENT...|negative|    0|  2|
|CEC ENTERTAINMENT...|       GEO GROUP INC|positive|    6|  8|
|DELTA APPAREL , I...| DELTA APPAREL , INC|positive|    0|  3|
|DELTA APPAREL , I...|Long-Term Stock E...|positive|   

In [None]:
test_data.show()

+--------------------+--------------------+--------+-----+---+
|                text|              target|   label|start|end|
+--------------------+--------------------+--------+-----+---+
|ALEXANDRIA REAL E...|            CDEX INC|positive|    9| 10|
|BRAVE ASSET MANAG...|BRAVE ASSET MANAG...|positive|    0|  3|
|BRAVE ASSET MANAG...|Mondelez Internat...|positive|    6| 10|
|Compound Natural ...|AMERICAN ELECTRIC...|positive|    9| 13|
|Compound Natural ...|Compound Natural ...|negative|    0|  4|
|Fundrise Income e...|MIDDLETON & CO IN...|positive|    9| 12|
|GHP Investment Ad...|GHP Investment Ad...|positive|    0|  5|
|MGP INGREDIENTS I...| MGP INGREDIENTS INC|negative|    0|  2|
|Mountain Capital ...|Mountain Capital ...|negative|    0|  5|
|Palo Alto Network...|Palo Alto Network...|negative|    0|  3|
|QUAINT OAK BANCOR...|QUAINT OAK BANCOR...|positive|    0|  3|
|QUALCOMM INC/DE i...| CANNAPOWDER , INC .|positive|    8| 11|
|QUALCOMM INC/DE i...|     QUALCOMM INC/DE|positive|   

###🔎 Using Bert Embeddings

Calculated using the `bert_embeddings_sec_bert_base` embeddings on your `text` column

In [None]:
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("document", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


We save the test data in parquet format to use in `AssertionDLApproach()`. 

In [None]:
assertion_pipeline = nlp.Pipeline(
    stages = [
    document,
    chunk,
    token,
    bert_embeddings])

assertion_test_data = assertion_pipeline.fit(test_data).transform(test_data)

assertion_test_data.write.mode('overwrite').parquet('test_data.parquet')

In [None]:
assertion_test_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

In [None]:
assertion_train_data = assertion_pipeline.fit(training_data).transform(training_data)

assertion_train_data.write.mode('overwrite').parquet('train_data.parquet')

In [None]:
assertion_train_data.columns

['text',
 'target',
 'label',
 'start',
 'end',
 'document',
 'doc_chunk',
 'token',
 'embeddings']

##🔎 Graph setup

In [None]:
! pip install -q tensorflow==2.7.0
! pip install -q tensorflow-addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.6/489.6 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m463.1/463.1 KB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25h

We will use TFGraphBuilder annotator which can be used to create graphs in the model training pipeline. 

TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

In [None]:
graph_folder= "./tf_graphs"

In [None]:
assertion_graph_builder =  finance.TFGraphBuilder()\
    .setModelName("assertion_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("assertion_graph.pb")\
    .setMaxSequenceLength(1200)\
    .setHiddenUnitsNumber(25)

📜**Setting the Scope Window (Target Area) Dynamically in Assertion Status Detection Models**


This parameter allows you to train the Assertion Status Models to focus on specific context windows when resolving the status of a NER chunk. The window is in format `[X,Y]` being `X` the number of tokens to consider on the left of the chunk, and `Y` the max number of tokens to consider on the right. Let’s take a look at what different windows mean:


*   By default, the window is `[-1,-1]` which means that the Assertion Status will look at all of the tokens in the sentence/document (up to a maximum of tokens set in `setMaxSentLen()` ).
*   `[0,0]` means “don’t pay attention to any token except the ner_chunk”, what basically is not considering any context for the Assertion resolution.
*   `[9,15]` is what empirically seems to be the best baseline, meaning that we look up to 9 tokens on the left and 15 on the right of the ner chunk to understand the context and resolve the status.


Check this [Scope Window Tuning Assertion Status Detection notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.1.Scope_window_tuning_assertion_status_detection.ipynb)  that illustrates the effect of the different windows and how to properly fine-tune your AssertionDLModels to get the best of them.

In our case, the best Scope Window is around [10,10]

In [None]:
scope_window = [50, 50]

assertionStatus = finance.AssertionDLApproach()\
    .setLabelCol("label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("start")\
    .setEndCol("end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

In [None]:
assertion_pipeline = nlp.Pipeline(
    stages = [
    assertion_graph_builder,
    assertionStatus])

In [None]:
training_data.printSchema()

root
 |-- text: string (nullable = true)
 |-- target: string (nullable = true)
 |-- label: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)



In [None]:
assertion_train_data = spark.read.parquet('train_data.parquet')

In [None]:
assertion_train_data.groupBy('label').count().show()

+--------+-----+
|   label|count|
+--------+-----+
|positive|   71|
|negative|   27|
+--------+-----+



In [None]:
%%time
assertion_model = assertion_pipeline.fit(assertion_train_data)

TF Graph Builder configuration:
Model name: assertion_dl
Graph folder: ./tf_graphs
Graph file name: assertion_graph.pb
Build params: {'n_classes': 2, 'feat_size': 768, 'max_seq_len': 1200, 'n_hidden': 25}


Instructions for updating:
non-resource variables are not supported in the long term


Device mapping: no known devices.


Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Device mapping: no known devices.
assertion_dl graph exported to ./tf_graphs/assertion_graph.pb
CPU times: user 11.1 s, sys: 691 ms, total: 11.8 s
Wall time: 55.5 s


Checking the results saved in the log file

In [None]:
import os

log_files = os.listdir("./training_logs")
log_files

['AssertionDLApproach_1288a0afc47a.log']

In [None]:
with open("./training_logs/"+log_files[0]) as log_file:
    print(log_file.read())

Name of the selected graph: ./tf_graphs/assertion_graph.pb
Training started, trainExamples: 98


Epoch: 0 started, learning rate: 0.001, dataset size: 98
Done, 21.154691869 total training loss: 2.5376515, avg training loss: 2.5376515, batches: 1
Quality on test dataset: 
time to finish evaluation: 2.22s
Total test loss: 2.1600	Avg test loss: 2.1600
label	 tp	 fp	 fn	 prec	 rec	 f1
negative	 10	 19	 0	 0.3448276	 1.0	 0.5128205
positive	 0	 0	 19	 0.0	 0.0	 0.0
tp: 10 fp: 19 fn: 19 labels: 2
Macro-average	 prec: 0.1724138, rec: 0.5, f1: 0.25641024
Micro-average	 prec: 0.3448276, rec: 0.3448276, f1: 0.3448276


Epoch: 1 started, learning rate: 9.5E-4, dataset size: 98
Done, 6.591055361 total training loss: 2.3828928, avg training loss: 2.3828928, batches: 1
Quality on test dataset: 
time to finish evaluation: 1.69s
Total test loss: 2.0273	Avg test loss: 2.0273
label	 tp	 fp	 fn	 prec	 rec	 f1
negative	 10	 19	 0	 0.3448276	 1.0	 0.5128205
positive	 0	 0	 19	 0.0	 0.0	 0.0
tp: 10 fp: 19 f

In [None]:
assertion_test_data = spark.read.parquet('test_data.parquet')

In [None]:
preds = assertion_model.transform(assertion_test_data).select('label','assertion.result')

preds.show()

+--------+----------+
|   label|    result|
+--------+----------+
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|negative|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|negative|[negative]|
|positive|[negative]|
|positive|[negative]|
|positive|[negative]|
+--------+----------+
only showing top 20 rows



In [None]:
preds_df = preds.toPandas()

In [None]:
preds_df["result"] = preds_df["result"].apply(lambda x: x[0] if len(x) else pd.NA)
preds_df.dropna(inplace=True)

preds_df

Unnamed: 0,label,result
0,positive,negative
1,positive,negative
2,positive,negative
3,positive,negative
4,negative,negative
5,positive,negative
6,positive,negative
7,negative,negative
8,negative,negative
9,negative,negative


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report( preds_df['label'], preds_df['result']))

              precision    recall  f1-score   support

    negative       0.34      1.00      0.51        10
    positive       0.00      0.00      0.00        19

    accuracy                           0.34        29
   macro avg       0.17      0.50      0.26        29
weighted avg       0.12      0.34      0.18        29



###✔️ Saving the trained model

In [None]:
assertion_model.stages

[TFGraphBuilderModel_a389b6a16cae, FINANCE-ASSERTION_DL_13ea29236849]

In [None]:
# Save a Spark NLP model
assertion_model.stages[-1].write().overwrite().save('Assertion')

# cd into saved dir and zip
! cd /content/Assertion ; zip -r /content/Assertion.zip *

  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 95%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: fields/datasetParams/.part-00001.crc (deflated 44%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/part-00000 (deflated 38%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: tensorflow (deflated 39%)


###✔️ Testing the model

The model had very little data, since it was created as a playground fopr the certification trainingts to run `quickly`. So don't expect big performance (for that, you have a pretrained version used earlier on this notebook).

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

finassertion = finance.AssertionDLModel.load("Assertion")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"])\
    .setOutputCol("finlabel")

pipe = nlp.Pipeline(stages = [ document_assembler, text_splitter, tokenizer, embeddings, ner, ner_converter, finassertion])

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_orgs_prods_alias download started this may take some time.
[OK!]


In [None]:
text = "Gradio INC will enter into a joint agreement with Hugging Face, Inc."

In [None]:
sdf = spark.createDataFrame([[text]]).toDF("text")
res = pipe.fit(sdf).transform(sdf)

In [None]:
import pyspark.sql.functions as F
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.finlabel.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']").alias("assertion")).show(200, truncate=100)

+-----------------+---------+
|        ner_chunk|assertion|
+-----------------+---------+
|       Gradio INC| negative|
|Hugging Face, Inc| negative|
+-----------------+---------+

