![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Financial Relation Extraction(RE) and Zero-shot Relation Extraction

## Setup

In [2]:
%pip install -q tensorflow==2.7.0
%pip install -q tensorflow-addons

In [4]:
from johnsnowlabs import *

import json
import os

import numpy as np
import pandas as pd

print("Spark NLP Version :", sparknlp.version())

spark = start_spark()

Spark NLP Version : 4.2.1
📋 Loading license number 0 from /home/ubuntu/.johnsnowlabs/licenses/license_number_0_for_.json


22/10/19 11:48:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/19 11:48:21 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


👌 Launched [92mcpu-Optimized JVM[39m SparkSession with Jars for: 🚀Spark-NLP==4.2.1, 💊Spark-Healthcare==4.2.0, 🕶Spark-OCR==4.1.0, running on ⚡ PySpark==3.1.2


## Extract Acquisition and Subsidiary Relationships

This is a demonstration of using SparkNLP for extracting the following relations.

- **DATE-ORG**
- **DATE-ALIAS**
- **DATE-PRODUCT**
- **ORG-ORG**

The aim of this model is to retrieve acquisition or subsidiary relationships between Organizations, included when the acquisition was carried out **was_acquired** and by whom **was_acquired_by**. Subsidiaries are tagged with the relationship **is_subsidiary_of**.

In [6]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
        
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_org")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_org"])\
    .setOutputCol("ner_chunk_org")

token_classifier = nlp.DeBertaForTokenClassification.pretrained("deberta_v3_base_token_classifier_ontonotes", "en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner_date")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512) 

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_date"])\
    .setOutputCol("ner_chunk_date")\
    .setWhiteList(["DATE"])

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols("ner_chunk_org", "ner_chunk_date")\
    .setOutputCol('ner_chunk')

re_model = finance.RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")\
    .setPredictionThreshold(0.3)\
    .setInputCols(["ner_chunk", "document"])\
    .setOutputCol("relations")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        token_classifier,
        ner_converter_date,
        chunk_merger,
        re_model
        ])
empty_df = spark.createDataFrame([[""]]).toDF("text")

re_model = pipeline.fit(empty_df)

light_model = LightPipeline(re_model)


[ \ ]Download done! Loading the resource.
[OK!]
finre_acquisitions_subsidiaries_md download started this may take some time.
[ | ]finre_acquisitions_subsidiaries_md download started this may take some time.
Approximate size to download 383.6 MB
[ — ]Download done! Loading the resource.
[OK!]


In [7]:
ner_model.getClasses()

['O', 'B-ORG', 'I-ORG', 'B-ALIAS', 'I-ALIAS', 'I-PRODUCT', 'B-PRODUCT']

### Create Generic Function to Show Relations in Dataframe

In [8]:
import pandas as pd

def get_relations_df (results, col='relations'):
    rel_pairs=[]
    for i in range(len(results)):
        for rel in results[i][col]:
            rel_pairs.append((
              rel.result, 
              rel.metadata['entity1'], 
              rel.metadata['entity1_begin'],
              rel.metadata['entity1_end'],
              rel.metadata['chunk1'], 
              rel.metadata['entity2'],
              rel.metadata['entity2_begin'],
              rel.metadata['entity2_end'],
              rel.metadata['chunk2'], 
              rel.metadata['confidence']
          ))
    rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])
    return rel_df

### Getting Result with Light Pipeline

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.
Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

In [9]:
sample_text = """WhatsApp was acquired by Facebook for $19 billion in 2014. Now, WhatsApp is a subsidiary of Meta."""

In [10]:
result = light_model.fullAnnotate(sample_text)

In [11]:
rel_df = get_relations_df(result)

rel_df

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,was_acquired_by,ORG,0,7,WhatsApp,ORG,25,32,Facebook,0.97506624
1,was_acquired,ORG,0,7,WhatsApp,DATE,53,56,2014,0.99661773
2,was_acquired,ORG,25,32,Facebook,DATE,53,56,2014,0.99855167


### Visualization of Extracted Relations

We use **RelationExtractionVisualizer** method of **spark-nlp-display** library for visualization fo the extracted relations between the entities.

In [12]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0],
               relation_col = "relations",
               document_col = "document",
               show_relations=True
               )

## Relation extraction between ORGS, PRODUCTS and their ALIASES

This model shows relations between ORG (Companies), PRODUCT (Products) and their ALIAS in financial documents.

The extracted relationships are as follows:

- **ORG-ALIAS**
- **PRODUCT-ALIAS**

In [13]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
        
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")
    
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pos = nlp.PerceptronModel.pretrained("pos_anc", 'en')\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos")
    
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setRelationPairs(["ORG-ALIAS, PRODUCT-ALIAS"])

re_model = finance.RelationExtractionDLModel.pretrained("finre_org_prod_alias", "en", "finance/models")\
    .setPredictionThreshold(0.3)\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relations")

nlpPipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        pos,
        dependency_parser,
        re_ner_chunk_filter,
        re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

re_model = nlpPipeline.fit(empty_data)

light_model = LightPipeline(re_model)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_orgs_prods_alias download started this may take some time.
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ | ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
Download done! Loading the resource.


                                                                                

[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[ | ]dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
Download done! Loading the resource.




[ — ]

                                                                                

[OK!]
finre_org_prod_alias download started this may take some time.
[ | ]finre_org_prod_alias download started this may take some time.
Approximate size to download 387.3 MB
[ — ]Download done! Loading the resource.
[OK!]


### Getting Result with Light Pipeline

In [14]:
sample_text = """On March 12, 2020 we closed a Loan and Security Agreement with Hitachi Capital America Corp. ("Hitachi") the terms of which are described in this report which replaced our credit facility with Western Alliance Bank."""

result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_alias,ORG,63,91,Hitachi Capital America Corp.,ALIAS,95,101,Hitachi,0.9983972
1,has_alias,ORG,63,91,Hitachi Capital America Corp.,ORG,193,213,Western Alliance Bank,0.88593066


In [15]:
pd.DataFrame([(x.result, x.metadata["entity"]) for x in result[0]["ner_chunk"]], columns=["text", "ner"])

Unnamed: 0,text,ner
0,Hitachi Capital America Corp.,ORG
1,Hitachi,ALIAS
2,Western Alliance Bank,ORG


### Visualization of Extracted Relations

In [16]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0],
               relation_col = "relations",
               document_col = "document",
               exclude_relations = ["no_rel"],
               show_relations=True
               )

## Zero Shot Relation Extraction to Extract Relations Between Financial Entities

This is a Zero-shot Relation Extraction Model, meaning that it does not require any training data, just few examples of of the relations types you are looking for, to output a proper result.

**Make sure you keep the proper syntax of the relations you want to extract**

In [17]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
        
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("relations")\
    .setMultiLabel(False)

re_model.setRelationalCategories({
    "profit_decline_by": ["{PROFIT_DECLINE} decreased by {AMOUNT} from", "{PROFIT_DECLINE} decreased by {AMOUNT} to"],
    "profit_decline_by_per": ["{PROFIT_DECLINE} decreased by a {PERCENTAGE} from", "{PROFIT_DECLINE} decreased by a {PERCENTAGE} to"],
    "profit_decline_from": ["{PROFIT_DECLINE} decreased from {AMOUNT}", "{PROFIT_DECLINE} decreased from {AMOUNT} for the year"],
    "profit_decline_from_per": ["{PROFIT_DECLINE} decreased from {PERCENTAGE} to", "{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of"],
    "profit_decline_to": ["{PROFIT_DECLINE} to {AMOUNT}"],
    "profit_increase_from": ["{PROFIT_INCREASE} from {AMOUNT}"],
    "profit_increase_to": ["{PROFIT_INCREASE} to {AMOUNT}"],    
    "expense_decrease_by": ["{EXPENSE_DECREASE} decreased by {AMOUNT}"],
    "expense_decrease_by_per": ["{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"],
    "expense_decrease_from": ["{EXPENSE_DECREASE} decreased from {AMOUNT}"],    
    "expense_decrease_to": ["{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"],    
    "has_date": ["{AMOUNT} for the fiscal year ended {FISCAL_YEAR}", "{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}"]
})

pipeline = Pipeline(stages =[
                document_assembler,  
                sentence_detector,
                tokenizer, 
                embeddings,
                ner_model,
                ner_converter,
                re_model
               ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

light_model = LightPipeline(model)

[ — ]Download done! Loading the resource.
[OK!]


In [18]:
ner_model.getClasses()

['O',
 'B-PERCENTAGE',
 'B-FISCAL_YEAR',
 'I-FISCAL_YEAR',
 'B-PROFIT_INCREASE',
 'I-EXPENSE_INCREASE',
 'B-CURRENCY',
 'B-EXPENSE_INCREASE',
 'B-EXPENSE_DECREASE',
 'I-AMOUNT',
 'I-DATE',
 'I-PROFIT_INCREASE',
 'B-AMOUNT',
 'I-PROFIT_DECLINE',
 'I-CURRENCY',
 'I-EXPENSE',
 'B-DATE',
 'I-PERCENTAGE',
 'B-EXPENSE',
 'B-PROFIT_DECLINE',
 'I-PROFIT',
 'B-PROFIT',
 'I-EXPENSE_DECREASE']

In [19]:
sample_text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019.  Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)


**NER output**

In [20]:
result.selectExpr("explode(ner_chunk) as ner").show(truncate=False)



+-------------------------------------------------------------------------------------------------------------------------+
|ner                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 19, License fees revenue, {entity -> PROFIT_DECLINE, sentence -> 0, chunk -> 0, confidence -> 0.56216663}, []}|
|{chunk, 31, 32, 40, {entity -> PERCENTAGE, sentence -> 0, chunk -> 1, confidence -> 1.0}, []}                            |
|{chunk, 40, 40, $, {entity -> CURRENCY, sentence -> 0, chunk -> 2, confidence -> 1.0}, []}                               |
|{chunk, 42, 52, 0.5 million, {entity -> AMOUNT, sentence -> 0, chunk -> 3, confidence -> 1.0}, []}                       |
|{chunk, 57, 57, $, {entity -> CURRENCY, sentence -> 0, chunk -> 4, confidence -> 1.0}, []}                               |
|{chunk,

                                                                                

**Relations output**

In [21]:
result.selectExpr("explode(relations) as relation").show(truncate=False)



+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|relation                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------

                                                                                

### Getting Result with Light Pipeline

In [22]:
result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_date,AMOUNT,227,238,25.6 million,FISCAL_YEAR,316,332,"December 31, 2019",0.8744758
1,has_date,PERCENTAGE,31,32,40,FISCAL_YEAR,153,169,"December 31, 2019",0.78890353
2,expense_decrease_from,EXPENSE_DECREASE,799,826,Sales and marketing expenses,AMOUNT,923,933,7.5 million,0.9770538
3,has_date,AMOUNT,59,69,0.7 million,FISCAL_YEAR,90,106,"December 31, 2020",0.6718775
4,profit_increase_to,PROFIT_INCREASE,172,187,Services revenue,AMOUNT,227,238,25.6 million,0.9674029
5,has_date,PERCENTAGE,31,32,40,FISCAL_YEAR,90,106,"December 31, 2020",0.7780036
6,has_date,PERCENTAGE,838,839,20,FISCAL_YEAR,898,914,"December 31, 2020",0.85455513
7,expense_decrease_by,EXPENSE_DECREASE,561,572,travel costs,AMOUNT,579,589,0.4 million,0.9946776
8,has_date,AMOUNT,42,52,0.5 million,FISCAL_YEAR,153,169,"December 31, 2019",0.7756693
9,profit_increase_from,PROFIT_INCREASE,172,187,Services revenue,AMOUNT,209,219,1.1 million,0.96610945


### Visualization of Extracted Relations

In [23]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0],
               relation_col = "relations",
               document_col = "document",
               exclude_relations = ["no_rel"],
               show_relations=True
               )