![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

In [0]:
from johnsnowlabs import nlp, finance, viz

##🔎  Pretrained Relation Extraction Models for Finance

Here are the list of pretrained Relation Extraction models:

📜**Relation Extraction Models**

|index|model|
|-----:|:-----|
| 1| [Financial Relation Extraction on Earning Calls (Small)](https://nlp.johnsnowlabs.com/2022/11/28/finre_earning_calls_sm_en.html)  | 
| 2| [Financial Relation Extraction on 10K filings (Small)](https://nlp.johnsnowlabs.com/2022/11/07/finre_financial_small_en.html)  | 
| 3| [Financial Relation Extraction (Tickers)](https://nlp.johnsnowlabs.com/2022/10/15/finre_has_ticker_en.html)  |
| 4| [Financial Relation Extraction (Acquisitions / Subsidiaries)](https://nlp.johnsnowlabs.com/2022/11/08/finre_acquisitions_subsidiaries_md_en.html)  | 
| 5| [Financial Relation Extraction (Work Experience, Medium)](https://nlp.johnsnowlabs.com/2022/11/08/finre_work_experience_md_en.html)  |
| 6| [Financial Relation Extraction (Work Experience, Small)](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html)  | 
| 7| [Financial Relation Extraction (Alias)](https://nlp.johnsnowlabs.com/2022/08/17/finre_org_prod_alias_en_3_2.html)  |
| 8| [Financial Zero-shot Relation Extraction](https://nlp.johnsnowlabs.com/2022/08/22/finre_zero_shot_en_3_2.html)  |

**These components are common for all the pipelines we will use.**

In [0]:
def get_generic_base_pipeline():
  """Common components used in all pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

In [0]:
# Text Classifier
def get_text_classification_pipeline(model):
  """This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
  the management roles and experiences are mentioned"""
  document_assembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

  embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

  classifier = nlp.ClassifierDLModel.pretrained(model, "en", "finance/models")\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("category")

  nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      embeddings,
      classifier])
  
  return nlpPipeline

In [0]:
import pandas as pd

def get_relations_df (results, col='relations'):
  """Shows a Dataframe with the relations extracted by Spark NLP"""
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  return rel_df

##🔎  Zero-shot Relation Extraction to Extract Relations Between Financial Entities

Let's suppose now we want to extract relations between `PROFIT_DECLINE`,  `PROFIT_INCREASE`, `EXPENSE_DECREASE`, `AMOUNT`, `PERCENTAGE` entities. Right now, we don't have a model to do that, but!

That's when Zero-shot RE comes into the game. You can use Zero-shot RE **without training data** and **without any pretrained model** to create your REDL model.


📜At John Snow Labs, we have developed our own annotators based on **Natural Language Inference (NLI)**, to not only carry out Question Answering, but using QA to:
- Retrieve Entities, also known as Zero-shot NER;
- Retrieve Relations, also known as Zero-shot Relation Extraction;

##🔎  A variation of NLI for Zero-shot Relation Extraction
Similarly to Zero-shot NER, Zero-shot RE also works with `H` (hypotheses) and `P` (premises), and the extraction as a positive result is conditioned to the `H` being `entailed` given a `P`.

📜In this case, what we do is:
- We took a prompt in the form of {ENT_1} [some_text] {ENT_2}
- ENT_1 is filled with entities from a previous NER
- ENT_2 too.
- We ask the ZeroShotRE model if, given the whole text, the premise {ENT_1} [some_text] {ENT_2} is entailed.

For example, `ENT_1` is `REVENUE`. `ENT_2` is `PERCENTAGE`. `[some_text]` is `decreased`.

Given a premise `License fees revenue decreased 40 %`, the result of the previous prompt will be `entailed`, returning a positive as a result.

However, `License fees revenue increased 40 %` would not return an entailment, so the relation will not be triggered.

##📌 Example

Firstly, we use the `finner_financial_small` model to extract `PROFIT_DECLINE`,  `PROFIT_INCREASE`, `EXPENSE_DECREASE`, `AMOUNT`, `PERCENTAGE`, entities. After that we define relations between these entities paying attention to the syntax.

📜For example, given the text P,`License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019.` we:

- Generate Hypotheses H with the tokens of the text
  - License fees revenue increase 40: `contradiction`
  - License fees revenue decrease 40: `entailment`
  - License fees revenue decrease 1.2 million: `entailment`
  - License fees revenue decrease 0.7 million: `entailment`
  - License fees revenue decrease 0.5 million: `entailment`

- We check all the H towards P to see if they are `entailed`. If so, we return them as relations between the entities.


🚀**!!! Make sure you keep the proper syntax of the relations you want to extract !!!**

In [0]:
ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("relations")\
    .setMultiLabel(False)

re_model.setRelationalCategories({
    "DECREASE": ["{PROFIT_DECLINE} decrease to {AMOUNT}", "{PROFIT_DECLINE} decrease {PERCENTAGE}", "{EXPENSE_DECREASE} decrease {AMOUNT}", "{EXPENSE_DECREASE} decrease {PERCENTAGE}"],
    "INCREASE": ["{PROFIT_INCREASE} increase to {AMOUNT}", "{PROFIT_INCREASE} increase {PERCENTAGE}"],
})

pipeline = nlp.Pipeline(stages =[
                generic_base_pipeline,
                ner_model,
                ner_converter,
                re_model
               ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

In [0]:
sample_text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""

sample_text


In [0]:
data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)

###🖨️ Get Results

**NER output**

In [0]:
result.selectExpr("explode(ner_chunk) as ner").show(50, truncate=False)

**Relations output**

In [0]:
light_result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(light_result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,INCREASE,PROFIT_INCREASE,172,187,Services revenue,AMOUNT,227,238,25.6 million,0.97882193
1,DECREASE,EXPENSE_DECREASE,335,362,Sales and marketing expenses,PERCENTAGE,374,375,20,0.9928862
2,DECREASE,EXPENSE_DECREASE,335,362,Sales and marketing expenses,AMOUNT,385,395,1.5 million,0.9894141
3,INCREASE,PROFIT_INCREASE,172,187,Services revenue,AMOUNT,209,219,1.1 million,0.9759809
4,DECREASE,EXPENSE_DECREASE,335,362,Sales and marketing expenses,AMOUNT,459,469,7.5 million,0.9819981
5,DECREASE,EXPENSE_DECREASE,335,362,Sales and marketing expenses,AMOUNT,403,413,6.0 million,0.9839978
6,INCREASE,PROFIT_INCREASE,172,187,Services revenue,AMOUNT,284,295,24.5 million,0.9391607
7,DECREASE,PROFIT_DECLINE,0,19,License fees revenue,PERCENTAGE,31,32,40,0.9931541
8,DECREASE,PROFIT_DECLINE,0,19,License fees revenue,AMOUNT,122,132,1.2 million,0.7389867
9,DECREASE,PROFIT_DECLINE,0,19,License fees revenue,AMOUNT,59,69,0.7 million,0.9894014


In [0]:
# relations output
result.selectExpr("explode(relations) as relation").show(truncate=False)

###🚀 Visualize Results

In [0]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

vis = re_vis.display(result = light_result[0],
               relation_col = "relations",
               document_col = "document",
               exclude_relations = ["no_rel"],
               show_relations=True,
               return_html=True
               )

displayHTML(vis)