![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

In [0]:
from johnsnowlabs import nlp, finance, viz

#🔎 Financial Relation Extraction(RE)

Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.

Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates.

##✔  Pretrained Relation Extraction Models for Finance

📚Here are the list of pretrained Relation Extraction models:

**Relation Extraction Models**

|index|model|
|-----:|:-----|
| 1| [Financial Relation Extraction on Earning Calls (Small)](https://nlp.johnsnowlabs.com/2022/11/28/finre_earning_calls_sm_en.html)  | 
| 2| [Financial Relation Extraction on 10K filings (Small)](https://nlp.johnsnowlabs.com/2022/11/07/finre_financial_small_en.html)  | 
| 3| [Financial Relation Extraction (Tickers)](https://nlp.johnsnowlabs.com/2022/10/15/finre_has_ticker_en.html)  |
| 4| [Financial Relation Extraction (Acquisitions / Subsidiaries)](https://nlp.johnsnowlabs.com/2022/11/08/finre_acquisitions_subsidiaries_md_en.html)  | 
| 5| [Financial Relation Extraction (Work Experience, Medium)](https://nlp.johnsnowlabs.com/2022/11/08/finre_work_experience_md_en.html)  |
| 6| [Financial Relation Extraction (Work Experience, Small)](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html)  | 
| 7| [Financial Relation Extraction (Alias)](https://nlp.johnsnowlabs.com/2022/08/17/finre_org_prod_alias_en_3_2.html)  |
| 8| [Financial Zero-shot Relation Extraction](https://nlp.johnsnowlabs.com/2022/08/22/finre_zero_shot_en_3_2.html)  |

##✔ Common Componennts
📚This pipeline will:
1.   Split Text into Sentences
2.   Split Sentences into Words
3.   Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words

**These components are common for all the pipelines we will use.**

In [0]:
def get_generic_base_pipeline():
  """Common components used in all pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

In [0]:
# Text Classifier
def get_text_classification_pipeline(model):
  """This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
  the management roles and experiences are mentioned"""
  document_assembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

  embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

  classifier = nlp.ClassifierDLModel.pretrained(model, "en", "finance/models")\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("category")

  nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      embeddings,
      classifier])
  
  return nlpPipeline

In [0]:
import pandas as pd

def get_relations_df (results, col='relations'):
  """Shows a Dataframe with the relations extracted by Spark NLP"""
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  return rel_df

##✔ NER and Relation Extraction
NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.

Let's suppose we want to extract information about **Acquisitions** and **Subsidiaries**. If we don't know where that information is in the document, we can use Text Classifiers to find it.

##✔ Using Text Classification to find Relevant Parts of the Document: Acquisitions and Subsidiaries
To check the SEC 10K Summary page, we have a specific model called `"finclf_acquisitions_item"`

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

###📌 Sample Texts from Cadence Design System
Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)

In [0]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

dbutils.fs.cp("file:/databricks/driver/cdns-20220101.html.txt", "dbfs:/") 

In [0]:
with open('/dbfs/cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:100])

In [0]:
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])

In [0]:
# Some examples
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[67]]] 

In [0]:
classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')

df = spark.createDataFrame(candidates).toDF("text")

model = classification_pipeline.fit(df)

result = model.transform(df)

In [0]:
result.select('category.result').show()

###📌 Acquisitions, Subsidiaries and Former Names
📚Let's use some NER models to obtain information about Organizations and Dates, and understand if:
- An ORG was acquired by another ORG
- An ORG is a subsidiary of another ORG
- An ORG name is an alias / abbreviation / acronym / etc of another ORG

We will use the deteceted `page[67]` as input

In [0]:
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")

nlpPipeline = nlp.Pipeline(stages=[
        generic_base_pipeline,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        annotation_merger])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

In [0]:
sample_text = pages[67].replace("“", "\"").replace("”", "\"")

In [0]:
result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_acquisition_date,ORG,440,446,Cadence,DATE,427,437,fiscal 2020,0.99945384
1,has_acquisition_date,ORG,490,504,AWR Corporation,DATE,427,437,fiscal 2020,0.99891853
2,was_acquired_by,ORG,490,504,AWR Corporation,ORG,440,446,Cadence,0.99111485
3,was_acquired_by,ORG,518,540,"Integrand Software, Inc",ORG,440,446,Cadence,0.99635243
4,was_acquired_by,ORG,518,540,"Integrand Software, Inc",ORG,490,504,AWR Corporation,0.94192755
5,other,ORG,1210,1212,AWR,ORG,1218,1226,Integrand,0.9999858
6,other,ORG,1229,1235,Cadence,DATE,1358,1367,nine years,0.996561
7,other,ORG,1905,1907,AWR,ORG,1913,1921,Integrand,0.9999651
8,has_acquisition_date,ORG,1955,1961,Cadence,DATE,2007,2017,fiscal 2020,0.99776745
9,other,DATE,2219,2229,fiscal 2021,ORG,2322,2330,Cadence’s,0.99219704


In [0]:
rel_df = rel_df[(rel_df["relation"] != "other") & (rel_df["relation"] != "no_rel")]

rel_df

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_acquisition_date,ORG,440,446,Cadence,DATE,427,437,fiscal 2020,0.99945384
1,has_acquisition_date,ORG,490,504,AWR Corporation,DATE,427,437,fiscal 2020,0.99891853
2,was_acquired_by,ORG,490,504,AWR Corporation,ORG,440,446,Cadence,0.99111485
3,was_acquired_by,ORG,518,540,"Integrand Software, Inc",ORG,440,446,Cadence,0.99635243
4,was_acquired_by,ORG,518,540,"Integrand Software, Inc",ORG,490,504,AWR Corporation,0.94192755
8,has_acquisition_date,ORG,1955,1961,Cadence,DATE,2007,2017,fiscal 2020,0.99776745


###📌 Visualize Results

In [0]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

vis = re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other", "no_rel"], show_relations=True,return_html=True)
displayHTML(vis)