![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/06.1.Additional_Relation_Extraction_Examples.ipynb)

#🎬 Installation

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

##🔗 Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance, viz
# nlp.install(force_browser=True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.8-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.8-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.8.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.8.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.8-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.2.8 installed! ✅ Heal the planet with NLP! 


#📌 Start Spark Session

In [None]:
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2


#🔎 Financial Relation Extraction(RE)

Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.

Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates.

##✔️ Pretrained Relation Extraction Models for Finance

Here are the list of pretrained Relation Extraction models:

**Relation Extraction Models**

|index|model|
|-----:|:-----|
| 1| [Financial Relation Extraction on Earning Calls (Small)](https://nlp.johnsnowlabs.com/2022/11/28/finre_earning_calls_sm_en.html)  | 
| 2| [Financial Relation Extraction on 10K filings (Small)](https://nlp.johnsnowlabs.com/2022/11/07/finre_financial_small_en.html)  | 
| 3| [Financial Relation Extraction (Tickers)](https://nlp.johnsnowlabs.com/2022/10/15/finre_has_ticker_en.html)  |
| 4| [Financial Relation Extraction (Acquisitions / Subsidiaries)](https://nlp.johnsnowlabs.com/2022/11/08/finre_acquisitions_subsidiaries_md_en.html)  | 
| 5| [Financial Relation Extraction (Work Experience, Medium)](https://nlp.johnsnowlabs.com/2022/11/08/finre_work_experience_md_en.html)  |
| 6| [Financial Relation Extraction (Work Experience, Small)](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html)  | 
| 7| [Financial Relation Extraction (Alias)](https://nlp.johnsnowlabs.com/2022/08/17/finre_org_prod_alias_en_3_2.html)  |
| 8| [Financial Zero-shot Relation Extraction](https://nlp.johnsnowlabs.com/2022/08/22/finre_zero_shot_en_3_2.html)  |




##✔️ Common Componennts
This pipeline will:
1.   Split Text into Sentences
2.   Split Sentences into Words
3.   Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words

**These components are common for all the pipelines we will use.**

In [None]:
def get_generic_base_pipeline():
  """Common components used in all pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
# Text Classifier
def get_text_classification_pipeline(model):
  """This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
  the management roles and experiences are mentioned"""
  document_assembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

  embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

  classifier = nlp.ClassifierDLModel.pretrained(model, "en", "finance/models")\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("category")

  nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      embeddings,
      classifier])
  
  return nlpPipeline

In [None]:
import pandas as pd

def get_relations_df (results, col='relations'):
  """Shows a Dataframe with the relations extracted by Spark NLP"""
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  return rel_df

##🔎 NER and Relation Extraction
NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.

Let's suppose we want to extract information about **Management Roles** and **Experiences**. If we don't know where that information is in the document, we can use Text Classifiers to find it.

##🚀 People's Information
Let's also extract People's name with their current roles and past experiences in other companies (including the dates).

###✔️ Sample Texts from Cadence Design System
Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [None]:
with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:100])

Table of Contents
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
__________


In [None]:
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])


UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended January 1, 2022 
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to_________.

Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 1

###✔️ Using Text Classification to find Relevant Parts of the Document: About Management and their work experience
To check the SEC 10K Summary page, we have a specific model called `"finclf_work_experience_item"`

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [None]:
candidates = [[pages[4]], [pages[84]], [pages[85]], [pages[86]], [pages[87]]]

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_work_experience_item')

df = spark.createDataFrame(candidates).toDF("text")

model = classification_pipeline.fit(df)

result = model.transform(df)

result.select('category.result').show()

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_work_experience_item download started this may take some time.
Approximate size to download 21.2 MB
[OK!]
+-----------------+
|           result|
+-----------------+
|          [other]|
|          [other]|
|          [other]|
|[work_experience]|
|          [other]|
+-----------------+



**We have some Work Experience in page 86. However, there is 1 sentence hidden in page 4, which is also very relevant.**
However, the model returned `other`. Why?

In [None]:
pages[4]

'\nThe IC and SoC design excellence requires technologies for custom IC, digital IC design and signoff, and functional verification, and leverages pre-built semiconductor IP. These tools, IP and associated services are specifically designed to meet the growing requirements of engineers designing increasingly complex chips across analog, digital and mixed-signal domains, and perform the associated verification tasks, including validation of low-level software running on the silicon model, thereby enabling design teams to manage complexity and verification throughput without commensurately increasing the team size or extending the project schedule, while reducing technical risks.\nThe second layer of our strategy centers around system innovation. It includes tools and services used for system design of the packages that encapsulate the ICs and the PCBs, system simulation which includes electromagnetic, electro-thermal and other multi-physics analysis necessary as part of optimizing the f

Exploring the page we understand there is a lot of texts about something else which got into the same page. Sometimes, going into a smaller detail may be necessary.

Let's see what happens if we get `paragraphs` instead of `pages.`

In [None]:
paragraphs = [x for x in pages[4].split('\n') if x.strip() != '']

paragraphs

['The IC and SoC design excellence requires technologies for custom IC, digital IC design and signoff, and functional verification, and leverages pre-built semiconductor IP. These tools, IP and associated services are specifically designed to meet the growing requirements of engineers designing increasingly complex chips across analog, digital and mixed-signal domains, and perform the associated verification tasks, including validation of low-level software running on the silicon model, thereby enabling design teams to manage complexity and verification throughput without commensurately increasing the team size or extending the project schedule, while reducing technical risks.',
 'The second layer of our strategy centers around system innovation. It includes tools and services used for system design of the packages that encapsulate the ICs and the PCBs, system simulation which includes electromagnetic, electro-thermal and other multi-physics analysis necessary as part of optimizing the

In [None]:
candidates = [[x] for x in paragraphs]

candidates

[['The IC and SoC design excellence requires technologies for custom IC, digital IC design and signoff, and functional verification, and leverages pre-built semiconductor IP. These tools, IP and associated services are specifically designed to meet the growing requirements of engineers designing increasingly complex chips across analog, digital and mixed-signal domains, and perform the associated verification tasks, including validation of low-level software running on the silicon model, thereby enabling design teams to manage complexity and verification throughput without commensurately increasing the team size or extending the project schedule, while reducing technical risks.'],
 ['The second layer of our strategy centers around system innovation. It includes tools and services used for system design of the packages that encapsulate the ICs and the PCBs, system simulation which includes electromagnetic, electro-thermal and other multi-physics analysis necessary as part of optimizing 

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_work_experience_item')

df = spark.createDataFrame(candidates).toDF("text")

model = classification_pipeline.fit(df)

result = model.transform(df)

result.select('category.result').show()

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_work_experience_item download started this may take some time.
Approximate size to download 21.2 MB
[OK!]
+-----------------+
|           result|
+-----------------+
|          [other]|
|          [other]|
|          [other]|
|          [other]|
|          [other]|
|          [other]|
|          [other]|
|          [other]|
|[work_experience]|
|[work_experience]|
|          [other]|
+-----------------+



**Here we are, if we split in smaller detail (paragraphs, lines), we can found more information than just at page level!**

This is because information in Embeddings gets deluted the bigger the text is. Also, there are some text restrictions (512 tokens in Bert)

In [None]:
ner_model_role = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_role")

ner_converter_role = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_role"])\
    .setOutputCol("ner_chunk_role")

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter_role = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk_role", "dependencies"])\
    .setOutputCol("re_ner_chunk_role")\
    .setRelationPairs(["PERSON-ROLE", "ORG-ROLE", "DATE-ROLE"])

re_model_exp = finance.RelationExtractionDLModel.pretrained("finre_work_experience_md", "en", "finance/models")\
    .setInputCols(["re_ner_chunk_role", "sentence"])\
    .setOutputCol("relations")

nlpPipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_role,
    ner_converter_role,
    pos,
    dependency_parser,
    re_ner_chunk_filter_role,
    re_model_exp,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)


finner_org_per_role_date download started this may take some time.
[OK!]
pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
finre_work_experience_md download started this may take some time.
[OK!]


###🖨️ Get Results

In [None]:
sample_text = candidates[9]

sample_text

['On December 15, 2021, Anirudh Devgan assumed the role of President and Chief Executive Officer of Cadence, replacing Lip-Bu Tan. Prior to his role as Chief Executive Officer, Dr. Devgan served as President of Cadence. Concurrently, Mr. Tan transitioned to the role of Executive Chair.']

In [None]:
result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "other"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_role_from,DATE,3,19,"December 15, 2021",ROLE,57,65,President,0.95321363
1,has_role_from,DATE,3,19,"December 15, 2021",ROLE,71,93,Chief Executive Officer,0.91833746
2,has_role,PERSON,22,35,Anirudh Devgan,ROLE,57,65,President,0.9993814
3,has_role,PERSON,22,35,Anirudh Devgan,ROLE,71,93,Chief Executive Officer,0.9889985
4,has_role_in_company,ROLE,57,65,President,ORG,98,104,Cadence,0.9997348
5,has_role_in_company,ROLE,71,93,Chief Executive Officer,ORG,98,104,Cadence,0.99845624
6,has_role,ROLE,150,172,Chief Executive Officer,PERSON,175,184,Dr. Devgan,0.8526861
7,has_role_in_company,ROLE,150,172,Chief Executive Officer,ORG,209,215,Cadence,0.9976404
8,has_role,PERSON,175,184,Dr. Devgan,ROLE,196,204,President,0.99899226
9,has_role_in_company,ROLE,196,204,President,ORG,209,215,Cadence,0.9997203


###🚀 Visualize Results

In [None]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other"], show_relations=True)