# Financial Solution Accelerator: Drawing a Company Ecosystem Graph and Analyzing it with Embeddings

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/31.Solution_Company_Ecosystem_Graph_Embeddings.ipynb)



This accelerator will help you process Financial Annual Reports (10K filings) or even Wikipedia data about companies, using John Snow Labs Finance NLP **Named Entity Recognition, Relation Extraction and Assertion Status**, to extract the following information about companies:
- Information about the Company itself (`Trading Symbol`, `State`, `Address`, Contact Information) and other names the Company is known by (`alias`, `former name`).
- People (usually management and C-level) working in that company and their past experiences, including roles and companies
- `Acquisitions` events, including the acquisition dates. `Subsidiaries` mentioned.
- Other Companies mentioned in the report as `competitors`: we will also run a "Competitor check", to understand if another company is just in the ecosystem / supply chain of the company or it is really a competitor
- Temporality (`past`, `present`, `future`) and Certainty (`possible`) of events described, including `Forward-looking statements`.

Also, John Snow Labs provides with offline modules to check for Edgar database (**Entity Linking** to resolve an organization name to its official name and **Chunk Mappers** to map a normalized name to Edgar Database), which are quarterly updated. We will using them to retrieve the `official name of a company`, `former names`, `dates where names where changed`, etc.

The final aim of this accelerator is to help you analyze companies information...

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/Certification_Trainings_JSL/Finance/data/im1.png" alt="drawing" width="600"/>

... create a graph...

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/Certification_Trainings_JSL/Finance/data/img10.png" alt="drawing" width="800"/>

# Installation


In [None]:
!pip install johnsnowlabs

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, finance
nlp.install()

# Starting a session

In [None]:
spark = nlp.start()

# Imports

In [None]:
import os
import sys
import time
import json
import functools 
import numpy as np
import pandas as pd
from tqdm import tqdm
from scipy import spatial

### Auxiliary Visualization functions 
We will use [NetworkX](https://networkx.org/) to store the graph and [Plotly](https://plotly.com/) to visualize it.

These functions will:
- Use Plotly to visualize a NetworkX graph
- Display relations in a dataframe

In [None]:
!pip install networkx plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import networkx as nx
G = nx.Graph()

In [None]:
import plotly.graph_objects as go
import random

def get_nodes_from_graph(graph, pos, node_color):
  """Extracts the nodes from a networkX dataframe in Plotly Scatterplot format"""
  node_x = []
  node_y = []
  texts = []
  hovers = []
  for node in graph.nodes():
    entity = graph.nodes[node]['attr_dict']['entity']
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    texts.append(node)
    hovers.append(entity)

  node_trace = go.Scatter(
    x=node_x, y=node_y, text=texts, hovertext=hovers,
    mode='markers+text',
    hoverinfo='text',
    marker=dict(
        color=node_color,
        size=40,
        line_width=2))
  
  return node_trace


def get_edges_from_graph(graph, pos, edge_color):
  """Extracts the edges from a networkX dataframe in Plotly Scatterplot format"""
  edge_x = []
  edge_y = []
  hovers = []
  xtext = []
  ytext = []
  for edge in graph.edges():
    relation = graph.edges[edge]['attr_dict']['relation']
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)
    hovers.append(relation)
    xtext.append((x0+x1)/2)
    ytext.append((y0+y1)/2)

  edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=2, color=edge_color),
    mode='lines')
  
  labels_trace = go.Scatter(x=xtext,y= ytext, mode='text',
                              textfont = {'color': edge_color},
                              marker_size=0.5,
                              text=hovers,
                              textposition='top center',
                              hovertemplate='weight: %{text}<extra></extra>')
  return edge_trace, labels_trace


def show_graph_in_plotly(graph, node_color='white', edge_color='grey'):
  """Shows Plotly graph in Databricks"""
  pos = nx.spring_layout(graph)
  node_trace = get_nodes_from_graph(graph, pos, node_color)
  edge_trace, labels_trace = get_edges_from_graph(graph, pos, edge_color)
  fig = go.Figure(data=[edge_trace, node_trace, labels_trace],
               layout=go.Layout(
                  title='Company Ecosystem',
                  titlefont_size=16,                   
                  showlegend=False,
                  width=1600,
                  height=1000,
                  xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                  yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                  )
  fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers')) 
  fig.show()

In [None]:
import pandas as pd

def get_relations_df (results, col='relations'):
  """Shows a Dataframe with the relations extracted by Spark NLP"""
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
        rel.result, 
        rel.metadata['entity1'], 
        rel.metadata['entity1_begin'],
        rel.metadata['entity1_end'],
        rel.metadata['chunk1'], 
        rel.metadata['entity2'],
        rel.metadata['entity2_begin'],
        rel.metadata['entity2_end'],
        rel.metadata['chunk2'], 
        rel.metadata['confidence']
    ))

  rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

  return rel_df

# Common Componennts
This pipeline will:
1) Split Text into Sentences
2) Split Sentences into Words
3) Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words

These components are common for all the pipelines we will use.

In [None]:
def get_generic_base_pipeline():
  """Common components used in all pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  sentence_detector = nlp.SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      sentence_detector,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
# Text Classifier
def get_text_classification_pipeline(model):
  """This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.
  It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document
  the management roles and experiences are mentioned"""
  documentAssembler = nlp.DocumentAssembler() \
       .setInputCol("text") \
       .setOutputCol("document")

  useEmbeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

  docClassifier = finance.ClassifierDLModel.pretrained(model, "en", "finance/models")\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("category")

  nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      useEmbeddings,
      docClassifier])
  
  return nlpPipeline

# Sample Texts from Cadence Design System
Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/cdns-20220101.html.txt

--2022-12-01 14:45:43--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/cdns-20220101.html.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347392 (339K) [text/plain]
Saving to: ‘cdns-20220101.html.txt’


2022-12-01 14:45:43 (9.48 MB/s) - ‘cdns-20220101.html.txt’ saved [347392/347392]



In [None]:
with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:100])

Table of Contents
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
__________


In [None]:
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])


UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended January 1, 2022 
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to_________.

Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 1

## Using Text Classification to find Relevant Parts of the Document: 10K Summary
In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.

To check the SEC 10K Summary page, we have a specific model called `"finclf_form_10k_summary_item"`

In [None]:
from johnsnowlabs import finance

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_form_10k_summary_item')
df = spark.createDataFrame([[pages[0]]]).toDF("text")
model = classification_pipeline.fit(df)
result = model.transform(df)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_form_10k_summary_item download started this may take some time.
[OK!]


In [None]:
result.select('category.result').show()

+------------------+
|            result|
+------------------+
|[form_10k_summary]|
+------------------+



Confirmed, page 0 is where the 10K summary is!

## NER: Named Entity Recognition on 10K Summary
Main component to carry out information extraction and extract entities from texts. 

This time we will use a model trained to extract many entities from 10K summaries.

In [None]:
summary_sample_text = pages[0]

In [None]:
ner_model_sec10k = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_summary")

ner_converter_sec10k = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_summary"])\
    .setOutputCol("ner_chunk_sec10k")

summary_pipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_sec10k,
    ner_converter_sec10k
])

finner_sec_10k_summary download started this may take some time.
[OK!]


Let's visualize the entities with Spark NLP Visualizer

In [None]:
from sparknlp_display import NerVisualizer

ner_vis = nlp.viz.NerVisualizer()

empty_data = spark.createDataFrame([[""]]).toDF("text")

summary_model = summary_pipeline.fit(empty_data)

light_summary_model = nlp.LightPipeline(summary_model)

summary_results = light_summary_model.fullAnnotate(summary_sample_text)

In [None]:
summary_results

[{'ner_summary': [Annotation(named_entity, 1, 6, O, {'word': 'UNITED', 'confidence': '0.9819', 'sentence': '0'}),
   Annotation(named_entity, 8, 13, O, {'word': 'STATES', 'confidence': '0.9875', 'sentence': '0'}),
   Annotation(named_entity, 15, 24, O, {'word': 'SECURITIES', 'confidence': '0.9997', 'sentence': '0'}),
   Annotation(named_entity, 26, 28, O, {'word': 'AND', 'confidence': '0.9754', 'sentence': '0'}),
   Annotation(named_entity, 30, 37, O, {'word': 'EXCHANGE', 'confidence': '0.9886', 'sentence': '0'}),
   Annotation(named_entity, 39, 48, O, {'word': 'COMMISSION', 'confidence': '0.9853', 'sentence': '0'}),
   Annotation(named_entity, 50, 59, O, {'word': 'Washington', 'confidence': '0.9683', 'sentence': '0'}),
   Annotation(named_entity, 60, 60, O, {'word': ',', 'confidence': '0.9622', 'sentence': '0'}),
   Annotation(named_entity, 62, 64, O, {'word': 'D.C', 'confidence': '0.8551', 'sentence': '0'}),
   Annotation(named_entity, 65, 65, O, {'word': '.', 'confidence': '0.9926',

### Visualize Results

In [None]:
for r in summary_results:
    displayHTML(ner_vis.display(r, label_col = "ner_chunk_sec10k", document_col = "document", return_html=True))

## First, let's extract the Organization from NER results

We create a new graph

In [None]:
G.clear()
G.nodes()

NodeView(())

We extract the organization (entity 'ORG' in the NER results)

In [None]:
ORG = next(filter(lambda x: x.metadata['entity']=='ORG', summary_results[0]['ner_chunk_sec10k'])).result
ORG

'CADENCE DESIGN SYSTEMS, INC'

We add our first node to the graph

In [None]:
# I add our main Organization in the center (x=0, y=0)
G.add_node(ORG, attr_dict={'entity': 'ORG'})

In [None]:
show_graph_in_plotly(G)

Then, let's add all the summary information from SEC 10K filings (1st page) to that organization.

We can create nodes and add a relation to Cadence directly, since we know it's information of that company.

In [None]:
for i, r in enumerate(summary_results[0]['ner_chunk_sec10k']):
  text = r.result
  entity = r.metadata['entity']
  
  if entity == 'ORG':
    continue #Already added
  G.add_node(text, attr_dict={'entity': entity}),
  G.add_edge(ORG, text, attr_dict={'relation': 'has_' + entity.lower()})  

In [None]:
show_graph_in_plotly(G)

## Normalizing the company name to query John Snow Labs datasources for more information about Cadence

Sometimes, companies in texts use a non-official, abbreviated name. For example, we can find `Cadence`, `Cadence Inc`, `Cadence, Inc`, or many other variations, where the official name of the company os `CADENCE DESIGN SYSTEMS INC`, as per registered in SEC Edgar.

Normalizing a company name is super important for data quality purposes. It will help us:
- Standardize the data, improving the quality;
- Carry out additional verifications;
- Join different databases or extract for external sources;

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("normalization")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = nlp.LightPipeline(pipelineModel)

normalized_org = lp.fullAnnotate(ORG)
normalized_org

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finel_edgar_company_name download started this may take some time.
[OK!]


[{'ner_chunk': [Annotation(document, 0, 26, CADENCE DESIGN SYSTEMS, INC, {})],
  'sentence_embeddings': [Annotation(sentence_embeddings, 0, 26, CADENCE DESIGN SYSTEMS, INC, {'sentence': '0', 'token': 'CADENCE DESIGN SYSTEMS, INC', 'pieceId': '-1', 'isWordStart': 'true'})],
  'normalization': [Annotation(entity, 0, 26, CADENCE DESIGN SYSTEMS INC, {'all_k_results': 'CADENCE DESIGN SYSTEMS INC:::DESIGN WITHIN REACH INC:::AVICI SYSTEMS INC:::HLM DESIGN INC:::NanoWatt Design Inc:::DELTEK SYSTEMS INC:::EPILOG IMAGING SYSTEMS INC', 'all_k_distances': '0.0000:::0.6361:::0.6418:::0.6574:::0.6664:::0.6743:::0.6766', 'confidence': '0.2436', 'all_k_cosine_distances': '0.0000:::0.2023:::0.2060:::0.2161:::0.2220:::0.2273:::0.2289', 'all_k_resolutions': 'CADENCE DESIGN SYSTEMS INC:::DESIGN WITHIN REACH INC:::AVICI SYSTEMS INC:::HLM DESIGN INC:::NanoWatt Design Inc:::DELTEK SYSTEMS INC:::EPILOG IMAGING SYSTEMS INC', 'target_text': 'CADENCE DESIGN SYSTEMS, INC', 'all_k_aux_labels': '770148231:::9433143

In [None]:
NORM_ORG = normalized_org[0]['normalization'][0].result
NORM_ORG

'CADENCE DESIGN SYSTEMS INC'

### NORMALIZED NAME
In Edgar, the company official is different! We need to take it before being able to augment with external information in EDGAR.

- Incorrect: `CADENCE DESIGN SYSTEMS, INC`
- Correct (Official): `CADENCE DESIGN SYSTEMS INC`

In [None]:
G.add_node(NORM_ORG, attr_dict={'entity': 'ORG'}),
G.add_edge(ORG, NORM_ORG, attr_dict={'relation': 'has_official_name'})  

## DATA AUGMENTATION WITH CHUNK MAPPER

Once we have the normalized name of the company, we can use `John Snow Labs Chunk Mappers`. These are pretrained data sources, which are updated frequently and can be queried inside Spark NLP without sending any API call to any server.

In this case, we will use Edgar Database (`finmapper_edgar_companyname`)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM = finance.ChunkMapperModel()\
      .pretrained("finmapper_edgar_companyname", "en", "finance/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("mappings")
      
cm_pipeline = nlp.Pipeline(stages=[documentAssembler, chunkAssembler, CM])
fit_cm_pipeline = cm_pipeline.fit(empty_data)

df = spark.createDataFrame([[NORM_ORG]]).toDF("text")
r = fit_cm_pipeline.transform(df).collect()

This is the information we got from that Chunk Mapper about Cadence.

In [None]:
mappings = r[0]['mappings']
for mapping in mappings:
  text = mapping.result
  relation = mapping.metadata['relation']
  print(f"{ORG} - has_{relation} - {text}")
    
  G.add_node(text, attr_dict={'entity': relation}),
  G.add_edge(ORG, text, attr_dict={'relation': 'has_' + relation.lower()})  

In [None]:
show_graph_in_plotly(G)

## NER and Relation Extraction
NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.

Let's suppose we want to extract information about Acquisitions and Subsidiaries. If we don't know where that information is in the document, we can again use or Text Classifiers to find it.

## Using Text Classification to find Relevant Parts of the Document: Acquisitions and Subsidiaries
To check the SEC 10K Summary page, we have a specific model called `"finclf_acquisitions_item"`

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [None]:
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[50]], [pages[67]]] # Some examples

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')
df = spark.createDataFrame(candidates).toDF("text")
model = classification_pipeline.fit(df)
result = model.transform(df)

In [None]:
result.select('category.result').show()

### Acquisitions, Subsidiaries and Former Names
Let's use some NER models to obtain information about Organizations and Dates, and understand if:
- An ORG was acquired by another ORG
- An ORG is a subsidiary of another ORG
- An ORG name is an alias / abbreviation / acronym / etc of another ORG

We will use the deteceted `page[67]` as input

In [None]:
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner_dates"])\
        .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner_orgs"])\
        .setOutputCol("ner_chunk_org")\
        .setWhiteList(['ORG', 'PRODUCT', 'ALIAS'])

chunk_merger = finance.ChunkMergeApproach()\
        .setInputCols('ner_chunk_org', "ner_chunk_date")\
        .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE", "ORG-ROLE", "ROLE-DATE"])\
    .setMaxSyntacticDistance(10)

re_filter_alias = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk_alias")\
    .setRelationPairs(["ORG-ALIAS"])\
    .setMaxSyntacticDistance(5)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

reDL_alias = finance.RelationExtractionDLModel()\
    .pretrained("finre_org_prod_alias", "en", "finance/models")\
    .setPredictionThreshold(0.8)\
    .setInputCols(["re_ner_chunk_alias", "sentence"])\
    .setOutputCol("relations_alias")

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")

nlpPipeline = nlp.Pipeline(stages=[
        generic_base_pipeline,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        re_filter_alias,
        reDL,
        reDL_alias,
        annotation_merger])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

In [None]:
sample_text = pages[67].replace("“", "\"").replace("”", "\"")

In [None]:
result = light_model.fullAnnotate(sample_text)
rel_df = get_relations_df(result)

### Visualize Results

In [None]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = nlp.viz.RelationExtractionVisualizer()
displayHTML(re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other", "no_rel"], return_html=True, show_relations=True))

### Inserting Nodes (Tags) and Relations into a Graph
Now, with entities and Relations connecting them, we can start populating the Graph of the company.

In [None]:
for t in rel_df.itertuples():
  relation = t.relation
  
  if relation in ['other', 'no_rel']:
    continue
  
  entity1 = t.entity1
  chunk1 = t.chunk1
  entity2 = t.entity2
  chunk2 = t.chunk2
  G.add_node(chunk1,  attr_dict={'entity': entity1})
  G.add_node(chunk2,  attr_dict={'entity': entity2})
  
  G.add_edge(ORG, chunk1, attr_dict={'relation': 'mentions_' + entity1.lower()})  
  G.add_edge(ORG, chunk2, attr_dict={'relation': 'mentions_' + entity2.lower()})  
  
  G.add_edge(chunk1, chunk2, attr_dict={'relation': relation.lower()})  
  

In [None]:
show_graph_in_plotly(G)

## People's Information
Let's also extract People's name with their current roles and past experiences in other companies (including the dates).

## Using Text Classification to find Relevant Parts of the Document: About Management and their work experience
To check the SEC 10K Summary page, we have a specific model called `"finclf_work_experience_item"`

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [None]:
candidates = [[pages[4]], [pages[84]], [pages[85]], [pages[86]], [pages[87]]]

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_work_experience_item')
df = spark.createDataFrame(candidates).toDF("text")
model = classification_pipeline.fit(df)
result = model.transform(df)
result.select('category.result').show()

**We have some Work Experience in page 86. However, there is 1 sentence hidden in page 4, which is also very relevant.**
However, the model returned `other`. Why?

In [None]:
pages[4]

Exploring the page we understand there is a lot of texts about something else which got into the same page. Sometimes, going into a smaller detail may be necessary.

Let's see what happens if we get `paragraphs` instead of `pages.`

In [None]:
paragraphs = [x for x in pages[4].split('\n') if x.strip() != '']

In [None]:
paragraphs

In [None]:
candidates = [[x] for x in paragraphs]

In [None]:
classification_pipeline = get_text_classification_pipeline('finclf_work_experience_item')
df = spark.createDataFrame(candidates).toDF("text")
model = classification_pipeline.fit(df)
result = model.transform(df)
result.select('category.result').show()

**Here we are, if we split in smaller detail (paragraphs, lines), we can found more information than just at page level!**

This is because information in Embeddings gets deluted the bigger the text is. Also, there are some text restrictions (512 tokens in Bert)

In [None]:
ner_model_role = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_role")

ner_converter_role = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_role"])\
    .setOutputCol("ner_chunk_role")

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter_role = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk_role", "dependencies"])\
    .setOutputCol("re_ner_chunk_role")\
    .setRelationPairs(["PERSON-ROLE", "ORG-ROLE", "DATE-ROLE"])

re_model_exp = finance.RelationExtractionDLModel.pretrained("finre_work_experience_md", "en", "finance/models")\
    .setInputCols(["re_ner_chunk_role", "sentence"])\
    .setOutputCol("relations")

nlpPipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_role,
    ner_converter_role,
    pos,
    dependency_parser,
    re_ner_chunk_filter_role,
    re_model_exp,
])


model = nlpPipeline.fit(empty_data)
light_model = nlp.LightPipeline(model)

## Get Results

In [None]:
sample_text = candidates[9]
sample_text

In [None]:
result = light_model.fullAnnotate(sample_text)
rel_df = get_relations_df(result)
rel_df[rel_df["relation"] != "other"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_role_from,DATE,3,19,"December 15, 2021",ROLE,57,65,President,0.82543
1,has_role_from,DATE,3,19,"December 15, 2021",ROLE,71,93,Chief Executive Officer,0.96655935
2,has_role,PERSON,22,35,Anirudh Devgan,ROLE,57,65,President,0.99948406
3,has_role,PERSON,22,35,Anirudh Devgan,ROLE,71,93,Chief Executive Officer,0.99911696
4,has_role_in_company,ROLE,57,65,President,ORG,98,104,Cadence,0.9898419
5,has_role_in_company,ROLE,71,93,Chief Executive Officer,ORG,98,104,Cadence,0.9961314
6,has_role,ROLE,150,172,Chief Executive Officer,PERSON,175,184,Dr. Devgan,0.98781955
7,has_role_in_company,ROLE,150,172,Chief Executive Officer,ORG,209,215,Cadence,0.9821989
8,has_role,PERSON,175,184,Dr. Devgan,ROLE,196,204,President,0.99987555
9,has_role_in_company,ROLE,196,204,President,ORG,209,215,Cadence,0.9999733


## Visualize Results

In [None]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = nlp.viz.RelationExtractionVisualizer()
displayHTML(re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["other"], return_html=True, show_relations=True))

## Adding to graph

In [None]:
for t in rel_df.itertuples():
  relation = t.relation
  if relation == 'other':
    continue
  entity1 = t.entity1
  chunk1 = t.chunk1
  entity2 = t.entity2
  chunk2 = t.chunk2
  G.add_node(chunk1,  attr_dict={'entity': entity1})
  G.add_node(chunk2,  attr_dict={'entity': entity2})
  
  G.add_edge(ORG, chunk1, attr_dict={'relation': 'mentions_' + entity1.lower()})  
  G.add_edge(ORG, chunk2, attr_dict={'relation': 'mentions_' + entity2.lower()})  
  
  G.add_edge(chunk1, chunk2, attr_dict={'relation': relation.lower()})  
  

In [None]:
show_graph_in_plotly(G)

# Understanding the context of mentioned companies to identify COMPETITORS
Many Companies may be mentioned in the report. Most of them are just organizations in the ecosystem of the Cadence. Others, may be competitors.

We can analyze the surrounding context of the extracted `ORG` to check if they are competitors or not.

In [None]:
ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(['ORG', 'PRODUCT'])

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"])\
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner,
    ner_converter,
    assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = nlp.LightPipeline(model)

### Get Results

In [None]:
sample_text = ["""In the rapidly evolving market, certain elements of our application compete with Microsoft, Google, InFocus, Bluescape, Mersive, Barco, Nureva and Prysm. But, Oracle  and IBM are out of our league."""]

chunks=[]
entities=[]
status=[]


light_result = light_model.fullAnnotate(sample_text)[0]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [None]:
df

Unnamed: 0,chunks,entities,assertion
0,Microsoft,ORG,COMPETITOR
1,Google,ORG,COMPETITOR
2,InFocus,ORG,COMPETITOR
3,Bluescape,ORG,COMPETITOR
4,Mersive,ORG,COMPETITOR
5,Barco,ORG,COMPETITOR
6,Nureva,ORG,COMPETITOR
7,Prysm,ORG,COMPETITOR
8,Oracle,ORG,NO_COMPETITOR
9,IBM,ORG,NO_COMPETITOR


### Visualize Assertion Result

In [None]:
vis = nlp.viz.AssertionVisualizer()

vis.set_label_colors({'COMPETITOR':'#008080', 'NO_COMPETITOR':'#800080'})
    
light_result = light_model.fullAnnotate(sample_text)[0]

displayHTML(vis.display(light_result, 'ner_chunk', 'assertion', return_html=True))


### Adding it to the graph

In [None]:
for t in df.itertuples():
  chunks = t.chunks
  entities = t.entities
  assertion = t.assertion

  G.add_node(chunks,  attr_dict={'entity': entities})
  
  G.add_edge(ORG, chunks, attr_dict={'relation': 'is_' + assertion.lower()})
  

In [None]:
show_graph_in_plotly(G)

# Annex 1: Detecting Temporality and Certainty in Affirmations

In [None]:
ner_model_role = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_role")

ner_converter_role = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_role"])\
    .setOutputCol("ner_chunk")

assertion = finance.AssertionDLModel.pretrained("finassertion_time", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")\
    .setMaxSentLen(1200)

assertion_pipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_role,
    ner_converter_role,
    assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

assertion_model = assertion_pipeline.fit(empty_data)

light_model_assertion = LightPipeline(assertion_model)

### Get Result

In [None]:
sample_text = ["""Joseph Costello was the CEO of the company since founded in 1988 until 1997. He was followed by Lip-Bu Tan for the 2009–2021 period. Currently, Anirudh Devgan is the CEO since 2021""",
              
              """In 2007, Cadence was rumored to be in talks with Kohlberg Kravis Roberts and Blackstone Group regarding a possible sale of the company.""",
              """In 2008, Cadence withdrew a $1.6 billion offer to purchase rival Mentor Graphics.""",
              
               """ The Cadence Giving Foundation will also support critical needs in areas such as diversity, equity and inclusion, environmental sustainability and STEM education.""",
              """This stand-alone, non-profit foundation will partner with other charitable initiatives to support critical needs in areas such as diversity, equity and inclusion, environmental sustainability and science, technology, engineering, and mathematics (“STEM”) education""",
              
              """Cadence employees could purchase common stock at a price equal to 85% of the lower of the fair market value at the beginning or the end of the applicable offering period"""]

chunks=[]               
entities=[]
status=[]

light_results = light_model_assertion.fullAnnotate(sample_text)

for light_result in light_results:
  for n,m in zip(light_result['ner_chunk'], light_result['assertion']):
      chunks.append(n.result)
      entities.append(n.metadata['entity']) 
      status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [None]:
df

Unnamed: 0,chunks,entities,assertion
0,Joseph Costello,PERSON,PAST
1,CEO,ROLE,PAST
2,1988,DATE,PAST
3,1997,DATE,PAST
4,2009–2021,DATE,PAST
5,Anirudh Devgan,PERSON,PRESENT
6,CEO,ROLE,PRESENT
7,2021,DATE,PRESENT
8,2007,DATE,PAST
9,Cadence,ORG,PAST


### Visualize Assertion Result

In [None]:
vis = nlp.viz.AssertionVisualizer()

for light_result in light_results:
  displayHTML(vis.display(light_result, 'ner_chunk', 'assertion', return_html=True))


# About Graph Embeddings
We got an example graph of a company. We can continue processing SEC and Wikidata information and populating the graph to get a better understanding of the company’s ecosystem.

After we are happy with the information contained in it, we can use Graph Embeddings to:

- Obtain a numerical representation of the company’s ecosystem;
- Be able to compare company graphs and check for similarity between companies. For example, for competition analysis.
- Be able to compare specific nodes or specific edges and check for similarity between them. For example, for new link prediction.


In [None]:
!pip install node2vec

## Node Embeddings

In [None]:
# pip install node2vec
from node2vec import Node2Vec

# Generate walks
node2vec = Node2Vec(G, dimensions=20, walk_length=16, num_walks=100)

In [None]:
# Learn embeddings 
model = node2vec.fit(window=10, min_count=1)

In [None]:
model.wv.get_vector('AWR')

In [None]:
for node, _ in model.wv.most_similar('AWR'):
    print(node)

## Edge Embeddings

In [None]:
from node2vec.edges import HadamardEmbedder
edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

In [None]:
edges_embs[('AWR', 'AWR Corporation')]

In [None]:
edges_kv = edges_embs.as_keyed_vectors()
edges_kv.most_similar(str(('AWR', 'AWR Corporation')))

## Want to go further and analyse subgraph Embeddings, to compare, for example, different Companies subgraphs?
Check `https://github.com/benedekrozemberczki/graph2vec` or get in touch with us at support@johnsnowlabs.com or in our [Slack](https://www.johnsnowlabs.com/slack-redirect/), #finance channel.