# Financial Solution Accelerator: Drawing a Company Ecosystem Graph and Analyzing it with Embeddings
This accelerator will help you process Financial Annual Reports (10K filings) or even Wikipedia data about companies, using John Snow Labs Finance NLP **Named Entity Recognition, Relation Extraction and Assertion Status**, to extract the following information about companies:
- Information about the Company itself (`Trading Symbol`, `State`, `Address`, Contact Information) and other names the Company is known by (`alias`, `former name`).
- Other Companies mentioned in the report as `competitors`: we will also run a "Competitor check", to understand if another company is just in the ecosystem / supply chain of the company or it is really a competitor
- People (usually management and C-level) working in that company and their past experiences, including roles and companies
- `Acquisitions` events, including the acquisition dates. `Subsidiaries` mentioned.
- Temporality (`past`, `present`, `future`) and Certainty (`possible`) of events described, including `Forward-looking statements`.

Also, John Snow Labs provides with offline modules to check for Edgar database (**Entity Linking** to resolve an organization name to its official name and **Chunk Mappers** to map a normalized name to Edgar Database), which are quarterly updated. We will using them to retrieve the `official name of a company`, `former names`, `dates where names where changed`, etc.

The final aim of this accelerator is to help you analyze companies information...

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/Certification_Trainings_JSL/Finance/data/im1.png" alt="drawing" width="600"/>

... create a graph...

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/Certification_Trainings_JSL/Finance/data/img6.png" alt="drawing" width="400"/>

... and even being able to run Graph Embeddings on top of the graph you extract (for example, to infer new relations to green nodes given the grey ones in the picture);

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/tutorials/Certification_Trainings_JSL/Finance/data/im4.png" alt="drawing" width="400"/>

# Install Spark NLP

In [None]:
!pip install johnsnowlabs

In [None]:
from johnsnowlabs import *
jsl.install(json_license_path=[your_path_to_json_license])

# Starting a session

In [None]:
jsl.start(json_license_path=[your_path_to_json_license])

# Imports

In [None]:
import os
import sys
import time
import json
import functools 
import numpy as np
import pandas as pd
from tqdm import tqdm
from scipy import spatial

### NetworkX and Plotly aux functions 
We will use [NetworkX](https://networkx.org/) to store the graph and [Plotly](https://plotly.com/) to visualize it.

In [None]:
!pip install networkx plotly

In [None]:
import networkx as nx
G = nx.Graph()

In [None]:
import plotly.graph_objects as go
import random

def get_nodes_from_graph(graph, pos, node_color):
  node_x = []
  node_y = []
  texts = []
  hovers = []
  for node in graph.nodes():
    entity = graph.nodes[node]['attr_dict']['entity']
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    texts.append(node)
    hovers.append(entity)

  node_trace = go.Scatter(
    x=node_x, y=node_y, text=texts, hovertext=hovers,
    mode='markers+text',
    hoverinfo='text',
    marker=dict(
        color=node_color,
        size=40,
        line_width=2))
  
  return node_trace


def get_edges_from_graph(graph, pos, edge_color):
  edge_x = []
  edge_y = []
  hovers = []
  xtext = []
  ytext = []
  for edge in graph.edges():
    relation = graph.edges[edge]['attr_dict']['relation']
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)
    hovers.append(relation)
    xtext.append((x0+x1)/2)
    ytext.append((y0+y1)/2)

  edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=2, color=edge_color),
    mode='lines')
  
  labels_trace = go.Scatter(x=xtext,y= ytext, mode='text',
                              textfont = {'color': edge_color},
                              marker_size=0.5,
                              text=hovers,
                              textposition='top center',
                              hovertemplate='weight: %{text}<extra></extra>')
  return edge_trace, labels_trace


def show_graph_in_plotly(graph, node_color='white', edge_color='grey'):
  pos = nx.spring_layout(graph)
  node_trace = get_nodes_from_graph(graph, pos, node_color)
  edge_trace, labels_trace = get_edges_from_graph(graph, pos, edge_color)
  fig = go.Figure(data=[edge_trace, node_trace, labels_trace],
               layout=go.Layout(
                  title='Company Ecosystem',
                  titlefont_size=16,                   
                  showlegend=False,
                  width=1600,
                  height=1000,
                  xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                  yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                  )
  fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers')) 
  fig.show()

# Relation Extraction Pipeline

### Create Generic Base Pipeline
This pipeline will:
1) Split Text into Sentences
2) Split Sentences into Words
3) Use FInancial Text Embeddings to obtain numerical semantic representation of words

In [None]:
def get_generic_base_pipeline():
    document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    sentence_detector = nlp.SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

    embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")
    
    base_pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings
    ])
    
    return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

In [None]:
import pandas as pd

def get_relations_df (results, col='relations'):
    rel_pairs=[]
    for rel in results[0][col]:
        rel_pairs.append((
          rel.result, 
          rel.metadata['entity1'], 
          rel.metadata['entity1_begin'],
          rel.metadata['entity1_end'],
          rel.metadata['chunk1'], 
          rel.metadata['entity2'],
          rel.metadata['entity2_begin'],
          rel.metadata['entity2_end'],
          rel.metadata['chunk2'], 
          rel.metadata['confidence']
      ))

    rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])

    return rel_df

# Sample Texts from Cadence Design System
Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)

## NER: Named Entity Recognition
Main component to carry out information extraction and extract entities from texts

In [None]:
summary_sample_text = ["""
 
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K
_____________________________________  
(Mark One) 
☒	ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
For the fiscal year ended January 1, 2022
OR 
☐	TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
For the transition period from _________ to_________.

Commission file number 000-15867
_____________________________________ 
cdns-20220101_g1.jpg
CADENCE DESIGN SYSTEMS, INC.
(Exact name of registrant as specified in its charter)
____________________________________  
Delaware	 	00-0000000
(State or Other Jurisdiction of
Incorporation or Organization)	 	(I.R.S. Employer
Identification No.)
2655 Seely Avenue, Building 5,	San Jose,	California	 	95134
(Address of Principal Executive Offices)	 	(Zip Code)
 
(408)-943-1234
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12(b) of the Act: 
Title of Each Class	Trading Symbol(s)	Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share	CDNS	Nasdaq Global Select Market
 
Securities registered pursuant to Section 12(g) of the Act:
None"""]
            

In [None]:
ner_model_sec10k = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_summary")

ner_converter_sec10k = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_summary"])\
    .setOutputCol("ner_chunk_sec10k")

summary_pipeline = Pipeline(stages=[
    generic_base_pipeline,
    ner_model_sec10k,
    ner_converter_sec10k
])

In [None]:
from sparknlp_display import NerVisualizer

ner_vis = viz.NerVisualizer()

empty_data = spark.createDataFrame([[""]]).toDF("text")

summary_model = summary_pipeline.fit(empty_data)

light_summary_model = LightPipeline(summary_model)

summary_results = light_summary_model.fullAnnotate(summary_sample_text)

### Visualize Results

In [None]:
for r in summary_results:
    displayHTML(ner_vis.display(r, label_col = "ner_chunk_sec10k", document_col = "document", return_html=True))

## First, let's extract the Organization from NER results

In [None]:
G.clear()
G.nodes()

In [None]:
ORG = next(filter(lambda x: x.metadata['entity']=='ORG', summary_results[0]['ner_chunk_sec10k'])).result
ORG

In [None]:
# I add our main Organization in the center (x=0, y=0)
G.add_node(ORG, attr_dict={'entity': 'ORG'})

### Let's create a node for the Company we are processing

In [None]:
show_graph_in_plotly(G)

Then, let's add all the summary information from SEC 10K filings (1st page) to that organization.

We can create nodes and add a relation to Cadence directly, since we know it's information of that company.

In [None]:
for i, r in enumerate(summary_results[0]['ner_chunk_sec10k']):
  text = r.result
  entity = r.metadata['entity']
  
  if entity == 'ORG':
    continue #Already added
  G.add_node(text, attr_dict={'entity': entity}),
  G.add_edge(ORG, text, attr_dict={'relation': 'has_' + entity.lower()})  

In [None]:
show_graph_in_plotly(G)

## Normalizing the company name to query John Snow Labs datasources for more information about Cadence

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("normalization")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = LightPipeline(pipelineModel)

normalized_org = lp.fullAnnotate(ORG)
normalized_org

In [None]:
NORM_ORG = normalized_org[0]['normalization'][0].result
NORM_ORG

### NORMALIZED NAME
In Edgar, the company official is different! We need to take it before being able to augment with external information in EDGAR.

- Incorrect: `CADENCE DESIGN SYSTEMS, INC`
- Correct (Official): `CADENCE DESIGN SYSTEM INC`

In [None]:
  G.add_node(NORM_ORG, attr_dict={'entity': 'ORG'}),
  G.add_edge(ORG, NORM_ORG, attr_dict={'relation': 'has_official_name'})  

## DATA AUGMENTATION WITH CHUNK MAPPER

In [None]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM = finance.ChunkMapperModel()\
      .pretrained("finmapper_edgar_companyname", "en", "finance/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("mappings")
      
cm_pipeline = Pipeline(stages=[documentAssembler, chunkAssembler, CM])
fit_cm_pipeline = cm_pipeline.fit(empty_data)

df = spark.createDataFrame([[NORM_ORG]]).toDF("text")
r = fit_cm_pipeline.transform(df).collect()

In [None]:
mappings = r[0]['mappings']
for mapping in mappings:
  text = mapping.result
  relation = mapping.metadata['relation']
  print(f"{ORG} - has_{relation} - {text}")
    
  G.add_node(text, attr_dict={'entity': relation}),
  G.add_edge(ORG, text, attr_dict={'relation': 'has_' + relation.lower()})  

In [None]:
show_graph_in_plotly(G)

## NER and Relation Extraction
NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.

### Companies Information

In [None]:
ner_model_dates = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_date")

ner_converter_dates = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_date"])\
    .setOutputCol("ner_chunk_date")\
    .setWhiteList(["DATE"])

ner_model_alias = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_alias")

ner_converter_alias = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_alias"])\
    .setOutputCol("ner_chunk_alias")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols("ner_chunk_alias", "ner_chunk_date")\
    .setOutputCol('ner_chunk')\
    .setMergeOverlapping(True)\

pos = PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter_alias = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk_alias")\
    .setRelationPairs(["ORG-ALIAS"])\
    .setMaxSyntacticDistance(5)

re_ner_chunk_filter_acq = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk_acq")\
    .setRelationPairs(["DATE-ORG", "ORG-ORG"])\
    .setMaxSyntacticDistance(5)

re_model_acq = finance.RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries", "en", "finance/models")\
    .setInputCols(["re_ner_chunk_acq", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.5)

re_model_alias = finance.RelationExtractionDLModel().pretrained("finre_org_prod_alias", "en", "finance/models")\
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunk_alias", "sentence"])\
    .setOutputCol("relations_alias")

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_alias", "relations_acq")\
    .setInputType("ner_chunk")\
    .setOutputCol("relations")

nlpPipeline = Pipeline(stages=[
    generic_base_pipeline,
    ner_model_dates,
    ner_converter_dates,
    ner_model_alias,
    ner_converter_alias,
    chunk_merger,
    pos,
    dependency_parser,
    re_ner_chunk_filter_alias,
    re_ner_chunk_filter_acq,
    re_model_acq,
    re_model_alias,
    annotation_merger
])


model = nlpPipeline.fit(empty_data)
light_model = LightPipeline(model)

In [None]:
sample_text = [
  """Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") on January 15, 2020. On February 6, 2020, Cadence also acquired all of the outstanding equity of Integrand Software Inc.""",
               
"""Cadence Design Systems was founded on 1988 by the merger of Solomon Design Automation and SDA"""]
            

In [None]:
rel_df = pd.DataFrame()

for i in range(len(sample_text)):
    result = light_model.fullAnnotate(sample_text[i])
    rel_df = pd.concat([rel_df,get_relations_df(result)],axis = 0,ignore_index=True)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_alias,ORG,50,64,AWR Corporation,ALIAS,68,70,AWR,0.9415686
1,was_acquired_by,ORG,0,6,Cadence,ORG,50,64,AWR Corporation,0.6954847
2,was_acquired,ORG,50,64,AWR Corporation,DATE,77,92,"January 15, 2020",0.94789404
3,was_acquired,DATE,98,113,"February 6, 2020",ORG,116,122,Cadence,0.91787183
4,was_acquired,DATE,98,113,"February 6, 2020",ORG,171,192,Integrand Software Inc,0.8110341
5,was_acquired_by,ORG,116,122,Cadence,ORG,171,192,Integrand Software Inc,0.8877772
6,was_acquired,ORG,0,21,Cadence Design Systems,DATE,38,41,1988,0.9816823
7,is_subsidiary_of,ORG,0,21,Cadence Design Systems,ORG,60,84,Solomon Design Automation,0.9912629
8,is_subsidiary_of,ORG,0,21,Cadence Design Systems,ORG,90,92,SDA,0.94412464
9,was_acquired,DATE,38,41,1988,ORG,60,84,Solomon Design Automation,0.6182229


### Visualize Results

In [None]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

for i in range(len(sample_text)):
    result = light_model.fullAnnotate(sample_text[i])
    displayHTML(re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["no_rel"], return_html=True, show_relations=True))

### Inserting Nodes (Tags) and Relations into a Graph
Now, with entities and Relations connecting them, we can start populating the Graph of the company.

In [None]:
for t in rel_df.itertuples():
  relation = t.relation
  entity1 = t.entity1
  chunk1 = t.chunk1
  entity2 = t.entity2
  chunk2 = t.chunk2
  G.add_node(chunk1,  attr_dict={'entity': entity1})
  G.add_node(chunk2,  attr_dict={'entity': entity2})
  
  G.add_edge(ORG, chunk1, attr_dict={'relation': 'mentions_' + entity1.lower()})  
  G.add_edge(ORG, chunk2, attr_dict={'relation': 'mentions_' + entity2.lower()})  
  
  G.add_edge(chunk1, chunk2, attr_dict={'relation': relation.lower()})  
  

In [None]:
show_graph_in_plotly(G)

## People's Information

In [None]:
ner_model_role = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_role")

ner_converter_role = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_role"])\
    .setOutputCol("ner_chunk_role")

pos = PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter_role = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk_role", "dependencies"])\
    .setOutputCol("re_ner_chunk_role")\
    .setRelationPairs(["PERSON-ROLE", "ORG-ROLE", "DATE-ROLE"])\
    .setMaxSyntacticDistance(5)

re_model_exp = finance.RelationExtractionDLModel.pretrained("finre_work_experience", "en", "finance/models")\
    .setInputCols(["re_ner_chunk_role", "sentence"])\
    .setOutputCol("relations")\
    .setPredictionThreshold(0.5)

nlpPipeline = Pipeline(stages=[
    generic_base_pipeline,
    ner_model_role,
    ner_converter_role,
    pos,
    dependency_parser,
    re_ner_chunk_filter_role,
    re_model_exp,
])


model = nlpPipeline.fit(empty_data)
light_model = LightPipeline(model)

## Get Results

In [None]:
sample_text = ["""Joseph Costello was the CEO of the company since founded in 1988 until 1997. Currently, Anirudh Devgan is the CEO since 2021"""]
            

In [None]:
rel_df = pd.DataFrame()

for i in range(len(sample_text)):
    result = light_model.fullAnnotate(sample_text[i])
    rel_df = pd.concat([rel_df,get_relations_df(result)],axis = 0,ignore_index=True)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_role,PERSON,0,14,Joseph Costello,ROLE,24,26,CEO,0.9403703
1,has_role_from,ROLE,24,26,CEO,DATE,60,63,1988,0.98894715
2,had_role_until,ROLE,24,26,CEO,DATE,71,74,1997,0.9704771
3,has_role,PERSON,88,101,Anirudh Devgan,ROLE,110,112,CEO,0.96097374
4,has_role_from,ROLE,110,112,CEO,DATE,120,123,2021,0.98356366


## Visualize Results

In [None]:
from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

for i in range(len(sample_text)):
    result = light_model.fullAnnotate(sample_text[i])
    displayHTML(re_vis.display(result = result[0], relation_col = "relations", document_col = "document", exclude_relations = ["no_rel"], return_html=True, show_relations=True))

## Adding to graph

In [None]:
for t in rel_df.itertuples():
  relation = t.relation
  entity1 = t.entity1
  chunk1 = t.chunk1
  entity2 = t.entity2
  chunk2 = t.chunk2
  G.add_node(chunk1,  attr_dict={'entity': entity1})
  G.add_node(chunk2,  attr_dict={'entity': entity2})
  
  G.add_edge(ORG, chunk1, attr_dict={'relation': 'mentions_' + entity1.lower()})  
  G.add_edge(ORG, chunk2, attr_dict={'relation': 'mentions_' + entity2.lower()})  
  
  G.add_edge(chunk1, chunk2, attr_dict={'relation': relation.lower()})  
  

In [None]:
show_graph_in_plotly(G)

# Understanding the context of mentioned companies to identify COMPETITORS
Many Companies may be mentioned in the report. Most of them are just organizations in the ecosystem of the Cadence. Others, may be competitors.

We can analyze the surrounding context of the extracted `ORG` to check if they are competitors or not.

In [None]:
ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(['ORG', 'PRODUCT'])

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"])\
    .setOutputCol("assertion")

nlpPipeline = Pipeline(stages=[
    generic_base_pipeline,
    ner,
    ner_converter,
    assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

light_model = LightPipeline(model)

### Get Results

In [None]:
sample_text = ["""In the rapidly evolving market, certain elements of our application compete with Microsoft, Google, InFocus, Bluescape, Mersive, Barco, Nureva and Prysm. But, Oracle  and IBM are out of our league."""]

chunks=[]
entities=[]
status=[]


light_result = light_model.fullAnnotate(sample_text)[0]

for n,m in zip(light_result['ner_chunk'],light_result['assertion']):
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [None]:
df

Unnamed: 0,chunks,entities,assertion
0,Microsoft,ORG,COMPETITOR
1,Google,ORG,COMPETITOR
2,InFocus,ORG,COMPETITOR
3,Bluescape,ORG,COMPETITOR
4,Mersive,ORG,COMPETITOR
5,Barco,ORG,COMPETITOR
6,Nureva,ORG,COMPETITOR
7,Prysm,ORG,COMPETITOR
8,Oracle,ORG,NO_COMPETITOR
9,IBM,ORG,NO_COMPETITOR


### Visualize Assertion Result

In [None]:
vis = viz.AssertionVisualizer()

vis.set_label_colors({'COMPETITOR':'#008080', 'NO_COMPETITOR':'#800080'})
    
light_result = light_model.fullAnnotate(sample_text)[0]

displayHTML(vis.display(light_result, 'ner_chunk', 'assertion', return_html=True))


### Adding it to the graph

In [None]:
for t in df.itertuples():
  chunks = t.chunks
  entities = t.entities
  assertion = t.assertion

  G.add_node(chunks,  attr_dict={'entity': entities})
  
  G.add_edge(ORG, chunks, attr_dict={'relation': 'is_' + assertion.lower()})
  

In [None]:
show_graph_in_plotly(G)

# Annex 1: Detecting Temporality and Certainty in Affirmations

In [None]:
ner_model_role = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_role")

ner_converter_role = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_role"])\
    .setOutputCol("ner_chunk")

assertion = finance.AssertionDLModel.pretrained("finassertion_time", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")\
    .setMaxSentLen(1200)

assertion_pipeline = Pipeline(stages=[
    generic_base_pipeline,
    ner_model_role,
    ner_converter_role,
    assertion
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

assertion_model = assertion_pipeline.fit(empty_data)

light_model_assertion = LightPipeline(assertion_model)

### Get Result

In [None]:
sample_text = ["""Joseph Costello was the CEO of the company since founded in 1988 until 1997. He was followed by Lip-Bu Tan for the 2009–2021 period. Currently, Anirudh Devgan is the CEO since 2021""",
              
              """In 2007, Cadence was rumored to be in talks with Kohlberg Kravis Roberts and Blackstone Group regarding a possible sale of the company.""",
              """In 2008, Cadence withdrew a $1.6 billion offer to purchase rival Mentor Graphics.""",
              
               """ The Cadence Giving Foundation will also support critical needs in areas such as diversity, equity and inclusion, environmental sustainability and STEM education.""",
              """This stand-alone, non-profit foundation will partner with other charitable initiatives to support critical needs in areas such as diversity, equity and inclusion, environmental sustainability and science, technology, engineering, and mathematics (“STEM”) education""",
              
              """Cadence employees could purchase common stock at a price equal to 85% of the lower of the fair market value at the beginning or the end of the applicable offering period"""]

chunks=[]               
entities=[]
status=[]

light_results = light_model_assertion.fullAnnotate(sample_text)

for light_result in light_results:
  for n,m in zip(light_result['ner_chunk'], light_result['assertion']):
      chunks.append(n.result)
      entities.append(n.metadata['entity']) 
      status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [None]:
df

Unnamed: 0,chunks,entities,assertion
0,Joseph Costello,PERSON,PAST
1,CEO,ROLE,PAST
2,1988,DATE,PAST
3,1997,DATE,PAST
4,2009–2021,DATE,PAST
5,Anirudh Devgan,PERSON,PRESENT
6,CEO,ROLE,PRESENT
7,2021,DATE,PRESENT
8,2007,DATE,PAST
9,Cadence,ORG,PAST


### Visualize Assertion Result

In [None]:
vis = viz.AssertionVisualizer()

for light_result in light_results:
  displayHTML(vis.display(light_result, 'ner_chunk', 'assertion', return_html=True))


# Annex 2: Graph Embeddings to find Node / Edge / Subgraph Similarity
When you populate several companies, you can generate Node, Edge or Subgraph Embeddings to check for node similarity and potential missed links

In [None]:
!pip install node2vec

## Node Embeddings

In [None]:
# pip install node2vec
from node2vec import Node2Vec

# Generate walks
node2vec = Node2Vec(G, dimensions=20, walk_length=16, num_walks=100)

In [None]:
# Learn embeddings 
model = node2vec.fit(window=10, min_count=1)

In [None]:
model.wv.get_vector('AWR')

In [None]:
for node, _ in model.wv.most_similar('AWR'):
    print(node)

## Edge Embeddings

In [None]:
from node2vec.edges import HadamardEmbedder
edges_embs = HadamardEmbedder(keyed_vectors=model.wv)

In [None]:
edges_embs[('AWR', 'AWR Corporation')]

In [None]:
edges_kv = edges_embs.as_keyed_vectors()
edges_kv.most_similar(str(('AWR', 'AWR Corporation')))

## Want to go further and analyse subgraph Embeddings, to compare,f or example, different Companies subgraphs?
Check `https://github.com/benedekrozemberczki/graph2vec` or get in touch with us at support@johnsnowlabs.com or in our [Slack](https://www.johnsnowlabs.com/slack-redirect/), #finance channel.