You may find this series of notebooks at https://github.com/databricks-industry-solutions/jsl-financial-nlp

In [0]:
%pip install johnsnowlabs==4.2.3 networkx==2.5 decorator==5.0.9 plotly==5.1.0 

# Entity Extraction
Let's proceed to extract the entities we know from previous steps (and for our knowledge of 10K or 10Q filings) that are available in our document.

In [0]:
from johnsnowlabs import nlp, finance, viz

import os
import sys
import time
import json
import functools 
import numpy as np
import pandas as pd
from tqdm import tqdm
from scipy import spatial

### Auxiliary Visualization functions 
We will use [NetworkX](https://networkx.org/) to store the graph and [Plotly](https://plotly.com/) to visualize it.

These functions will:
- Use Plotly to visualize a NetworkX graph
- Display relations in a dataframe

In [0]:
%run "./aux_visualization_functions"

In [0]:
G = nx.Graph()

# Auxiliary Pipeline functions
In an independent file, we save 2 common pipelines we will be used all over the document, to keep the notebooks clean:
- **a generic pipeline**: having `DocumentAssembler`, `SentenceDetector`, `Tokenizer` and `Financial Embeddings`;
- **a text classification pipeline**: having `DocumentAssembler`, `Sentence Embeddings (Universal Sentence Embedings)` and `ClassifierDL (Text Classification)`;

In [0]:
%run "./aux_pipeline_functions"

In [0]:
generic_base_pipeline = get_generic_base_pipeline()

# Let's start
We read back our text file of 90 pages

In [0]:
import pickle
with open('/databricks/driver/cadence_pages.pickle', 'rb') as f:
  pages = pickle.load(f)

In [0]:
print(pages[0])

## NER: Named Entity Recognition on 10K Summary
Main component to carry out information extraction and extract entities from texts. 

This time we will use a model trained to extract many entities from 10K summaries.

In [0]:
summary_sample_text = pages[0]

In [0]:
ner_model_sec10k = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_summary")

ner_converter_sec10k = nlp.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_summary"])\
    .setOutputCol("ner_chunk_sec10k")

summary_pipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_sec10k,
    ner_converter_sec10k
])

In [0]:
from johnsnowlabs.nlp import LightPipeline

ner_vis = viz.NerVisualizer()

empty_data = spark.createDataFrame([[""]]).toDF("text")

summary_model = summary_pipeline.fit(empty_data)

light_summary_model = LightPipeline(summary_model)

summary_results = light_summary_model.fullAnnotate(summary_sample_text)

In [0]:
summary_results

### Visualize Results

In [0]:
for r in summary_results:
    displayHTML(ner_vis.display(r, label_col = "ner_chunk_sec10k", document_col = "document", return_html=True))

## First, let's extract the Organization from NER results

We create a new graph

In [0]:
G.clear()
G.nodes()

In [0]:
ORG = next(filter(lambda x: x.metadata['entity']=='ORG', summary_results[0]['ner_chunk_sec10k'])).result
ORG

We add our first node to the graph

In [0]:
# I add our main Organization in the center (x=0, y=0)
G.add_node(ORG, attr_dict={'entity': 'ORG'})

In [0]:
show_graph_in_plotly(G)

Then, let's add all the summary information from SEC 10K filings (1st page) to that organization.

We can create nodes and add a relation to Cadence directly, since we know it's information of that company.

In [0]:
for i, r in enumerate(summary_results[0]['ner_chunk_sec10k']):
  text = r.result
  entity = r.metadata['entity']
  
  if entity == 'ORG':
    continue #Already added
  G.add_node(text, attr_dict={'entity': entity}),
  G.add_edge(ORG, text, attr_dict={'relation': 'has_' + entity.lower()})  

In [0]:
show_graph_in_plotly(G)

In [0]:
import pickle

# save graph object to file
pickle.dump(G, open('/databricks/driver/cadence.pickle', 'wb'))

# Now you can proceed to 04 Normalization and Data Augmentation!