You may find this series of notebooks at https://github.com/databricks-industry-solutions/jsl-financial-nlp

In [0]:
%pip install johnsnowlabs==4.2.3 networkx==2.5 decorator==5.0.9 plotly==5.1.0 

## Normalizing the company name to query John Snow Labs datasources for more information about Cadence

Normalizing a company name is super important for data quality purposes. It will help us:
- Standardize the data, improving the quality;
- Carry out additional verifications;
- Join different databases or extract for external sources;

## Let's resume the G creation, loading it from disk from previous step

In [0]:
from johnsnowlabs import nlp, finance, viz
import pickle

In [0]:
%run "./aux_visualization_functions"

In [0]:
# load graph object from file
G = pickle.load(open('/databricks/driver/cadence.pickle', 'rb'))

Sometimes, companies in texts use a non-official, abbreviated name. For example, we can find `Cadence`, `Cadence Inc`, `Cadence, Inc`, or many other variations, where the official name of the company os `CADENCE DESIGN SYSTEMS INC`, as per registered in SEC Edgar.

# Entity Resolution
To normalize names or map permutations or variations of strings to unique names or codes, we use Financial NLP `EntityResolvers`

In [0]:
from johnsnowlabs.nlp import LightPipeline

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("normalization")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          document_assembler,
          embeddings,
          resolver])

Our unnormalized company name was our first node

In [0]:
ORG = [n for n in G.nodes()][0]
ORG

Let's see it's official name

In [0]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(empty_data)

lp = LightPipeline(pipelineModel)

normalized_org = lp.fullAnnotate(ORG)
normalized_org

In [0]:
NORM_ORG = normalized_org[0]['normalization'][0].result
NORM_ORG

Ok, it turns out it's `CADENCE DESIGN SYSTEMS INC`. We got our first insight, using pretrained Spark NLP data sources, in this case, an `EntityResolver` for company names normalization.

But Finance NLP has much more than that!

## DATA AUGMENTATION WITH CHUNK MAPPER

Once we have the normalized name of the company, we can use `Finance NLP Chunk Mappers`. These are pretrained data sources, which are updated frequently and can be queried inside Spark NLP without sending any API call to any server.

In this case, we will use Edgar Database (`finmapper_edgar_companyname`)

In [0]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

CM = finance.ChunkMapperModel()\
      .pretrained("finmapper_edgar_companyname", "en", "finance/models")\
      .setInputCols(["document"])\
      .setOutputCol("mappings")
      
cm_pipeline = nlp.Pipeline(stages=[documentAssembler, CM])
fit_cm_pipeline = cm_pipeline.fit(empty_data)

cm_lp = LightPipeline(fit_cm_pipeline)

mapping = cm_lp.fullAnnotate(NORM_ORG)[0]

In [0]:
mapping

In [0]:
for key, value in mapping.items():
  if key == 'mappings':
    for relation in mapping[key]:
      text = relation.result
      relation_name = relation.metadata['relation']
      print(f"{ORG} - has_{relation_name} - {text}")
      G.add_node(text, attr_dict={'entity': relation_name}),
      G.add_edge(ORG, text, attr_dict={'relation': 'has_' + relation_name.lower()})

In [0]:
show_graph_in_plotly(G)

In [0]:
import pickle

# save graph object to file
pickle.dump(G, open('/databricks/driver/cadence.pickle', 'wb'))