![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/10.1.Data_Augmentation_with_ChunkMappers_Edgar.ipynb)

# Financial Data Augmentation with Chunk Mappers

**This notebook is the continuation of [10.0.Data_Augmentation_with_ChunkMappers.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/10.0.Data_Augmentation_with_ChunkMappers.ipynb)**

# Installation

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone


In [None]:
from johnsnowlabs import nlp, finance, viz
nlp.install(force_browser=True)

# Start Spark Session

In [None]:
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


Alright! CADENCE DESIGN SYSTEMS, INC has been detected as an organization. 

Now, let's augment `CADENCE DESIGN SYSTEMS, INC` with more information about the company, given that there are no more details in the SEC10K form I can use.

But before __augmenting__, there is a very important step we need to carry out: `Company Name Normalization`

## Step 3: Company Names Normalization

Let's suppose we want to manually get information about CADENCE DESIGN SYSTEMS, INC.

Since it's a public US company, we can go to [SEC Edgar's database](https://www.sec.gov/edgar/searchedgar/companysearch) and look for it.


Unfortunately, `CADENCE DESIGN SYSTEMS, INC` is not the official name of the company, which means no entry for `CADENCE DESIGN SYSTEMS, INC` is available. That's were __Company Names Normalization__ comes in handy.

**Company Name Normalization** is the process of obtaining the name of the company used by data providers, usually the "official" name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Crunchbase until 2015
- Wikidata (in progress)

Let's normalize `CADENCE DESIGN SYSTEMS, INC` to the official name in _SEC Edgar_.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")
    
resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("resolution")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          document_assembler,
          use_embeddings,
          resolver])

lp_res = nlp.LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finel_edgar_company_name download started this may take some time.
[OK!]


In [None]:
ORG = ['CADENCE DESIGN SYSTEMS, INC', 'Cadence Design Systems, Inc']

ORG

['CADENCE DESIGN SYSTEMS, INC', 'Cadence Design Systems, Inc']

In [None]:
el_res = lp_res.annotate(ORG)
el_res

[{'document': ['CADENCE DESIGN SYSTEMS, INC'],
  'sentence_embeddings': ['CADENCE DESIGN SYSTEMS, INC'],
  'resolution': ['CADENCE DESIGN SYSTEMS INC']},
 {'document': ['Cadence Design Systems, Inc'],
  'sentence_embeddings': ['Cadence Design Systems, Inc'],
  'resolution': ['CADENCE DESIGN SYSTEMS INC']}]

In [None]:
NORM_ORG = el_res[0]['resolution'][0]

NORM_ORG

'CADENCE DESIGN SYSTEMS INC'

Here is our normalized name for Amazon: `CADENCE DESIGN SYSTEMS INC`.

Now, let's see which information is available in Edgar database for `CADENCE DESIGN SYSTEMS INC`

Once we have the normalized name of the company, we can use `John Snow Labs Chunk Mappers`. These are pretrained data sources, which are updated frequently and can be queried inside Spark NLP without sending any API call to any server.

In this case, we will use Edgar Database (`finmapper_edgar_companyname`)




## Step 4: Data Augmentation with Chunk Mappers


Once we have the normalized name of the company, we can use `John Snow Labs Chunk Mappers`. These are pretrained data sources, which are updated frequently and can be queried inside Spark NLP without sending any API call to any server.

In this case, we will use Edgar Database (`finmapper_edgar_companyname`)

The component which carries out __Data Augmentation__ is called `ChunkMapper`.

It's name comes from the way it works: it uses a _Ner Chunk_ to map it to an external data source.

As a result, you will get a JSON with a dictionary of additional fields and their values. 

Let's take a look at how it works.

In [None]:
chunk_assembler = nlp.Doc2Chunk()\
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM = finance.ChunkMapperModel().pretrained("finmapper_edgar_companyname", "en", "finance/models")\
    .setInputCols(["chunk"])\
    .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[document_assembler, chunk_assembler, CM])

empty_data = spark.createDataFrame([[""]]).toDF("text")

fit_cm_pipeline = cm_pipeline.fit(empty_data)

finmapper_edgar_companyname download started this may take some time.
[OK!]


In [None]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform
df = spark.createDataFrame([[NORM_ORG]]).toDF("text")

df.show(truncate = False)

+--------------------------+
|text                      |
+--------------------------+
|CADENCE DESIGN SYSTEMS INC|
+--------------------------+



In [None]:
res = fit_cm_pipeline.transform(df)
res.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               chunk|            mappings|
+--------------------+--------------------+--------------------+--------------------+
|CADENCE DESIGN SY...|[{document, 0, 25...|[{chunk, 0, 25, C...|[{labeled_depende...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
r = res.collect()
r

[Row(text='CADENCE DESIGN SYSTEMS INC', document=[Row(annotatorType='document', begin=0, end=25, result='CADENCE DESIGN SYSTEMS INC', metadata={'sentence': '0'}, embeddings=[])], chunk=[Row(annotatorType='chunk', begin=0, end=25, result='CADENCE DESIGN SYSTEMS INC', metadata={'sentence': '0', 'chunk': '0'}, embeddings=[])], mappings=[Row(annotatorType='labeled_dependency', begin=0, end=25, result='CADENCE DESIGN SYSTEMS INC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '0.0', 'all_relations': '', 'chunk': '0', '__trained__': 'CADENCE DESIGN SYSTEMS INC', '__distance_function__': 'levenshtein', '__relation_name__': 'name', 'entity': 'CADENCE DESIGN SYSTEMS INC', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=25, result='SERVICES-PREPACKAGED SOFTWARE [7372]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '0.0', 'all_relations': '', 'chunk': '0', '__trained__': 'CADENCE DESIGN SYSTEMS INC', '__distance_function__': 'levenshtein', 

In [None]:
json_dict = dict()
for n in r[0]['mappings']:
    json_dict[n.metadata['relation']] = str(n.result)

In [None]:
import json
print(json.dumps(json_dict, indent=4, sort_keys=True))

{
    "business_city": "SAN JOSE",
    "business_phone": "4089431234",
    "business_state": "CA",
    "business_street": "2655 SEELY AVENUE BLDG 5",
    "business_zip": "95134",
    "company_id": "813672",
    "date": "2017-02-10",
    "fiscal_year_end": "1228",
    "former_name": "ECAD INC /DE/",
    "former_name_date": "19880609",
    "irs_number": "770148231",
    "name": "CADENCE DESIGN SYSTEMS INC",
    "sic": "SERVICES-PREPACKAGED SOFTWARE [7372]",
    "sic_code": "7372",
    "state_incorporation": "DE",
    "state_location": "CA"
}


Yes, here it is. We get additional information about `CADENCE DESIGN SYSTEMS INC` using only company name.