![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Legal/6.Entity_Resolution_EDGAR.ipynb)

# Installation

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

In [None]:
from johnsnowlabs import nlp, legal, viz
nlp.install(force_browser=True)

## Start Spark Session

In [3]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


# Legal Entity Resolution


Entity resolution is an important task in natural language processing and information extraction, as it allows for more accurate analysis and understanding of legal texts. For example, in a news article discussing the performance of a company's stock, accurately identifying and disambiguating the company's name is crucial for accurately tracking the stock's performance.

An NLP use case in financial or legal applications is identifying legal entities' presence in a given text. One of those entities could be `Company Name`. We can carry out NER to extract different chunks of information, but in real financial and legal use cases, the company name is usually not useful as it is mentioned in the text. Sometimes we need the **_official_** name of the company (instead of `Amazon`, `Amazon.com INC`, as registered in Edgar, Crunchbase and Nasdaq). We have pre-trained sentence entity resolver models for these purposes shown below with the examples.

## Pretrained Entity Resolution Models for Legal

Here are the list of pretrained Entity Resolution models:

|index|model|
|-----:|:-----|
| 1| [Company Name Normalization Using Edgar Database](https://nlp.johnsnowlabs.com/2022/08/30/legel_edgar_company_name_en.html)  |
| 2| [Company Names Normalization Using Crunchbase](https://nlp.johnsnowlabs.com/2022/08/09/legel_crunchbase_companynames_en_3_2.html)  | 
| 3| [Company Name to IRS (Edgar database)](https://nlp.johnsnowlabs.com/2022/08/30/finel_edgar_company_name_en.html)  |


## Common Componennts


Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- all_k_resolutions -> All codes descriptions
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId

We will use following Generic Function For Getting the Codes and Relation Pairs

In [5]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

def get_codes (lp, text, vocab='company_name', hcc=False):

    """Returns LightPipeline resolution results"""
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for i in range(len(full_light_result)):

      for chunk, code in zip(full_light_result[i]['ner_chunk'], full_light_result[i][vocab]):   
          begin.append(chunk.begin)
          end.append(chunk.end)
          chunks.append(chunk.result)
          codes.append(code.result) 
          all_codes.append(code.metadata['all_k_results'].split(':::'))
          resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
          all_distances.append(code.metadata['all_k_distances'].split(':::'))
          all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
          if hcc:
              try:
                  all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
              except:
                  all_k_aux_labels.append([])
          else:
              all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    return df

## Company Name Normalization using Edgar

**Normalizing the company name to query John Snow Labs datasources for more information about Cadence.**

Sometimes, companies in texts use a non-official, abbreviated name. For example, we can find `Cadence`, `Cadence Inc`, `Cadence, Inc`, or many other variations, where the official name of the company os `CADENCE DESIGN SYSTEMS INC`, as per registered in SEC Edgar.

[Edgar's Public Database](https://www.sec.gov/edgar/searchedgar/companysearch)
- EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, is the primary system for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. 

- Access to EDGAR’s public database is free—allowing you to research, for example, a public company’s financial information and operations by reviewing the filings the company makes with the SEC.(U.S. Securities and Exchange Commission)

Normalizing a company name is super important for data quality purposes. It will help us:
- Standardize the data, improving the quality;
- Carry out additional verifications;
- Join different databases or extract for external sources;

`Company Name Normalization` is the process of obtaining the name of the company used by data providers, usually the "official" name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Crunchbase until 2015
- Wikidata (in progress)

Here we will normalize company names and find IRS code of each company with the Edgar's Database

### Sample Text



In [6]:
sample_text = """Contact Gold is a gold exploration company focused on leveraging its properties, people, technology and capital to make district scale gold discoveries in Nevada."""


### Using NER model to Find Company Names

Main component to carry out information extraction and extract entities from texts. 

This time we will use a model trained to extract many entities from legal texts.

In [7]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")\
        .setWhiteList(["ORG"])

nlp_pipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_df = spark.createDataFrame([[""]]).toDF("text")

model = nlp_pipeline.fit(empty_df)



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
legner_orgs_prods_alias download started this may take some time.
[OK!]


In [9]:
df = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(df).cache()

In [10]:
import pyspark.sql.functions as F

result = result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence"))
      
result.show()

+------------+---------+----------+
|       chunk|ner_label|confidence|
+------------+---------+----------+
|Contact Gold|      ORG|0.91964996|
+------------+---------+----------+



In [11]:
res = result.toPandas()

res

Unnamed: 0,chunk,ner_label,confidence
0,Contact Gold,ORG,0.91964996


In [12]:
ORG = list(res["chunk"])
ORG

['Contact Gold']

### Get Normalized Company Name

In [13]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("normalization")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = nlp.LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_edgar_company_name download started this may take some time.
[OK!]


In [14]:
normalized_org = lp.fullAnnotate(ORG)

normalized_org

[{'ner_chunk': [Annotation(document, 0, 11, Contact Gold, {})],
  'sentence_embeddings': [Annotation(sentence_embeddings, 0, 11, Contact Gold, {'sentence': '0', 'token': 'Contact Gold', 'pieceId': '-1', 'isWordStart': 'true'})],
  'normalization': [Annotation(entity, 0, 11, Contact Gold Corp., {'all_k_results': 'Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.', 'all_k_distances': '0.0000:::0.7118:::0.7182:::0.7397:::0.7641:::0.7658:::0.7695:::0.7705:::0.7709:::0.7732:::0.7759:::0.7773', 'confidence': '0.1624', 'all_k_cosine_distances': '0.0000:::0.2533:::0.2579:::0.2736:::0.2919:::0.2933:::0.2961:::0.2968:::0.2971:::0.2989:::0.3010:::0.3021', 'all_k_resolutions': 'Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC

In [15]:
NORM_ORG = normalized_org[0]['normalization'][0].result
NORM_ORG

'Contact Gold Corp.'

In [16]:
get_codes(lp, ORG, vocab = "normalization")

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Contact Gold,0,11,Contact Gold Corp.,"[Contact Gold Corp., ISHARES GOLD TRUST, Minatura Gold, Mexus Gold US, BESRA GOLD INC., ALAMOS GOLD INC, JOSHUA GOLD RESOURCES INC, MIDEX GOLD CORP., Gold Mark Stephen, Guskin Gold Corp., CMX GOLD & SILVER CORP., Permal Gold Ltd.]","[Contact Gold Corp., ISHARES GOLD TRUST, Minatura Gold, Mexus Gold US, BESRA GOLD INC., ALAMOS GOLD INC, JOSHUA GOLD RESOURCES INC, MIDEX GOLD CORP., Gold Mark Stephen, Guskin Gold Corp., CMX GOLD & SILVER CORP., Permal Gold Ltd.]",[],"[0.0000, 0.2533, 0.2579, 0.2736, 0.2919, 0.2933, 0.2961, 0.2968, 0.2971, 0.2989, 0.3010, 0.3021]"


### Normalized Name
In Edgar, the company official is different! We need to take it before being able to augment with external information in EDGAR.

- Incorrect: `Contact Gold`
- Correct (Official): `Contact Gold Corp`