# Legal Entity Resolution

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/6.EntityResolution.ipynb)

**Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=4.1.0 but should be Version=0.1.14
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up if John Snow Labs home exists in /root/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.1.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library internal_with_finleg-0.1.14-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.1.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-assembly-4.1.0.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
Installing /root/.johnsnowlabs/py_installs/spark_ocr-4.1.0-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip instal

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

[91m🚨 Your Spark-OCR is outdated, installed==4.0.0a1 but latest version==4.1.0
You can run [92m jsl.install() [39mto update Spark-OCR
👌 Detected license file /content/4.1.0.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=4.1.0 but should be Version=0.1.14
👌 Launched [92mcpu-Optimized JVM[39m with SparkSession with Jars for: 🚀Spark-NLP==4.1.0, 💊Spark-Healthcare==4.0.0a1, 🕶Spark-OCR==4.1.0, running on ⚡ PySpark==3.1.2


In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


# Sentence Entity Resolver Models

An NLP use case in financial or legal applications is identifying legal entities' presence in a given text. One of those entities could be `Company Name`. We can carry out NER to extract different chunks of information, but in real financial and legal use cases, the company name is usually not useful as it is mentioned in the text. Sometimes we need the _official_ name of the company (instead of `Amazon`, `Amazon.com INC`, as registered in Edgar). We have pre-trained sentence entity resolver models for these purposes shown below with the examples.

Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- all_k_resolutions -> All codes descriptions
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId

### Helper Function
We will use following Generic Function For Getting the Codes and Relation Pairs

In [None]:
# returns LP resolution results

# import pandas as pd
pd.set_option('display.max_colwidth', 0)


def get_codes (lp, text, vocab='company_name', hcc=False):
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):
            
        begin.append(chunk.begin)
        end.append(chunk.end)
        chunks.append(chunk.result)
        codes.append(code.result) 
        all_codes.append(code.metadata['all_k_results'].split(':::'))
        resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
        all_distances.append(code.metadata['all_k_distances'].split(':::'))
        all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
        if hcc:
            try:
                all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
            except:
                all_k_aux_labels.append([])
        else:
            all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    return df

## Sentence Entity Resolver (EDGAR)
[Edgar's Public Database](https://www.sec.gov/edgar/searchedgar/companysearch)

![image.png](attachment:635a0e2c-5d63-4a2b-be1a-f84aaf49e190.png)

- EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, is the primary system for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. 

- Access to EDGAR’s public database is free—allowing you to research, for example, a public company’s financial information and operations by reviewing the filings the company makes with the SEC.(U.S. Securities and Exchange Commission)

Here we will normalize company names and find IRS code of each company with the Edgar's Database

### Company Name Normalization

`Company Name Normalization` is the process of obtaining the name of the company used by data providers, usually the "official" name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Crunchbase until 2015
- Wikidata (in progress)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("normalized_name")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_edgar_company_name download started this may take some time.
[OK!]


In [None]:
text = 'Pre Paid Legal Services'

%time get_codes (lp, text, vocab='normalized_name')

CPU times: user 20.7 ms, sys: 2.62 ms, total: 23.3 ms
Wall time: 1.25 s


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Pre Paid Legal Services,0,22,PRE PAID LEGAL SERVICES INC,"[PRE PAID LEGAL SERVICES INC, AMERICAN PREPAID LEGAL SERVICES, VIRTU FINANCIAL BD LLC, Virtu Financial BD LLC, HJ Umbaugh Associates Certified Public Accountants LLP, Commerce Nursing Homes LLC, IRUNURUN LLC, Camas Associates LLC, Court Document Services Inc, EZC Medical LLC, Virtu Financial LLC, CST Services LLC, Tontine Associates LLC, JENNISON ASSOCIATES LLC, VTL Associates LLC, Emancipation Management LLC, Hotel Internet Services LLC, Frasca Associates LLC, BCIP T Associates III LLC, MEDIACOM LLC, BBR PARTNERS LLC, Watauga Associates LLC, JCRA FINANCIAL LLC, LifePoint Billing Services LLC]","[PRE PAID LEGAL SERVICES INC, AMERICAN PREPAID LEGAL SERVICES, VIRTU FINANCIAL BD LLC, Virtu Financial BD LLC, HJ Umbaugh Associates Certified Public Accountants LLP, Commerce Nursing Homes LLC, IRUNURUN LLC, Camas Associates LLC, Court Document Services Inc, EZC Medical LLC, Virtu Financial LLC, CST Services LLC, Tontine Associates LLC, JENNISON ASSOCIATES LLC, VTL Associates LLC, Emancipation Management LLC, Hotel Internet Services LLC, Frasca Associates LLC, BCIP T Associates III LLC, MEDIACOM LLC, BBR PARTNERS LLC, Watauga Associates LLC, JCRA FINANCIAL LLC, LifePoint Billing Services LLC]",[],"[0.0553, 0.2114, 0.2679, 0.2679, 0.2806, 0.2823, 0.2881, 0.2881, 0.2883, 0.2892, 0.2895, 0.2895, 0.2897, 0.2919, 0.2932, 0.2932, 0.2949, 0.2952, 0.2963, 0.2966, 0.2980, 0.2980, 0.2981, 0.2986]"


In [None]:
text = 'Legal Research Center inc'

%time get_codes (lp, text, vocab='normalized_name')

CPU times: user 8.09 ms, sys: 2.71 ms, total: 10.8 ms
Wall time: 615 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Legal Research Center inc,0,24,LEGAL RESEARCH CENTER INC,"[LEGAL RESEARCH CENTER INC, Vector Research LLC, MATRIXX INITIATIVES INC, SYMIC BIOMEDICAL INC, Alliqua BioMedical Inc, EXPERIENTIAL AGENCY INC, PREMIER BIOMEDICAL INC]","[LEGAL RESEARCH CENTER INC, Vector Research LLC, MATRIXX INITIATIVES INC, SYMIC BIOMEDICAL INC, Alliqua BioMedical Inc, EXPERIENTIAL AGENCY INC, PREMIER BIOMEDICAL INC]",[],"[0.0000, 0.2192, 0.2386, 0.2387, 0.2393, 0.2424, 0.2440]"


In [None]:
text = 'Cyber Law Reporter'

%time get_codes (lp, text, vocab='normalized_name')

CPU times: user 10.2 ms, sys: 2.89 ms, total: 13.1 ms
Wall time: 627 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Cyber Law Reporter,0,17,CYBER LAW REPORTER INC,"[CYBER LAW REPORTER INC, Cyber Informatix Inc, CETERA ADVISOR NETWORKS LLC, Cetera Advisor Networks LLC, COUNSEL COMMUNICATIONS LLC, AirTouch Communications Inc, GEO Corrections Detention LLC]","[CYBER LAW REPORTER INC, Cyber Informatix Inc, CETERA ADVISOR NETWORKS LLC, Cetera Advisor Networks LLC, COUNSEL COMMUNICATIONS LLC, AirTouch Communications Inc, GEO Corrections Detention LLC]",[],"[0.0548, 0.2975, 0.3243, 0.3243, 0.3343, 0.3365, 0.3397]"


### Find Company IRS Number

An employer identification number (EIN) is a nine-digit number assigned by the `IRS`. It's used to identify the tax accounts of employers and certain others who have no employees. The IRS uses the number to identify taxpayers who are required to file various business tax returns. EINs are used by employers, sole proprietors, corporations, partnerships, non-profit associations, trusts, estates of decedents, government agencies, certain individuals, and other business entities.

![image.png](attachment:01d5797e-94ae-4d7a-acb6-52cfa88ca194.png)


In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_irs", "en", "legal/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("irs_code")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_edgar_irs download started this may take some time.
[OK!]


In [None]:
text = 'LEGAL GENERAL INVESTMENT MANAGEMENT AMERICA INC'

%time get_codes (lp, text, vocab='irs_code')

CPU times: user 8.21 ms, sys: 2.72 ms, total: 10.9 ms
Wall time: 685 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,LEGAL GENERAL INVESTMENT MANAGEMENT AMERICA INC,0,46,208058531,"[208058531, 0, 440640487, 133008848]","[Legal General Investment Management America Inc, Legal General Investment Management America, AMERICAN CENTURY INVESTMENT MANAGEMENT INC, AMERICAN CAPITAL MANAGEMENT INC]",[],"[0.0000, 0.0403, 0.1420, 0.1569]"


In [None]:
text = 'Justice Delawere Holdco inc'

%time get_codes (lp, text, vocab='irs_code')

CPU times: user 9.5 ms, sys: 4.13 ms, total: 13.6 ms
Wall time: 623 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Justice Delawere Holdco inc,0,26,455011014,"[455011014, 0, 954695021, 231726661, 261327790, 352567439, 521951797]","[Justice Delaware Holdco Inc, ChowNow Inc, PeopleSupport Inc, JUDGE GROUP INC, ABVIVA INC, MOVEIX INC, PATAPSCO BANCORP INC]",[],"[0.1576, 0.2576, 0.2630, 0.2637, 0.2648, 0.2690, 0.2738]"


In [None]:
text = 'Legal Research Center inc'

%time get_codes (lp, text, vocab='irs_code')

CPU times: user 12.3 ms, sys: 1.21 ms, total: 13.6 ms
Wall time: 619 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Legal Research Center inc,0,24,411680384,"[411680384, 0, 870482806, 582349413, 880471263]","[LEGAL RESEARCH CENTER INC, Vector Research LLC, MATRIXX INITIATIVES INC, Alliqua BioMedical Inc, EXPERIENTIAL AGENCY INC]",[],"[0.0000, 0.2192, 0.2386, 0.2393, 0.2424]"


## Sentence Entity Resolver (CRUNCHBASE)
[Crunchbase Homepage](https://www.crunchbase.com/)

![image.png](attachment:eb4df463-fe67-48d7-a8cc-83a993234fc7.png)

- Crunchbase is a platform for gaining awareness about business information about private and public companies. Originally built to track startups, the Crunchbase website contains information on public and private companies on a global scale.

Here we will normalize company names with the Crunchbase Database

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_crunchbase_companynames", "en", "legal/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("name")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp = LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_crunchbase_companynames download started this may take some time.
[OK!]


In [None]:
text = 'Legalcrunch'

%time get_codes (lp, text, vocab='name')

CPU times: user 10.8 ms, sys: 605 µs, total: 11.4 ms
Wall time: 223 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Legalcrunch,0,10,"LegalCrunch, Inc.","[LegalCrunch, Inc., Pitzi, Adisn, XChanger Companies, Terviu, Brazzlebox, AnySource Media, ikaSystems, Teikhos Tech, ProPlan]","[LegalCrunch, Inc., Pitzi, Adisn, XChanger Companies, Terviu, Brazzlebox, AnySource Media, ikaSystems, Teikhos Tech, ProPlan]",[],"[0.0000, 0.0373, 0.0391, 0.0411, 0.0435, 0.0441, 0.0452, 0.0454, 0.0455, 0.0460]"


In [None]:
text = 'Shwrm'

%time get_codes (lp, text, vocab='name')

CPU times: user 12.5 ms, sys: 3.74 ms, total: 16.2 ms
Wall time: 234 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Shwrm,0,4,Shwrüm,"[Shwrüm, Xervmon Inc, ADVANCED CREDIT TECHNOLOGIES, Quickcomm Software Solutions, citysocializer, SurgiQuest, ShoutNow, Reset Therapeutics, MoneyReef, TopiVert, Brevity, Phoenix Health and Safety, Learnpedia Edutech Solutions, Stumpedia]","[Shwrüm, Xervmon Inc, ADVANCED CREDIT TECHNOLOGIES, Quickcomm Software Solutions, citysocializer, SurgiQuest, ShoutNow, Reset Therapeutics, MoneyReef, TopiVert, Brevity, Phoenix Health and Safety, Learnpedia Edutech Solutions, Stumpedia]",[],"[0.0000, 0.0436, 0.0448, 0.0471, 0.0488, 0.0488, 0.0491, 0.0497, 0.0497, 0.0500, 0.0500, 0.0504, 0.0507, 0.0507]"


In [None]:
text = 'Waywire'

%time get_codes (lp, text, vocab='name')

CPU times: user 9.3 ms, sys: 1.21 ms, total: 10.5 ms
Wall time: 166 ms


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,Waywire,0,6,#waywire,"[#waywire, Limonetik, Totally Interactive Weather, ThoughtFocus, 2345.com, WebNotes, Synovex, relocality, Grab Media]","[#waywire, Limonetik, Totally Interactive Weather, ThoughtFocus, 2345.com, WebNotes, Synovex, relocality, Grab Media]",[],"[0.0000, 0.0431, 0.0434, 0.0441, 0.0443, 0.0445, 0.0452, 0.0458, 0.0459]"
