![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/09.1.Entity_Resolution_Edgar_unique_IDs_Tickers.ipynb)

#🔎 Financial Entity Resolution

**In this notebook, we continue from where left off in `7.Entity_Resolution` notebook.**

#🎬 Installation

In [None]:
! pip install -q johnsnowlabs

##🔗 Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

#📌 Starting

In [None]:
spark = nlp.start()

#🔎 Sentence Entity Resolver Models

Entity resolution is an important task in natural language processing and information extraction, as it allows for more accurate analysis and understanding of financial texts. For example, in a news article discussing the performance of a company's stock, accurately identifying and disambiguating the company's name is crucial for accurately tracking the stock's performance.

An NLP use case in financial or legal applications is identifying financial entities' presence in a given text. One of those entities could be `Company Name`. We can carry out NER to extract different chunks of information, but in real financial and legal use cases, the company name is usually not useful as it is mentioned in the text. Sometimes we need the _official_ name of the company (instead of `Amazon`, `Amazon.com INC`, as registered in Edgar, Crunchbase and Nasdaq). We have pre-trained sentence entity resolver models for these purposes shown below with the examples.

##📜 Retrieving official / unique IDs

Besides mapping a series of non-normalized strings to a normalized version of, for example, the company name in some registries, we can also map them to retrieve unique IDs (as IRS number in Edgar database) using Entity Resolution.

Let's take a look at how we do it.

#🔎 Pretrained Entity Resolution Models for Finance

Here are the list of pretrained Entity Resolution models:


|index|model|
|-----:|:-----|
| 1| [Company Name Normalization Using Nasdaq](https://nlp.johnsnowlabs.com/2022/10/22/finel_nasdaq_data_company_name_en.html)  |
| 2| [Company Name Normalization Using Edgar Database](https://nlp.johnsnowlabs.com/2022/08/30/finel_edgar_company_name_en.html)  |
| 3| [Company Names Normalization Using Crunchbase](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html)  | 
| 4| [Company Name to Ticker Using Nasdaq](https://nlp.johnsnowlabs.com/2022/10/22/finel_nasdaq_data_ticker_en.html)  | 
| 5| [Company Name to IRS Number Using Edgar Database](https://nlp.johnsnowlabs.com/2022/08/30/finel_edgar_irs_en.html)  |
| 6| [Resolve Tickers to Company Names Using Nasdaq](https://nlp.johnsnowlabs.com/2022/09/09/finel_tickers2names_en.html)  |
| 7| [Resolve Company Names to Tickers Using Nasdaq](https://nlp.johnsnowlabs.com/2022/09/08/finel_names2tickers_en.html)  | 

##🔎 Common Componennts

In [None]:
def get_generic_base_pipeline():
  """Common components used in all NER pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


📜Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- all_k_resolutions -> All codes descriptions
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId

We will use following Generic Function For Getting the Codes and Relation Pairs

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

def get_codes (lp, text, vocab='company_name', hcc=False):

    """Returns LightPipeline resolution results"""
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for i in range(len(full_light_result)):

      for chunk, code in zip(full_light_result[i]['ner_chunk'], full_light_result[i][vocab]):   
          begin.append(chunk.begin)
          end.append(chunk.end)
          chunks.append(chunk.result)
          codes.append(code.result) 
          all_codes.append(code.metadata['all_k_results'].split(':::'))
          resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
          all_distances.append(code.metadata['all_k_distances'].split(':::'))
          all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
          if hcc:
              try:
                  all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
              except:
                  all_k_aux_labels.append([])
          else:
              all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    return df

###✅ Normalized Name
In Edgar, the company official is different! We need to take it before being able to augment with external information in EDGAR.

- Incorrect: `CADENCE DESIGN SYSTEMS, INC` , `Cadence Design Systems, Inc`
- Correct (Official): `CADENCE DESIGN SYSTEMS INC`

In [None]:
# 'CADENCE DESIGN SYSTEMS, INC'
NORM_ORG = 'CADENCE DESIGN SYSTEMS INC'

NORM_ORG

'CADENCE DESIGN SYSTEMS INC'

###✅ Find Company IRS Number

An employer identification number (EIN) is a nine-digit number assigned by the `IRS`. It's used to identify the tax accounts of employers and certain others who have no employees. The IRS uses the number to identify taxpayers who are required to file various business tax returns. EINs are used by employers, sole proprietors, corporations, partnerships, non-profit associations, trusts, estates of decedents, government agencies, certain individuals, and other business entities.




In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("ner_chunk")\
    .setOutputCol("sentence_embeddings")

resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_irs", "en", "finance/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("irs_code")\
    .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          document_assembler,
          embeddings,
          resolver])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

lp = nlp.LightPipeline(model)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finel_edgar_irs download started this may take some time.
[OK!]


In [None]:
%time
get_codes (lp, NORM_ORG, vocab='irs_code')

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.11 µs


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,CADENCE DESIGN SYSTEMS INC,0,25,770148231,"[770148231, 943314374, 20493372, 562018819, 0, 541252625, 272957582]","[770148231, 943314374, 20493372, 562018819, 0, 541252625, 272957582]",[],"[0.0000, 0.2023, 0.2060, 0.2161, 0.2220, 0.2273, 0.2289]"


###✅ Find Ticker Symbol using Company Name

`Ticker Symbol` is the use of letters to represent shares that are traded on the stock market, and it is mainly a combination of two , three or four alphabets that is unique and easy for investors to identify and buy/sell that particular stock with the help of this symbol on the stock exchange. Symbols with four or more letters generally denote securities traded on the American stock exchange and NASDAQ.

Here we will find the ticker symbol of `CADENCE DESIGN SYSTEMS INC` on Nasdaq


In [None]:
resolver = finance.SentenceEntityResolverModel.pretrained("finel_names2tickers", "en", "finance/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("name")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          document_assembler,
          embeddings,
          resolver])

lp = nlp.LightPipeline(pipelineModel)

finel_names2tickers download started this may take some time.
[OK!]


In [None]:
%time 
get_codes (lp, NORM_ORG, vocab='name')

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 7.63 µs


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,CADENCE DESIGN SYSTEMS INC,0,25,CDNS,"[CDNS, UAVS, LLL, LQDA, CVLT, AXNX, GVP]","[CDNS, UAVS, LLL, LQDA, CVLT, AXNX, GVP]",[],"[0.0000, 0.2629, 0.2659, 0.2676, 0.2696, 0.2751, 0.2795]"


###✅ Find Company Name using Ticker Symbol

We can also find the company name using the ticker name. For this, we will use NER model result (it is in `7.Entity_Resolution` notebook) and get ticker name from there. After that we can find company name with `finel_tickers2names` model

In [None]:
TICKER = 'CDNS'
TICKER

'CDNS'

In [None]:
resolver = finance.SentenceEntityResolverModel.pretrained("finel_tickers2names", "en", "finance/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("name")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(stages = [
          document_assembler,
          embeddings,
          resolver])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

lp = nlp.LightPipeline(model)

finel_tickers2names download started this may take some time.
[OK!]


In [None]:
%time
get_codes (lp, TICKER, vocab='name')

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 7.63 µs


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,CDNS,0,3,Cadence Design Systems Inc.,"[Cadence Design Systems Inc., Madison Covered Call & Equity Strategy Fund, INVESCO MORTGAGE CAPITAL INC, SiteOne Landscape Supply Inc., Ituran Location and Control Ltd.]","[CADENCE DESIGN SYSTEMS INC., Madison Covered Call & Equity Strategy Fund, Invesco Mortgage Capital Inc, SiteOne Landscape Supply INC., ITURAN LOCATION AND CONTROL LTD.]",[],"[0.0000, 0.2412, 0.2625, 0.2760, 0.3043]"
