![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

#🔎 Financial Entity Resolution

In [0]:
from johnsnowlabs import *

#🔎 Sentence Entity Resolver Models

📜Entity resolution is an important task in natural language processing and information extraction, as it allows for more accurate analysis and understanding of financial texts. For example, in a news article discussing the performance of a company's stock, accurately identifying and disambiguating the company's name is crucial for accurately tracking the stock's performance.

📜An NLP use case in financial or legal applications is identifying financial entities' presence in a given text. One of those entities could be `Company Name`. We can carry out NER to extract different chunks of information, but in real financial and legal use cases, the company name is usually not useful as it is mentioned in the text. Sometimes we need the _official_ name of the company (instead of `Amazon`, `Amazon.com INC`, as registered in Edgar, Crunchbase and Nasdaq). We have pre-trained sentence entity resolver models for these purposes shown below with the examples.

##✅ Pretrained Entity Resolution Models for Finance

Here are the list of pretrained Entity Resolution models:


|index|model|
|-----:|:-----|
| 1| [Company Name Normalization Using Nasdaq](https://nlp.johnsnowlabs.com/2022/10/22/finel_nasdaq_data_company_name_en.html)  |
| 2| [Company Name Normalization Using Edgar Database](https://nlp.johnsnowlabs.com/2022/08/30/finel_edgar_company_name_en.html)  |
| 3| [Company Names Normalization Using Crunchbase](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html)  | 
| 4| [Company Name to Ticker Using Nasdaq](https://nlp.johnsnowlabs.com/2022/10/22/finel_nasdaq_data_ticker_en.html)  | 
| 5| [Company Name to IRS Number Using Edgar Database](https://nlp.johnsnowlabs.com/2022/08/30/finel_edgar_irs_en.html)  |
| 6| [Resolve Tickers to Company Names Using Nasdaq](https://nlp.johnsnowlabs.com/2022/09/09/finel_tickers2names_en.html)  |
| 7| [Resolve Company Names to Tickers Using Nasdaq](https://nlp.johnsnowlabs.com/2022/09/08/finel_names2tickers_en.html)  |

##✅ Common Componennts
📜This pipeline will:
1.   Split Text into Sentences
2.   Split Sentences into Words
3.   Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words

**These components are common for all the pipelines we will use.**

In [0]:
def get_generic_base_pipeline():
  """Common components used in all NER pipelines"""
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  text_splitter = finance.TextSplitter()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")
  
  tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

  base_pipeline = nlp.Pipeline(stages=[
      document_assembler,
      text_splitter,
      tokenizer,
      embeddings
  ])

  return base_pipeline
    
generic_base_pipeline = get_generic_base_pipeline()

📜Other than providing the code in the "result" field it provides more metadata about the matching process:

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- all_k_resolutions -> All codes descriptions
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId

We will use following Generic Function For Getting the Codes and Relation Pairs

In [0]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

def get_codes (lp, text, vocab='company_name', hcc=False):

    """Returns LightPipeline resolution results"""
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []
    all_k_aux_labels=[]

    for i in range(len(full_light_result)):

      for chunk, code in zip(full_light_result[i]['ner_chunk'], full_light_result[i][vocab]):   
          begin.append(chunk.begin)
          end.append(chunk.end)
          chunks.append(chunk.result)
          codes.append(code.result) 
          all_codes.append(code.metadata['all_k_results'].split(':::'))
          resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
          all_distances.append(code.metadata['all_k_distances'].split(':::'))
          all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
          if hcc:
              try:
                  all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
              except:
                  all_k_aux_labels.append([])
          else:
              all_k_aux_labels.append([])

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})
    
    return df

##✅ Sample Texts from Cadence Design System
Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)

In [0]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt

In [0]:
with open('cdns-20220101.html.txt', 'r') as f:
  cadence_sec10k = f.read()
print(cadence_sec10k[:500])

In [0]:
pages = [x for x in cadence_sec10k.split("Table of Contents") if x.strip() != '']
print(pages[0])

##✅ Using Text Classification to find Relevant Parts of the Document: 10K Summary
In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.

To check the SEC 10K Summary page, we have a specific model called `"finclf_form_10k_summary_item"`

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
classifier = finance.ClassifierDLModel.pretrained("finclf_form_10k_summary_item", "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

clf_pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    classifier])

df = spark.createDataFrame([[pages[0]]]).toDF("text")

model = clf_pipeline.fit(df)

result = model.transform(df)

In [0]:
result.select('category.result').show()

##✅ NER: Named Entity Recognition on 10K Summary
Main component to carry out information extraction and extract entities from texts. 

This time we will use a model trained to extract many entities from 10K summaries.

In [0]:
ner_model_sec10k = finance.NerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_summary")

ner_converter_sec10k = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_summary"])\
    .setOutputCol("ner_chunk_sec10k")

summary_pipeline = nlp.Pipeline(stages=[
    generic_base_pipeline,
    ner_model_sec10k,
    ner_converter_sec10k
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

summary_model = summary_pipeline.fit(empty_data)


In [0]:
summary_sample_text = pages[0]

light_summary_model = nlp.LightPipeline(summary_model)

summary_results = light_summary_model.fullAnnotate(summary_sample_text)

chunks = []
entities = []
begin = []
end = []

for n in summary_results[0]['ner_chunk_sec10k']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,entities
0,"January 1, 2022",287,301,FISCAL_YEAR
1,000-15867,476,484,CFN
2,"CADENCE DESIGN SYSTEMS, INC",527,553,ORG
3,Delaware,650,657,STATE
4,00-0000000,661,670,IRS
5,"2655 Seely Avenue, Building 5,\nSan Jose,\nCalifornia",772,822,ADDRESS
6,(408)\n-943-1234,886,900,PHONE
7,Common Stock,1098,1109,TITLE_CLASS
8,$0.01,1112,1116,TITLE_CLASS_VALUE
9,CDNS,1138,1141,TICKER


In [0]:
ORG = list(df[df["entities"] == "ORG"]["chunks"])
ORG

##✅ Company Name Normalization using Edgar

**Normalizing the company name to query John Snow Labs datasources for more information about Cadence.**

📜Sometimes, companies in texts use a non-official, abbreviated name. For example, we can find `Cadence`, `Cadence Inc`, `Cadence, Inc`, or many other variations, where the official name of the company os `CADENCE DESIGN SYSTEMS INC`, as per registered in SEC Edgar.

[Edgar's Public Database](https://www.sec.gov/edgar/searchedgar/companysearch)
- EDGAR, the Electronic Data Gathering, Analysis, and Retrieval system, is the primary system for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. 

- Access to EDGAR’s public database is free—allowing you to research, for example, a public company’s financial information and operations by reviewing the filings the company makes with the SEC.(U.S. Securities and Exchange Commission)

Normalizing a company name is super important for data quality purposes. It will help us:
- Standardize the data, improving the quality;
- Carry out additional verifications;
- Join different databases or extract for external sources;

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("ner_chunk")\
    .setOutputCol("sentence_embeddings")

resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("normalization")\
    .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(stages = [
          document_assembler,
          embeddings,
          resolver])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

lp = nlp.LightPipeline(model)


In [0]:
normalized_org = lp.fullAnnotate(ORG)

normalized_org

In [0]:
# 'CADENCE DESIGN SYSTEMS, INC'
NORM_ORG = normalized_org[0]['normalization'][0].result

NORM_ORG

In [0]:
# 'Cadence Design Systems, Inc'
NORM_ORG = normalized_org[1]['normalization'][0].result

NORM_ORG

In [0]:
get_codes(lp, ORG, vocab = "normalization")

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_k_aux_labels,all_distances
0,"CADENCE DESIGN SYSTEMS, INC",0,26,CADENCE DESIGN SYSTEMS INC,"[CADENCE DESIGN SYSTEMS INC, DESIGN WITHIN REACH INC, AVICI SYSTEMS INC, HLM DESIGN INC, NanoWatt Design Inc, DELTEK SYSTEMS INC, EPILOG IMAGING SYSTEMS INC]","[CADENCE DESIGN SYSTEMS INC, DESIGN WITHIN REACH INC, AVICI SYSTEMS INC, HLM DESIGN INC, NanoWatt Design Inc, DELTEK SYSTEMS INC, EPILOG IMAGING SYSTEMS INC]",[],"[0.0000, 0.2023, 0.2060, 0.2161, 0.2220, 0.2273, 0.2289]"
1,"Cadence Design Systems, Inc",0,26,CADENCE DESIGN SYSTEMS INC,"[CADENCE DESIGN SYSTEMS INC, DESIGN WITHIN REACH INC, AVICI SYSTEMS INC, HLM DESIGN INC, NanoWatt Design Inc, DELTEK SYSTEMS INC, EPILOG IMAGING SYSTEMS INC]","[CADENCE DESIGN SYSTEMS INC, DESIGN WITHIN REACH INC, AVICI SYSTEMS INC, HLM DESIGN INC, NanoWatt Design Inc, DELTEK SYSTEMS INC, EPILOG IMAGING SYSTEMS INC]",[],"[0.0000, 0.2023, 0.2060, 0.2161, 0.2220, 0.2273, 0.2289]"


### Normalized Name
📜In Edgar, the company official is different! We need to take it before being able to augment with external information in EDGAR.

- Incorrect: `CADENCE DESIGN SYSTEMS, INC` , `Cadence Design Systems, Inc`
- Correct (Official): `CADENCE DESIGN SYSTEMS INC`

🚀**You will find more solutions in `7.1.Entitiy_Resolution` notebook. You will find:**
 - Company IRS number
 - Ticker symbol using company name
 - Company name using ticker symbol