![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Legal Data Augmentation with Chunk Mappers

In [0]:
from johnsnowlabs import *

# About Data Augmentation

__Data Augmentation__ is the process of increase an extracted datapoint with external sources. 

For example, let's suppose I work with a document which mentions the company _Apple_. We can extract that entity using using NER.

But we can do much more than that! Public companies have a lot of information published in the Internet about them. We can check legal and financial information from those companies in Legal NLP, in an offline-mode!

With __Chunk Mappers__, we can use external sources, as _SEC Edgar, Nasdaq_ or even _Wikidata_, to enrich `Apple` with much more information, allowing us to take better decisions.

Let's see how to do it.

# Step 1: Name Entity Recognition

Let's suppose we get this news from scrapping the Internet, or from Wikipedia.

In [0]:
text = """APPLE, INC. became the first publicly traded U.S. company to be valued at over $1 trillion in August 2018, then $2 trillion in August 2020, and most recently $3 trillion in January 2022. """

text

Firstly, We use NER model to extract the companies name from the text.

In [0]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")
        
ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

## We use LightPipelines to get the result

In [0]:
lp_ner = nlp.LightPipeline(model)

ner_result = lp_ner.fullAnnotate(text)

In [0]:
import pandas as pd

chunks = []
entities = []
begin = []
end = []

for n in ner_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,entities
0,"APPLE, INC.",0,10,ORG


Alright! Company names has been detected as an organization.

# Step 2: Mapping NER to External Data Using Chunk Mappers (offline)

Very often, the name of the organizations we find in texts are not their official name. For example, if we train to find `APPLE, INC` in SEC Edgar, we won't find it as it is.

Every data provider may have different versions of the official names. Most of them will include organization types as `Inc`, `Corp` etc. But again, Some others may have `Inc.`, `Incorporated`, etc.

**ChunkMappers** by default work with exact matches, so before being able to map our ORG detected by NER to Chunk Mappers, we need to:
1. Either enable Fuzzy Matching in Chunk Mappers (Section 2a);
2. Or normalize the company name with Entity Resolvers (Section 2n);

![my_test_image](files/Screenshot_2023_02_24_152350.png)

Let's suppose we want to manually get information about these companies.

Since it's a public US company, we can go to [SEC Edgar's database](https://www.sec.gov/edgar/searchedgar/companysearch) and look for it.

THere is no `APPLE, INC.` in Edgar, but another variation of it (`APPLE INC`). Let's check the two methods we can apply to do the mapping from one to another.

## Step 2a) ChunkMappers Fuzzy Matching

Let's get several variations of Apple to see how we can use Fuzzy Matching to get the official name of Apple in Edgar.

In [0]:
ORG = df['chunks'].tolist()

ORG

In [0]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Posible distance metrics: ['levenshtein', 'longest-common-subsequence', 'cosine']
CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_companyname", "en", "legal/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setEnableFuzzyMatching(True)\
    .setEnableCharFingerprintMatching(False)\
    .setFuzzyMatchingDistances(['cosine'])\
    .setFuzzyMatchingDistanceThresholds([5])

cm_pipeline = nlp.Pipeline(stages=[document_assembler, CM])

empty_data = spark.createDataFrame([[""]]).toDF("text")

fit_cm_pipeline = cm_pipeline.fit(empty_data)

lp = nlp.LightPipeline(fit_cm_pipeline)

res = lp.fullAnnotate(ORG)

In [0]:
for r in res:
  for map in r['mappings']:
    print(map)
  print('\n')

**We have been able to successfully retrieve the information in Edgar using different variations of the Company Name `Apple`!**

## Step 2b: Using Entity Resolvers for Company Names Normalization

`Company Name Normalization` is the process of obtaining the name of the company used by data providers, usually the **"official"** name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Wikidata
- etc.

Let's normalize `APPLE INC` to the official name in _SEC Edgar_.

In [0]:
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("resolution")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp_res = nlp.LightPipeline(pipelineModel)

In [0]:
ORG = df['chunks'].tolist()

ORG

In [0]:
el_res = lp_res.annotate(ORG)

el_res

Here is our normalized name for:
- Apple: `APPLE INC`.

Now, let's see which information is available in Edgar database for `APPLE INC` company

In [0]:
NORM_ORG = el_res[0]["resolution"]

NORM_ORG

## And now, we do exact match with the normalized version

In [0]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Posible distance metrics: ['levenshtein', 'longest-common-subsequence', 'cosine']
CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_companyname", "en", "legal/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\

cm_pipeline = nlp.Pipeline(stages=[document_assembler, CM])

empty_data = spark.createDataFrame([[""]]).toDF("text")

fit_cm_pipeline = cm_pipeline.fit(empty_data)

lp = nlp.LightPipeline(fit_cm_pipeline)

res = lp.fullAnnotate(NORM_ORG)

In [0]:
for r in res:
  for map in r['mappings']:
    print(map)

# Train Your Own ChunkMapper Model

Here, we will train a ChunkMapper model with 1000 samples

### Load Dataset

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_openedgar.json

dbutils.fs.cp("file:/databricks/driver/sample_openedgar.json", "dbfs:/") 

In [0]:
import json
with open('sample_openedgar.json', 'r') as f:
    company_json = json.load(f)

In [0]:
company_json['mappings'][1]

### Check a sample company

In [0]:
for x in company_json['mappings']:
    if 'StepOne Personal Health, Inc.' in x['key']:
        print(x)

### Check all keys

In [0]:
all_rels = [x['key'] for x in company_json['mappings'][0]['relations']]

In [0]:
all_rels

### Create ChunkMapperApproach

In [0]:
chunkerMapper = legal.ChunkMapperApproach()\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setDictionary("dbfs:/sample_openedgar.json")\
      .setRels(all_rels)

In [0]:
empty_dataset = spark.createDataFrame([[""]]).toDF("text")

In [0]:
fit_CM = chunkerMapper.fit(empty_dataset)

In [0]:
# Save model
fit_CM.write().overwrite().save('/dbfs/openedgar_2000_2022_company_mapper')

### Let's test our ChunkMapper model

In [0]:
text = ["""StepOne Personal Health, Inc. is an American solar cell and engineered wafer manufacturer."""]

In [0]:
# We get company name from sample text

ner_result = lp_ner.fullAnnotate(text)

ner_result

In [0]:
ORG = ner_result[0]["ner_chunk"][0].result

ORG

In [0]:
# We normalize company name

el_res = lp_res.annotate(ORG)

el_res

In [0]:
NORM_ORG = el_res["resolution"]

NORM_ORG

### Let's load our ChunkMapper model

In [0]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM =legal.ChunkMapperModel().load("/dbfs/openedgar_2000_2022_company_mapper")\
      .setInputCols(["chunk"])\
      .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[documentAssembler, 
                                   chunkAssembler, 
                                   CM])

fit_cm_pipeline = cm_pipeline.fit(empty_data)

In [0]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform

df = spark.createDataFrame([NORM_ORG]).toDF("text")

df.show()

In [0]:
res = fit_cm_pipeline.transform(df)

res.show()

In [0]:
res.select("mappings.result").show(truncate=False)

In [0]:
r = res.select("mappings").collect()
r