![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/10.0.Data_Augmentation_with_ChunkMappers.ipynb)

# Legal Data Augmentation with Chunk Mappers

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7163 (2).json to spark_nlp_for_healthcare_spark_ocr_7163 (2).json


- Install it

In [None]:
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (2).json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (2).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl
Installed 1 products:
💊 Spark-Healthcare==4.2.4 installed! ✅ Heal the planet with NLP! 


# Starting

In [None]:
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (2).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


# About Data Augmentation

__Data Augmentation__ is the process of increase an extracted datapoint with external sources. 

For example, let's suppose I work with a document which mentions the company _Apple_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.

In the document, we can extract entities using NER as an Organization

Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich `Apple` with much more information, allowing us to take better decisions.

Let's see how to do it.

# Step 1: Name Entity Recognition

Let's suppose we get this news from scrapping the Internet, or from Wikipedia.

In [None]:
text = """Apple became the first publicly traded U.S. company to be valued at over $1 trillion in August 2018, then $2 trillion in August 2020, and most recently $3 trillion in January 2022. """

text

'Apple became the first publicly traded U.S. company to be valued at over $1 trillion in August 2018, then $2 trillion in August 2020, and most recently $3 trillion in January 2022. '

Firstly, We use NER model to extract the companies name from the text.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")
        
ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
legner_orgs_prods_alias download started this may take some time.
[OK!]


## We use LightPipelines to get the result

In [None]:
lp_ner = nlp.LightPipeline(model)

ner_result = lp_ner.fullAnnotate(text)

ner_result

[{'document': [Annotation(document, 0, 180, Apple became the first publicly traded U.S. company to be valued at over $1 trillion in August 2018, then $2 trillion in August 2020, and most recently $3 trillion in January 2022. , {})],
  'ner_chunk': [Annotation(chunk, 0, 4, Apple, {'entity': 'ORG', 'sentence': '0', 'chunk': '0', 'confidence': '0.9911'})],
  'token': [Annotation(token, 0, 4, Apple, {'sentence': '0'}),
   Annotation(token, 6, 11, became, {'sentence': '0'}),
   Annotation(token, 13, 15, the, {'sentence': '0'}),
   Annotation(token, 17, 21, first, {'sentence': '0'}),
   Annotation(token, 23, 30, publicly, {'sentence': '0'}),
   Annotation(token, 32, 37, traded, {'sentence': '0'}),
   Annotation(token, 39, 41, U.S, {'sentence': '0'}),
   Annotation(token, 42, 42, ., {'sentence': '0'}),
   Annotation(token, 44, 50, company, {'sentence': '0'}),
   Annotation(token, 52, 53, to, {'sentence': '0'}),
   Annotation(token, 55, 56, be, {'sentence': '0'}),
   Annotation(token, 58, 63, 

In [None]:
import pandas as pd

chunks = []
entities = []
begin = []
end = []

for n in ner_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,entities
0,Apple,0,4,ORG


Alright! Company names has been detected as an organization. 

But before __augmenting__, there is a very important step we need to carry out: `Company Name Normalization`

# Step 2: Company Names Normalization

Let's suppose we want to manually get information about these companies.

Since it's a public US company, we can go to [SEC Edgar's database](https://www.sec.gov/edgar/searchedgar/companysearch) and look for it.

Unfortunately, `Apple` is not the official name of the company, which means no entry for `Apple` is available. That's were __Company Names Normalization__ comes in handy.

`Company Name Normalization` is the process of obtaining the name of the company used by data providers, usually the **"official"** name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Crunchbase until 2015
- Wikidata (in progress)

Let's normalize `Apple` to the official name in _SEC Edgar_.

In [None]:
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("resolution")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.PipelineModel(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

lp_res = nlp.LightPipeline(pipelineModel)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
legel_edgar_company_name download started this may take some time.
[OK!]


In [None]:
ORG = list(df["chunks"])

ORG

['Apple']

In [None]:
el_res = lp_res.annotate(ORG)

el_res

[{'document': ['Apple'],
  'sentence_embeddings': ['Apple'],
  'resolution': ['APPLE INC']}]

Here is our normalized name for:
- Apple: `APPLE INC`.

Now, let's see which information is available in Edgar database for `APPLE INC` company

In [None]:
NORM_ORG = el_res[0]["resolution"]

NORM_ORG

['APPLE INC']

# Step 3: Data Augmentation with Chunk Mappers

The component which carries out __Data Augmentation__ is called `ChunkMapper`.

It's name comes from the way it works: it uses a _Ner Chunk_ to map it to an external data source.

As a result, you will get a JSON with a dictionary of additional fields and their values. 

Let's take a look at how it works.

In [None]:
chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM =legal.ChunkMapperModel().pretrained("legmapper_edgar_companyname", "en", "legal/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[documentAssembler, chunkAssembler, CM])

fit_cm_pipeline = cm_pipeline.fit(empty_data)

legmapper_edgar_companyname download started this may take some time.
[OK!]


In [None]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform

df = spark.createDataFrame([NORM_ORG]).toDF("text")

df.show()

+---------+
|     text|
+---------+
|APPLE INC|
+---------+



In [None]:
res = fit_cm_pipeline.transform(df)

res.show()

+---------+--------------------+--------------------+--------------------+
|     text|            document|               chunk|            mappings|
+---------+--------------------+--------------------+--------------------+
|APPLE INC|[{document, 0, 8,...|[{chunk, 0, 8, AP...|[{labeled_depende...|
+---------+--------------------+--------------------+--------------------+



In [None]:
res.select("mappings.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[APPLE INC, ELECTRONIC COMPUTERS [3571], 3571, 942404110, 930, CA, CA, ONE INFINITE LOOP, CUPERTINO, CA, 95014, (408) 996-1010, APPLE COMPUTER INC, 19970808, 2017-02-01, 320193]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
r = res.collect()
r

[Row(text='APPLE INC', document=[Row(annotatorType='document', begin=0, end=8, result='APPLE INC', metadata={'sentence': '0'}, embeddings=[])], chunk=[Row(annotatorType='chunk', begin=0, end=8, result='APPLE INC', metadata={'sentence': '0', 'chunk': '0'}, embeddings=[])], mappings=[Row(annotatorType='labeled_dependency', begin=0, end=8, result='APPLE INC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '0.0', 'all_relations': '', 'chunk': '0', '__trained__': 'APPLE INC', '__distance_function__': 'levenshtein', '__relation_name__': 'name', 'entity': 'APPLE INC', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='ELECTRONIC COMPUTERS [3571]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '0.0', 'all_relations': '', 'chunk': '0', '__trained__': 'APPLE INC', '__distance_function__': 'levenshtein', '__relation_name__': 'sic', 'entity': 'APPLE INC', 'relation': 'sic'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin

In [None]:
json_dict = dict()
for n in r[0]['mappings']:
    json_dict[n.metadata['relation']] = str(n.result)

In [None]:
import json
print(json.dumps(json_dict, indent=4, sort_keys=True))

{
    "business_city": "CUPERTINO",
    "business_phone": "(408) 996-1010",
    "business_state": "CA",
    "business_street": "ONE INFINITE LOOP",
    "business_zip": "95014",
    "company_id": "320193",
    "date": "2017-02-01",
    "fiscal_year_end": "930",
    "former_name": "APPLE COMPUTER INC",
    "former_name_date": "19970808",
    "irs_number": "942404110",
    "name": "APPLE INC",
    "sic": "ELECTRONIC COMPUTERS [3571]",
    "sic_code": "3571",
    "state_incorporation": "CA",
    "state_location": "CA"
}


Yes, here it is. We get additional information about `APPLE INC` using only company name.

# Train Your Own ChunkMapper Model

Here, we will train a ChunkMapper model with 1000 samples

### Load Dataset

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Legal/data/sample_openedgar.json

In [None]:
import json
with open('sample_openedgar.json', 'r') as f:
    company_json = json.load(f)

In [None]:
company_json['mappings'][0]

{'key': 'Rayton Solar Inc.',
 'relations': [{'key': 'name', 'values': ['Rayton Solar Inc.']},
  {'key': 'sic', 'values': ['SEMICONDUCTORS & RELATED DEVICES [3674]']},
  {'key': 'sic_code', 'values': [3674]},
  {'key': 'irs_number', 'values': [0]},
  {'key': 'fiscal_year_end', 'values': [1231]},
  {'key': 'state_location', 'values': ['CA']},
  {'key': 'state_incorporation', 'values': ['DE']},
  {'key': 'business_street', 'values': ['920 COLORADO AVE.']},
  {'key': 'business_city', 'values': ['SANTA MONICA']},
  {'key': 'business_state', 'values': ['CA']},
  {'key': 'business_zip', 'values': ['90401']},
  {'key': 'business_phone', 'values': ['(661) 259-4786']},
  {'key': 'former_name', 'values': ['']},
  {'key': 'former_name_date', 'values': ['']},
  {'key': 'date',
   'values': ['2017-01-10',
    '2017-01-20',
    '2017-01-06',
    '2017-05-15',
    '2017-09-28',
    '2016-11-29',
    '2016-12-20',
    '2016-12-22',
    '2022-09-21',
    '2019-06-27',
    '2018-03-22',
    '2018-04-30',

### Check a sample company

In [None]:
for x in company_json['mappings']:
    if 'Rayton Solar Inc.' in x['key']:
        print(x)

{'key': 'Rayton Solar Inc.', 'relations': [{'key': 'name', 'values': ['Rayton Solar Inc.']}, {'key': 'sic', 'values': ['SEMICONDUCTORS & RELATED DEVICES [3674]']}, {'key': 'sic_code', 'values': [3674]}, {'key': 'irs_number', 'values': [0]}, {'key': 'fiscal_year_end', 'values': [1231]}, {'key': 'state_location', 'values': ['CA']}, {'key': 'state_incorporation', 'values': ['DE']}, {'key': 'business_street', 'values': ['920 COLORADO AVE.']}, {'key': 'business_city', 'values': ['SANTA MONICA']}, {'key': 'business_state', 'values': ['CA']}, {'key': 'business_zip', 'values': ['90401']}, {'key': 'business_phone', 'values': ['(661) 259-4786']}, {'key': 'former_name', 'values': ['']}, {'key': 'former_name_date', 'values': ['']}, {'key': 'date', 'values': ['2017-01-10', '2017-01-20', '2017-01-06', '2017-05-15', '2017-09-28', '2016-11-29', '2016-12-20', '2016-12-22', '2022-09-21', '2019-06-27', '2018-03-22', '2018-04-30', '2018-12-10', '2021-09-22', '2020-06-08', '2020-09-28']}, {'key': 'company_

### Check all keys

In [None]:
all_rels = [x['key'] for x in company_json['mappings'][0]['relations']]

In [None]:
all_rels

['name',
 'sic',
 'sic_code',
 'irs_number',
 'fiscal_year_end',
 'state_location',
 'state_incorporation',
 'business_street',
 'business_city',
 'business_state',
 'business_zip',
 'business_phone',
 'former_name',
 'former_name_date',
 'date',
 'company_id']

### Create ChunkMapperApproach

In [None]:
chunkerMapper = legal.ChunkMapperApproach()\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setDictionary("sample_openedgar.json")\
      .setRels(all_rels)

In [None]:
empty_dataset = spark.createDataFrame([[""]]).toDF("text")

In [None]:
fit_CM = chunkerMapper.fit(empty_dataset)

In [None]:
# Save model
fit_CM.write().overwrite().save('openedgar_2000_2022_company_mapper')

### Let's test our ChunkMapper model

In [None]:
text = ["""Rayton Solar is an American solar cell and engineered wafer manufacturer."""]

In [None]:
# We get company name from sample text

ner_result = lp_ner.fullAnnotate(text)

ner_result

[{'document': [Annotation(document, 0, 72, Rayton Solar is an American solar cell and engineered wafer manufacturer., {})],
  'ner_chunk': [Annotation(chunk, 0, 11, Rayton Solar, {'entity': 'ORG', 'sentence': '0', 'chunk': '0', 'confidence': '0.86965'})],
  'token': [Annotation(token, 0, 5, Rayton, {'sentence': '0'}),
   Annotation(token, 7, 11, Solar, {'sentence': '0'}),
   Annotation(token, 13, 14, is, {'sentence': '0'}),
   Annotation(token, 16, 17, an, {'sentence': '0'}),
   Annotation(token, 19, 26, American, {'sentence': '0'}),
   Annotation(token, 28, 32, solar, {'sentence': '0'}),
   Annotation(token, 34, 37, cell, {'sentence': '0'}),
   Annotation(token, 39, 41, and, {'sentence': '0'}),
   Annotation(token, 43, 52, engineered, {'sentence': '0'}),
   Annotation(token, 54, 58, wafer, {'sentence': '0'}),
   Annotation(token, 60, 71, manufacturer, {'sentence': '0'}),
   Annotation(token, 72, 72, ., {'sentence': '0'})],
  'ner': [Annotation(named_entity, 0, 5, B-ORG, {'word': 'Rayt

In [None]:
ORG = ner_result[0]["ner_chunk"][0].result

ORG

'Rayton Solar'

In [None]:
# We normalize company name

el_res = lp_res.annotate(ORG)

el_res

{'document': ['Rayton Solar'],
 'sentence_embeddings': ['Rayton Solar'],
 'resolution': ['Rayton Solar Inc.']}

In [None]:
NORM_ORG = el_res["resolution"]

NORM_ORG

['Rayton Solar Inc.']

### Let's load our ChunkMapper model

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM =legal.ChunkMapperModel().load("openedgar_2000_2022_company_mapper")\
      .setInputCols(["chunk"])\
      .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[documentAssembler, 
                                   chunkAssembler, 
                                   CM])

fit_cm_pipeline = cm_pipeline.fit(empty_data)

In [None]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform

df = spark.createDataFrame([NORM_ORG]).toDF("text")

df.show()

+-----------------+
|             text|
+-----------------+
|Rayton Solar Inc.|
+-----------------+



In [None]:
res = fit_cm_pipeline.transform(df)

res.show()

+-----------------+--------------------+--------------------+--------------------+
|             text|            document|               chunk|            mappings|
+-----------------+--------------------+--------------------+--------------------+
|Rayton Solar Inc.|[{document, 0, 16...|[{chunk, 0, 16, R...|[{labeled_depende...|
+-----------------+--------------------+--------------------+--------------------+



In [None]:
res.select("mappings.result").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Rayton Solar Inc., SEMICONDUCTORS & RELATED DEVICES [3674], 3674, 0, 1231, CA, DE, 920 COLORADO AVE., SANTA MONICA, CA, 90401, (661) 259-4786, , , 2017-01-10, 1654124]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
r = res.select("mappings").collect()
r

[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=16, result='Rayton Solar Inc.', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'Rayton Solar Inc.', '__distance_function__': 'cosine', '__relation_name__': 'name', 'entity': 'Rayton Solar Inc.', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=16, result='SEMICONDUCTORS & RELATED DEVICES [3674]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'Rayton Solar Inc.', '__distance_function__': 'cosine', '__relation_name__': 'sic', 'entity': 'Rayton Solar Inc.', 'relation': 'sic'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=16, result='3674', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'Rayton Solar Inc.', '__distan

In [None]:
json_dict = dict()
for n in r[0]['mappings']:
    json_dict[n.metadata['relation']] = str(n.result)

In [None]:
import json
print(json.dumps(json_dict, indent=4, sort_keys=True))

{
    "business_city": "SANTA MONICA",
    "business_phone": "(661) 259-4786",
    "business_state": "CA",
    "business_street": "920 COLORADO AVE.",
    "business_zip": "90401",
    "company_id": "1654124",
    "date": "2017-01-10",
    "fiscal_year_end": "1231",
    "former_name": "",
    "former_name_date": "",
    "irs_number": "0",
    "name": "Rayton Solar Inc.",
    "sic": "SEMICONDUCTORS & RELATED DEVICES [3674]",
    "sic_code": "3674",
    "state_incorporation": "DE",
    "state_location": "CA"
}
