![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Legal Data Augmentation with Chunk Mappers

## Setup

In [None]:
from johnsnowlabs import *

import pandas as pd
import json
import os

spark = start_spark()

# About Data Augmentation

__Data Augmentation__ is the process of increase an extracted datapoint with external sources. 

For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.

In the document, we can extract `Amazon` using NER as an Organization, but that's all the information available about `Amazon` in that document.

Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich `Amazon` with much more information, allowing us to take better decisions.

Let's see how to do it.

# Step 1: Name Entity Recognition

Let's suppose we get this news from scrapping the Internet, or from Twitter.

In [2]:
text = "We have entered into a definitive merger agreement with Amazon."

We use NER to extract the companies name, in this case, Amazon.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")
        
ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlp_pipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlp_pipeline.fit(empty_data)

## We use LightPipelines to get the result

In [None]:
lp_ner = nlp.LightPipeline(model)

ner_result = lp_ner.annotate(text)

In [7]:
ner_result

{'document': ['We have entered into a definitive merger agreement with Amazon.'],
 'ner_chunk': ['Amazon'],
 'token': ['We',
  'have',
  'entered',
  'into',
  'a',
  'definitive',
  'merger',
  'agreement',
  'with',
  'Amazon',
  '.'],
 'ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O'],
 'embeddings': ['We',
  'have',
  'entered',
  'into',
  'a',
  'definitive',
  'merger',
  'agreement',
  'with',
  'Amazon',
  '.'],
 'sentence': ['We have entered into a definitive merger agreement with Amazon.']}

Alright! Amazon has been detected as an organization. 

Now, let's augment `Amazon` with more information about the company, given that there are no more details in the tweet I can use.

But before __augmenting__, there is a very important step we need to carry out: `Company Name Normalization`

# Step 2: Company Names Normalization

Let's suppose we want to manually get information about Amazon.

Since it's a public US company, we can go to [SEC Edgar's database](https://www.sec.gov/edgar/searchedgar/companysearch) and look for it.

Unfortunately, `Amazon` is not the official name of the company, which means no entry for `Amazon` is available. That's were __Company Names Normalization__ comes in handy.

`Company Name Normalization` is the process of obtaining the name of the company used by data providers, usually the "official" name of the company.

Sometimes, some data providers may have different versions of the name with different punctuation. For example, for Meta:
- Meta Platforms, Inc.
- Meta Platforms Inc.
- Meta Platforms, Inc
- etc

So, it's mandatory we do `Company Normalization` taking into account the database / datasource provider we want to extract data from. The data providers we have are:
- SEC Edgar
- Crunchbase until 2015
- Wikidata (in progress)

Let's normalize `Amazon` to the official name in _SEC Edgar_.

In [None]:
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("resolution")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.PipelineModel(stages = [
          documentAssembler,
          embeddings,
          resolver])

lp_res = nlp.LightPipeline(pipeline)

In [10]:
ner_result['ner_chunk']

['Amazon']

In [None]:
el_res = lp_res.annotate(ner_result['ner_chunk'])

In [13]:
el_res

[{'document': ['Amazon'],
  'sentence_embeddings': ['Amazon'],
  'resolution': ['AMAZON COM INC']}]

Here is our normalized name for Amazon: `AMAZON COM INC`.

Now, let's see which information is available in Edgar database for `AMAZON COM INC`

# Steps 1 and 2 in the same pipeline

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk2doc = nlp.Chunk2Doc()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_chunk_doc")

sentence_embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
  .setInputCols("ner_chunk_doc") \
  .setOutputCol("sentence_embeddings")

resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
  .setInputCols(["sentence_embeddings"]) \
  .setOutputCol("resolution")\
  .setDistanceFunction("EUCLIDEAN")

nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunk2doc,
    sentence_embeddings,
    resolver
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlp_pipeline.fit(empty_data)

In [None]:
lp_model = nlp.LightPipeline(model)

el_res = lp_model.annotate(text)

In [16]:
el_res

{'document': ['We have entered into a definitive merger agreement with Amazon.'],
 'ner_chunk': ['Amazon'],
 'sentence_embeddings': ['Amazon'],
 'resolution': ['AMAZON COM INC'],
 'token': ['We',
  'have',
  'entered',
  'into',
  'a',
  'definitive',
  'merger',
  'agreement',
  'with',
  'Amazon',
  '.'],
 'ner': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O'],
 'embeddings': ['We',
  'have',
  'entered',
  'into',
  'a',
  'definitive',
  'merger',
  'agreement',
  'with',
  'Amazon',
  '.'],
 'ner_chunk_doc': ['Amazon'],
 'sentence': ['We have entered into a definitive merger agreement with Amazon.']}

# Step 3: Data Augmentation with Chunk Mappers

The component which carries out __Data Augmentation__ is called `ChunkMapper`.

It's name comes from the way it works: it uses a _Ner Chunk_ to map it to an external data source.

As a result, you will get a JSON with a dictionary of additional fields and their values. 

Let's take a look at how it works.

In [None]:
chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols("document") \
    .setOutputCol("chunk") \
    .setIsArray(False)

CM =legal.ChunkMapperModel().pretrained("legmapper_edgar_companyname", "en", "legal/models")\
    .setInputCols(["chunk"])\
    .setOutputCol("mappings")

cm_pipeline = nlp.Pipeline(stages=[documentAssembler, chunkAssembler, CM])

fit_cm_pipeline = cm_pipeline.fit(empty_data)

In [18]:
# LightPipelines don't support Doc2Chunk, so we will use here usual transform

df = spark.createDataFrame([el_res['resolution']]).toDF("text")
df.show()

+--------------+
|          text|
+--------------+
|AMAZON COM INC|
+--------------+



In [19]:
res = fit_cm_pipeline.transform(df)
res.show()

+--------------+--------------------+--------------------+--------------------+
|          text|            document|               chunk|            mappings|
+--------------+--------------------+--------------------+--------------------+
|AMAZON COM INC|[{document, 0, 13...|[{chunk, 0, 13, A...|[{labeled_depende...|
+--------------+--------------------+--------------------+--------------------+



In [20]:
r = res.collect()
r

[Row(text='AMAZON COM INC', document=[Row(annotatorType='document', begin=0, end=13, result='AMAZON COM INC', metadata={'sentence': '0'}, embeddings=[])], chunk=[Row(annotatorType='chunk', begin=0, end=13, result='AMAZON COM INC', metadata={'sentence': '0', 'chunk': '0'}, embeddings=[])], mappings=[Row(annotatorType='labeled_dependency', begin=0, end=13, result='AMAZON COM INC', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMAZON COM INC', 'relation': 'name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=13, result='RETAIL-CATALOG & MAIL-ORDER HOUSES [5961]', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMAZON COM INC', 'relation': 'sic', 'all_relations': '[5961'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=13, result='5961', metadata={'sentence': '0', 'chunk': '0', 'entity': 'AMAZON COM INC', 'relation': 'sic_code', 'all_relations': '0'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0

In [21]:
json_dict = dict()
for n in r[0]['mappings']:
    json_dict[n.metadata['relation']] = str(n.result)

In [22]:
print(json.dumps(json_dict, indent=4, sort_keys=True))

{
    "business_city": "SEATTLE",
    "business_phone": "2062661000",
    "business_state": "WA",
    "business_street": "410 TERRY AVENUE NORTH",
    "business_zip": "98109",
    "company_id": "1018724",
    "date": "2017-02-10",
    "fiscal_year_end": "1231",
    "former_name": "ABX Holdings, Inc.",
    "former_name_date": "20080102",
    "irs_number": "911646860",
    "name": "AMAZON COM INC",
    "sic": "RETAIL-CATALOG & MAIL-ORDER HOUSES [5961]",
    "sic_code": "5961",
    "state_incorporation": "DE",
    "state_location": "WA"
}
