![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Legal Coreference Resolution

In [0]:
from johnsnowlabs import * 

# Legal Correference resolution

![IMAGE.PNG](/files/FINLEG/4.png)

Correference Resolution is the the task of finding all expressions that refer to the same entity in a text.

This is very important in both Legal and Financial texts, where the name of the company is mentioned at the beginning of the document, but later on aliases of the company are used, intead of the official name.

Let's take a look at some examples and how to solve them using Correference Resolution.

`'Armstrong Hardwood Flooring Company is a Tennessee corporation (known also as "Company"). The Company own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.'`

In the previous text, in the second sentence, `Company` refers to `Armstrong Hardwood Floowing Company`.

There are two ways we can accomplish correference resolution:
1. With a specific `SpanBertCorefModel` annotator;
2. With NER and Relation Extraction;

# 1. SpanBertCoref

SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper. 

In Spark NLP, we include `SpanBertCorefModel` annotator as an implementation of this  SpanBert-based coreference resolution model.

In [0]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

corefResolution = nlp.SpanBertCorefModel()\
    .pretrained("spanbert_base_coref")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("corefs")

pipeline = nlp.Pipeline(stages=[document_assembler, sentence_detector, tokenizer, corefResolution])

### Who is "the Company" in this example?

In [0]:
example1 = 'Armstrong Hardwood Flooring Company is a Tennessee corporation (known also as "Company"). The Company own certain Copyrights and Know-How which may be used to the conditions set forth herein.'

data = spark.createDataFrame([[example1]]).toDF("text")

model = pipeline.fit(data)

model.transform(data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)

### What is "this Agreement" in the example? And "it"?

In [0]:
example2 = 'This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement") is dated as of December 31, 2018 (the "Effective Date").It was entered into by and between Armstrong Flooring (the "Seller") and AHF Holding (the "Buyer"). Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the "Stock Purchase Agreement")'

data = spark.createDataFrame([[example2]]).toDF("text")

model = pipeline.fit(data)

model.transform(data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)

### Which date are we talking about?

In [0]:
example3 = 'This Agreement is dated as of December 31, 2018 (the "Effective Date"). Seller and Buyer should sign it before the ending of that date.'

data = spark.createDataFrame([[example3]]).toDF("text")

model = pipeline.fit(data)

model.transform(data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)

However, reality is that legal texts are often times, much longer and complex than those ones.

Let's take a look at the following example:

### Who is Seller? Who is Buyer?

FAIL: We are unable to retrieve that information using SpanBertCoref.

In [0]:
example4 = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. ("Seller") and AHF Holding, Inc. ("Buyer"). "Seller" and "Buyer" have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the "Stock Purchase Agreement")'

data = spark.createDataFrame([[example4]]).toDF("text")

model = pipeline.fit(data)

print("\x1b[31m")
model.transform(data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)
print("\x1b[0m")

Another disadvantage of this method is that you need to send all the text at once to resolve the correferences. If you miss the original lines where the concepts are defined, you will lose the reference.

As an alternative, we can use NER and Relation Extraction, as shown in the next section.

# 2. NER and Relation Extraction

We have several models trained in Models Hub (Spark NLP for Legal), which are able to detect aliases or secondary names in financial and legal documents.

We are going to use this NER one:
`https://nlp.johnsnowlabs.com/2022/08/12/legre_contract_doc_parties_en_3_2.html`

After extracting the aliases, we will check which names they are referring to. To do this, we will use Relation Extraction. For this example, we will use this model:

`https://nlp.johnsnowlabs.com/2022/08/17/legre_org_prod_alias_en_3_2.html`

Let's see them in action.

In [0]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["document"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
        .setInputCols(["document", "token"]) \
        .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
        .setInputCols(["document", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["document","token","ner"])\
        .setOutputCol("ner_chunk")

reDL = legal.RelationExtractionDLModel()\
    .pretrained("legre_org_prod_alias", "en", "legal/models")\
    .setPredictionThreshold(0.99)\
    .setInputCols(["ner_chunk", "document"])\
    .setOutputCol("relations")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        reDL])

In [0]:
example4 = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. (the "Seller") and AHF Holding, Inc., a Delaware Corporation (the "Buyer").'

data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(data)

In [0]:
lmodel = LightPipeline(model)
res = lmodel.fullAnnotate(example4)

In [0]:
aliases = dict()
for r in res:
  for rel in r['relations']:
    if rel.result != 'no_rel':
      aliases.setdefault(rel.metadata['chunk2'], []).append(rel.metadata['chunk1'])
      print(f"{rel.metadata['chunk1']} - {rel.result} - {rel.metadata['chunk2']} (confidence: {rel.metadata['confidence']})")

In [0]:
aliases

Being that done, you can process the rest of the document, detecting entities as Seller, Buyer, etc with either NER or ContextualParsers, and be ablet o disambiguate it using the results of the previous model

In [0]:
example5 = '"Seller" and "Buyer" have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the "Stock Purchase Agreement")'

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(data)

In [0]:
lmodel = LightPipeline(model)
res = lmodel.fullAnnotate(example5)

In [0]:
for r in res:
  for ner_chunk in r['ner_chunk']:
    print(f"{ner_chunk.result} is {aliases[ner_chunk.result][0]}")

The big advantage of using this method is that you don't need to process the whole text to know the correferences. You can detect first the aliases, store them and then resolve the correferences with NER and RE.

# 3. Question Answering

This is the third option from retrieving correferences. You can detect the alias and ask questions on the fly about what they refer to.

Let's see an example.

In [0]:
context = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. ("Seller") and AHF Holding, Inc. ("Buyer").'.lower()
question1 = 'Which company is the Buyer'.lower()
question2 = 'Which company is the Seller'.lower()

In [0]:
document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert_large","en", "legal/models")\
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    document_assembler,
    spanClassifier
])

In [0]:
qa = [[question1, context], [question2, context]]

In [0]:
example = spark.createDataFrame(qa).toDF("question", "context")

result = pipeline.fit(example).transform(example)

result.select('answer.result').show(truncate=False)