# Legal Relation Extraction(RE) and Zero-shot Relation Extraction

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/5.Legal_RE_ZeroShotRE.ipynb)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Saving latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json to latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json


In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up John Snow Labs home in /home/ckl/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library Spark-NLP-4.1.0-wheel-for-spark-3.x.x.whl
Downloading 🐍+💊 Python Library hc
Downloading 🐍+🕶 Python Library Spark-OCR-4.0.1-wheel-for-spark-3.x.x.whl
Downloading 🫘+🚀 Java Library Spark-NLP-4.1.0-cpu-for-spark-3.x.x.jar
Downloading 🫘+💊 Java Library hc
Downloading 🫘+🕶 Java Library Spark-OCR-4.0.1-cpu-for-spark-3.x.x.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-ocr/spark_ocr-4.0.1-py3-none-any.whl --force-reinstall"
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-nlp-internal/spark_nlp_internal-4.1.0-py3-none-any.whl --force-reinst

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored new John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_2_for_Spark-Healthcare_Spark-OCR.json
👌 Launched SparkSession with Jars for: 🚀Spark-NLP, 💊Spark-Healthcare, 🕶Spark-OCR


In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


## Extract Relations Between Parties in an Agreement

This is a Legal Relation Extraction model, which can be used after the NER Model for extracting realtions between Parties, Document Types, Effective Dates and Aliases 

As an output, you will get the relations linking the different concepts together, if such relation exists. The list of relations is:

- **dated_as**: A document has an effective date
- **has_alias**: The alias of a party all along the document
- **has_collective_alias**: An alias hold by several parties at the same time
- **signed_by**: Between a party and the document they signed

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
        
tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

"""
ONLY NEEDED IF YOU WANT TO FILTER RELATION PAIRS OR SYNTACTIC DISTANCE
pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = DependencyParserModel() \
    .pretrained("dependency_conllu", "en") \
    .setInputCols(["document", "pos_tags", "tokens"]) \
    .setOutputCol("dependencies")

Set a filter on pairs of named entities which will be treated as relation candidates
re_filter = RENerChunksFilter()\
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(7)\
    .setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])
"""
re_model = legal.RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")\
    .setPredictionThreshold(0.5)\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relations")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model
        ])
empty_df = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_df)

light_model = LightPipeline(model)


sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties download started this may take some time.
[OK!]
legre_contract_doc_parties download started this may take some time.
[OK!]


In [None]:
ner_model.getClasses()

['O',
 'I-DOC',
 'B-EFFDATE',
 'B-ALIAS',
 'I-ALIAS',
 'B-PARTY',
 'I-EFFDATE',
 'I-PARTY',
 'B-DOC']

### Create Generic Function to Show Relations in Dataframe

In [None]:
def get_relations_df (results, col='relations'):
    rel_pairs=[]
    for i in range(len(results)):
        for rel in results[i][col]:
            rel_pairs.append((
              rel.result, 
              rel.metadata['entity1'], 
              rel.metadata['entity1_begin'],
              rel.metadata['entity1_end'],
              rel.metadata['chunk1'], 
              rel.metadata['entity2'],
              rel.metadata['entity2_begin'],
              rel.metadata['entity2_end'],
              rel.metadata['chunk2'], 
              rel.metadata['confidence']
          ))
    rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])
    return rel_df

### Getting Result with Light Pipeline

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.
Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

In [None]:
sample_text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

result = light_model.fullAnnotate(sample_text)

In [None]:
rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,dated_as,DOC,5,35,INTELLECTUAL PROPERTY AGREEMENT,EFFDATE,69,85,"December 31, 2018",0.98433614
1,signed_by,DOC,5,35,INTELLECTUAL PROPERTY AGREEMENT,PARTY,141,163,"Armstrong Flooring, Inc",0.6040471
27,has_alias,PARTY,141,163,"Armstrong Flooring, Inc",ALIAS,192,197,Seller,0.96357507
50,has_alias,PARTY,205,221,AFI Licensing LLC,ALIAS,263,271,Licensing,0.95466775
81,has_alias,PARTY,315,330,"AHF Holding, Inc",ALIAS,611,615,Party,0.5387175
82,has_alias,PARTY,315,330,"AHF Holding, Inc",ALIAS,641,647,Parties,0.5387175
87,has_collective_alias,ALIAS,399,403,Buyer,ALIAS,611,615,Party,0.5539446
88,has_collective_alias,ALIAS,399,403,Buyer,ALIAS,641,647,Parties,0.5539445
89,has_alias,PARTY,411,445,Armstrong Hardwood Flooring Company,ALIAS,478,484,Company,0.9210608
92,has_alias,PARTY,411,445,Armstrong Hardwood Flooring Company,ALIAS,611,615,Party,0.5812397


In [None]:
pd.DataFrame([(x.result, x.metadata["entity"]) for x in result[0]["ner_chunk"]], columns=["text", "ner"])

Unnamed: 0,text,ner
0,INTELLECTUAL PROPERTY AGREEMENT,DOC
1,"December 31, 2018",EFFDATE
2,"Armstrong Flooring, Inc",PARTY
3,Seller,ALIAS
4,AFI Licensing LLC,PARTY
5,Licensing,ALIAS
6,Seller,ALIAS
7,"AHF Holding, Inc",PARTY
8,Buyer,ALIAS
9,Armstrong Hardwood Flooring Company,PARTY


### Visualization of Extracted Relations

We use **RelationExtractionVisualizer** method of **spark-nlp-display** library for visualization fo the extracted relations between the entities.

In [None]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0],
           relation_col = "relations",
           document_col = "document",
           exclude_relations = ["no_rel"],
           show_relations=True
           )

## Relation Extraction Model to Infer Relations Between Elements in WHEREAS Clauses

This is a Relation Extraction model to infer relations between elements in **WHEREAS** clauses, more specifically the **SUBJECT**, the **ACTION** and the **OBJECT**. There are two relations possible: **has_subject** and **has_object**.

In [None]:
ner_model = legal.NerModel.pretrained("legner_whereas", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.RelationExtractionDLModel.pretrained("legre_whereas", "en", "legal/models")\
    .setPredictionThreshold(0.5)\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relations")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

light_model = LightPipeline(model)


legner_whereas download started this may take some time.
[OK!]
legre_whereas download started this may take some time.
[OK!]


In [None]:
ner_model.getClasses()

['O',
 'B-WHEREAS_SUBJECT',
 'I-WHEREAS_OBJECT',
 'B-WHEREAS_ACTION',
 'I-WHEREAS_SUBJECT',
 'B-WHEREAS_OBJECT',
 'I-WHEREAS_ACTION']

### Getting Result with Light Pipeline

In [None]:
sample_text = """WHEREAS VerticalNet owns and operates a series of online communities ( as defined below ) that are accessible via the world wide web, each of which is designed to be an online gathering place for businesses of a certain type or within a certain industry"""

result = light_model.fullAnnotate(sample_text)

rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_subject,WHEREAS_SUBJECT,8,18,VerticalNet,WHEREAS_ACTION,29,36,operates,0.99839705
1,has_subject,WHEREAS_SUBJECT,8,18,VerticalNet,WHEREAS_OBJECT,38,67,a series of online communities,0.98838055
2,has_object,WHEREAS_ACTION,29,36,operates,WHEREAS_OBJECT,38,67,a series of online communities,0.8244948


In [None]:
pd.DataFrame([(x.result, x.metadata["entity"]) for x in result[0]["ner_chunk"]], columns=["text", "ner"])

Unnamed: 0,text,ner
0,VerticalNet,WHEREAS_SUBJECT
1,operates,WHEREAS_ACTION
2,a series of online communities,WHEREAS_OBJECT


### Visualization of Extracted Relations

In [None]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

re_vis.display(result = result[0],
           relation_col = "relations",
           document_col = "document",
           exclude_relations = ["no_rel"],
           show_relations=True
           )

## Zero Shot Relation Extraction to Extract Relations Between Legal Entities

This is a Zero-shot Relation Extraction Model, meaning that it does not require any training data, just few examples of of the relations types you are looking for, to output a proper result.

**!!!Make sure you keep the proper syntax of the relations you want to extract!!!**

In [None]:
tokenClassifier = legal.BertForTokenClassification.pretrained("legner_obligations","en", "legal/models")\
    .setInputCols("document", "token")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = nlp.NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")\
    .setInputCols(["ner_chunk", "document"]) \
    .setOutputCol("relations")

re_model.setRelationalCategories({
    "should_provide": ["{OBLIGATION_SUBJECT} will provide {OBLIGATION}", "{OBLIGATION_SUBJECT} should provide {OBLIGATION}"],
    "commits_with": ["{OBLIGATION_SUBJECT} to {OBLIGATION_INDIRECT_OBJECT}", "{OBLIGATION_SUBJECT} with {OBLIGATION_INDIRECT_OBJECT}"],
    "commits_to": ["{OBLIGATION_SUBJECT} commits to {OBLIGATION}"],
    "agree_to": ["{OBLIGATION_SUBJECT} agrees to {OBLIGATION}"],
})

pipeline = Pipeline(stages = [
                document_assembler,  
                tokenizer,
                tokenClassifier, 
                ner_converter,
                re_model
               ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

light_model = LightPipeline(model)

legner_obligations download started this may take some time.
[OK!]
legre_zero_shot download started this may take some time.
[OK!]


Py4JJavaError: ignored

In [None]:
tokenClassifier.getClasses()

['B-OBLIGATION_ACTION',
 'I-OBLIGATION_INDIRECT_OBJECT',
 'I-OBLIGATION',
 'B-OBLIGATION_INDIRECT_OBJECT',
 'PAD',
 'I-OBLIGATION_SUBJECT',
 'I-OBLIGATION_ACTION',
 'O',
 'B-OBLIGATION_SUBJECT',
 'B-OBLIGATION']

### Getting Result with Light Pipeline

In [None]:
sample_texts = [
    """NVIDIA agrees to provide an one-year supply of hardware components""",
    """The Supplier should provide the Buyer with all the necessary components""",
    """Fox grants to Licensee exclusive right and license""",
    """The parties have agreed on the conditions of this agreement""",
    """Provider commits to provide all required technical documentation which may be necessary."""
]

result = light_model.fullAnnotate(sample_texts)

rel_df = get_relations_df(result)

rel_df[rel_df["relation"] != "no_rel"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,has_subject,WHEREAS_SUBJECT,0,5,NVIDIA,WHEREAS_ACTION,7,23,agrees to provide,0.5906779
1,has_subject,WHEREAS_SUBJECT,0,10,The parties,WHEREAS_ACTION,12,25,have agreed on,0.78109324
2,has_subject,WHEREAS_SUBJECT,0,10,The parties,WHEREAS_OBJECT,27,58,the conditions of this agreement,0.64438367
3,has_object,WHEREAS_ACTION,12,25,have agreed on,WHEREAS_OBJECT,27,58,the conditions of this agreement,0.9708592


### Visualization of Extracted Relations

In [None]:
# from sparknlp_display import RelationExtractionVisualizer

re_vis = viz.RelationExtractionVisualizer()

for i in range(len(sample_texts)):

    re_vis.display(result = result[i],
               relation_col = "relations",
               document_col = "document",
               exclude_relations = ["no_rel"],
               show_relations=True
               )