![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/80.2.Legal_Contract_Understanding_NDA.ipynb)

# 🎬 Installation

In [None]:
! pip install -q johnsnowlabs

## 🔗 Automatic Installation


In [None]:
from johnsnowlabs import nlp, finance, legal

nlp.install(refresh_install=True, visual=True, force_browser = True)

## 🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install(force_browser=True)

# 📌 Starting

In [None]:
spark = nlp.start()

## 🔎 **Legal Contract Understanding NDA**


✍Explanation:

 A Non-Disclosure Agreement (NDA) is a legal contract designed to protect confidential information shared between parties involved in a business or professional relationship. The purpose of an NDA is to establish a legal obligation that restricts the receiving party from disclosing or using the confidential information for any purpose other than what is defined in the agreement.

 The `legclf_nda_agreements` model focuses specifically on analyzing Non-Disclosure Agreements (NDAs). NDAs are legal contracts designed to protect confidential information shared between parties. A specific Legal NLP model for NDAs is trained on a potentially large corpus of NDA documents to understand the specific clauses, terms and language commonly used in these agreements.

- The `legclf_nda_agreements` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class nda or not (Binary Classification).




### Get sample text

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/non_disclosure_agreement.txt

In [5]:
with open('non_disclosure_agreement.txt', 'r') as f:
  text = f.read()
  text = text[170:3384]
  print(text)

                        Non-Disclosure Agreement

This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS
Corporation - Federal Systems Division. ("UNISYS") and concerns the handling,
safeguarding and exchange of sensitive, proprietary or confidential information
(collectively "Proprietary Information"). It is agreed and understood that any
exchange of such Proprietary Information is expressly for the sole purpose of
CELERITY and UNISYS evaluating a possible business arrangement involving an
existing UNISYS contract with a federal government agency.

1.   For the purposes hereof, Proprietary Information shall be any documents,
     materials or information that (i) is in tangible form and clearly marked as
     being proprietary, confidential or sensitive and (ii) is received by one
     party to this Agreement ("Recipient") from the other party to this
     Agreement ("Transmitter").

2.   The Recipient agrees to safeguard, protect and handle all Proprietary
   

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
  
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_nda_agreements", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])

df = spark.createDataFrame([[text]]).toDF("text")

model = nlpPipeline.fit(df)
result = model.transform(df)

In [8]:
result.select("category.result").show(truncate=False)

+------+
|result|
+------+
|[nda] |
+------+



## 🔎 **Multilabel Classification of NDA Clauses (paragraph, medium)**

✍Explanation:

This models is a version of `legmulticlf_mnda_sections_other` (sentence, medium) but expecting a bigger-than-sentence context, ideally between 2 and 4-5 sentences, or a small paragraph, to provide with more context.

It should be run on sentences of the NDA clauses, and will retrieve a series of 1..N labels for each of them. The possible clause types detected my this model in NDA / MNDA aggrements are:

- Parties to the Agreement - Names of the Parties Clause
- Identification of What Information Is Confidential - Definition of Confidential Information Clause
- Use of Confidential Information: Permitted Use Clause and Obligations of the Recipient
- Time Frame of the Agreement - Termination Clause
- Return of Confidential Information Clause
- Remedies for Breaches of Agreement - Remedies Clause
- Non-Solicitation Clause
- Dispute Resolution Clause
- Exceptions Clause
- Non-competition clause
- Other: Nothing of the above (synonym to [])

📚 Predicted Entities
`APPLIC_LAW`, `ASSIGNMENT`, `DEF_OF_CONF_INFO`,`DISPUTE_RESOL`, `EXCEPTIONS`, `NAMES_OF_PARTIES`, `NON_COMP`, `NON_SOLIC`, `PREAMBLE`, `REMEDIES`, `REQ_DISCL`, `RETURN_OF_CONF_INFO`, `TERMINATION`, `USE_OF_CONF_INFO`, `OTHER`

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = legal.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setExplodeSentences(True)

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")

paragraph_classifier = nlp.MultiClassifierDLModel.pretrained("legmulticlf_mnda_sections_paragraph_other", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")


sentence_pipeline = nlp.Pipeline(stages=[document_assembler, 
                                         text_splitter, 
                                         embeddings, 
                                         paragraph_classifier])

prediction_pipeline = nlp.Pipeline(stages=[document_assembler, 
                                           embeddings, 
                                           paragraph_classifier])

df = spark.createDataFrame([[text]]).toDF("text")

sentence_model = sentence_pipeline.fit(df)
prediction_model = prediction_pipeline.fit(df)

sentence_lp = nlp.LightPipeline(sentence_model)
prediction_lp = nlp.LightPipeline(prediction_model)

In [10]:
import pandas as pd

# Create an empty dataframe to store the texts
# class_texts_df = pd.DataFrame(columns=['Class', 'Text'])
class_texts_df = []

res = sentence_lp.fullAnnotate(text)
sentences = [x.result for x in res[0]['sentences']]

for i, s in enumerate(sentences):
    prev_sentence = "" if i == 0 else sentences[i-1]
    next_sentence = "" if i >= len(sentences)-1 else sentences[i+1]
    chunk = " ".join([prev_sentence, s, next_sentence]).strip()
    annotation = prediction_lp.annotate(chunk)
    predicted_class = annotation['class']
    class_texts_df.append((s, predicted_class))


    
    # Append the text to the dataframe for the predicted class
class_texts_df = pd.DataFrame(class_texts_df, columns=['text', 'class'])

In [12]:
class_texts_df["text"][0]

'Non-Disclosure Agreement\n\nThis Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS\nCorporation - Federal Systems Division.'

In [13]:
class_texts_df = class_texts_df.explode('class').reset_index(drop=True)

In [14]:
class_texts_df

Unnamed: 0,text,class
0,Non-Disclosure Agreement\n\nThis Agreement is ...,NAMES_OF_PARTIES
1,"(""UNISYS"") and concerns the handling,\nsafegua...",NAMES_OF_PARTIES
2,It is agreed and understood that any\nexchange...,DEF_OF_CONF_INFO
3,"1. For the purposes hereof, Proprietary Info...",DEF_OF_CONF_INFO
4,(i) is in tangible form and clearly marked as\...,
5,(ii) is received by one\n party to this Ag...,
6,"2. The Recipient agrees to safeguard, protec...",
7,Without the written consent of the Transmitter...,RETURN_OF_CONF_INFO
8,"(i) not to disclose,\n disseminate, reprod...",USE_OF_CONF_INFO
9,"(i) not to disclose,\n disseminate, reprod...",RETURN_OF_CONF_INFO


In [15]:
# Filter out rows with empty square brackets
class_texts_df = class_texts_df.dropna(subset=['class'])

# Print the filtered DataFrame
class_texts_df

Unnamed: 0,text,class
0,Non-Disclosure Agreement\n\nThis Agreement is ...,NAMES_OF_PARTIES
1,"(""UNISYS"") and concerns the handling,\nsafegua...",NAMES_OF_PARTIES
2,It is agreed and understood that any\nexchange...,DEF_OF_CONF_INFO
3,"1. For the purposes hereof, Proprietary Info...",DEF_OF_CONF_INFO
7,Without the written consent of the Transmitter...,RETURN_OF_CONF_INFO
8,"(i) not to disclose,\n disseminate, reprod...",USE_OF_CONF_INFO
9,"(i) not to disclose,\n disseminate, reprod...",RETURN_OF_CONF_INFO
18,(ii) is known to the Recipient prior to disclo...,EXCEPTIONS
19,(iii) is\n lawfully disclosed to Recipient...,RETURN_OF_CONF_INFO
21,This represents the entire agreement of the pa...,ASSIGNMENT


## 📌 **`NAMES_OF_PARTIES` CLAUSE-SPECIFIC NER**

### Creating a dataframe with the texts of the `NAMES_OF_PARTIES` class

In [16]:
names_of_the_parties_df = class_texts_df[class_texts_df['class']=='NAMES_OF_PARTIES']
names_of_the_parties_df

Unnamed: 0,text,class
0,Non-Disclosure Agreement\n\nThis Agreement is ...,NAMES_OF_PARTIES
1,"(""UNISYS"") and concerns the handling,\nsafegua...",NAMES_OF_PARTIES


In [17]:
names_of_the_parties_df["text"][0]

'Non-Disclosure Agreement\n\nThis Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS\nCorporation - Federal Systems Division.'

In [None]:
document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentence_detector = legal.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setExplodeSentences(True)

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("embeddings")\
        .setMaxSentenceLength(512)\
        .setCaseSensitive(True)

# nda_names_of_parties
ner_model_parties = legal.NerModel.pretrained("legner_nda_names_of_parties", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_parties")

ner_converter_parties = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_parties"])\
        .setOutputCol("ner_chunk_parties")


nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_parties,
        ner_converter_parties, 
        ])

names_of_the_parties_sdf = spark.createDataFrame(names_of_the_parties_df)

model = nlpPipeline.fit(names_of_the_parties_sdf)
result = model.transform(names_of_the_parties_sdf)


In [19]:
from pyspark.sql import functions as F

result_df = result.select(F.explode('ner_chunk_parties').alias("cols")) \
                  .select(F.expr("cols.begin").alias("begin"),
                          F.expr("cols.end").alias("end"),
                          F.expr("cols.result").alias("chunk"),
                          F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+-----+---+---------------------+------+
|begin|end|                chunk|entity|
+-----+---+---------------------+------+
|   57| 77|CELERITY SYSTEMS, INC| PARTY|
|   83| 90|             CELERITY| ALIAS|
+-----+---+---------------------+------+



### 🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [20]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(names_of_the_parties_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_parties.result', 'ner_chunk_parties.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_parties', document_col = 'document')
  print("\n\n")


+---------------------+-----+
|ner_chunk            |label|
+---------------------+-----+
|CELERITY SYSTEMS, INC|PARTY|
|CELERITY             |ALIAS|
+---------------------+-----+
















## 📌 **`DEF_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [21]:
def_of_conf_info_df = class_texts_df[class_texts_df['class']=='DEF_OF_CONF_INFO']
def_of_conf_info_df

Unnamed: 0,text,class
2,It is agreed and understood that any\nexchange...,DEF_OF_CONF_INFO
3,"1. For the purposes hereof, Proprietary Info...",DEF_OF_CONF_INFO


In [22]:
def_of_conf_info_df["text"][2]

'It is agreed and understood that any\nexchange of such Proprietary Information is expressly for the sole purpose of\nCELERITY and UNISYS evaluating a possible business arrangement involving an\nexisting UNISYS contract with a federal government agency.'

In [23]:
# nda_conf_info
ner_model_conf = legal.NerModel.pretrained("legner_nda_def_conf_info", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_conf")

ner_converter_conf = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_conf"])\
        .setOutputCol("ner_chunk_conf")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_conf,
        ner_converter_conf, 
        ])

def_of_conf_info_sdf = spark.createDataFrame(def_of_conf_info_df)

model = nlpPipeline.fit(def_of_conf_info_sdf)
result = model.transform(def_of_conf_info_sdf)


from pyspark.sql import functions as F

result_df = result.select(F.explode('ner_chunk_conf').alias("cols")) \
                  .select(F.expr("cols.begin").alias("begin"),
                          F.expr("cols.end").alias("end"),
                          F.expr("cols.result").alias("chunk"),
                          F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)


legner_nda_def_conf_info download started this may take some time.
[OK!]
+-----+---+-----+------+
|begin|end|chunk|entity|
+-----+---+-----+------+
+-----+---+-----+------+



### 🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [24]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(def_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_conf.result', 'ner_chunk_conf.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_conf', document_col = 'document')
  print("\n\n")


+---------+-----+
|ner_chunk|label|
+---------+-----+
+---------+-----+
















## 📌 **`RETURN_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [25]:
return_of_conf_info_df = class_texts_df[class_texts_df['class']=='RETURN_OF_CONF_INFO']
return_of_conf_info_df

Unnamed: 0,text,class
7,Without the written consent of the Transmitter...,RETURN_OF_CONF_INFO
9,"(i) not to disclose,\n disseminate, reprod...",RETURN_OF_CONF_INFO
19,(iii) is\n lawfully disclosed to Recipient...,RETURN_OF_CONF_INFO


In [None]:
# nda_return_of_conf_info
ner_model_return = legal.NerModel.pretrained("legner_nda_return_of_conf_info", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_return")

ner_converter_return = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_return"])\
        .setOutputCol("ner_chunk_return")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_return,
        ner_converter_return, 
        ])

return_of_conf_info_sdf = spark.createDataFrame(return_of_conf_info_df)

model = nlpPipeline.fit(return_of_conf_info_sdf)
result = model.transform(return_of_conf_info_sdf)


###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [27]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(return_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_return.result', 'ner_chunk_return.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_return', document_col = 'document')
  print("\n\n--------------")

+------------+-------------+
|ner_chunk   |label        |
+------------+-------------+
|governmental|LEGAL_PURPOSE|
|lawfully    |LEGAL_PURPOSE|
+------------+-------------+








--------------




--------------




--------------


## 📌 **`USE_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [28]:
use_of_conf_info_df = class_texts_df[class_texts_df['class']=='USE_OF_CONF_INFO']
use_of_conf_info_df

Unnamed: 0,text,class
8,"(i) not to disclose,\n disseminate, reprod...",USE_OF_CONF_INFO


In [None]:
# nda_use_of_conf_info
ner_model_use = legal.NerModel.pretrained("legner_nda_confidential_information_restricted", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_use")

ner_converter_use = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_use"])\
        .setOutputCol("ner_chunk_use")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_use,
        ner_converter_use, 
        ])

use_of_conf_info_sdf = spark.createDataFrame(use_of_conf_info_df)

model = nlpPipeline.fit(use_of_conf_info_sdf)
result = model.transform(use_of_conf_info_sdf)



###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [30]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(use_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_use.result', 'ner_chunk_use.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_use', document_col = 'document')
  print("\n\n")

+-----------+---------------------+
|ner_chunk  |label                |
+-----------+---------------------+
|disclose   |RESTRICTED_ACTION    |
|disseminate|RESTRICTED_ACTION    |
|reproduce  |RESTRICTED_ACTION    |
|distribute |RESTRICTED_ACTION    |
|information|RESTRICTED_OBJECT    |
|third party|RESTRICTED_IND_OBJECT|
+-----------+---------------------+











## 📌 **`ASSIGNMENT` CLAUSE-SPECIFIC NER**

In [32]:
assign_df = class_texts_df[class_texts_df['class']=='ASSIGNMENT']
assign_df

Unnamed: 0,text,class
21,This represents the entire agreement of the pa...,ASSIGNMENT


In [None]:
# nda_assigment
ner_model_assigment = legal.NerModel.pretrained("legner_nda_assigment", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_assigment")

ner_converter_assigment = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_assigment"])\
        .setOutputCol("ner_chunk_assigment")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_assigment,
        ner_converter_assigment, 
        ])

assign_sdf = spark.createDataFrame(assign_df)

model = nlpPipeline.fit(assign_sdf)
result = model.transform(assign_sdf)

###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [34]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(assign_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_assigment.result', 'ner_chunk_assigment.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_assigment', document_col = 'document')
  print("\n\n")

+---------+-----+
|ner_chunk|label|
+---------+-----+
+---------+-----+











## 📌 **`EXCEPTIONS` CLAUSE-SPECIFIC NER**

In [35]:
exceptions_df = class_texts_df[class_texts_df['class']=='EXCEPTIONS']
exceptions_df

Unnamed: 0,text,class
18,(ii) is known to the Recipient prior to disclo...,EXCEPTIONS


In [None]:
# nda_exceptions
ner_model_exceptions = legal.NerModel.pretrained("legner_nda_exceptions", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_exceptions")

ner_converter_exceptions = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_exceptions"])\
        .setOutputCol("ner_chunk_exceptions")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model_exceptions,
        ner_converter_exceptions, 
        ])

exceptions_sdf = spark.createDataFrame(exceptions_df)

model = nlpPipeline.fit(exceptions_sdf)
result = model.transform(exceptions_sdf)

###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [37]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(exceptions_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_exceptions.result', 'ner_chunk_exceptions.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_exceptions', document_col = 'document')
  print("\n\n")

+---------+----------------+
|ner_chunk|label           |
+---------+----------------+
|known    |EXCLUDED_INFO   |
|prior to |EXCLUSION_GROUND|
+---------+----------------+











## 📌 **Legal Pretrained Pipeline**

In [None]:
legal_pipeline = nlp.PretrainedPipeline("legpipe_ner_contract_doc_parties_alias_former", "en", "legal/models")

result = legal_pipeline.annotate(text)

In [39]:
result = legal_pipeline.fullAnnotate(text)[0]

###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [40]:
from johnsnowlabs import viz
ner_viz = viz.NerVisualizer()
ner_viz.display(result, label_col='ner_chunk')

## 📌 **QUESTION ANSWERING**

In [6]:
with open('non_disclosure_agreement.txt', 'r') as f:
  text = f.read()
  text = text[220:3384]
  print(text)

This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS
Corporation - Federal Systems Division. ("UNISYS") and concerns the handling,
safeguarding and exchange of sensitive, proprietary or confidential information
(collectively "Proprietary Information"). It is agreed and understood that any
exchange of such Proprietary Information is expressly for the sole purpose of
CELERITY and UNISYS evaluating a possible business arrangement involving an
existing UNISYS contract with a federal government agency.

1.   For the purposes hereof, Proprietary Information shall be any documents,
     materials or information that (i) is in tangible form and clearly marked as
     being proprietary, confidential or sensitive and (ii) is received by one
     party to this Agreement ("Recipient") from the other party to this
     Agreement ("Transmitter").

2.   The Recipient agrees to safeguard, protect and handle all Proprietary
     Information made available under this Agreement 

In [25]:
# Split the text into paragraphs
paragraphs = text.split('\n\n')

# Remove empty paragraphs
paragraphs = [p for p in paragraphs if p.strip() != ""]

# Trim leading/trailing whitespaces from each paragraph
paragraphs = [p.strip() for p in paragraphs]

# Print the resulting list of paragraphs
print(paragraphs)


['This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS\nCorporation - Federal Systems Division. ("UNISYS") and concerns the handling,\nsafeguarding and exchange of sensitive, proprietary or confidential information\n(collectively "Proprietary Information"). It is agreed and understood that any\nexchange of such Proprietary Information is expressly for the sole purpose of\nCELERITY and UNISYS evaluating a possible business arrangement involving an\nexisting UNISYS contract with a federal government agency.', '1.   For the purposes hereof, Proprietary Information shall be any documents,\n     materials or information that (i) is in tangible form and clearly marked as\n     being proprietary, confidential or sensitive and (ii) is received by one\n     party to this Agreement ("Recipient") from the other party to this\n     Agreement ("Transmitter").', "2.   The Recipient agrees to safeguard, protect and handle all Proprietary\n     Information made available unde

In [None]:
documentAssembler = nlp.MultiDocumentAssembler()\
        .setInputCols(["question", "context"])\
        .setOutputCols(["document_question", "document_context"])

spanClassifier = nlp.BertForQuestionAnswering.pretrained("legqa_bert_large","en", "legal/models") \
.setInputCols(["document_question", "document_context"]) \
.setOutputCol("answer") \
.setCaseSensitive(True)

pipeline = nlp.Pipeline().setStages([
documentAssembler,
spanClassifier
])

In [27]:
new_text = [paragraphs[0]]

questions = {0: ["Between whom was this agreement made?", 
                 "What is the scope of this Agreement?",
                 "What does the Agreement concern?",
                 "What is the purpose of exchanging the Proprietary Information between CELERITY and UNISYS?",
                 "What is the understanding between CELERITY and UNISYS regarding the exchange of Proprietary Information?"]}

# Create a list of dictionaries for each row in the DataFrame
data = [{"context": text, "question": question} for text, question_list in zip(new_text, questions.values()) for question in question_list]

# Create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data)

# Display the resulting DataFrame
df.show(truncate=90)


+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+
|                                                                                   context|                                                                                  question|
+------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+
|This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS
Corporat...|                                                     Between whom was this agreement made?|
|This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS
Corporat...|                                                      What is the scope of this Agreement?|
|This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS


In [28]:
result = pipeline.fit(df).transform(df)

result.select('answer.result').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[between CELERITY SYSTEMS , INC . , ( " CELERITY " ) and UNISYS Corporation - Federal Systems Division .]                                                                                                                |
|[concerns the handling , safeguarding and exchange of sensitive , proprietary or confidential information]             

## 📌 **SUMMARIZATION**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

flant5 = legal.Summarizer().pretrained('legsum_flant5_legal_augmented','en','legal/models')\
    .setInputCols(["documents"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = nlp.Pipeline(stages=[document_assembler, 
                                flant5])

data = spark.createDataFrame([[1, " "]]).toDF('id', 'text')

In [12]:
model = pipeline.fit(data)

In [13]:
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(paragraphs)

In [14]:
light_result

[{'documents': ['This Agreement is made between CELERITY SYSTEMS, INC., ("CELERITY") and UNISYS\nCorporation - Federal Systems Division. ("UNISYS") and concerns the handling,\nsafeguarding and exchange of sensitive, proprietary or confidential information\n(collectively "Proprietary Information"). It is agreed and understood that any\nexchange of such Proprietary Information is expressly for the sole purpose of\nCELERITY and UNISYS evaluating a possible business arrangement involving an\nexisting UNISYS contract with a federal government agency.'],
  'summary': ['This agreement is between CELERITY Systems, Inc. and UNISYS Corporation - Federal Systems Division. It outlines the handling, safeguarding, and exchange of sensitive, proprietary, or confidential information. It is intended to evaluate a possible business arrangement involving an existing UNISYS contract with a federal government agency.']},
 {'documents': ['1.   For the purposes hereof, Proprietary Information shall be any do

In [15]:
combined_text = '\n'.join([d['summary'][0] for d in light_result])
print(combined_text)

This agreement is between CELERITY Systems, Inc. and UNISYS Corporation - Federal Systems Division. It outlines the handling, safeguarding, and exchange of sensitive, proprietary, or confidential information. It is intended to evaluate a possible business arrangement involving an existing UNISYS contract with a federal government agency.
This legal agreement states that any documents, materials, or information that are in tangible form and clearly marked as being proprietary, confidential, or sensitive must be received from the other party to the agreement.
This legal agreement states that the recipient agrees to safeguard, protect, and handle all proprietary information made available under the agreement with the same degree of care and procedures as the recipient applies to its own confidential or proprietary information. The recipient also agrees not to disclose, disseminate, reproduce, or otherwise distribute such information and materials to any third party, except a governmental 

In [22]:
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(combined_text)

In [24]:
light_result['summary']

["This legal agreement is between CELERITY Systems, Inc. and UNISYS Corporation - Federal Systems Division. It states that any documents, materials, or information that are in tangible form and clearly marked as being proprietary, confidential, or sensitive must be received from the other party to the agreement. The recipient agrees to safeguard, protect, and handle all proprietary information made available under the agreement with the same degree of care and procedures as the recipient applies to its own confidential or proprietary information. The recipient also agrees not to disclose, disseminate, reproduce, or otherwise distribute such information and materials to any third party, except a governmental judicial authority's issued order, and not to use such information for any other purposes beyond that expressly set forth in this agreement. The recipient may terminate the agreement at any time by providing written notice to the receiving party. The agreement is the entire agreemen