![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/80.1.Legal_Contract_Understanding_NDA.ipynb)

# 🎬 Installation

In [1]:
! pip install -q johnsnowlabs

## 🔗 Automatic Installation


In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser = True)

## 🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# 📌 Starting

In [None]:
spark = nlp.start()

# 🔎 **Non-disclouse Agreement Contract Understanding**


✍Explanation:

 A Non-Disclosure Agreement (NDA) is a legal contract designed to protect confidential information shared between parties involved in a business or professional relationship. The purpose of an NDA is to establish a legal obligation that restricts the receiving party from disclosing or using the confidential information for any purpose other than what is defined in the agreement.

 The `legclf_nda_agreements` model focuses specifically on analyzing Non-Disclosure Agreements (NDAs). NDAs are legal contracts designed to protect confidential information shared between parties. A specific Legal NLP model for NDAs is trained on a potentially large corpus of NDA documents to understand the specific clauses, terms and language commonly used in these agreements.

- The `legclf_nda_agreements` model is a Bert Sentence Embeddings Document Classifier used to classify if the document belongs to the class nda or not (Binary Classification).




## Get sample text

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/nda_2.txt

In [5]:
import re
with open('nda_2.txt', 'r') as f:
  text = f.read()
  print(text)

This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).

For the purpose of evaluating a possible relationship between the parties (the "Purpose"), the parties may disclose to each other in oral and written form or in other medium, certain Confidential Information, as defined below, relating to their respective pipelines, products and businesses.

As a precondition to any such disclosure, each party hereby agrees to keep all such Confidential Information of the other party in strict confidence under the terms set out in this Agreement. 
For purposes of this Agreement, a party receiving Confidential Information hereunder shall be referred to as the “Receiving Party” and a party disclosing Confidential Information hereunder shal

## Document classification

In [9]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
  
embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")
    
doc_classifier = legal.ClassifierDLModel.pretrained("legclf_nda_agreements", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
nlpPipeline = nlp.Pipeline(stages=[
    document_assembler, 
    embeddings,
    doc_classifier])

df = spark.createDataFrame([[text]]).toDF("text")

model = nlpPipeline.fit(df)
result = model.transform(df)

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
legclf_nda_agreements download started this may take some time.
[OK!]


In [10]:
result.select("category.result").show(truncate=False)

+------+
|result|
+------+
|[nda] |
+------+



# 🔎 **Multilabel Classification of NDA Clauses (paragraph, medium)**

✍Explanation:

This models is a version of `legmulticlf_mnda_sections_other` (sentence, medium) but expecting a bigger-than-sentence context, ideally between 2 and 4-5 sentences, or a small paragraph, to provide with more context.

It should be run on sentences of the NDA clauses, and will retrieve labels for each of them. The possible clause types detected my this model in NDA / MNDA aggrements are:

- Parties to the Agreement - Names of the Parties Clause
- Identification of What Information Is Confidential - Definition of Confidential Information Clause
- Use of Confidential Information: Permitted Use Clause and Obligations of the Recipient
- Time Frame of the Agreement - Termination Clause
- Return of Confidential Information Clause
- Remedies for Breaches of Agreement - Remedies Clause
- Non-Solicitation Clause
- Dispute Resolution Clause
- Exceptions Clause
- Non-competition clause
- Other: Nothing of the above (synonym to [])

📚 Predicted Entities
`APPLIC_LAW`, `ASSIGNMENT`, `DEF_OF_CONF_INFO`,`DISPUTE_RESOL`, `EXCEPTIONS`, `NAMES_OF_PARTIES`, `NON_COMP`, `NON_SOLIC`, `PREAMBLE`, `REMEDIES`, `REQ_DISCL`, `RETURN_OF_CONF_INFO`, `TERMINATION`, `USE_OF_CONF_INFO`, `OTHER`

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = legal.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setExplodeSentences(True)

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document")\
    .setOutputCol("sentence_embeddings")

paragraph_classifier = nlp.MultiClassifierDLModel.pretrained("legmulticlf_mnda_sections_paragraph_other", "en", "legal/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")


sentence_pipeline = nlp.Pipeline(stages=[document_assembler, 
                                         text_splitter, 
                                         embeddings, 
                                         paragraph_classifier])

prediction_pipeline = nlp.Pipeline(stages=[document_assembler, 
                                           embeddings, 
                                           paragraph_classifier])

df = spark.createDataFrame([[text]]).toDF("text")

sentence_model = sentence_pipeline.fit(df)
prediction_model = prediction_pipeline.fit(df)

sentence_lp = nlp.LightPipeline(sentence_model)
prediction_lp = nlp.LightPipeline(prediction_model)

## In order to classify sections, I split by paragraph (\n\n). 
Another way is splitting by sentences and accumulation up to certain number of them (3 ideally).

In [12]:
import pandas as pd

# Create an empty dataframe to store the texts
class_texts_df = []

paragraphs = text.split("\n\n")

for i, s in enumerate(paragraphs):
    annotation = prediction_lp.annotate(s)
    predicted_class = annotation['class']
    class_texts_df.append((s, predicted_class))
    
# Append the text to the dataframe for the predicted class
class_texts_df = pd.DataFrame(class_texts_df, columns=['text', 'class'])

In [13]:
class_texts_df["text"][0]

'This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).'

In [14]:
class_texts_df = class_texts_df.explode('class').reset_index(drop=True).fillna("OTHER")

In [15]:
class_texts_df

Unnamed: 0,text,class
0,This Non-Disclosure Agreement (“Agreement”) is...,NAMES_OF_PARTIES
1,For the purpose of evaluating a possible relat...,DEF_OF_CONF_INFO
2,For the purpose of evaluating a possible relat...,PREAMBLE
3,"As a precondition to any such disclosure, each...",PREAMBLE
4,Definition of Confidential Information.,DEF_OF_CONF_INFO
5,"For the purpose of this Agreement, “Confidenti...",DEF_OF_CONF_INFO
6,"Any analyses, compilations, comparisons studie...",DEF_OF_CONF_INFO
7,"For purposes of this Agreement, “Affiliate” of...",OTHER
8,Exclusions from Definition.\nThe term Confiden...,EXCEPTIONS
9,Restrictions on Disclosure of Confidential Inf...,PREAMBLE


In [16]:
# Filter out rows with empty square brackets
class_texts_df = class_texts_df[class_texts_df['class']!='OTHER']

# Print the filtered DataFrame
class_texts_df

Unnamed: 0,text,class
0,This Non-Disclosure Agreement (“Agreement”) is...,NAMES_OF_PARTIES
1,For the purpose of evaluating a possible relat...,DEF_OF_CONF_INFO
2,For the purpose of evaluating a possible relat...,PREAMBLE
3,"As a precondition to any such disclosure, each...",PREAMBLE
4,Definition of Confidential Information.,DEF_OF_CONF_INFO
5,"For the purpose of this Agreement, “Confidenti...",DEF_OF_CONF_INFO
6,"Any analyses, compilations, comparisons studie...",DEF_OF_CONF_INFO
8,Exclusions from Definition.\nThe term Confiden...,EXCEPTIONS
9,Restrictions on Disclosure of Confidential Inf...,PREAMBLE
10,The Receiving Party shall ensure that its Repr...,USE_OF_CONF_INFO


## 📌 **`NAMES_OF_PARTIES` CLAUSE-SPECIFIC NER**

### Creating a dataframe with the texts of the `NAMES_OF_PARTIES` class

In [17]:
names_of_the_parties_df = class_texts_df[class_texts_df['class']=='NAMES_OF_PARTIES']
names_of_the_parties_df

Unnamed: 0,text,class
0,This Non-Disclosure Agreement (“Agreement”) is...,NAMES_OF_PARTIES


In [18]:
names_of_the_parties_df["text"][0]

'This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).'

### **We use a pretrained pipeline, instead of putting all the components manually**

In [23]:
legal_pipeline = nlp.PretrainedPipeline("legpipe_ner_contract_doc_parties_alias_former", "en", "legal/models")

legpipe_ner_contract_doc_parties_alias_former download started this may take some time.
Approx size to download 891.1 MB
[OK!]


In [None]:
result = legal_pipeline.fullAnnotate(names_of_the_parties_df.loc[0, 'text'])

###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [31]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
visualiser.display(result = result[0],label_col = 'merged_ner_chunks', document_col = 'document')
print("\n\n")






## 📌 **`DEF_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [32]:
def_of_conf_info_df = class_texts_df[class_texts_df['class']=='DEF_OF_CONF_INFO']
def_of_conf_info_df

Unnamed: 0,text,class
1,For the purpose of evaluating a possible relat...,DEF_OF_CONF_INFO
4,Definition of Confidential Information.,DEF_OF_CONF_INFO
5,"For the purpose of this Agreement, “Confidenti...",DEF_OF_CONF_INFO
6,"Any analyses, compilations, comparisons studie...",DEF_OF_CONF_INFO
30,Company acknowledges that the provisions of U....,DEF_OF_CONF_INFO


In [34]:
def_of_conf_info_df["text"][5]

'For the purpose of this Agreement, “Confidential Information”  shall mean any proprietary information relating to a Disclosing Party, its business or assets or that of any of its customers, Affiliates (defined below), subcontractors or other persons that is not generally known to the public, whether of a technical, business or other nature, including but not limited to inventions, know-how, trade-secrets, methods, techniques, processes, designs, drawings, diagrams, formulae and analysis, clinical data, and any business information, including but not limited to price lists, customers lists, costs analyses, reports, surveys and market information and data), whether communicated in tangible or intangible form.'

In [45]:
# nda_conf_info
document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

text_splitter = legal.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

ner_model_conf = legal.NerModel.pretrained("legner_nda_def_conf_info", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_conf")

ner_converter_conf = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_conf"])\
        .setOutputCol("ner_chunk_conf")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_conf,
        ner_converter_conf, 
        ])

def_of_conf_info_sdf = spark.createDataFrame(def_of_conf_info_df)

model = nlpPipeline.fit(def_of_conf_info_sdf)
result = model.transform(def_of_conf_info_sdf)


from pyspark.sql import functions as F

result_df = result.select(F.explode('ner_chunk_conf').alias("cols")) \
                  .select(F.expr("cols.begin").alias("begin"),
                          F.expr("cols.end").alias("end"),
                          F.expr("cols.result").alias("chunk"),
                          F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)


legner_nda_def_conf_info download started this may take some time.
[OK!]
+-----+---+-------------------+--------------+
|begin|end|              chunk|        entity|
+-----+---+-------------------+--------------+
|  133|136|               oral|CONF_INFO_FORM|
|  142|148|            written|CONF_INFO_FORM|
|   27| 37|        Information|CONF_INFO_TYPE|
|  372|381|         inventions|CONF_INFO_TYPE|
|  384|391|           know-how|CONF_INFO_TYPE|
|  394|406|      trade-secrets|CONF_INFO_TYPE|
|  409|415|            methods|CONF_INFO_TYPE|
|  418|427|         techniques|CONF_INFO_TYPE|
|  430|438|          processes|CONF_INFO_TYPE|
|  441|447|            designs|CONF_INFO_TYPE|
|  450|457|           drawings|CONF_INFO_TYPE|
|  460|467|           diagrams|CONF_INFO_TYPE|
|  470|477|           formulae|CONF_INFO_TYPE|
|  483|490|           analysis|CONF_INFO_TYPE|
|  493|505|      clinical data|CONF_INFO_TYPE|
|  567|577|        price lists|CONF_INFO_TYPE|
|  580|594|    customers lists|CON

### 🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [46]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(def_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_conf.result', 'ner_chunk_conf.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_conf', document_col = 'document')
  print("\n\n")


+-------------------+--------------+
|ner_chunk          |label         |
+-------------------+--------------+
|oral               |CONF_INFO_FORM|
|written            |CONF_INFO_FORM|
|Information        |CONF_INFO_TYPE|
|inventions         |CONF_INFO_TYPE|
|know-how           |CONF_INFO_TYPE|
|trade-secrets      |CONF_INFO_TYPE|
|methods            |CONF_INFO_TYPE|
|techniques         |CONF_INFO_TYPE|
|processes          |CONF_INFO_TYPE|
|designs            |CONF_INFO_TYPE|
|drawings           |CONF_INFO_TYPE|
|diagrams           |CONF_INFO_TYPE|
|formulae           |CONF_INFO_TYPE|
|analysis           |CONF_INFO_TYPE|
|clinical data      |CONF_INFO_TYPE|
|price lists        |CONF_INFO_TYPE|
|customers lists    |CONF_INFO_TYPE|
|costs analyses     |CONF_INFO_TYPE|
|reports            |CONF_INFO_TYPE|
|surveys            |CONF_INFO_TYPE|
|data               |CONF_INFO_TYPE|
|analyses           |CONF_INFO_TYPE|
|compilations       |CONF_INFO_TYPE|
|comparisons studies|CONF_INFO_TYPE|
|


























## 📌 **`RETURN_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [47]:
return_of_conf_info_df = class_texts_df[class_texts_df['class']=='RETURN_OF_CONF_INFO']
return_of_conf_info_df

Unnamed: 0,text,class
16,All Confidential Information (including tangib...,RETURN_OF_CONF_INFO
17,"Any portion of Derivative Materials, whether c...",RETURN_OF_CONF_INFO
19,"Notwithstanding the foregoing, the Receiving P...",RETURN_OF_CONF_INFO


In [49]:
# nda_return_of_conf_info
ner_model_return = legal.NerModel.pretrained("legner_nda_return_of_conf_info", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_return")

ner_converter_return = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_return"])\
        .setOutputCol("ner_chunk_return")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_return,
        ner_converter_return, 
        ])

return_of_conf_info_sdf = spark.createDataFrame(return_of_conf_info_df)

model = nlpPipeline.fit(return_of_conf_info_sdf)
result = model.transform(return_of_conf_info_sdf)


legner_nda_return_of_conf_info download started this may take some time.
[OK!]


###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [50]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(return_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_return.result', 'ner_chunk_return.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_return', document_col = 'document')
  print("\n\n--------------")

+----------+----------------+
|ner_chunk |label           |
+----------+----------------+
|law       |LEGAL_PURPOSE   |
|rule      |LEGAL_PURPOSE   |
|regulation|LEGAL_PURPOSE   |
|archiving |ARCHIVAL_PURPOSE|
|back-up   |ARCHIVAL_PURPOSE|
+----------+----------------+








--------------




--------------




--------------


## 📌 **`USE_OF_CONF_INFO` CLAUSE-SPECIFIC NER**

In [51]:
use_of_conf_info_df = class_texts_df[class_texts_df['class']=='USE_OF_CONF_INFO']
use_of_conf_info_df

Unnamed: 0,text,class
10,The Receiving Party shall ensure that its Repr...,USE_OF_CONF_INFO
12,"For the purpose hereof, the Receiving Party it...",USE_OF_CONF_INFO
18,"Without limiting the foregoing, the Receiving ...",USE_OF_CONF_INFO
24,Scope of Relations.\nNeither this Agreement no...,USE_OF_CONF_INFO
33,"IN WITNESS WHEREOF, the Parties have executed ...",USE_OF_CONF_INFO


In [52]:
# nda_use_of_conf_info
ner_model_use = legal.NerModel.pretrained("legner_nda_confidential_information_restricted", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_use")

ner_converter_use = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_use"])\
        .setOutputCol("ner_chunk_use")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_use,
        ner_converter_use, 
        ])

use_of_conf_info_sdf = spark.createDataFrame(use_of_conf_info_df)

model = nlpPipeline.fit(use_of_conf_info_sdf)
result = model.transform(use_of_conf_info_sdf)



legner_nda_confidential_information_restricted download started this may take some time.
[OK!]


###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [53]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(use_of_conf_info_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_use.result', 'ner_chunk_use.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_use', document_col = 'document')
  print("\n\n")

+---------------+------------------+
|ner_chunk      |label             |
+---------------+------------------+
|Receiving Party|RESTRICTED_SUBJECT|
|Receiving Party|RESTRICTED_SUBJECT|
|Receiving      |RESTRICTED_SUBJECT|
|Representatives|RESTRICTED_SUBJECT|
|Receiving Party|RESTRICTED_SUBJECT|
|Receiving      |RESTRICTED_SUBJECT|
+---------------+------------------+































## 📌 **`EXCEPTIONS` CLAUSE-SPECIFIC NER**

In [58]:
exceptions_df = class_texts_df[class_texts_df['class']=='EXCEPTIONS']
exceptions_df

Unnamed: 0,text,class
8,Exclusions from Definition.\nThe term Confiden...,EXCEPTIONS
31,Company acknowledges that the provisions of U....,EXCEPTIONS


In [59]:
# nda_exceptions
ner_model_exceptions = legal.NerModel.pretrained("legner_nda_exceptions", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_exceptions")

ner_converter_exceptions = nlp.NerConverter()\
        .setInputCols(["sentence", "token", "ner_exceptions"])\
        .setOutputCol("ner_chunk_exceptions")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_exceptions,
        ner_converter_exceptions, 
        ])

exceptions_sdf = spark.createDataFrame(exceptions_df)

model = nlpPipeline.fit(exceptions_sdf)
result = model.transform(exceptions_sdf)

legner_nda_exceptions download started this may take some time.
[OK!]


###🧮 **NER Visualizer**
For saving the visualization result as html, provide save_path parameter in the display function.

In [60]:
# Importing the necessary functions
from pyspark.sql.functions import col

result = model.transform(exceptions_sdf)

result.select(F.explode(F.arrays_zip('ner_chunk_exceptions.result', 'ner_chunk_exceptions.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("ner_chunk"),
              F.expr("cols['1']['entity']").alias("label")).show(300,truncate = False)

print("\n\n")

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()
for i in range(result.count()):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk_exceptions', document_col = 'document')
  print("\n\n")

+-----------------------+----------------+
|ner_chunk              |label           |
+-----------------------+----------------+
|known                  |EXCLUDED_INFO   |
|prior to               |EXCLUSION_GROUND|
|public domain          |EXCLUDED_INFO   |
|through no wrongful act|EXCLUSION_GROUND|
|rightfully             |EXCLUDED_INFO   |
|received               |EXCLUDED_INFO   |
|without restriction    |EXCLUSION_GROUND|
|developed              |EXCLUDED_INFO   |
|without the use        |EXCLUSION_GROUND|
|provisions             |EXCLUDED_INFO   |
|promulgated            |EXCLUDED_INFO   |
+-----------------------+----------------+














































## 📌 **QUESTION ANSWERING**

In [6]:
import re
with open('nda_2.txt', 'r') as f:
  text = f.read().split('\n\n')[0]
  print(text)

This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).


In [16]:
documentAssembler = nlp.MultiDocumentAssembler()\
        .setInputCols(["question", "context"])\
        .setOutputCols(["document_question", "document_context"])

leg_qa = legal.QuestionAnswering.pretrained("legqa_flant5_finetuned","en","legal/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {question} context: {context}")\
    .setMaxNewTokens(50)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline().setStages([
        documentAssembler,
        leg_qa
])

legqa_flant5_finetuned download started this may take some time.
[OK!]


In [23]:
new_text = [text]

questions = {0: ["Between whom was this agreement made?", 
                 "When did this agreement enter into effect ?",
                 "In which state is PlusUltra Inc. based?",
                 "What is the address of PlusUltra?"]}

# Create a list of dictionaries for each row in the DataFrame
data = [{"context": text, "question": question} for text, question_list in zip(new_text, questions.values()) for question in question_list]

# Create a DataFrame from the list of dictionaries
df = spark.createDataFrame(data)

# Display the resulting DataFrame
df.show(truncate=90)


+------------------------------------------------------------------------------------------+-------------------------------------------+
|                                                                                   context|                                   question|
+------------------------------------------------------------------------------------------+-------------------------------------------+
|This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of...|      Between whom was this agreement made?|
|This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of...|When did this agreement enter into effect ?|
|This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of...|    In which state is PlusUltra Inc. based?|
|This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of...|          What is the address of PlusUltra?|
+----------------------------------------

In [24]:
result = pipeline.fit(df).transform(df)

result.select('answer.result').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This agreement was made between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 and John Snow Labs Inc., a Delaware corporation, located at ,]|
|[This agreement entered into effect on the 2nd day of January, 2023.  . . . . . . ]                                                                                                            

## 📌 **SUMMARIZATION**

In [29]:
import re
with open('nda_2.txt', 'r') as f:
  text = f.read()

paragraphs = text.split('\n\n')
paragraphs[:5]

['This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).',
 'For the purpose of evaluating a possible relationship between the parties (the "Purpose"), the parties may disclose to each other in oral and written form or in other medium, certain Confidential Information, as defined below, relating to their respective pipelines, products and businesses.',
 'As a precondition to any such disclosure, each party hereby agrees to keep all such Confidential Information of the other party in strict confidence under the terms set out in this Agreement. \nFor purposes of this Agreement, a party receiving Confidential Information hereunder shall be referred to as the “Receiving Party” and a party disclosing Confidential Information hereu

In [25]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

flant5 = legal.Summarizer().pretrained('legsum_flant5_legal_augmented','en','legal/models')\
    .setInputCols(["documents"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = nlp.Pipeline(stages=[document_assembler, 
                                flant5])

data = spark.createDataFrame([[1, " "]]).toDF('id', 'text')

legsum_flant5_legal_augmented download started this may take some time.
[OK!]


In [26]:
model = pipeline.fit(data)

In [30]:
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(paragraphs)

In [31]:
light_result

[{'documents': ['This Non-Disclosure Agreement (“Agreement”) is made and entered into on this 2nd day of January, 2023, by and between PlusUltra Inc., a Delaware corporation having its principal offices at 111 Way White, Austin, NH 229484 (“PlusUltra”) and John Snow Labs Inc., a Delaware corporation, located at , , ,  (the “Company”).'],
  'summary': ['This Non-Disclosure Agreement is between PlusUltra Inc. and John Snow Labs Inc., a Delaware corporation located at 111 Way White, Austin, NH 229484. It is a non-disclosure agreement between the two companies.']},
 {'documents': ['For the purpose of evaluating a possible relationship between the parties (the "Purpose"), the parties may disclose to each other in oral and written form or in other medium, certain Confidential Information, as defined below, relating to their respective pipelines, products and businesses.'],
  'summary': ['This legal agreement states that the parties may disclose certain confidential information to each other 

In [32]:
combined_text = '\n'.join([d['summary'][0] for d in light_result])
print(combined_text)

This Non-Disclosure Agreement is between PlusUltra Inc. and John Snow Labs Inc., a Delaware corporation located at 111 Way White, Austin, NH 229484. It is a non-disclosure agreement between the two companies.
This legal agreement states that the parties may disclose certain confidential information to each other in oral and written form or in other medium, relating to their respective pipelines, products and businesses.
This agreement states that each party agrees to keep all confidential information of the other party in strict confidence under the terms set out in the agreement. For purposes of the agreement, a party receiving and disclosing confidential information will be referred to as the “Receiving Party” and the disclosing party will be referred to as the “Disclosing Party”.
This legal agreement defines the definition of confidential information.
This legal agreement defines "Confidential Information" as any proprietary information relating to a Disclosure Party, its business o

In [33]:
light_model = nlp.LightPipeline(model)
light_result = light_model.annotate(combined_text)

In [34]:
light_result['summary']

["This legal agreement is between PlusUltra Inc. and John Snow Labs Inc. It states that the parties may disclose certain confidential information to each other in oral and written form or in other medium, relating to their respective pipelines, products and businesses. It also states that the Receiving Party must keep all confidential information of the other party in strict confidence under the terms set out in the agreement. The Receiving Party must not use any confidential information other than as necessary specifically for the purpose, nor disclose any confidential information to any person other than its directors, officers, employees, agents or consultants who have a bona fide need to know such confidential information for the purpose. The Receiving Party must ensure that its Representatives maintain the confidentiality of any Disclosing Party's Confidential Information received by such Representatives and must be absolutely responsible for any breach of such obligations by its 