![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.2.Clause_based_NER.ipynb)

#🚀 Legal NLP

In this notebook, you will learn how to use Spark NLP and Legal NLP to identify relevant entities in legal texts using our state-of-the-art Named-Entity Recognition (NER) models and the recent Zero-Shot models.

We will cover the full analysis cycle, from reading a document in PDF formar, extracting its text contents, classifying its sections and applying NER models on specific sections.

Let`s dive in!

##📜 Introduction

###🔎 Classification models

📚For the text classification tasks, we will use two annotators:

- `ClassifierDL`: uses the state-of-the-art Universal Senten- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).
- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

In Legal NLP, since the number of classes can be very high (over 250) and the texts could belong to more than one topic at the same time (multilabel problem), we pretrained several binary classifiers (yes / no) for many clause types in legal documents that can be used independently.

You can select the topics you are interested in (for example, looking for loans and fiscal-year clauses) and create a pipeline with both of them to detect for those types of clauses in your paragraphs. 

As a reminder, since the models are independent and the task is multilabel, you may get some times positive results for more than one class (i.e, a paragraph talks about loans and fiscal year at the same time).

As an alternative, we also have `MultiClassifierDL` that predicts many clause types in one model. The choice between using binary classifiers or the multilabel model will depend on the document types, and experimentations should be made to verify the accuracy of the models in texts that are too different from the trianed data (CUAD dataset, SEC sample documents, etc.).

📚Example Classification models:

| title                                                     | language   | predicted_entities                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | compatible_editions            |
|:----------------------------------------------------------|:-----------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|
| Human Rights Articles Classification                      | en         | ['Artículo 1. Obligación de Respetar los Derechos', 'Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno', 'Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica', 'Artículo 4. Derecho a la Vida', 'Artículo 5. Derecho a la Integridad Personal', 'Artículo 6. Prohibición de la Esclavitud y Servidumbre', 'Artículo 7. Derecho a la Libertad Personal', 'Artículo 8. Garantías Judiciales', 'Artículo 9. Principio de Legalidad y de Retroactividad', 'Artículo 11. Protección de la Honra y de la Dignidad', 'Artículo 12. Libertad de Conciencia y de Religión', 'Artículo 13. Libertad de Pensamiento y de Expresión', 'Artículo 14. Derecho de Rectificación o Respuesta', 'Artículo 15. Derecho de Reunión', 'Artículo 16. Libertad de Asociación', 'Artículo 17. Protección a la Familia', 'Artículo 18. Derecho al Nombre', 'Artículo 19. Derechos del Niño', 'Artículo 20. Derecho a la Nacionalidad', 'Artículo 22. Derecho de Circulación y de Residencia', 'Artículo 23. Derechos Políticos', 'Artículo 24. Igualdad ante la Ley', 'Artículo 25. Protección Judicial', 'Artículo 26. Desarrollo Progresivo', 'Artículo 27. Suspensión de Garantías', 'Artículo 28. Cláusula Federal', 'Artículo 21. Derecho a la Propiedad Privada', 'Artículo 29. Normas de Interpretación', 'Artículo 30. Alcance de las Restricciones', 'Artículo 63.1 Reparaciones'] | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Absence of certain changes Clause Binary Classifier | en         | ['other', 'absence-of-certain-changes']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Acceleration Clause Binary Classifier               | en         | ['other', 'acceleration']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Access Clause Binary Classifier                     | en         | ['other', 'access']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Accounting terms Clause Binary Classifier           | en         | ['other', 'accounting-terms']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Adjustments Clause Binary Classifier                | en         | ['other', 'adjustments']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Agreements Clause Binary Classifier                 | en         | ['other', 'agreements']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Amendments Clause Binary Classifier                 | en         | ['other', 'amendments']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Application of proceeds Clause Binary Classifier    | en         | ['other', 'application-of-proceeds']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | ['Legal NLP 1.0', 'Legal NLP'] |
| Conventions Classification                                | es         | ['Convención sobre la Eliminación de todas las formas de Discriminación contra la Mujer', 'Convención sobre los Derechos de las Personas con Discapacidad', 'Convención Internacional Sobre la Eliminación de Todas las Formas de Discriminación Racial', 'Convención Internacional sobre la Protección de los Derechos de todos los Trabajadores Migratorios y de sus Familias', 'Convención de los Derechos del Niño', 'Pacto Internacional de Derechos Civiles y Políticos']                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ['Legal NLP 1.0', 'Legal NLP'] |


For a complete list, check [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&type=model&task=Text+Classification)

###🔎 NER models

Named-Entity Recognition (NER) is the capability to automatically identify relevant entities in the text. For example, person names, company names, public companies trading code, quantities, etc. There are many ways to implement NER, but nowadays the most efficient one is to use models based on deep learning.

The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps. This model is implemented in our `NerDL`/`NerModel` annotators that we will experiment with in this section.

At John Snow Labs, we are proud to have a library of state-of-the-art pretrained, out-of-the-box, NLP models. With our newer package Legal NLP it is no different, and we currently support more than 580 models fine tuned for the legal domain. Specifically to NER, we currently have more than 40 models that can identify entities for different business needs.

📚Example NER models:

| title                                               | language   | predicted_entities                                                                                                                            | compatible_editions            |
|:----------------------------------------------------|:-----------|:----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|
| NER on Legal Texts (CUAD, Silver corpus)            | en         | ['PERSON', 'LAW', 'PARTY', 'EFFDATE', 'LOC', 'DATE', 'DOC', 'ORDINAL', 'ROLE', 'PERCENT', 'ORG']                                              | ['Legal NLP 1.0', 'Legal NLP'] |
| Generic Deidentification NER                        | en         | ['AGE', 'CITY', 'COUNTRY', 'DATE', 'EMAIL', 'FAX', 'LOCATION-OTHER', 'ORG', 'PERSON', 'PHONE', 'PROFESSION', 'STATE', 'STREET', 'URL', 'ZIP'] | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER - License / Permission Clauses (Bert, sm) | en         | ['PERMISSION', 'PERMISSION_SUBJECT', 'PERMISSION_OBJECT', 'PERMISSION_INDIRECT_OBJECT']                                                       | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER (Headers / Subheaders)                    | en         | ['HEADER', 'SUBHEADER']                                                                                                                       | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER - Whereas Clauses (sm)                    | en         | ['WHEREAS_SUBJECT', 'WHEREAS_OBJECT', 'WHEREAS_ACTION']                                                                                       | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER (Parties, Dates, Document Type - sm)      | en         | ['PARTY', 'EFFDATE', 'DOC', 'ALIAS']                                                                                                          | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER (Headers / Subheaders)                    | en         | ['SIGNING_TITLE', 'SIGNING_PERSON', 'PARTY']                                                                                                  | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal ORG, PRODUCT and ALIAS NER (small)            | en         | ['ORG', 'PROD', 'ALIAS']                                                                                                                      | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal NER Obligations on Agreements                 | en         | ['OBLIGATION_SUBJECT', 'OBLIGATION_ACTION', 'OBLIGATION', 'OBLIGATION_INDIRECT_OBJECT']                                                       | ['Legal NLP 1.0', 'Legal NLP'] |
| Legal Zero-shot NER                                 | en         | []                                                                                                                                            | ['Legal NLP 1.0', 'Legal NLP'] |


For the complete list, check the [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&type=model&task=Named+Entity+Recognition).

##🎬 Installation

In [None]:
! pip install -q johnsnowlabs

###🔗 Automatic Installation
Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO

In [2]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

###🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

###📌 Start Spark Session

In [5]:
from johnsnowlabs import nlp, legal, viz
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (3).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2


In [6]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

##🚨 Application: Identify Entities in a Credit Agreement Document

Getting an example agreement document, which we will use throughout this notebook to exemplify the real-world usage of our models.

In [None]:
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/credit_agreement.txt

In [9]:
credit_agreement = open("credit_agreement.txt", "r", encoding="utf8").read()

# First page - note the  "-i-" at the end
print(credit_agreement[:4650])



                                                                    Exhibit 10.1

                                                                  EXECUTION COPY

                    $225,000,000.00 REVOLVING CREDIT FACILITY

                                CREDIT AGREEMENT

                                  by and among

                             P.H. GLATFELTER COMPANY

                                       and

                    Certain of its Subsidiaries, as Borrowers

                                       and

                       THE BANKS PARTY HERETO, as Lenders

                                       and

             PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent

                                      with

           PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA,

                  as Joint Lead Arrangers and Joint Bookrunners

                                       and

               CITIZENS BANK OF PENNSYLVANIA, as Syndication Agent

   

###✔️ Splitting the document by pages

Sometimes, pages have patterns which tell you how to split them. In our case, `the page number` was present in the bottom of our documents.

📚Feel free to always analyze for signals when trying to detect pages boundaries. Patterns you can usually find in the bottom of a page:
- Bottom placeholders
- Name of people
- Name of the document
- other footer information
- etc.

In [10]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

text_splitter = (
    legal.TextSplitter()
    .setInputCols(["document"])
    .setOutputCol("pages")
    .setCustomBounds(["\n+\s*[0-9]+\s*\n+", "[-][iv]+[-]"])
    .setUseCustomBoundsOnly(True)
    .setExplodeSentences(True)
)

page_splitting_pipeline = nlp.Pipeline(stages=[document_assembler, text_splitter])

📜**Explanation:**

- `.setCustomBounds(["\n+\s*[0-9]+\s*\n+", "[-][iv]+[-]"])` sets an array of regular expression(s) to tell the annotator how to split the document. The first regular expression identifies the page numbers of the document, and the second regular expression identifies the initial numbers in roman numerals enclosed by dash (-i-, -ii-, etc.) - only up to `-viii-`, which was manually checked, we could add more roman numerals identifiers if needed.
- `.setUseCustomBoundsOnly(True)` the default behaviour of TextSplitter is Text Splitting, so we set to ignore the default regex ('\n', ...).
- `.setExplodeSentences(True)` creates one new row in the dataframe per split, instead of an array containing the splits.


In [11]:
sdf = spark.createDataFrame([[ credit_agreement ]]).toDF("text")

page_splitter_model = page_splitting_pipeline.fit(sdf)

In [12]:
%%time

#transforms: executes inference on a fit pipeline
res = page_splitter_model.transform(sdf)

# by selecting/showing/collecting the operations are performed
res.select('pages').show()

+--------------------+
|               pages|
+--------------------+
|[{document, 70, 4...|
|[{document, 4678,...|
|[{document, 8160,...|
|[{document, 11801...|
|[{document, 15298...|
|[{document, 18938...|
|[{document, 22461...|
|[{document, 25295...|
|[{document, 26399...|
|[{document, 31821...|
|[{document, 34560...|
|[{document, 37458...|
|[{document, 40734...|
|[{document, 43945...|
|[{document, 46511...|
|[{document, 49420...|
|[{document, 53065...|
|[{document, 56539...|
|[{document, 59694...|
|[{document, 62708...|
+--------------------+
only showing top 20 rows

CPU times: user 101 ms, sys: 6.2 ms, total: 107 ms
Wall time: 11.9 s


In [13]:
# Checking the first page
print(res.select('pages.result').take(1)[0].result[0])

Exhibit 10.1

                                                                  EXECUTION COPY

                    $225,000,000.00 REVOLVING CREDIT FACILITY

                                CREDIT AGREEMENT

                                  by and among

                             P.H. GLATFELTER COMPANY

                                       and

                    Certain of its Subsidiaries, as Borrowers

                                       and

                       THE BANKS PARTY HERETO, as Lenders

                                       and

             PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent

                                      with

           PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA,

                  as Joint Lead Arrangers and Joint Bookrunners

                                       and

               CITIZENS BANK OF PENNSYLVANIA, as Syndication Agent

                           Dated as of April 29, 2010



                

Let's keep the pages in a new data frame.

In [14]:
pages = res.select(F.expr("pages.result[0] as page"))
pages.show()

+--------------------+
|                page|
+--------------------+
|Exhibit 10.1

   ...|
|TABLE OF CONTENTS...|
|TABLE OF CONTENTS...|
|TABLE OF CONTENTS...|
|TABLE OF CONTENTS...|
|TABLE OF CONTENTS...|
|TABLE OF CONTENTS...|
|LIST OF SCHEDULES...|
|CREDIT AGREEMENT
...|
|AUGMENTING LENDER...|
|BUSINESS DAY shal...|
|COMPLIANCE CERTIF...|
|immediately prece...|
|DECLINED SHARE sh...|
|or directives iss...|
|EURO-RATE shall m...|
|Administrative Ag...|
|EURO-RATE OPTION ...|
|rate as quoted by...|
|XXXXXXX NOTE shal...|
+--------------------+
only showing top 20 rows



In [15]:
pages.limit(1).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

###✔️ Classifying each page

We will use a few pretrained binary classifier models to try to identify clause types in each page. We will also replace all `\n` (linebreak) from the pages' text to avoid extra tokenization (we keep the cleaned text in the column `page_clean`).

In [16]:
pages = pages.withColumn("page_clean", F.regexp_replace("page", "\s+", " "))
pages.limit(1).select("page_clean").show(truncate=150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                            page_clean|
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|Exhibit 10.1 EXECUTION COPY $225,000,000.00 REVOLVING CREDIT FACILITY CREDIT AGREEMENT by and among P.H. GLATFELTER COMPANY and Certain of its Subs...|
+------------------------------------------------------------------------------------------------------------------------------------------------------+



In [17]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("page_clean").setOutputCol("document")
)

embeddings = (
    nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en")
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

whereas_classifier = (
    legal.ClassifierDLModel.pretrained(
        "legclf_cuad_whereas_clause", "en", "legal/models"
    )
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("is_whereas")
)

warranty_classifier = (
    legal.ClassifierDLModel.pretrained(
        "legclf_cuad_warranty_clause", "en", "legal/models"
    )
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("is_warranty")
)

termination_classifier = (
    legal.ClassifierDLModel.pretrained("legclf_termination_md", "en", "legal/models")
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("is_termination")
)


pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        embeddings,
        whereas_classifier,
        warranty_classifier,
        termination_classifier,
    ]
)

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sent_bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
legclf_cuad_whereas_clause download started this may take some time.
[OK!]
legclf_cuad_warranty_clause download started this may take some time.
[OK!]
legclf_termination_md download started this may take some time.
[OK!]


In [18]:
result = model.transform(pages)
result.select(
    F.expr("is_whereas.result as whereas"),
    F.expr("is_warranty.result as survival"),
    F.expr("is_termination.result as termination"),
).show()

+---------+----------+-----------+
|  whereas|  survival|termination|
+---------+----------+-----------+
|[whereas]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|[warranty]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|[whereas]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
|  [other]|   [other]|    [other]|
+---------+----------+-----------+
only showing top 20 rows



How many survival clauses on pages❓

In [19]:
result.select(F.expr("is_whereas.result[0] as whereas")).filter(
    "whereas != 'other'"
).count()

3

In [20]:
result.filter("is_whereas.result[0] == 'whereas'").select(
    "page_clean", F.expr("is_whereas.result[0] as whereas")
).show(3, truncate=80)

+--------------------------------------------------------------------------------+-------+
|                                                                      page_clean|whereas|
+--------------------------------------------------------------------------------+-------+
|Exhibit 10.1 EXECUTION COPY $225,000,000.00 REVOLVING CREDIT FACILITY CREDIT ...|whereas|
|CREDIT AGREEMENT THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is ...|whereas|
|ANY KIND ARISING OUT OF OR RELATED TO THIS AGREEMENT OR ANY OTHER LOAN DOCUME...|whereas|
+--------------------------------------------------------------------------------+-------+



How many warranty clauses on pages❓

In [21]:
result.select(F.expr("is_warranty.result[0] as warranty")).filter(
    "warranty != 'other'"
).count()

9

In [22]:
result.filter("is_warranty.result[0] == 'warranty'").select(
    "page_clean", F.expr("is_warranty.result[0] as warranty")
).show(9, truncate=80)

+--------------------------------------------------------------------------------+--------+
|                                                                      page_clean|warranty|
+--------------------------------------------------------------------------------+--------+
|TABLE OF CONTENTS Section Page ------- ---- 5.1.14 Patents, Trademarks, Copyr...|warranty|
|Document as a whole and not to any particular provision of this Agreement or ...|warranty|
|(v) the lack of power or authority of any signer of (or any defect in or forg...|warranty|
|such taxes, fees, assessments and other charges are being contested in good f...|warranty|
|5.1.16 COMPLIANCE WITH LAWS. The Loan Parties and their Subsidiaries are in c...|warranty|
|(ii) To the best of the Loan Parties' knowledge, each Multiemployer Plan and ...|warranty|
|9.18 NO RELIANCE ON ADMINISTRATIVE AGENT'S CUSTOMER IDENTIFICATION PROGRAM. E...|warranty|
|ANY KIND ARISING OUT OF OR RELATED TO THIS AGREEMENT OR ANY OTHER LOAN DOCUME..

How many termination clauses on pages❓

In [23]:
result.select(F.expr("is_termination.result[0] as termination")).filter(
    "termination != 'other'"
).count()

7

In [24]:
result.filter("is_termination.result[0] == 'termination'").select(
    "page_clean", F.expr("is_termination.result[0] as termination")
).show(7, truncate=80)

+--------------------------------------------------------------------------------+-----------+
|                                                                      page_clean|termination|
+--------------------------------------------------------------------------------+-----------+
|Fees shall be payable quarterly in arrears on the first day of each July, Oct...|termination|
|the calculation of Equivalent Amounts which thereafter are actually in effect...|termination|
|certificate to the other Lenders and the Borrowers. Upon such date as shall b...|termination|
|normal banking procedures each Lender could purchase the Original Currency wi...|termination|
|7.1.12 GERMAN AND ENGLISH BORROWERS. On or before the Closing Date, and such ...|termination|
|(H) the Loan Parties shall deliver to the Administrative Agent at least five ...|termination|
|8.1.4 BREACH OF OTHER COVENANTS. Any of the Loan Parties shall default in the...|termination|
+-------------------------------------------------

###✔️ Identifying Entities in Whereas Clauses

Now that we found the clauses in each page, we select one NER model to identify the entities present in the `whereas` clauses. Our pretrained model can identify the following entities:

- `WHEREAS_SUBJECT`
- `WHEREAS_OBJECT`
- `WHEREAS_ACTION`

We will filter the whereas clauses and extract them as raw text, so we can build a new pipeline from scratch. 

> Note: The model was trained with `Roberta Embeddings` instead of `Bert`.

In addition to these entities, we will also use other models to identify person, organization, location, and dates. We can use the `ChunkMergeApproach` to merge two NER chunks in an unified field, containing all the entities.

In [25]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("page_clean").setOutputCol("document")
)

text_splitter = (
    legal.TextSplitter().setInputCols(["document"]).setOutputCol("page_sentence")
)

tokenizer = nlp.Tokenizer().setInputCols(["page_sentence"]).setOutputCol("token")

embeddings = (
    nlp.WordEmbeddingsModel.pretrained("w2v_cc_300d", "en")
    .setInputCols(["page_sentence", "token"])
    .setOutputCol("embeddings")
)

roberta_embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["page_sentence", "token"])
    .setOutputCol("roberta")
    .setMaxSentenceLength(512)
)

ner_whereas = (
    legal.NerModel.pretrained("legner_whereas_md", "en", "legal/models")
    .setInputCols(["page_sentence", "token", "roberta"])
    .setOutputCol("ner_whereas")
)

ner_converter_whereas = (
    legal.NerConverterInternal()
    .setInputCols(["page_sentence", "token", "ner_whereas"])
    .setOutputCol("ner_chunk_whereas")
)

ner_generic = (
    legal.NerModel.pretrained("legner_cuad_silver", "en", "legal/models")
		.setInputCols(["page_sentence", "token", "embeddings"])
		.setOutputCol("ner_generic")
)

ner_converter_generic = (
    legal.NerConverterInternal()
    .setInputCols(["page_sentence", "token", "ner_generic"])
    .setOutputCol("ner_chunk_generic")
    .setGreedyMode(True)
)

chunk_merge = (
    legal.ChunkMergeApproach()
    .setInputCols("ner_chunk_whereas", "ner_chunk_generic")
    .setOutputCol("merged_chunk")
)

ner_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        roberta_embeddings,
        ner_whereas,
        ner_converter_whereas,
        ner_generic,
        ner_converter_generic,
        chunk_merge
    ]
)
ner_model = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("page_clean"))

w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_whereas_md download started this may take some time.
[OK!]
legner_cuad_silver download started this may take some time.
[OK!]


In [26]:
example_clauses = result.filter("is_whereas.result[0] == 'whereas'").select(
    "page_clean"
)
example_clauses.show(1)

+--------------------+
|          page_clean|
+--------------------+
|Exhibit 10.1 EXEC...|
+--------------------+
only showing top 1 row



In [27]:
ner_results = ner_model.transform(example_clauses)

In [28]:
ner_results.select(
    F.explode(F.arrays_zip(ner_results.merged_chunk.result, ner_results.merged_chunk.metadata)).alias(
        "cols"
    )
).select(
    F.expr("cols['0']").alias("chunk"),
    F.expr("cols['1']['entity']").alias("ner_label"),
    F.expr("cols['1']['confidence']").alias("confidence"),
).show(300, truncate=False)


+-----------------------------------------------------------------------+---------------+----------+
|chunk                                                                  |ner_label      |confidence|
+-----------------------------------------------------------------------+---------------+----------+
|EXECUTION COPY                                                         |DOC            |0.99565   |
|REVOLVING CREDIT FACILITY CREDIT AGREEMENT                             |DOC            |0.8406199 |
|GLATFELTER COMPANY                                                     |PARTY          |0.52175   |
|PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent                |ORG            |0.87597775|
|PNC CAPITAL MARKETS LLC                                                |ORG            |0.631425  |
|CITIZENS BANK OF PENNSYLVANIA                                          |ORG            |0.6161    |
|Joint Lead Arrangers and Joint Bookrunners                             |ORG            |0.

Using the visualization package:

In [29]:
ner_visualizer = viz.NerVisualizer()

In [30]:
results_collected = ner_results.collect()

In [31]:
ner_visualizer.display(
    results_collected[1], label_col="merged_chunk", document_col="document"
)

###✔️ Using LightPipeline

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details:
[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [32]:
light_model = nlp.LightPipeline(ner_model)

You can use strings or list of strings with the method [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) to get the results. To get more metadata in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead. The result is a `list` if a `list` is given, or a `dict` if a string was given.

To extract the results from the object, you just need to parse the dictionary.

In [33]:
text = results_collected[1].page_clean

In [34]:
lp_results = light_model.annotate(text)
lp_results.keys()

dict_keys(['ner_whereas', 'document', 'merged_chunk', 'ner_chunk_generic', 'token', 'page_sentence', 'embeddings', 'ner_chunk_whereas', 'roberta', 'ner_generic'])

In [35]:
# List with all the chunks
lp_results["merged_chunk"]

['CREDIT AGREEMENT',
 'CREDIT AGREEMENT',
 'April 29, 2010',
 'GLATFELTER COMPANY',
 'Pennsylvania',
 'the "COMPANY")',
 'BORROWER',
 'PNC BANK, NATIONAL ASSOCIATION,',
 'Agreement',
 'PNC CAPITAL MARKETS LLC',
 'CITIZENS BANK OF PENNSYLVANIA',
 'CITIZENS BANK OF PENNSYLVANIA',
 'the Borrowers',
 'have requested',
 'the Lenders to provide a revolving credit facility',
 'proceeds of the',
 'credit facility',
 'WHEREAS',
 'the Lenders',
 'are willing to provide',
 'such credit',
 'the parties',
 'DEFINITIONS 1.1 CERTAIN DEFINITIONS',
 'Agreement',
 '2006',
 '2006',
 'May 1, 2016',
 'the Loan Parties',
 '2010',
 '2010',
 'May 1, 2016',
 'the Loan Parties',
 'RECEIVABLE FACILITY',
 'Company',
 'the Receivables Entity',
 'a Permitted Accounts Receivable Program',
 'ADMINISTRATIVE AGENT',
 'Section 9.15',
 "AGENT'S LETTER",
 'Section 9.15',
 'AFFILIATE',
 'Person',
 'Person',
 'contract',
 'AGREEMENT',
 'Credit Agreement',
 'Executive Order No. 13224',
 'the USA Patriot',
 'Act',
 'Act',
 "t

We can see that the `.annotate()` did't return the labels in the `ner_chunk` item. How can we obtain them? Using the `.fullAnnotate()` instead. This method always returns a list.

In [36]:
lp_results_full = light_model.fullAnnotate(text)
lp_results_full[0].keys()

dict_keys(['ner_whereas', 'document', 'merged_chunk', 'ner_chunk_generic', 'token', 'page_sentence', 'embeddings', 'ner_chunk_whereas', 'roberta', 'ner_generic'])

In [37]:
lp_results_full[0]["merged_chunk"]

[Annotation(chunk, 0, 15, CREDIT AGREEMENT, {'entity': 'DOC', 'confidence': '0.99310005', 'ner_source': 'ner_chunk_generic', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 22, 37, CREDIT AGREEMENT, {'entity': 'DOC', 'confidence': '0.9901', 'ner_source': 'ner_chunk_generic', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 54, 67, April 29, 2010, {'entity': 'DATE', 'confidence': '0.8848001', 'ner_source': 'ner_chunk_generic', 'chunk': '2', 'sentence': '0'}, []),
 Annotation(chunk, 100, 117, GLATFELTER COMPANY, {'entity': 'ORG', 'confidence': '0.6135', 'ner_source': 'ner_chunk_generic', 'chunk': '3', 'sentence': '1'}, []),
 Annotation(chunk, 122, 133, Pennsylvania, {'entity': 'LOC', 'confidence': '0.7951', 'ner_source': 'ner_chunk_generic', 'chunk': '4', 'sentence': '1'}, []),
 Annotation(chunk, 149, 162, the "COMPANY"), {'entity': 'ORG', 'confidence': '0.78772503', 'ner_source': 'ner_chunk_generic', 'chunk': '5', 'sentence': '1'}, []),
 Annotation(chunk, 246, 253, BORROW

Now we can see all the metadata in the annotation objects. Let's get the results in a tabular form.

In [38]:
results_tabular = []
for res in lp_results_full[0]["merged_chunk"]:
    results_tabular.append(
        (
            res.begin,
            res.end,
            res.result,
            res.metadata["entity"],
            res.metadata["confidence"],
        )
    )

import pandas as pd

pd.DataFrame(results_tabular, columns=["begin", "end", "chunk", "entity", "confidence"])


Unnamed: 0,begin,end,chunk,entity,confidence
0,0,15,CREDIT AGREEMENT,DOC,0.99310005
1,22,37,CREDIT AGREEMENT,DOC,0.9901
2,54,67,"April 29, 2010",DATE,0.8848001
3,100,117,GLATFELTER COMPANY,ORG,0.6135
4,122,133,Pennsylvania,LOC,0.7951
5,149,162,"the ""COMPANY"")",ORG,0.78772503
6,246,253,BORROWER,ROLE,0.9892
7,379,409,"PNC BANK, NATIONAL ASSOCIATION,",ORG,0.7839833
8,463,471,Agreement,DOC,0.9553
9,618,640,PNC CAPITAL MARKETS LLC,PARTY,0.8001


##📌 Multilabel classification

In this section we will use the `MultiClassifierDL` annotator to idenfity more than one classes in texts. 

In [39]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)
embeddings = (
    nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en")
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

multiClassifier = (
    nlp.MultiClassifierDLModel.pretrained("legmulticlf_edgar", "en", "legal/models")
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, multiClassifier])


light_pipeline = nlp.LightPipeline(
    clf_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
)

sent_bert_base_uncased_legal download started this may take some time.
Approximate size to download 390.8 MB
[OK!]
legmulticlf_edgar download started this may take some time.
Approximate size to download 13.3 MB
[OK!]


We will experiment in simpler sentences to showcase it's capabilities.

In [40]:
result = light_pipeline.annotate(
    """No failure or delay by the Administrative Agent or any Lender in exercising any right or power hereunder shall operate as a waiver thereof, nor shall any single or partial exercise of any such right or power, or any abandonment or discontinuance of steps to enforce such a right or power, preclude any other or further exercise thereof or the exercise of any other right or power. The rights and remedies of the Administrative Agent and the Lenders hereunder are cumulative and are not exclusive of any rights or remedies that they would otherwise have. No waiver of any provision of this Agreement or consent to any departure by the Borrower therefrom shall in any event be effective unless the same shall be permitted by paragraph (b) of this Section, and then such waiver or consent shall be effective only in the specific instance and for the purpose for which given. Without limiting the generality of the foregoing, the making of a Loan shall not be construed as a waiver of any Default, regardless of whether the Administrative Agent or any Lender may have had notice or knowledge of such Default at the time.
    """.lower()
)

result["class"]

['waivers']

In [41]:
result = light_pipeline.annotate(
    """The provisions of this Agreement shall be binding upon and inure to the benefit of the parties hereto and their respective successors and assigns permitted hereby (including any Affiliate of the Issuing Bank that issues any Letter of Credit), except that (i) the Borrower may not assign or otherwise transfer any of its rights or obligations hereunder without the prior written consent of each Lender (and any attempted assignment or transfer by the Borrower without such consent shall be null and void) and (ii) no Lender may assign or otherwise transfer its rights or obligations hereunder except in accordance with this Section. Nothing in this Agreement, expressed or implied, shall be construed to confer upon any Person (other than the parties hereto, their respective successors and assigns permitted hereby (including any Affiliate of the Issuing Bank that issues any Letter of Credit), Participants (to the extent provided in paragraph (c) of this Section) and, to the extent expressly contemplated hereby, the Related Parties of each of the Administrative Agent, the Issuing Bank and the Lenders) any legal or equitable right, remedy or claim under or by reason of this Agreement.
    """.lower()
)

result["class"]

[]

In [42]:
result = light_pipeline.annotate(
    """After the effectiveness of this Amendment, the representations and warranties of the Borrower set forth in the Credit Agreement and in the other Loan Documents are true and correct in all material respects on and as of the date hereof, with the same force and effect as if made on and as of such date, except to the extent that such representations and warranties (i) specifically refer to an earlier date, in which case they shall be true and correct in all material respects as of such earlier date (except to the extent of changes in facts or circumstances that have been disclosed to the Lenders and do not constitute an Event of Default or a Potential Default under the Credit Agreement or any other Loan Document), and (ii) are already qualified by materiality, in which case they shall be true and correct in all respects, and except that for purposes of this Section 4.1 , the representations and warranties contained in Section 7.6 of the Credit Agreement shall be deemed to refer to the most recent financial statements furnished pursuant to Section 8.1(a) of the Credit Agreement.
    """.lower()
)

result["class"] 

['warranties', 'representations']

In [43]:
result = light_pipeline.annotate("""All notices and other communications provided for in this Agreement and the other Loan Documents shall be in writing and may (subject to paragraph (b) below) be telecopied (faxed), mailed by certified mail return receipt requested, or delivered by hand or overnight courier service to the intended recipient at the addresses specified below or at such other address as shall be designated by any party listed below in a notice to the other parties listed below given in accordance with this Section.""".lower())

result["class"]

['notices']

##📌 Identify obligations using Transformers (Bert) models

In this section, we will illustrate how to use a different NER annotator that is based on Transformer architecture.

`BertForTokenClassification` annotator can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see [Import Transformers into Spark NLP 🚀](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).

Using these models is very similar to the `NerModel` we used before. We adjust the pipeline by adding the `BertFotTokenClassification` step instead of `NerModel`, and don't need to add the `Embeddings` step as it is already part of the new annotator. 

Then, the pipeline is just:

In [7]:
def bert_pipeline(model_name="legner_obligations", language="en"):
    documentAssembler = (
        nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
    )

    tokenizer = nlp.Tokenizer().setInputCols("document").setOutputCol("token")

    tokenClassifier = (
        legal.BertForTokenClassification.pretrained(
            model_name, language, "legal/models"
        )
        .setInputCols(["token", "document"])
        .setOutputCol("label")
        .setCaseSensitive(True)
    )

    ner_converter = (
        nlp.NerConverter()
        .setInputCols(["document", "token", "label"])
        .setOutputCol("ner_chunk")
    )

    pipeline = nlp.Pipeline(
        stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]
    )

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = pipeline.fit(empty_data)
    return model

For Legal NLP we currently have Bert models for English, German, Arabic, and (Brazilian) Portuguese available, but we are constantly adding new models with every release. The model we will use here is the `legner_obligations`, which identifies obligations (what the different parties commit to do) in agreement documents.

This model extracts the subject (who commits to doing what), the action (the verb - will provide, shall sign…) and the object (what subject will provide, what subject shall sign, etc). Also, if the recipient of the obligation is a third party (a subject will provide to the Company X …), then that third party (Company X) will be extracted as an indirect object.

The model was trained with in-house annotated documents on CUAD dataset.



In [8]:
bert_model = bert_pipeline("legner_obligations", "en")
bert_model.stages

legner_obligations download started this may take some time.
[OK!]


[DocumentAssembler_95ad50798e30,
 REGEX_TOKENIZER_67928cc401e1,
 BERT_FOR_TOKEN_CLASSIFICATION_d6615a22d5c2,
 NerConverter_a3663db7abd4]

In [9]:
bert_model.stages[-2].getClasses()

['B-OBLIGATION_ACTION',
 'I-OBLIGATION_INDIRECT_OBJECT',
 'I-OBLIGATION',
 'B-OBLIGATION_INDIRECT_OBJECT',
 'PAD',
 'I-OBLIGATION_SUBJECT',
 'I-OBLIGATION_ACTION',
 'O',
 'B-OBLIGATION_SUBJECT',
 'B-OBLIGATION']

In [10]:
import pyspark.sql.functions as F

text = """The Buyer shall use such materials and supplies only in accordance with the present agreement"""

res = bert_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result_df = res.select(
    F.explode(
        F.arrays_zip(res.token.result, res.label.result, res.label.metadata)
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("token"),
    F.expr("cols['1']").alias("label"),
    F.expr("cols['2']['confidence']").alias("confidence"),
)

result_df.show(truncate=100)

+----------+--------------------+----------+
|     token|               label|confidence|
+----------+--------------------+----------+
|       The|                   O|0.71809256|
|     Buyer|B-OBLIGATION_SUBJECT|0.86514723|
|     shall| B-OBLIGATION_ACTION|0.99315745|
|       use| I-OBLIGATION_ACTION| 0.9729679|
|      such|        B-OBLIGATION| 0.7499739|
| materials|        I-OBLIGATION| 0.9127689|
|       and|        I-OBLIGATION|0.88955635|
|  supplies|        I-OBLIGATION| 0.9182221|
|      only|        I-OBLIGATION|0.82361615|
|        in|        I-OBLIGATION| 0.8662357|
|accordance|        I-OBLIGATION| 0.9251934|
|      with|        I-OBLIGATION| 0.8835488|
|       the|        I-OBLIGATION|0.53246284|
|   present|        I-OBLIGATION| 0.8670555|
| agreement|        I-OBLIGATION| 0.8018013|
+----------+--------------------+----------+

