![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/16.0.Legal_Text_Generation.ipynb)

# **Legal Text Generation**

Legal Text Generator uses the basic Flan-T5 model to perform various tasks related to legal text abstraction. With this models, a user can provide a prompt and context and instruct the system to perform a legal specific task. The Flan-T5 is an enhanced version of the original T5 model and is designed to produce better quality and more coherent text generation. It is trained on a large dataset of diverse texts and can generate high-quality summaries of articles, documents, and other text-based inputs.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Generation&edition=Legal+NLP).


# Colab Setup

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, legal
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, legal
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (7).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.3, running on ⚡ PySpark==3.1.2


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Text Generation Models

<div align="center">

| **Index** | **Text Generator Models**        |
|---------------|----------------------|
| 1        |  [leggen_flant5_finetuned](https://nlp.johnsnowlabs.com/2023/04/29/leggen_flant5_finetuned_en.html)     |
| 2      | [leggen_flant5_base](https://nlp.johnsnowlabs.com/2023/04/21/leggen_flant5_base_en.html)    |



</div>

## **leggen_flant5_base**

This `leggen_flant5_base` model has been fine-tuned on FLANT5 Using legal texts. FLAN-T5 is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text generation tasks.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = legal.TextGenerator.pretrained("leggen_flant5_base","en","legal/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(200)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setNoRepeatNgramSize(3)\
    .setStopAtEos(True)
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([["", ""]]).toDF("id", "text"))


leggen_flant5_base download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([[1, "Explain loan clauses."],
                              [2, "This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission."],
                              [3, "Certificate of common stock (incorporated by reference to exhibit 4"]]).toDF("id", "text")

result = model.transform(data)

result.select("id", "text", "generated_text.result").show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+
|id |text                                                                                                                                                                                                      |result                                                                 |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+
|1  |Explain loan clauses.                                                                                                                                   

## **leggen_flant5_finetuned**

This Text Generation model has been fine-tuned on FLANT5 using legal texts.


In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = legal.TextGenerator.pretrained("leggen_flant5_finetuned","en","legal/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(200)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setNoRepeatNgramSize(3)\
    .setStopAtEos(True)
 
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

leggen_flant5_finetuned download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([[1, "Explain loan clauses."],
                              [2, "This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission."],
                              [3, "Certificate of common stock (incorporated by reference to exhibit 4"]]).toDF("id", "text")

result = model.transform(data)

result.select("id", "text", "generated_text.result").show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### **Using LightPipeline**

In [None]:
text = ["Explain loan clauses.",
        "This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission.",
        "Certificate of common stock (incorporated by reference to exhibit 4"]

light_model = nlp.LightPipeline(model)

all_result = []

for t in text:
  light_result = light_model.annotate(t)
  all_result.append(light_result)

all_result

[{'prompt': ['Explain loan clauses.'],
  'generated_text': ['The loan clauses in the agreement should include the terms of the loan, the terms and conditions of the agreement, and the terms that the loan is due. The loan should also include the amount of the interest paid on the loan and the amount due to the loan. The lenders should also provide a detailed explanation of the terms, conditions, and terms of each loan. This should include any additional fees or costs associated with the loan or the loan itself. The lender should also ensure that the terms are clear and concise.']},
 {'prompt': ['This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission.'],
  'generated_text': ['The redacted material is confidential and will not be disclosed to any third party without the prior written consent of the parties. The parties agree to use their best e

In [None]:
import textwrap

for t in range(len(text)):

  input = textwrap.fill(all_result[t]['prompt'][0], width=120)

  output = textwrap.fill(all_result[t]['generated_text'][0], width=120)

  print("➤ Input: \n{}".format(input))
  print("\n")
  print("➤ Output: \n{}".format(output))
  print("\n")

➤ Input: 
Explain loan clauses.


➤ Output: 
The loan clauses in the agreement should include the terms of the loan, the terms and conditions of the agreement, and
the terms that the loan is due. The loan should also include the amount of the interest paid on the loan and the amount
due to the loan. The lenders should also provide a detailed explanation of the terms, conditions, and terms of each
loan. This should include any additional fees or costs associated with the loan or the loan itself. The lender should
also ensure that the terms are clear and concise.


➤ Input: 
This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with
[* * *] and has been filed separately with the securities and exchange commission.


➤ Output: 
The redacted material is confidential and will not be disclosed to any third party without the prior written consent of
the parties. The parties agree to use their best efforts to protect the confidential

Our finetuned model returns better result. You can change model parameters to get the most relevant result according to your prompt.