![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/14.0.Date_Normalizer.ipynb)

# Legal Date Normalizer

## Colab Setup

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, legal

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, legal

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [5]:
from pyspark.sql.types import StructType, IntegerType, StringType
import pyspark.sql.functions as F

## **Date Normalizer**

New Annotator that transforms chunks Dates to a normalized Date with format YYYY/MM/DD. This annotator identifies dates in chunk annotations and transforms those dates to the format YYYY/MM/DD. 



We will create texts containing dates in different formats:

In [6]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(True)

ner_model = legal.NerModel.pretrained('legner_deid', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("date_chunk")\
    .setWhiteList(["DATE"])\

date_normalizer = legal.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        date_normalizer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]


In [7]:
dates = [
"""The contract between Party A and Party B shall commence on June 1, 2023, and shall remain in effect for a period of five years""",

"""The plaintiff must file a response to the defendant's motion no later than Jan 15, 2024, in accordance with the rules of civil procedure.""",

"""The deadline for submitting all required documentation for the scholarship application is 30Sep2023.""",

"""The parties agree to engage in mediation within 30 days of the occurrence of any dispute, starting from the date of the written notice of dispute, which shall be sent no later than August 31, 2023.""",

"""On 01/2023, the parties entered into a legally binding agreement, as evidenced by their signatures on the document."""
]

df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

df_dates.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The contract between Party A and Party B shall commence on June 1, 2023, and shall remain in effe...|
|The plaintiff must file a response to the defendant's motion no later than Jan 15, 2024, in accor...|
|The deadline for submitting all required documentation for the scholarship application is 30Sep2023.|
|The parties agree to engage in mediation within 30 days of the occurrence of any dispute, startin...|
|On 01/2023, the parties entered into a legally binding agreement, as evidenced by their signature...|
+----------------------------------------------------------------------------------------------------+



In [8]:
result = model.transform(df_dates)

We are going to show how the date is normalized.

In [9]:
result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, result.normalized_date.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date"))
                  
result_df.show(truncate=150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------+
|                                                                                                                                                  text|     date_chunk|normalized_date|
+------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+---------------+
|                        The contract between Party A and Party B shall commence on June 1, 2023, and shall remain in effect for a period of five years|   June 1, 2023|     2023/06/01|
|             The plaintiff must file a response to the defendant's motion no later than Jan 15, 2024, in accordance with the rules of civil procedure.|   Jan 15, 2024|     2024/01/15|
|                                                  The deadline for submitt

### Replacer

In [10]:
date_normalizer = legal.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')

replacer = legal.Replacer()\
    .setInputCols(["normalized_date","sentence"])\
    .setOutputCol("replaced_document")

nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      sentence_detector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      date_normalizer,
      replacer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [11]:
result = model.transform(df_dates)

result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date.result,
                                                        result.replaced_document.result)).alias("cols")) \
                  .select("text",F.expr("cols['1']").alias("normalized_date"),
                                 F.expr("cols['2']").alias("replaced_document"))
                  
result_df.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                                                                 |normalized_date|replaced_document                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Date Format

With the new setOutputDateformat feature of DateNormalizer, date outputs can be customized in `us`: `MM/DD/YYYY` or `eu`: `DD/MM/YYYY` format.

In [12]:
date_normalizer_us = legal.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_us')\
    .setOutputDateformat('us')

date_normalizer_eu = legal.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_eu')\
    .setOutputDateformat('eu')

nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      sentence_detector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      date_normalizer_us,
      date_normalizer_eu
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [13]:
result = model.transform(df_dates)

In [14]:
result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date_us.result,
                                                        result.normalized_date_eu.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date_us"),
                                 F.expr("cols['2']").alias("normalized_date_eu"))
                  
result_df.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------------------+------------------+
|text                                                                                                                                                                                                 |date_chunk     |normalized_date_us|normalized_date_eu|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+------------------+------------------+
|The contract between Party A and Party B shall commence on June 1, 2023, and shall remain in effect for a period of five years                                                                       |June 1, 2023   |06/01/2023        |01/0

### Default Replacement

If any of the day, month and year information is missing in the date format, the following default values are added.

- `setDefaultReplacementDay`: default value is 15
- `setDefaultReplacementMonth`: default value is July or 6
- `setDefaultReplacementYear`: default value is 2020


In [15]:
date_normalizer_us = legal.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_us')\
    .setOutputDateformat('us')\
    .setDefaultReplacementDay(2)\
    .setDefaultReplacementMonth(3)\
    .setDefaultReplacementYear(2024)

nlpPipeline = nlp.Pipeline(stages=[
      document_assembler, 
      sentence_detector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      date_normalizer_us
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [16]:
dates = [
'08/02',
'11/2018',
'03/2021',
'05 Jan',
'01/05',
'2022'
]

df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

In [17]:
result = model.transform(df_dates)

In [18]:
result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date_us.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date_us"))
                  
result_df.show(truncate=False)

+-------+----------+------------------+
|text   |date_chunk|normalized_date_us|
+-------+----------+------------------+
|08/02  |08/02     |08/02/2024        |
|11/2018|11/2018   |11/02/2018        |
|03/2021|03/2021   |03/02/2021        |
|05 Jan |05 Jan    |01/05/2024        |
|01/05  |01/05     |01/05/2024        |
|2022   |2022      |03/02/2022        |
+-------+----------+------------------+



### Only date chunk


We are going to create a chunks dates with different formats:

In [19]:
dates = [
'08/02/2018',
'11/2018',
'11/01/2018',
'12Mar2021',
'Jan 30, 2018',
'13.04.1999', 
'3April 2020',
'03/2021',
'05 Jan',
'01/05',
'2022'
]

In [20]:
df_dates = spark.createDataFrame(dates,StringType()).toDF('ner_chunk')

We are going to transform that text to documents in spark-nlp.

In [21]:
document_assembler = nlp.DocumentAssembler().setInputCol('ner_chunk').setOutputCol('document')
documents_DF = document_assembler.transform(df_dates)

After that we are going to transform that documents to chunks.

In [22]:
chunks_df = nlp.map_annotations_col(documents_DF.select("document","ner_chunk"),
                    lambda x: [nlp.Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [23]:
chunks_df.select('chunk_date').show(truncate=False)

+---------------------------------------------------+
|chunk_date                                         |
+---------------------------------------------------+
|[{chunk, 0, 9, 08/02/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 6, 11/2018, {sentence -> 0}, []}]      |
|[{chunk, 0, 9, 11/01/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 8, 12Mar2021, {sentence -> 0}, []}]    |
|[{chunk, 0, 11, Jan 30, 2018, {sentence -> 0}, []}]|
|[{chunk, 0, 9, 13.04.1999, {sentence -> 0}, []}]   |
|[{chunk, 0, 10, 3April 2020, {sentence -> 0}, []}] |
|[{chunk, 0, 6, 03/2021, {sentence -> 0}, []}]      |
|[{chunk, 0, 5, 05 Jan, {sentence -> 0}, []}]       |
|[{chunk, 0, 4, 01/05, {sentence -> 0}, []}]        |
|[{chunk, 0, 3, 2022, {sentence -> 0}, []}]         |
+---------------------------------------------------+



Now we are going to normalize those chunks using the DateNormalizer.

In [24]:
date_normalizer = legal.DateNormalizer().setInputCols('chunk_date').setOutputCol('date')

In [25]:
date_normalized_df = date_normalizer.transform(chunks_df)

We are going to show how the date is normalized.

In [26]:
dateNormalizedClean = date_normalized_df.selectExpr("ner_chunk","date.result as date_result","date.metadata as metadata")

dateNormalizedClean.withColumn("date_result", dateNormalizedClean["date_result"]
                               .getItem(0)).withColumn("metadata", dateNormalizedClean["metadata"]
                                                       .getItem(0)['normalized']).show(truncate=False)

+------------+-----------+--------+
|ner_chunk   |date_result|metadata|
+------------+-----------+--------+
|08/02/2018  |2018/08/02 |true    |
|11/2018     |2018/11/15 |true    |
|11/01/2018  |2018/11/01 |true    |
|12Mar2021   |2021/03/12 |true    |
|Jan 30, 2018|2018/01/30 |true    |
|13.04.1999  |1999/04/13 |true    |
|3April 2020 |2020/04/03 |true    |
|03/2021     |2021/03/15 |true    |
|05 Jan      |2020/01/05 |true    |
|01/05       |2020/01/05 |true    |
|2022        |2022/06/15 |true    |
+------------+-----------+--------+



## Relative Date

We can configure the `anchorDateYear`,`anchorDateMonth` and `anchorDateDay` for the relatives dates.

In [27]:
rel_dates = [
'next monday',
'today',
'next week'
]

rel_dates_df = spark.createDataFrame(rel_dates,StringType()).toDF('ner_chunk')

In [28]:
rel_documents_DF = document_assembler.transform(rel_dates_df)

rel_chunks_df = nlp.map_annotations_col(rel_documents_DF.select("document","ner_chunk"),
                    lambda x: [nlp.Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [29]:
rel_chunks_df.select('chunk_date').show(truncate=False)

+--------------------------------------------------+
|chunk_date                                        |
+--------------------------------------------------+
|[{chunk, 0, 10, next monday, {sentence -> 0}, []}]|
|[{chunk, 0, 4, today, {sentence -> 0}, []}]       |
|[{chunk, 0, 8, next week, {sentence -> 0}, []}]   |
+--------------------------------------------------+



In the following example we will use as a relative date 2021/02/16, to make that possible we need to set up the `anchorDateYear` to 2021, the `anchorDateMonth` to 2 and the `anchorDateDay` to 16. We will show you the configuration with the following example.

In [30]:
rel_date_normalizer = legal.DateNormalizer()\
                        .setInputCols('chunk_date')\
                        .setOutputCol('date')\
                        .setAnchorDateDay(16)\
                        .setAnchorDateMonth(2)\
                        .setAnchorDateYear(2021)

In [31]:
rel_date_normalized_df = rel_date_normalizer.transform(rel_chunks_df)
relDateNormalizedClean = rel_date_normalized_df.selectExpr("ner_chunk","date.result as date_result","date.metadata as metadata")
relDateNormalizedClean.withColumn("date_result", relDateNormalizedClean["date_result"].getItem(0))\
                      .withColumn("metadata", relDateNormalizedClean["metadata"].getItem(0)['normalized']).show(truncate=False)

+-----------+-----------+--------+
|ner_chunk  |date_result|metadata|
+-----------+-----------+--------+
|next monday|2021/02/22 |true    |
|today      |2021/02/16 |true    |
|next week  |2021/02/23 |true    |
+-----------+-----------+--------+



As you see the relatives dates like `next monday` , `today` and `next week` takes the `2021/02/16` as reference date.
