![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/25.Date_Normalizer.ipynb)

# 25. Date Normalizer

## Colab Setup

In [None]:
import sys
import json
import os
with open('license.json') as f:
    license_keys = json.load(f)
    
import os
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [3]:
import json
import os
import pandas as pd

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import sparknlp_jsl
import sparknlp

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader


spark = sparknlp_jsl.start(license_keys['SECRET'])

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.2
Spark NLP_JSL Version : 3.5.0


## **Date Normalizer**

New Annotator that transforms chunks Dates to a normalized Date with format YYYY/MM/DD. This annotator identifies dates in chunk annotations and transforms those dates to the format YYYY/MM/DD. 



We going to create a chunks dates with different formats:

In [4]:
dates = [
'08/02/2018',
'11/2018',
'11/01/2018',
'12Mar2021',
'Jan 30, 2018',
'13.04.1999', 
'3April 2020',
]



In [5]:
from pyspark.sql.types import StringType
df_dates = spark.createDataFrame(dates,StringType()).toDF('ner_chunk')

We going to transform that text to documents in spark-nlp.

In [6]:
document_assembler = DocumentAssembler().setInputCol('ner_chunk').setOutputCol('document')
documents_DF = document_assembler.transform(df_dates)

After that we going to transform that documents to chunks.

In [7]:
from sparknlp.functions import map_annotations_col

chunks_df = map_annotations_col(documents_DF.select("document","ner_chunk"),
                    lambda x: [Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [8]:
chunks_df.select('chunk_date').show(truncate=False)

+---------------------------------------------------+
|chunk_date                                         |
+---------------------------------------------------+
|[{chunk, 0, 9, 08/02/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 6, 11/2018, {sentence -> 0}, []}]      |
|[{chunk, 0, 9, 11/01/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 8, 12Mar2021, {sentence -> 0}, []}]    |
|[{chunk, 0, 11, Jan 30, 2018, {sentence -> 0}, []}]|
|[{chunk, 0, 9, 13.04.1999, {sentence -> 0}, []}]   |
|[{chunk, 0, 10, 3April 2020, {sentence -> 0}, []}] |
+---------------------------------------------------+



Now we going to normalize that chunks using the DateNormalizer.

In [9]:
date_normalizer = DateNormalizer().setInputCols('chunk_date').setOutputCol('date')

In [10]:
date_normalized_df = date_normalizer.transform(chunks_df)

We going to show how the date is normalized.

In [11]:
dateNormalizedClean = date_normalized_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")

dateNormalizedClean.withColumn("dateresult", dateNormalizedClean["dateresult"]
                               .getItem(0)).withColumn("metadata", dateNormalizedClean["metadata"]
                                                       .getItem(0)['normalized']).show(truncate=False)

+------------+----------+--------+
|ner_chunk   |dateresult|metadata|
+------------+----------+--------+
|08/02/2018  |2018/08/02|true    |
|11/2018     |2018/11/DD|true    |
|11/01/2018  |2018/11/01|true    |
|12Mar2021   |2021/03/12|true    |
|Jan 30, 2018|2018/01/30|true    |
|13.04.1999  |1999/04/13|true    |
|3April 2020 |2020/04/03|true    |
+------------+----------+--------+



## Relative Date

We can configure the `anchorDateYear`,`anchorDateMonth` and `anchorDateDay` for the relatives dates.

In [12]:
rel_dates = [
'next monday',
'today',
'next week'
]

rel_dates_df = spark.createDataFrame(rel_dates,StringType()).toDF('ner_chunk')

In [13]:
rel_documents_DF = document_assembler.transform(rel_dates_df)

rel_chunks_df = map_annotations_col(rel_documents_DF.select("document","ner_chunk"),
                    lambda x: [Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [14]:
rel_chunks_df.select('chunk_date').show(truncate=False)

+--------------------------------------------------+
|chunk_date                                        |
+--------------------------------------------------+
|[{chunk, 0, 10, next monday, {sentence -> 0}, []}]|
|[{chunk, 0, 4, today, {sentence -> 0}, []}]       |
|[{chunk, 0, 8, next week, {sentence -> 0}, []}]   |
+--------------------------------------------------+



In the following example we will use as a relative date 2021/02/15, to make that possible we need to set up the `anchorDateYear` to 2021, the `anchorDateMonth` to 2 and the `anchorDateDay` to 16. We will show you the configuration with the following example.

In [15]:
rel_date_normalizer = DateNormalizer().setInputCols('chunk_date').setOutputCol('date')\
    .setAnchorDateDay(16)\
    .setAnchorDateMonth(2)\
    .setAnchorDateYear(2021)

In [16]:
rel_date_normalized_df = rel_date_normalizer.transform(rel_chunks_df)
relDateNormalizedClean = rel_date_normalized_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")
relDateNormalizedClean.withColumn("dateresult", relDateNormalizedClean["dateresult"].getItem(0))\
                      .withColumn("metadata", relDateNormalizedClean["metadata"].getItem(0)['normalized']).show(truncate=False)

+-----------+----------+--------+
|ner_chunk  |dateresult|metadata|
+-----------+----------+--------+
|next monday|2021/02/22|true    |
|today      |2021/02/16|true    |
|next week  |2021/02/23|true    |
+-----------+----------+--------+



As you see the relatives dates like `next monday` , `today` and `next week` takes the `2021/02/16` as reference date.
