![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/25.Date_Normalizer.ipynb)

In [None]:
import os

jsl_secret = os.getenv('SECRET')

import sparknlp
sparknlp_version = sparknlp.version()
import sparknlp_jsl
jsl_version = sparknlp_jsl.version()

print (jsl_secret)

In [None]:

import json
import os
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd
      


In [None]:
spark = sparknlp_jsl.start(jsl_secret)

In [None]:
spark

# **Date Normalizer**

New Annotator that transforms chunks Dates to a normalized Date with format YYYY/MM/DD. This annotator identifies dates in chunk annotations and transforms those dates to the format YYYY/MM/DD. 



We going to create a chunks dates with different formats:

In [None]:
dates = [
'08/02/2018',
'11/2018',
'11/01/2018',
'12Mar2021',
'Jan 30, 2018',
'13.04.1999', 
'3April 2020',
'next monday',
'today',
'next week'
]



In [None]:
from pyspark.sql.types import StringType
df_dates = spark.createDataFrame(dates,StringType()).toDF('ner_chunk')

We going to transform that text to documents in spark-nlp.

In [None]:
document_assembler = DocumentAssembler().setInputCol('ner_chunk').setOutputCol('document')
documents_DF = document_assembler.transform(df_dates)

After that we going to transform that documents to chunks.

In [None]:
from sparknlp.functions import map_annotations_col

chunks_df = map_annotations_col(documents_DF.select("document","ner_chunk"),
                    lambda x: [Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [None]:
chunks_df.select('chunk_date').show(truncate=False)

+---------------------------------------------------+
|chunk_date                                         |
+---------------------------------------------------+
|[{chunk, 0, 9, 08/02/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 6, 11/2018, {sentence -> 0}, []}]      |
|[{chunk, 0, 9, 11/01/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 8, 12Mar2021, {sentence -> 0}, []}]    |
|[{chunk, 0, 11, Jan 30, 2018, {sentence -> 0}, []}]|
|[{chunk, 0, 9, 13.04.1999, {sentence -> 0}, []}]   |
|[{chunk, 0, 10, 3April 2020, {sentence -> 0}, []}] |
|[{chunk, 0, 10, next monday, {sentence -> 0}, []}] |
|[{chunk, 0, 4, today, {sentence -> 0}, []}]        |
|[{chunk, 0, 8, next week, {sentence -> 0}, []}]    |
+---------------------------------------------------+



Now we going to normalize that chunks using the DateNormalizer.

In [None]:
date_normalizer = DateNormalizer().setInputCols('chunk_date').setOutputCol('date')


In [None]:
date_normaliced_df = date_normalizer.transform(chunks_df)

We going to show how the date is normalized.

In [None]:
dateNormalizedClean = date_normaliced_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")

dateNormalizedClean.withColumn("dateresult", dateNormalizedClean["dateresult"]
                               .getItem(0)).withColumn("metadata", dateNormalizedClean["metadata"]
                                                       .getItem(0)['normalized']).show(truncate=False)

+------------+----------+--------+
|ner_chunk   |dateresult|metadata|
+------------+----------+--------+
|08/02/2018  |2018/08/02|true    |
|11/2018     |2018/11/DD|true    |
|11/01/2018  |2018/11/01|true    |
|12Mar2021   |2021/03/12|true    |
|Jan 30, 2018|2018/01/30|true    |
|13.04.1999  |1999/04/13|true    |
|3April 2020 |2020/04/03|true    |
|next monday |2021/06/19|true    |
|today       |2021/06/13|true    |
|next week   |2021/06/20|true    |
+------------+----------+--------+



We can configure the `anchorDateYear`,`anchorDateMonth` and `anchorDateDay` for the relatives dates.

In the following example we will use as a relative date 2021/02/22, to make that possible we need to set up the `anchorDateYear` to 2020, the `anchorDateMonth` to 2 and the `anchorDateDay` to 27. I will show you the configuration with the following example.

In [None]:
date_normalizer = DateNormalizer().setInputCols('chunk_date').setOutputCol('date')\
            .setAnchorDateDay(27)\
            .setAnchorDateMonth(2)\
            .setAnchorDateYear(2021)

In [None]:
date_normaliced_df = date_normalizer.transform(chunks_df)
dateNormalizedClean = date_normaliced_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")
dateNormalizedClean.withColumn("dateresult", dateNormalizedClean["dateresult"]
                               .getItem(0)).withColumn("metadata", dateNormalizedClean["metadata"]
                                                       .getItem(0)['normalized']).show(truncate=False)


+------------+----------+--------+
|ner_chunk   |dateresult|metadata|
+------------+----------+--------+
|08/02/2018  |2018/08/02|true    |
|11/2018     |2018/11/DD|true    |
|11/01/2018  |2018/11/01|true    |
|12Mar2021   |2021/03/12|true    |
|Jan 30, 2018|2018/01/30|true    |
|13.04.1999  |1999/04/13|true    |
|3April 2020 |2020/04/03|true    |
|next monday |2021/02/29|true    |
|today       |2021/02/27|true    |
|next week   |2021/03/03|true    |
+------------+----------+--------+



As you see the relatives dates like `next monday` , `today` and `next week` takes the `2021/02/22` as reference date.
