![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DateNormalizer**

This notebook will cover the different parameters and usages of `DateNormalizer`. This annotator transforms date mentions to a common standard format: YYYY/MM/DD. 

**📖 Learning Objectives:**

1. Understand how it is useful when using data from different sources, some times from different countries that has different formats to represent dates.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [DateNormalizer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#datenormalizer)

- Python Docs : [DateNormalizer](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/normalizer/date_normalizer/index.html)

- Scala Docs : [DateNormalizer](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/normalizer/DateNormalizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [4]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/4.4.2.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.2, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `AnchorDateYear`: (Int) Sets an anchor year for the relative dates such as a day after tomorrow. If not set it will use the current year.

- `AnchorDateMonth`: (Int) Sets an anchor month for the relative dates such as a day after tomorrow. If not set it will use the current month.

- `AnchorDateDay`: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day.

- `OutputDateformat`: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day.

- `DefaultReplacementDay`: (Int) Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.

- `DefaultReplacementMonth`: (Int) Defines which value to use for creating the Month Value when original Date-Entity has no Month Information. Defaults to 06.

- `DefaultReplacementYear`: (Int) Defines which value to use for creating the Year Value when original Date-Entity has no Year Information. Defaults to 2020.

### `setAnchorDateYear()`



Add an anchor year for the relative dates such as a day after tomorrow (Default: -1). If it is not set, the by default it will use the current year. Example: 2021

In [5]:
dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
    '08/02',
    '11/2018',
    '03/2021',
    '05 Jan',
    '01/05',
    '2022'
]

df = spark.createDataFrame(dates, F.StringType()).toDF("original_date")

In [6]:
df.show()

+-------------+
|original_date|
+-------------+
|   08/02/2018|
|      11/2018|
|   11/01/2018|
|    12Mar2021|
| Jan 30, 2018|
|   13.04.1999|
|  3April 2020|
|  next monday|
|        today|
|    next week|
|        08/02|
|      11/2018|
|      03/2021|
|       05 Jan|
|        01/05|
|         2022|
+-------------+



In [7]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [8]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/06/05]|
|        today|   [2000/06/02]|
|    next week|   [2000/06/09]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setAnchorDateMonth()`

Add an anchor month for the relative dates such as a day after tomorrow (Default: -1). By default it will use the current month. Month values start from 1, so 1 stands for January.

In [9]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)\
    .setAnchorDateMonth(3)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [10]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/03/06]|
|        today|   [2000/03/02]|
|    next week|   [2000/03/09]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setAnchorDateDay()`

Add an anchor day for the relative dates such as a day after tomorrow (Default: -1). By default it will use the current day. The first day of the month has value 1.

In [11]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)\
    .setAnchorDateMonth(3)\
    .setAnchorDateDay(15)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [12]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/03/20]|
|        today|   [2000/03/15]|
|    next week|   [2000/03/22]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setOutputDateformat()`

Select what output format should I use By default it will use the current day. The first day of the month has value 1.

In [13]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [14]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [05/06/2023]|
|        today|   [02/06/2023]|
|    next week|   [09/06/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/06/2022]|
+-------------+---------------+



### `setDefaultReplacementYear()`

Defines which value to use for creating the Year Value when original Date-Entity has no Day Information. Defaults to 2020.

In [15]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [16]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [05/06/2023]|
|        today|   [02/06/2023]|
|    next week|   [09/06/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/06/2022]|
+-------------+---------------+



### `setDefaultReplacementMonth()`


Defines which value to use for creating the Month Value when original Date-Entity has no Day Information. Defaults to 6.

In [17]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)\
    .setDefaultReplacementMonth(3)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [18]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [05/06/2023]|
|        today|   [02/06/2023]|
|    next week|   [09/06/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/03/2022]|
+-------------+---------------+



### `setDefaultReplacementDay()`

Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.

In [19]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)\
    .setDefaultReplacementMonth(3)\
    .setDefaultReplacementDay(2)
    
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [20]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"  
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [02/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [05/06/2023]|
|        today|   [02/06/2023]|
|    next week|   [09/06/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [02/11/2018]|
|      03/2021|   [02/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [02/03/2022]|
+-------------+---------------+

