![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DateNormalizer**

This notebook will cover the different parameters and usages of `DateNormalizer`. This annotator normalizes Date chunks into a chosen format.


**📖 Learning Objectives:**

1. Understand how it is useful when using data from different sources, some times from different countries that has different formats to represent dates.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [DateNormalizer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#datenormalizer)

- Python Docs : [DateNormalizer](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/normalizer/date_normalizer/index.html)

- Scala Docs : [DateNormalizer](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/normalizer/DateNormalizer.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp).

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m656.0/656.0 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m87.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.7/540.7 kB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7139.json to spark_nlp_for_healthcare_spark_ocr_7139.json


In [3]:
from johnsnowlabs import nlp

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
🚨 Outdated OCR Secrets in license file. Version=5.1.0 but should be Version=5.0.2
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.1.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.1.3-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.1.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.1.3.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.1.3-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.1.3 installed! ✅ Heal the planet with NLP! 


In [4]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, 💊Spark-Healthcare==5.1.3, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `anchorDateYear`: (Int) Sets an anchor year for the relative dates such as a day after tomorrow. If not set it will use the current year.

- `anchorDateMonth`: (Int) Sets an anchor month for the relative dates such as a day after tomorrow. If not set it will use the current month.

- `anchorDateDay`: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day.

- `outputDateformat`: (string) Select what output format to use. If not set, the dates will be formatted as  `YYYY/MM/DD`. Options are:
  - `eu`: Format the dates as `DD/MM/YYYY`
  - `us`: Format the dates as `MM/DD/YYYY`

- `defaultReplacementDay`: (Int) Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.

- `defaultReplacementMonth`: (Int) Defines which value to use for creating the Month Value when original Date-Entity has no Month Information. Defaults to 06.

- `defaultReplacementYear`: (Int) Defines which value to use for creating the Year Value when original Date-Entity has no Year Information. Defaults to 2020.

### `setAnchorDateYear()`



Add an anchor year for the relative dates such as a day after tomorrow (Default: -1). If it is not set, the by default it will use the current year. Example: 2000

In [5]:
dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
    '08/02',
    '11/2018',
    '03/2021',
    '05 Jan',
    '01/05',
    '2022'
]

df = spark.createDataFrame(dates, F.StringType()).toDF("original_date")

In [6]:
df.show()

+-------------+
|original_date|
+-------------+
|   08/02/2018|
|      11/2018|
|   11/01/2018|
|    12Mar2021|
| Jan 30, 2018|
|   13.04.1999|
|  3April 2020|
|  next monday|
|        today|
|    next week|
|        08/02|
|      11/2018|
|      03/2021|
|       05 Jan|
|        01/05|
|         2022|
+-------------+



In [9]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [10]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/11/27]|
|        today|   [2000/11/21]|
|    next week|   [2000/11/28]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setAnchorDateMonth()`

Add an anchor month for the relative dates such as a day after tomorrow. By default it will use the current month. Month values start from 1, so 1 stands for January.

In [11]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)\
    .setAnchorDateMonth(3)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [12]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/03/27]|
|        today|   [2000/03/21]|
|    next week|   [2000/03/28]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setAnchorDateDay()`

Add an anchor day for the relative dates such as a day after tomorrow. By default it will use the current day. The first day of the month has value 1.

In [13]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)\
    .setAnchorDateMonth(3)\
    .setAnchorDateDay(15)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [14]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [2018/08/02]|
|      11/2018|   [2018/11/15]|
|   11/01/2018|   [2018/11/01]|
|    12Mar2021|   [2021/03/12]|
| Jan 30, 2018|   [2018/01/30]|
|   13.04.1999|   [1999/04/13]|
|  3April 2020|   [2020/04/03]|
|  next monday|   [2000/03/20]|
|        today|   [2000/03/15]|
|    next week|   [2000/03/22]|
|        08/02|   [2020/08/02]|
|      11/2018|   [2018/11/15]|
|      03/2021|   [2021/03/15]|
|       05 Jan|   [2020/01/05]|
|        01/05|   [2020/01/05]|
|         2022|   [2022/06/15]|
+-------------+---------------+



### `setOutputDateformat()`

Select what output format to use. If not set, the dates will be formatted as  `YYYY/MM/DD`. Options are:
  - `eu`: Format the dates as `DD/MM/YYYY`
  - `us`: Format the dates as `MM/DD/YYYY`

In [15]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [16]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [27/11/2023]|
|        today|   [21/11/2023]|
|    next week|   [28/11/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/06/2022]|
+-------------+---------------+



Now setting it to `us`:

In [18]:
date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('us')

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [08/02/2018]|
|      11/2018|   [11/15/2018]|
|   11/01/2018|   [11/01/2018]|
|    12Mar2021|   [03/12/2021]|
| Jan 30, 2018|   [01/30/2018]|
|   13.04.1999|   [04/13/1999]|
|  3April 2020|   [04/03/2020]|
|  next monday|   [11/27/2023]|
|        today|   [11/21/2023]|
|    next week|   [11/28/2023]|
|        08/02|   [08/02/2020]|
|      11/2018|   [11/15/2018]|
|      03/2021|   [03/15/2021]|
|       05 Jan|   [01/05/2020]|
|        01/05|   [01/05/2020]|
|         2022|   [06/15/2022]|
+-------------+---------------+



### `setDefaultReplacementYear()`

Defines which value to use for creating the Year Value when original Date-Entity has no Day Information. Defaults to 2020.

In [19]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [20]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [27/11/2023]|
|        today|   [21/11/2023]|
|    next week|   [28/11/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/06/2022]|
+-------------+---------------+



### `setDefaultReplacementMonth()`


Defines which value to use for creating the Month Value when original Date-Entity has no Day Information. Defaults to 6.

In [21]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)\
    .setDefaultReplacementMonth(3)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [22]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [15/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [27/11/2023]|
|        today|   [21/11/2023]|
|    next week|   [28/11/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [15/11/2018]|
|      03/2021|   [15/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [15/03/2022]|
+-------------+---------------+



### `setDefaultReplacementDay()`

Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.

In [23]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setOutputDateformat('eu')\
    .setDefaultReplacementYear(2020)\
    .setDefaultReplacementMonth(3)\
    .setDefaultReplacementDay(2)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

In [24]:
result = pipeline.fit(df).transform(df)

result.selectExpr(
    "original_date",
    "date.result as normalized_date"
).show()

+-------------+---------------+
|original_date|normalized_date|
+-------------+---------------+
|   08/02/2018|   [02/08/2018]|
|      11/2018|   [02/11/2018]|
|   11/01/2018|   [01/11/2018]|
|    12Mar2021|   [12/03/2021]|
| Jan 30, 2018|   [30/01/2018]|
|   13.04.1999|   [13/04/1999]|
|  3April 2020|   [03/04/2020]|
|  next monday|   [27/11/2023]|
|        today|   [21/11/2023]|
|    next week|   [28/11/2023]|
|        08/02|   [02/08/2020]|
|      11/2018|   [02/11/2018]|
|      03/2021|   [02/03/2021]|
|       05 Jan|   [05/01/2020]|
|        01/05|   [05/01/2020]|
|         2022|   [02/03/2022]|
+-------------+---------------+

