![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Date Normalizer

## **Setup**

In [2]:
import json
import os

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

from johnsnowlabs import nlp, medical

spark = start_spark()
spark.sparkContext.setLogLevel("ERROR")

spark

Spark Session already created, some configs may not take.


## **Date Normalizer**

New Annotator that transforms chunks Dates to a normalized Date with format YYYY/MM/DD. This annotator identifies dates in chunk annotations and transforms those dates to the format YYYY/MM/DD. 



We going to create a chunks dates with different formats:

In [3]:
from pyspark.sql.types import StructType, IntegerType, StringType

In [4]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("date_chunk")\
    .setWhiteList(["DATE"])

date_normalizer = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      date_normalizer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ]embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
Download done! Loading the resource.
[OK!]
ner_deid_generic_augmented download started this may take some time.
[ | ]ner_deid_generic_augmented download started this may take some time.
Approximate size to download 13.8 MB
Download done! Loading the resource.
[OK!]


In [5]:
dates = [
'She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.' ,
'She has been receiving gemcitabine and she receives three cycles of this with the last one being given on 11/2018. ',
'She was last seen in the clinic on 11/01/2018by Dr. Y.',
'Chris Brown was discharged on 12Mar2021',
'Last INR was on Tuesday, Jan 30, 2018, and her INR was 2.3. 2. Amiodarone 100 mg p.o. daily. ',
'We reviewed the pathology obtained from the pericardectomy on 13.04.1999, which was diagnostic of mesothelioma', 
'A review of her CT scan on 3 April2020 prior to her pericardectomy, already shows bilateral plural effusions. ',
]

df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

df_dates.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with rig...|
|She has been receiving gemcitabine and she receives three cycles of this with the last one being ...|
|                                              She was last seen in the clinic on 11/01/2018by Dr. Y.|
|                                                             Chris Brown was discharged on 12Mar2021|
|       Last INR was on Tuesday, Jan 30, 2018, and her INR was 2.3. 2. Amiodarone 100 mg p.o. daily. |
|We reviewed the pathology obtained from the pericardectomy on 13.04.1999, which was diagnostic of...|
|A review of her CT scan on 3 April2020 prior to her pericardectomy, alre

In [7]:
result = model.transform(df_dates)

We are going to show how the date is normalized.

In [8]:
import pyspark.sql.functions as F

result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date"))
                  
result_df.show(truncate=False)

                                                                                

+---------------------------------------------------------------------------------------------------------------------+---------------------+---------------+
|text                                                                                                                 |date_chunk           |normalized_date|
+---------------------------------------------------------------------------------------------------------------------+---------------------+---------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.|08/02/2018           |2018/08/02     |
|She has been receiving gemcitabine and she receives three cycles of this with the last one being given on 11/2018.   |11/2018              |2018/11/15     |
|She was last seen in the clinic on 11/01/2018by Dr. Y.                                                               |11/01/2018by         |2018/11/01     |
|Chris Brown was discharged on 12Mar2021            

### Replacer

In [9]:
date_normalizer = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')

replacer = medical.Replacer()\
    .setInputCols(["normalized_date","document"])\
    .setOutputCol("replaced_document")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      date_normalizer,
      replacer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [10]:
result = model.transform(df_dates)

result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date.result,
                                                        result.replaced_document.result)).alias("cols")) \
                  .select("text",F.expr("cols['1']").alias("normalized_date"),
                                 F.expr("cols['2']").alias("replaced_document"))
                  
result_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                 |normalized_date|replaced_document                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.|2018/08/02     |She has a history of pericarditis and pericardectomy on 2018/08/02 and developed a cough with right-

### Date Format

With the new setOutputDateformat feature of DateNormalizer, date outputs can be customized in `us`: `MM/DD/YYYY` or `eu`: `DD/MM/YYYY` format.

In [11]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("date_chunk")\
    .setWhiteList(["DATE"])

date_normalizer_us = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_us')\
    .setOutputDateformat('us')

date_normalizer_eu = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_eu')\
    .setOutputDateformat('eu')

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      date_normalizer_us,
      date_normalizer_eu
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


In [12]:
result = model.transform(df_dates)

In [13]:
result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date_us.result,
                                                        result.normalized_date_eu.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date_us"),
                                 F.expr("cols['2']").alias("normalized_date_eu"))
                  
result_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+---------------------+------------------+------------------+
|text                                                                                                                 |date_chunk           |normalized_date_us|normalized_date_eu|
+---------------------------------------------------------------------------------------------------------------------+---------------------+------------------+------------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.|08/02/2018           |08/02/2018        |2018/08/02        |
|She has been receiving gemcitabine and she receives three cycles of this with the last one being given on 11/2018.   |11/2018              |11/15/2018        |2018/11/15        |
|She was last seen in the clinic on 11/01/2018by Dr. Y.                                             

### Default Replacement

If any of the day, month and year information is missing in the date format, the following default values are added.

- `setDefaultReplacementDay`: default value is 15
- `setDefaultReplacementMonth`: default value is July or 6
- `setDefaultReplacementYear`: default value is 2020


In [14]:
date_normalizer_us = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_us')\
    .setOutputDateformat('us')\
    .setDefaultReplacementDay(2)\
    .setDefaultReplacementMonth(3)\
    .setDefaultReplacementYear(2024)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      date_normalizer_us
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [15]:
dates = [
'08/02',
'11/2018',
'03/2021',
'05 Jan',
'01/05',
'2022'
]

from pyspark.sql.types import StringType
df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

In [16]:
result = model.transform(df_dates)

In [17]:
result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date_us.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("date_chunk"),
                                 F.expr("cols['1']").alias("normalized_date_us"))
                  
result_df.show(truncate=False)

+-------+----------+------------------+
|text   |date_chunk|normalized_date_us|
+-------+----------+------------------+
|08/02  |08/02     |08/02/2024        |
|11/2018|11/2018   |11/02/2018        |
|03/2021|03/2021   |03/02/2021        |
|05 Jan |05 Jan    |01/05/2024        |
|01/05  |01/05     |01/05/2024        |
|2022   |2022      |03/02/2022        |
+-------+----------+------------------+



### Only date chunk

We are going to create a chunks dates with different formats:

In [18]:
dates = [
'08/02/2018',
'11/2018',
'11/01/2018',
'12Mar2021',
'Jan 30, 2018',
'13.04.1999', 
'3April 2020',
'03/2021',
'05 Jan',
'01/05',
'2022'
]



In [19]:
df_dates = spark.createDataFrame(dates,StringType()).toDF('ner_chunk')

We are going to transform that text to documents in spark-nlp.

In [20]:
document_assembler = nlp.DocumentAssembler().setInputCol('ner_chunk').setOutputCol('document')
documents_DF = document_assembler.transform(df_dates)

After that we are going to transform that documents to chunks.

In [21]:
chunks_df = nlp.map_annotations_col(documents_DF.select("document","ner_chunk"),
                    lambda x: [nlp.Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [22]:
chunks_df.select('chunk_date').show(truncate=False)



+---------------------------------------------------+
|chunk_date                                         |
+---------------------------------------------------+
|[{chunk, 0, 9, 08/02/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 6, 11/2018, {sentence -> 0}, []}]      |
|[{chunk, 0, 9, 11/01/2018, {sentence -> 0}, []}]   |
|[{chunk, 0, 8, 12Mar2021, {sentence -> 0}, []}]    |
|[{chunk, 0, 11, Jan 30, 2018, {sentence -> 0}, []}]|
|[{chunk, 0, 9, 13.04.1999, {sentence -> 0}, []}]   |
|[{chunk, 0, 10, 3April 2020, {sentence -> 0}, []}] |
|[{chunk, 0, 6, 03/2021, {sentence -> 0}, []}]      |
|[{chunk, 0, 5, 05 Jan, {sentence -> 0}, []}]       |
|[{chunk, 0, 4, 01/05, {sentence -> 0}, []}]        |
|[{chunk, 0, 3, 2022, {sentence -> 0}, []}]         |
+---------------------------------------------------+



                                                                                

Now we are going to normalize those chunks using the DateNormalizer.

In [23]:
date_normalizer = medical.DateNormalizer().setInputCols('chunk_date').setOutputCol('date')

In [24]:
date_normalized_df = date_normalizer.transform(chunks_df)

We are going to show how the date is normalized.

In [25]:
dateNormalizedClean = date_normalized_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")

dateNormalizedClean.withColumn("dateresult", dateNormalizedClean["dateresult"]
                               .getItem(0)).withColumn("metadata", dateNormalizedClean["metadata"]
                                                       .getItem(0)['normalized']).show(truncate=False)

+------------+----------+--------+
|ner_chunk   |dateresult|metadata|
+------------+----------+--------+
|08/02/2018  |2018/08/02|true    |
|11/2018     |2018/11/15|true    |
|11/01/2018  |2018/11/01|true    |
|12Mar2021   |2021/03/12|true    |
|Jan 30, 2018|2018/01/30|true    |
|13.04.1999  |1999/04/13|true    |
|3April 2020 |2020/04/03|true    |
|03/2021     |2021/03/15|true    |
|05 Jan      |2020/01/05|true    |
|01/05       |2020/01/05|true    |
|2022        |2022/06/15|true    |
+------------+----------+--------+



## **Relative Date**

We can configure the `anchorDateYear`,`anchorDateMonth` and `anchorDateDay` for the relatives dates.

In [26]:
rel_dates = [
'next monday',
'today',
'next week'
]

rel_dates_df = spark.createDataFrame(rel_dates,StringType()).toDF('ner_chunk')

In [27]:
rel_documents_DF = document_assembler.transform(rel_dates_df)

rel_chunks_df = nlp.map_annotations_col(rel_documents_DF.select("document","ner_chunk"),
                    lambda x: [nlp.Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
                    "chunk_date", "chunk")

In [28]:
rel_chunks_df.select('chunk_date').show(truncate=False)



+--------------------------------------------------+
|chunk_date                                        |
+--------------------------------------------------+
|[{chunk, 0, 10, next monday, {sentence -> 0}, []}]|
|[{chunk, 0, 4, today, {sentence -> 0}, []}]       |
|[{chunk, 0, 8, next week, {sentence -> 0}, []}]   |
+--------------------------------------------------+



                                                                                

In the following example we will use as a relative date 2021/02/15, to make that possible we need to set up the `anchorDateYear` to 2021, the `anchorDateMonth` to 2 and the `anchorDateDay` to 16. We will show you the configuration with the following example.

In [29]:
rel_date_normalizer = medical.DateNormalizer()\
                        .setInputCols('chunk_date')\
                        .setOutputCol('date')\
                        .setAnchorDateDay(16)\
                        .setAnchorDateMonth(2)\
                        .setAnchorDateYear(2021)

In [30]:
rel_date_normalized_df = rel_date_normalizer.transform(rel_chunks_df)
relDateNormalizedClean = rel_date_normalized_df.selectExpr("ner_chunk","date.result as dateresult","date.metadata as metadata")
relDateNormalizedClean.withColumn("dateresult", relDateNormalizedClean["dateresult"].getItem(0))\
                      .withColumn("metadata", relDateNormalizedClean["metadata"].getItem(0)['normalized']).show(truncate=False)

+-----------+----------+--------+
|ner_chunk  |dateresult|metadata|
+-----------+----------+--------+
|next monday|2021/02/22|true    |
|today      |2021/02/16|true    |
|next week  |2021/02/23|true    |
+-----------+----------+--------+



As you see the relatives dates like `next monday` , `today` and `next week` takes the `2021/02/16` as reference date.
