

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/03.01.DateMatcher_MultiDateMatcher.ipynb)

# **DateMatcher and MultiDateMatcher**

This notebook is about using the Spark NLP `DateMatcher` and `MultiDateMatcher` annotators. Here, differences between `DateMatcher` and `MultiDateMatcher` are explained and all of their parameters are described with examples.

**📖 Learning Objectives:**

With this `DateMatcher` and `MultiDateMatcher` Notebook, you will be able to:
1. Know the differences between `DateMatcher` and `MultiDateMatcher`,
2. Extract date from text,
3. Deal with relative dates,
4. Change input/output date formats,
5. Set missing day in date without day,
6. Extract dates in different languages.

**🔗 Helpful Links:**

- Documentaion : [DateMatcher](https://nlp.johnsnowlabs.com/docs/en/annotators#datematcher), [MultiDateMatcher](https://nlp.johnsnowlabs.com/docs/en/annotators#multidatematcher)

- Python Doc :  [DateMatcher](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/matcher/date_matcher/index.html#module-sparknlp.annotator.matcher.date_matcher), [MultiDateMatcher](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/matcher/multi_date_matcher/index.html)


- Scala Doc :  [DateMatcher](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/DateMatcher.html), [MultiDateMatcher](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/MultiDateMatcher.html)


- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

## **📜 Background**


`DateMatcher` and `MultiDateMatcher` extract *exact* & *normalized dates* from relative date-time phrases and convert these dates to a *provided date format*. `DateMatcher` can only extract one date per input document while `MultiDateMatcher` can multiple dates. 

Here are the examples of some date entities that `DateMatcher` and `MultiDateMatcher` can match:

>` "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008", "Fri, 21 Nov 1997", "Jan 21, "97", "Sun", "Nov 21", "jan 1st", "next thursday", "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month", "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.", "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"`


For example `"The 31st of April in the year 2008"` will be converted into `2008/04/31`









## **🎬 Colab Setup**

In [None]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.1 spark-nlp==4.3.0

In [None]:
import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

spark = sparknlp.start()
spark

## **🖨️ Inputs/Output Annotation Types:**

- Input Annotation types: `DOCUMENT`

- Output Annotation type: `DATE`


## **🔎 Parameters**


A list of parameters that this annotator can take. 

- `inputFormats` (StringArrayParam) : Date Matcher regex patterns.

- `outputFormat` (String) : Output format of parsed date. (Default: "yyyy/MM/dd")

- `anchorDateYear` (Int) :  Add an anchor year for the relative dates.(Default: -1, which means current year)

- `anchorDateMonth` (Int) :  Add an anchor month for the relative dates.(Default: -1, which means current month)

- `anchorDateDay` (Int) : Add an anchor day for the relative dates.(Default: -1, which means current day)

- `defaultDayWhenMissing` (Int) : Which day to set when it is missing from parsed input. (Default: 1)

- `readMonthFirst` (Boolean) : Whether to interpret dates as "MM/DD/YYYY" instead of "DD/MM/YYYY". (Default: True)

- `sourceLanguage` (String) : Source language for explicit translation (Default: "en")

## Comparing DateMatcher and MultiDateMatcher

The below pipeline demonstrates difference between `DateMatcher` and `MultiDateMatcher` annotators.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setOutputFormat("yyyy/MM/dd")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date") \
    .setOutputFormat("MM/dd/yy")


pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate
])

text_list = ["See you on next monday.",  "She was born on 02/03/1966.", "The project started yesterday and will finish next year.", 
             "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

In [None]:
result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)

+--------------------------------------------------------+------------+--------------------+
|text                                                    |date        |multi_date          |
+--------------------------------------------------------+------------+--------------------+
|See you on next monday.                                 |[2023/02/20]|[02/20/23]          |
|She was born on 02/03/1966.                             |[1966/02/03]|[02/03/66]          |
|The project started yesterday and will finish next year.|[2024/02/18]|[02/18/24, 02/17/23]|
|She will graduate by July 2023.                         |[2023/07/01]|[07/01/23]          |
|She will visit doctor tomorrow and next month again.    |[2023/03/18]|[03/18/23, 02/19/23]|
+--------------------------------------------------------+------------+--------------------+



As seen above result, `DateMatcher` provides only one date per input document and `MultiDateMatcher` can return multiple dates. 

And here we provided different output formats for date matchers in the pipeline. As a result, we get different output formatted dates. 

In [None]:
result.select("date","multi_date").show(truncate=False)

+-------------------------------------------------+----------------------------------------------------------------------------------------------+
|date                                             |multi_date                                                                                    |
+-------------------------------------------------+----------------------------------------------------------------------------------------------+
|[{date, 11, 18, 2023/02/20, {sentence -> 0}, []}]|[{date, 11, 18, 02/20/23, {sentence -> 0}, []}]                                               |
|[{date, 16, 25, 1966/02/03, {sentence -> 0}, []}]|[{date, 16, 25, 02/03/66, {sentence -> 0}, []}]                                               |
|[{date, 46, 54, 2024/02/18, {sentence -> 0}, []}]|[{date, 46, 54, 02/18/24, {sentence -> 0}, []}, {date, 20, 28, 02/17/23, {sentence -> 0}, []}]|
|[{date, 21, 29, 2023/07/01, {sentence -> 0}, []}]|[{date, 21, 29, 07/01/23, {sentence -> 0}, []}]                    

## Relative Dates

`DateMatcher` and `MultiDateMatcher` annotators return relative dates as actual dates. But in this situation, we need to provide a reference point for the date. To accomplish this, an anchor date should be set, so the actual date can be calculated. These reference date parameters can be set by `setAnchorDateDay(), setAnchorDateMonth(), setAnchorDateYear()`. 

If an anchor date parameter is not set, the current day or current month or current year will be set as the default value.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date") \
    .setOutputFormat("MM/dd/yyyy")\
    .setAnchorDateYear(2001)\
    .setAnchorDateMonth(1)\
    .setAnchorDateDay(17)\

multiDate_no_day = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date_no_day") \
    .setOutputFormat("MM/dd/yyyy")\
    .setAnchorDateYear(2001)\
    .setAnchorDateMonth(1)\

pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate,
    multiDate_no_day
])

result = pipeline.fit(spark_df).transform(spark_df)

text_list = ["See you on next monday.",  "She was born on 02/03/1966.", "The project started on yesterday and will finish next year.", 
             "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result.selectExpr("text", "multi_date.result as date", "multi_date_no_day.result as date_no_day_anchor").show(truncate=False)


+--------------------------------------------------------+------------------------+------------------------+
|text                                                    |date                    |date_no_day_anchor      |
+--------------------------------------------------------+------------------------+------------------------+
|See you on next monday.                                 |[01/22/2001]            |[01/22/2001]            |
|She was born on 02/03/1966.                             |[02/03/1966]            |[02/03/1966]            |
|The project started yesterday and will finish next year.|[01/17/2002, 01/16/2001]|[01/18/2002, 01/17/2001]|
|She will graduate by July 2023.                         |[07/01/2023]            |[07/01/2023]            |
|She will visit doctor tomorrow and next month again.    |[02/17/2001, 01/18/2001]|[02/18/2001, 01/19/2001]|
+--------------------------------------------------------+------------------------+------------------------+



In the `date` column, relative dates are referenced from the date `01/17/2001` and in `date_no_day_anchor` column, anchor day is not set. In `date` column, the relative dates are calculated and printed according to this reference date. When the `anchorDateDay` parameter is not set as in `date_no_day_anchor` column, by default it is set to current day of the month.

## Date Formats

Input and output date formats can be set by the `setInputFormats, setOutputFormat, setReadMonthFirst`. You can use get more information on how to use [date formatting strings here](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html).

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

multiDate_1 = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date_1") \
    .setOutputFormat("MM/dd/yy")

multiDate_2 = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date_2") \
    .setOutputFormat("MMMM dd, yyyy")

multiDate_3 = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date_3") \
    .setInputFormats(["dd/MM/yyyy"])\
    .setOutputFormat(", EEEEMM/dd/yyyy")


pipeline = Pipeline().setStages([
    documentAssembler,
    multiDate_1,multiDate_2,multiDate_3,
    multiDate
])

text_list = ["See you on 1st December 2004.",  "She was born on 02/03/1966.", "The project started on yesterday and will finish next year.", 
             "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "multi_date_1.result as date_1", "multi_date_2.result as date_2", "multi_date_3.result as date_3").show(truncate=False)

+-----------------------------------------------------------+--------------------+--------------------------------------+-----------------------+
|text                                                       |date_1              |date_2                                |date_3                 |
+-----------------------------------------------------------+--------------------+--------------------------------------+-----------------------+
|See you on 1st December 2004.                              |[12/01/04]          |[December 01, 2004]                   |[]                     |
|She was born on 02/03/1966.                                |[02/03/66]          |[February 03, 1966]                   |[Wednesday, 03/02/1966]|
|The project started on yesterday and will finish next year.|[02/18/24, 02/17/23]|[February 18, 2024, February 17, 2023]|[]                     |
|She will graduate by July 2023.                            |[07/01/23]          |[July 01, 2023]                       |[] 

## Missing Days

Sometimes in a date expression, days are not specified. For example "She will graduate by July 2023". In this situation one can set a default day value for missing days using `setDefaultDayWhenMissing`. If it is not set, default value is `1`.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \

multiDate_missing_day_set = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date_missing_day_set") \
    .setDefaultDayWhenMissing(15)


pipeline = Pipeline().setStages([
    documentAssembler,
    multiDate,
    multiDate_missing_day_set
])

text_list = ["See you on December 2004.",  "She was born on 02/03/1966.", "The project started on yesterday and will finish next year.", 
             "She will graduate by July 2023.", "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "date.result as date", "date_missing_day_set.result as date_missing_day_set").show(truncate=False)

+-----------------------------------------------------------+------------------------+------------------------+
|text                                                       |date                    |date_missing_day_set    |
+-----------------------------------------------------------+------------------------+------------------------+
|See you on December 2004.                                  |[2004/12/01]            |[2004/12/15]            |
|She was born on 02/03/1966.                                |[1966/02/03]            |[1966/02/03]            |
|The project started on yesterday and will finish next year.|[2024/02/18, 2023/02/17]|[2024/02/18, 2023/02/17]|
|She will graduate by July 2023.                            |[2023/07/01]            |[2023/07/15]            |
|She will visit doctor tomorrow and next month again.       |[2023/03/18, 2023/02/19]|[2023/03/18, 2023/02/19]|
+-----------------------------------------------------------+------------------------+------------------

As seen from above results, missing days at row 1 and 4 are `15` at the column `date_missing_day_set`, but `1` at `date` column.

## Other Languages

Date matchers can be used with other languages. Its default value is `"en"`-English.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("multi_date") \
    .setOutputFormat("yyyy/MM/dd")\
    .setSourceLanguage("de")


pipeline = Pipeline().setStages([
    documentAssembler,
    multiDate
])

spark_df = spark.createDataFrame([["Das letzte zahlungsdatum dieser rechnung ist der 4. mai 1998."], ["Wir haben morgen eine prüfung."]]).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text", "multi_date.result as date").show(truncate=False)

+-------------------------------------------------------------+------------+
|text                                                         |date        |
+-------------------------------------------------------------+------------+
|Das letzte zahlungsdatum dieser rechnung ist der 4. mai 1998.|[1998/05/04]|
|Wir haben morgen eine prüfung.                               |[2023/02/19]|
+-------------------------------------------------------------+------------+



Date matchers can extract dates from other languages. In the above German example, the first row contains an actual date while the second one has a relative date (morgen means tomorrow in English). They are formatted in the desired output format.

You can find supported languages [here](https://github.com/JohnSnowLabs/spark-nlp/blob/281c0af227f3ccc1b973ac4b89ccae3aa89a9ae3/src/main/resources/date-matcher/supported_languages.txt)