# DateMatcher multi-language

#### This annotator allows you to specify a source language that will be used to identify temporal keywords and extract dates.

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 12:23:48--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 12:23:48--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 12:23:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:44

In [2]:
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
# start() functions has two parameters: gpu and spark23
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(spark23=True) is when you have Apache Spark 2.3.x installed
spark = sparknlp.start()

In [3]:
spark

In [4]:
sparknlp.version()

'4.2.6'

# Italian examples

### Let's import some articoles sentences from the news where relative dates are present.

In [5]:
it_articles = [
  ("Così il ct azzurro Roberto Mancini, poco prima di entrare al Quirinale dove l'Italia campione d'Europa sta per essere accolta dal Presidente della Repubblica Sergio Mattarella oggi.",),
  ("I giocatori della nazionale italiana campione d'Europa sono stati ricevuti al Quirinale il 13 Luglio 2021 per un incontro con il presidente della Repubblica, Sergio Mattarella.",),
  ("Il presidente della Repubblica Sergio Mattarella ha ricevuto ieri, alle ore 17.00 al Quirinale, la Nazionale italiana di calcio vincitrice del Campionato Europeo UEFA Euro 2020 e Matteo Berrettini, finalista al Torneo di Wimbledon.",)
]

### Let's  fill a DataFrame with the text column

In [6]:
articles_cols = ["text"]

df = spark.createDataFrame(data=it_articles, schema=articles_cols)

df.printSchema()
df.show()

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|Così il ct azzurr...|
|I giocatori della...|
|Il presidente del...|
+--------------------+



### Now, let's create a simple pipeline to apply the DateMatcher, specifying the source language

In [8]:
document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

date_matcher = DateMatcher() \
            .setInputCols(['document']) \
            .setOutputCol("date") \
            .setOutputFormat("MM/dd/yyyy") \
            .setSourceLanguage("it")

### Let's transform the DataFrame content to extract the dates

In [9]:
assembled = document_assembler.transform(df)
date_matcher.transform(assembled).select('date').show(10, False)

+---------------------------------------------------+
|date                                               |
+---------------------------------------------------+
|[{date, 175, 183, 12/23/2022, {sentence -> 0}, []}]|
|[{date, 91, 102, 07/13/2021, {sentence -> 0}, []}] |
|[{date, 61, 69, 12/22/2022, {sentence -> 0}, []}]  |
+---------------------------------------------------+

