

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DATE_MATCHER.ipynb)




# **Spark NLP Date Matcher**

### Spark NLP documentation and instructions:
https://nlp.johnsnowlabs.com/docs/en/quickstart

### You can find details about Spark NLP annotators here:
https://nlp.johnsnowlabs.com/docs/en/annotators

### You can find details about Spark NLP models here:
https://nlp.johnsnowlabs.com/models


## 1. Colab Setup

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash
# !bash colab.sh
# -p is for pyspark
# -s is for spark-nlp
# !bash colab.sh -p 3.1.1 -s 3.0.1
# by default they are set to the latest

openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
setup Colab for PySpark 3.1.1 and Spark NLP 3.0.0
[K     |████████████████████████████████| 212.3MB 60kB/s 
[K     |████████████████████████████████| 143kB 48.5MB/s 
[K     |████████████████████████████████| 204kB 42.0MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


## 2. Start the Spark session

Import dependencies and start Spark session.

In [2]:
import json
import pandas as pd
import numpy as np

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start()

##3. Build Pipeline

In [3]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector().setInputCols("document")\
    .setOutputCol("sentence")

date_matcher = DateMatcher() \
    .setInputCols('sentence')\
    .setOutputCol("date") \
    .setDateFormat("yyyy/MM/dd")

pipeline1= Pipeline(stages=[ document_assembler, 
                                 sentence_detector,
                                 date_matcher,
                                 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

date_pp = pipeline1.fit(empty_df)
date_model = LightPipeline(date_pp)

##4. Run & Visualize

In [4]:
input_list = [
    """David visited the restaurant yesterday with his family. 
He also visited and the day before, but at that time he was alone.
David again visited today with his colleagues.
He and his friends really liked the food and hoped to visit again tomorrow.""",]

In [5]:

tres = date_model.fullAnnotate(input_list)[0]
for dte in tres['date']:
    sent = tres['sentence'][int(dte.metadata['sentence'])]
    print (f'text/chunk {sent.result[dte.begin:dte.end+1]} | mapped_date: {dte.result}')

text/chunk yesterday | mapped_date: 2021/04/03
text/chunk day before | mapped_date: 2021/04/03
text/chunk today | mapped_date: 2021/04/04
text/chunk tomorrow | mapped_date: 2021/04/05
