# Date Time Research - Infoland Text-to-Case
_By Eli Brouwers_
In this notebook, we will be researching a solution to date time parsing using machine learning. 

## Problem
In our prototype, dates and times will be extracted textually. These need to be converted to actual date time objects. This is a non-trivial problem, as dates and times can be written in many different ways. For example, if today is the 17th of May, the 18th of May could be said as: 
- Tomorrow
- The day after today
- The 18th of May
- 18th of May
- 18 May
- 18-05

And so on. Additionally, the answers we extract might have residual words that are not dates or times. 

## Library: SUTime
We will be researching the SUTime library by the Stanford Natural Language Processing Group (https://nlp.stanford.edu/software/sutime.html). This library is a Java library that can be used to parse dates and times from text. It is based on the TimeML standard (http://timeml.org/). 
The NLP part of this library uses entity recognition to find dates and times in text. It then uses a rule-based system to parse these dates and times.

We will try a bunch of different examples to see how well this library works.

### Imports

In [2]:
from sutime import SUTime
import pandas as pd
import datetime
from dateutil.parser import *
import re

### Example data
We created a list of ways to express a date and/or time. We will use this list to test the SUTime library.

In [12]:
current_date = datetime.datetime.now()
current_date = current_date.replace(microsecond=0)
validation_data = [
    {'text': 'I will be there tomorrow', 'expected_datetime': (current_date + datetime.timedelta(days=1)).date()},
    {'text': 'I will be there today', 'expected_datetime': current_date.date()},
    {'text': 'I will be there yesterday', 'expected_datetime': (current_date - datetime.timedelta(days=1)).date()},
    {'text': 'I will be there next week', 'expected_datetime': (current_date + datetime.timedelta(days=7)).date()},
    {'text': 'I will be there last week', 'expected_datetime': (current_date - datetime.timedelta(days=7)).date()},
    {'text': 'I will be there next month', 'expected_datetime': (current_date + datetime.timedelta(weeks=4)).date().replace(day=1)},
    {'text': 'I will be there last month', 'expected_datetime': (current_date - datetime.timedelta(weeks=4)).date().replace(day=1)},
    {'text': 'I will be there next year', 'expected_datetime': (current_date + datetime.timedelta(weeks=52)).date().replace(day=1, month=1)},
    {'text': 'I will be there last year', 'expected_datetime': (current_date - datetime.timedelta(weeks=52)).date().replace(day=1, month=1)},
    {'text': 'I will be there the day after tomorrow', 'expected_datetime': (current_date + datetime.timedelta(days=2)).date()},
    {'text': 'I will be there the day before yesterday', 'expected_datetime': (current_date - datetime.timedelta(days=2)).date()},
    {'text': 'I will be there the day before today', 'expected_datetime': (current_date - datetime.timedelta(days=1)).date()},
    {'text': 'I will be there the day after today', 'expected_datetime': (current_date + datetime.timedelta(days=1)).date()},
    {'text': 'I will be there the week after the next', 'expected_datetime': (current_date + datetime.timedelta(days=14)).date()},
    {'text': 'I will be there the week before the last', 'expected_datetime':  (current_date - datetime.timedelta(days=14)).date()},
    {'text': 'I will be there the month after the next', 'expected_datetime': (current_date + datetime.timedelta(weeks=8)).date().replace(day=1)},
    {'text': 'I will be there the month before the last', 'expected_datetime': (current_date - datetime.timedelta(weeks=8)).date().replace(day=1)},
    {'text': 'I will be there the year after the next', 'expected_datetime': (current_date + datetime.timedelta(weeks=104)).date().replace(day=1, month=1)},
    {'text': 'I will be there the year before the last', 'expected_datetime': (current_date - datetime.timedelta(weeks=104)).date().replace(day=1, month=1)},
    {'text': 'I will be there at 10:00', 'expected_datetime': current_date.replace(hour=10, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there at ten o\'clock am', 'expected_datetime': current_date.replace(hour=10, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there in 10 minutes', 'expected_datetime': current_date + datetime.timedelta(minutes=10)},
    {'text': 'I will be there in 10 hours', 'expected_datetime': current_date + datetime.timedelta(hours=10)},
    {'text': 'I will be there in ten minutes', 'expected_datetime': current_date + datetime.timedelta(minutes=10)},
    {'text': 'I will be there in ten hours', 'expected_datetime': current_date + datetime.timedelta(hours=10)},
    {'text': 'I will be there in 10 minutes and 10 hours', 'expected_datetime': current_date + datetime.timedelta(hours=10, minutes=10)},
    {'text': 'I was there 10 minutes ago', 'expected_datetime': current_date - datetime.timedelta(minutes=10)},
    {'text': 'I was there 10 hours ago', 'expected_datetime': current_date - datetime.timedelta(hours=10)},
    {'text': 'I will be there this morning', 'expected_datetime': current_date.replace(hour=8, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there this afternoon', 'expected_datetime': current_date.replace(hour=12, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there this evening', 'expected_datetime': current_date.replace(hour=18, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there this night', 'expected_datetime': current_date.replace(hour=22, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there this coming Friday', 'expected_datetime': (current_date + (datetime.timedelta(days=(4-current_date.weekday()) % 7))).replace(hour=0, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there next Friday', 'expected_datetime': (current_date + (datetime.timedelta(days=(4-current_date.weekday()) % 7))).replace(hour=0, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there next week Friday', 'expected_datetime': (current_date + (datetime.timedelta(days=(4-current_date.weekday()) % 7, weeks=1))).replace(hour=0, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there last Friday', 'expected_datetime': (current_date - (datetime.timedelta(days=(current_date.weekday()-4) % 7))).replace(hour=0, minute=0, second=0, microsecond=0)},
    {'text': 'I will be there next Monday at ten o\'clock am', 'expected_datetime': (current_date + (datetime.timedelta(days=(7-current_date.weekday()) % 7))).replace(hour=10, minute=0, second=0, microsecond=0)},
]

validation_data = pd.DataFrame(validation_data)

In [5]:
sutime = SUTime()

In [13]:
final_data = pd.DataFrame(columns=['text', 'expected_datetime', 'result_parsed_datetime', 'result_string'])
for index, row in validation_data.iterrows():
    result = sutime.parse(row['text'])
    
    date = result[0]['value']
    # switch of formats (yyyy-mm-dd, yyyy-Mmm)
    if re.match(r'^\d{4}-\d{2}-\d{2}$', date): # yyyy-mm-dd
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%d')
    elif re.match(r'^\d{4}-W\d{2}$', date): # yyyy-Www
        parsed_date = datetime.datetime.strptime(date + current_date.strftime('-%w'), '%Y-W%W-%w')
    elif re.match(r'^\d{4}-\d{2}$', date): # yyyy-mm
        parsed_date = datetime.datetime.strptime(date + '-01', "%Y-%m-%d")
    elif re.match(r'^\d{4}$', date): # yyyy 
        parsed_date = datetime.datetime.strptime(date + '-01-01', "%Y-%m-%d")
    elif re.match(r'^\d{2}:\d{2}$', date): # hh:mm
        parsed_date = datetime.datetime.strptime(date, '%H:%M')
    elif re.match(r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$', date): # yyyy-mm-ddThh:mm
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%dT%H:%M')
    elif re.match(r'^PT\d{2}H\d{2}M$', date): # PTxxHxxM
        parsed_date = current_date + datetime.timedelta(hours=int(date[2:4]), minutes=int(date[5:7]))
    elif re.match(r'^PT\d{2}H$', date): # PTxxH
        parsed_date = current_date + datetime.timedelta(hours=int(date[2:4]))
    elif re.match(r'^PT\d{2}M$', date): # PTxxM
        parsed_date =  current_date + datetime.timedelta(minutes=int(date[2:4]))
    elif re.match(r'^\d{4}-\d{2}-\d{2}TMO$', date): # yyyy-mm-ddTMO
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%dTMO').replace(hour=8, minute=0, second=0, microsecond=0)
    elif re.match(r'^\d{4}-\d{2}-\d{2}TAF$', date): # yyyy-mm-ddTAF
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%dTAF').replace(hour=12, minute=0, second=0, microsecond=0)
    elif re.match(r'^\d{4}-\d{2}-\d{2}TEV$', date): # yyyy-mm-ddTEV
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%dTEV').replace(hour=18, minute=0, second=0, microsecond=0)
    elif re.match(r'^\d{4}-\d{2}-\d{2}TNI$', date): # yyyy-mm-ddTNI
        parsed_date = datetime.datetime.strptime(date, '%Y-%m-%dTNI').replace(hour=22, minute=0, second=0, microsecond=0)
    else:
        raise Exception(f'Unknown date format: {result} for text: {row["text"]}')
    
    final_data = pd.concat([final_data, pd.DataFrame([{
        'text': row['text'],
        'expected_datetime': row['expected_datetime'],
        'result_parsed_datetime': parsed_date,
        'result_string': result[0]['text'],
        'result_value': result[0]['value'],
    }])])
    

Then, we will run all these examples through the parser and see what the results are.

In [14]:
not_matching = final_data[final_data['expected_datetime'] != final_data['result_parsed_datetime']]
print(f'Not matching: {len(not_matching)}')
accuracy = 1 - len(not_matching) / len(final_data)
print(f'Accuracy: {accuracy}')


Not matching: 10
Accuracy: 0.7297297297297297


Measuring the accuracy, we can see that the parser got about 73% of the examples right. Let's investigate the ones that were wrong.

In [15]:
not_matching

Unnamed: 0,text,expected_datetime,result_parsed_datetime,result_string,result_value
0,I will be there the week after the next,2023-05-31,2023-05-17 00:00:00,the week,2023-W20
0,I will be there the week before the previous,2023-05-03,2023-05-17 00:00:00,the week before,2023-W20
0,I will be there the month after the next,2023-07-01,2023-05-01 00:00:00,the month,2023-05
0,I will be there the month before the last,2023-03-01,2023-05-01 00:00:00,the month before,2023-05
0,I will be there the year after the next,2025-01-01,2023-01-01 00:00:00,the year,2023
0,I will be there the year before the last,2021-01-01,2023-01-01 00:00:00,the year before,2023
0,I will be there in 10 minutes and 10 hours,2023-05-18 02:08:05,2023-05-17 16:08:05,10 minutes,PT10M
0,I was there 10 minutes ago,2023-05-17 15:48:05,2023-05-16 00:00:00,10 minutes ago,2023-05-16
0,I was there 10 hours ago,2023-05-17 05:58:05,2023-05-16 00:00:00,10 hours ago,2023-05-16
0,I will be there next Friday,2023-05-19 00:00:00,2023-05-26 00:00:00,next Friday,2023-05-26


Let's examine the ones that were wrong. 

The first six examples are all dates that have something about "the x after/before the next/last". Most of these are edge cases are not very common, and the common ones like "the day after tomorrow" are parsed correctly.

The next example is parsed incorrectly because of our code and is actually correct.

The next two examples are times in the past, saying "x hours/minutes ago". These are more common examples.

The next example is a fairly ambiguous sentence. "next Friday" could mean upcoming Friday, or the Friday after that. The parser chose the latter, but the former was expected. This could still be counted as correct.

### Conclusion
The SUTime library got about 73% of the examples right. The incorrect examples were mostly uncommon edge cases. One limitation of the library is that it did not parse "x hours/minutes ago" correctly. This is a fairly common way to express a time in the past.

However, we can conclude that the SUTime library is a good solution for our problem. It worked well enough on our test examples. Next, we will have to implement it and see how well it works in practice.