### Date Representations

Here we provide the logic for parsing all the date and times from a given Chinese Wikipedia page.

- Dates are to be formatted to ISO8601
- A single page might contain multiple 'date and time' representation styles
- Representations may be incomplete

Examples

- 民國 105 年 10 月 10 日 (Taiwan datetime representation for 2016-10-10)
- 2016-10-10
- Oct. 10, 2016
- 2016 年 10 月 (Chinese date representation for Oct. 2016)
- 10 月 10 日 (Chinese date representation for Oct. 10th)
- 同年 10 月 (Chinese date representation for 'In the same year, .... In October, ....')
- 本月 10 日 (Chinese date representation for 'In this month, .... On 10th, ....')
- 平成 28 年 10 月 10 日等等 (Japanese datetime representation for 2016-10-10)

## Entry

How can the accuracy of parsing 'date and time' representations be improved for different?

## Advanced 

How can we parse 'date and time' representations that are not represented by standard/normal representation style, e.g., Christmas Eve for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

## Challenge 

How can we derive correct the 'date and time' from the contexts of a corpus, e.g., 'On May 20th, Ing-Wen Tsai took her oath as the president of Taiwan. Next day, ...' in which the 'Next day' is May 21st.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import requests
from bs4 import BeautifulSoup

In [3]:
html = requests.get('http://en.wikipedia.org/wiki/Aquamole_Pot').text
soup = BeautifulSoup(html, 'html.parser')

In [5]:
# print(soup.prettify())

### Init and compile regex

In [17]:
# Asian dates. Matches examples: a,d,e,f,
asian_dates = r"(?P<roc>\D{2})?(?P<year>\d{1,3}\D{1}|[0-9]{4}\D{1})?(?P<month>0?[1-9]月|1[0-2]月)(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?"

# ISO date. Matches example: b
iso_dates = r"(?P<year>\d{4})\D?(?P<month>0[1-9]|1[0-2])\D?(?P<day>[12]\d|0[1-9]|3[01])"

# Month text representation. Matches example: c
text_month = r"(?P<month>[a-zA-z]{3,9})(?:\D{1,2})(?P<day>[12]\d|0?[1-9]|3[01])(?:\D{1,5})(?P<year>\d{4})"

# Chinese representation. Current year and month, given day. Matches example: g
# Need to do a cleanup after for month validation
given_day = r"(?P<month>本月)(?P<day>[12]\d\D{1}|0?[1-9]\D{1}|3[01]\D{1})"

cmp_asian_dates = re.compile(asian_dates)
cmp_iso_dates = re.compile(iso_dates)
cmp_text_month = re.compile(text_month, re.IGNORECASE)
cmp_given_day = re.compile(given_day, re.IGNORECASE)

In [18]:
s = "民國105年10月10日"
x = cmp_asian_dates.search(s)
x.group('roc')

'民國'