# Date Representations

### Logic for parsing all the date and times from a given Chinese Wikipedia page

- Dates are to be formatted to ISO8601
- A single page might contain multiple 'date and time' representation styles
- Representations may be incomplete

#### Examples

- a: 民國105年10月10日 (Taiwan datetime representation for 2016-10-10)
- b: 2016-10-10
- c: Oct. 10, 2016
- d: 2016年10月 (Chinese date representation for Oct. 2016)
- e: 10月10日 (Chinese date representation for Oct. 10th)
- f: 同年10月 (Chinese date representation for 'In the same year, .... In October, ....')
- g: 本月10日 (Chinese date representation for 'In this month, .... On 10th, ....')
- h: 平成28年10月10日等等 (Japanese datetime representation for 2016-10-10)

#### Entry

How can the accuracy of parsing 'date and time' representations be improved for different?

#### Advanced 

How can we parse 'date and time' representations that are not 
represented by standard/normal representation style, e.g., Christmas Eve 
for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

#### Challenge 

How can we derive correct the 'date and time' from the contexts of a 
corpus, e.g., 'On May 20th, Ing-Wen Tsai took her oath as the president of 
Taiwan. Next day, ...' in which the 'Next day' is May 21st.

### ISO 8601 Formats [Output will use dashes] 

- YYYY-MM-DD	or	YYYYMMDD
- YYYY-MM	(but not YYYYMM)
- --MM-DD	or	--MMDD

### Calendar system conversions

- [Minguo Calendar](https://en.wikipedia.org/wiki/Minguo_calendar) 1 = 1912
- [Heisei period](https://en.wikipedia.org/wiki/Heisei_period) 1 = 1989

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import urllib.request
from bs4 import BeautifulSoup

## Init helper functions and page urls

In [3]:
page_url = "https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3"
page_html = urllib.request.urlopen(page_url).read()
soup = BeautifulSoup(page_html, 'html.parser')
texts = soup.findAll(text=True)

In [4]:
def visible(element):
    """ Accepts an html page and removes all tags, scripts, css and newlines
    Used as filter to return visible page text.
    """
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    elif re.match(r"[\s\r\n]+",str(element)): 
        return False
    return True

In [5]:
def convert_calendar_system(year, )

In [6]:
visible_texts = filter(visible, texts)
page_text = " ".join(list(visible_texts))

### Init and compile regex

In [7]:
# Asian dates. Matches examples: a,d,e,f,h
asian_dates = r"\b(?:\w*?)?(?P<calendar>民國|民国|中華民國|中华民国|平成|同年)?(?P<year>\d{1,3}\w{1}|[0-9]{4}\w{1})?(?P<month>0?[1-9]月|1[0-2]月)(?:\w*?)?(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?\b"

# ISO date. Matches example: b
iso_dates = r"\b(?P<year>\d{4})\D?(?P<month>0[1-9]|1[0-2])\D?(?P<day>[12]\d|0[1-9]|3[01])\b"

# Month text representation. Matches example: c
# Need to do a cleanup after for month validation
text_month = r"\b(?P<month>[a-zA-z]{3,9})(?:\D{1,2})(?P<day>[12]\d|0?[1-9]|3[01])(?:\D{1,5})(?P<year>\d{4})\b"

# Chinese representation. Current year and month, given day. Matches example: g
given_day = r"\b(?:\w*?)?(?P<month>本月)(?P<day>[12]\d\D{1}|0?[1-9]\D{1}|3[01]\D{1})\b"

cmp_asian_dates = re.compile(asian_dates, re.UNICODE)
cmp_iso_dates = re.compile(iso_dates, re.UNICODE)
cmp_text_month = re.compile(text_month, re.UNICODE | re.IGNORECASE)
cmp_given_day = re.compile(given_day, re.UNICODE | re.IGNORECASE)

# \b(?:\w*?)?(?P<calendar>[^0-9\s]{0,4})?(?P<year>\d{1,3}\w{1}|[0-9]{4}\w{1})?(?P<month>0?[1-9]月|1[0-2]月)(?:\w*?)?(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?\b

In [10]:
asian_date_list = list(cmp_asian_dates.finditer(page_text))

In [17]:
asian_date_list[1].group("calendar")

'最近一次為1945年'

In [12]:
asian_date_list

[<_sre.SRE_Match object; span=(491, 499), match='2017年7月底'>,
 <_sre.SRE_Match object; span=(1617, 1631), match='最近一次為1945年10月後'>,
 <_sre.SRE_Match object; span=(6832, 6851), match='1645年4月荷蘭人召開南部的地方會議'>,
 <_sre.SRE_Match object; span=(8062, 8080), match='1662年2月荷蘭人接受條件開城投降'>,
 <_sre.SRE_Match object; span=(8271, 8281), match='於1662年2月1日'>,
 <_sre.SRE_Match object; span=(8286, 8292), match='12月13日'>,
 <_sre.SRE_Match object; span=(8390, 8403), match='鄭成功於同年6月23日病逝'>,
 <_sre.SRE_Match object; span=(8851, 8858), match='1684年4月'>,
 <_sre.SRE_Match object; span=(9612, 9633), match='自1841年9月起英國艦隊數度出現臺灣外海'>,
 <_sre.SRE_Match object; span=(9764, 9785), match='清廷在日軍出兵臺灣後的1874年5月27日'>,
 <_sre.SRE_Match object; span=(10280, 10283), match='12月'>,
 <_sre.SRE_Match object; span=(10318, 10328), match='1894年7月25日'>,
 <_sre.SRE_Match object; span=(10329, 10339), match='1895年4月17日'>,
 <_sre.SRE_Match object; span=(10444, 10464), match='日軍為接收臺灣於1895年5月29日登陸'>,
 <_sre.SRE_Match object; span=(10529, 10539)

In [None]:
# matches = re.finditer(asian_dates, page_text)

# for matchNum, match in enumerate(matches):
#     matchNum = matchNum + 1
    
#     print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
#     for groupNum in range(0, len(match.groups())):
#         groupNum = groupNum + 1
        
#         print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))