# Date Representations

### Logic for parsing all the date and times from a given Chinese Wikipedia page

- Dates are to be formatted to ISO8601
- A single page might contain multiple 'date and time' representation styles
- Representations may be incomplete

#### Examples

- a: 民國105年10月10日 (Taiwan datetime representation for 2016-10-10)
- b: 2016-10-10
- c: Oct. 10, 2016
- d: 2016年10月 (Chinese date representation for Oct. 2016)
- e: 10月10日 (Chinese date representation for Oct. 10th)
- f: 同年10月 (Chinese date representation for 'In the same year, .... In October, ....')
- g: 本月10日 (Chinese date representation for 'In this month, .... On 10th, ....')
- h: 平成28年10月10日等等 (Japanese datetime representation for 2016-10-10)

#### Entry

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

#### Advanced 

How can we parse 'date and time' representations that are not 
represented by standard/normal representation style, e.g., Christmas Eve 
for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

#### Challenge 

How can we derive correct the 'date and time' from the contexts of a 
corpus, e.g., 'On May 20th, Ing-Wen Tsai took her oath as the president of 
Taiwan. Next day, ...' in which the 'Next day' is May 21st.

### ISO 8601 Formats [Output will use dashes] 

- YYYY-MM-DD	or	YYYYMMDD
- YYYY-MM	(but not YYYYMM)
- --MM-DD	or	--MMDD

### Calendar system conversions

- [Minguo Calendar](https://en.wikipedia.org/wiki/Minguo_calendar) 1 = 1912
- [Heisei period](https://en.wikipedia.org/wiki/Heisei_period) 1 = 1989

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import datetime
import urllib.request
from bs4 import BeautifulSoup

## Entry Level

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

### Init helper functions and page urls

In [3]:
def visible(element):
    """ Accepts an html page and removes all tags, scripts, css and newlines
    Used as filter to return visible page text.
    Args:
        element (bs4.element)
    """
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    elif re.match(r"[\s\r\n]+",str(element)): 
        return False
    return True

# Store japanese era as char:year_offset lookup
japanese_era_lookup = {"平成":1988}

minguo_set = set(["民國", "民国","中華民國","中华民国"])
chinese_same_year_set = set(["同年"])

def convert_calendar_system(year, calendar_system):
    """ Accepts a year and calendar system and converts to the
    Gregorian calendar system.
    Args:
        year (int): Numerical year value
        calendar_system (str): Specifier for calendar system
    Return
        gregorian_year (int)
    """

    if year < 1:
        return None

    if calendar_system in minguo_set:
        gregorian_year = 1911 + year
    
    if calendar_system in japanese_era_lookup:
        gregorian_year = japanese_era_lookup[calendar_system] + year
    if calendar_system in chinese_same_year_set:
        gregorian_year = datetime.date.today().year
    
    return gregorian_year

In [4]:
page_url = "https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3"
# page_url = "https://zh.wikipedia.org/wiki/%E6%B0%91%E5%9C%8B%E7%B4%80%E5%B9%B4"
page_html = urllib.request.urlopen(page_url).read()
soup = BeautifulSoup(page_html, 'html.parser')
texts = soup.findAll(text=True)

In [5]:
visible_texts = filter(visible, texts)
page_text = " ".join(list(visible_texts))

### Init and compile regex

In [6]:
# Asian dates. Matches examples: a,d,e,f,h
asian_dates = r"\b(?:\w*?)?(?P<calendar>民國|民国|中華民國|中华民国|平成|同年)?(?P<year>\d{1,3}\w{1}|[0-9]{4}\w{1})?(?P<month>0?[1-9]月|1[0-2]月)(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?(?:\w*?)?\b"

# ISO date. Matches example: b
iso_dates = r"\b(?P<year>\d{4})\-?(?P<month>0[1-9]|1[0-2])\-?(?P<day>[12]\d|0[1-9]|3[01])\b"

# Month text representation. Matches example: c
text_month = r"\b(?P<month>[a-zA-z]{3,9})(?:\D{1,2})(?P<day>[12]\d|0?[1-9]|3[01])?(?:\D{0,5})(?P<year>\d{4})?\b"

# Need to do a cleanup after for month validation
text_month_cleanup = r"(?P<month>Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)(?:.)?\s+(?P<day>\d{1,2})(?:th)?(?:,)?\s+(?P<year>\d{4})"

# Chinese representation. Current year and month, given day. Matches example: g
given_day = r"\b(?:\w*?)?(?P<month>本月)(?P<day>[12]\d\D{1}|0?[1-9]\D{1}|3[01]\D{1})\b"

cmp_asian_dates = re.compile(asian_dates, re.UNICODE)
cmp_iso_dates = re.compile(iso_dates, re.UNICODE)
cmp_text_month = re.compile(text_month, re.UNICODE | re.IGNORECASE)
cmp_given_day = re.compile(given_day, re.UNICODE | re.IGNORECASE)
cmp_text_month_cleanup = re.compile(text_month_cleanup, re.IGNORECASE)

In [7]:
asian_date_list = list(cmp_asian_dates.finditer(page_text))
iso_date_list = list(cmp_iso_dates.finditer(page_text))
text_month_list = list(cmp_text_month.finditer(page_text))
given_day_list = list(cmp_given_day.finditer(page_text))

### Parse and format dates

In [8]:
def prep_origin_date(match_item):
    """Prep the original date string
    Return:
        origin_date (str)
    """
    
    origin_date = ""
    try:
        if match_item.group("calendar"):
            origin_date += match_item.group("calendar")
    except:
        pass
    try:
        if match_item.group("year"):
            origin_date += match_item.group("year")
    except:
        pass
    try:
        if match_item.group("month"):
            origin_date += match_item.group("month")
    except:
        pass
    try:
        if match_item.group("day"):
            origin_date += match_item.group("day")
    except:
        pass
    
    return origin_date
    
def prep_iso8601(year=None, month=None, day=None):
    """ Accept and format year, month and day args.
    Args:
        year, month, day (int)
    Return
        result (str): Formatted iso8601 string.
    """
    
    # Format with leading zeros
    if month is not None:
        month = "".join(["0",str(month)]) if month and int(month) < 10 else str(month)
    if day is not None:
        day = "".join(["0",str(day)]) if day and int(day) < 10 else str(day)

    result = ""
    result = result + str(year) + "-" if year else "--"
    if month and day:
        result = result + str(month) + "-"
    elif month:
        result = result + str(month)
    elif month is None:
        result = result + "--"
    result = result + str(day) if day else result
    return result

def format_iso_dates(iso_date_list):
    """ Format and return iso_dates
    Args:
        iso_date_list (list): List of iso dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """

    result = []
    for item in iso_date_list:
        if item.group()[4] != "-" and item.group()[6] != "-":
            item_list = list(item.group())
            item_list.insert(4, "-")
            item_list.insert(7, "-")
            result.append("".join(item_list))
        else:
            result.append(item.group())
    return result

def format_asian_dates(asian_date_list):
    """ Format and return asian_dates
    Args:
        asian_date_list (list): List of asian dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in asian_date_list:
        calendar = item.group("calendar")
        year = item.group("year")
        month = item.group("month")
        day = item.group("day")
        
        # Remove chinese character suffix
        if year:
            year = year[:-1]
        if month:
            month = month[:-1]
        if day:
            day = day[:-1]
        
        if calendar in japanese_era_lookup:
            year = convert_calendar_system(year, calendar)
        
        if calendar in minguo_set:
            year = convert_calendar_system(year, calendar)
        
        formatted_date = prep_iso8601(year=year, month=month, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list

def format_text_month(text_month_list):
    """ Format and return written month dates
    Args:
        text_month_list (list): List of text month dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in text_month_list:
        parsed = cmp_text_month_cleanup.search(str(item.group()))
        if parsed:
            year = parsed.group("year") if parsed.group("year") else None
            month = datetime.datetime.strptime(parsed.group("month"),"%B").month if len(parsed.group("month")) > 3 else datetime.datetime.strptime(parsed.group("month"),"%b").month
            day = parsed.group("day") if parsed.group("day") else None
            
            formatted_date = prep_iso8601(year=year, month=month, day=day)
            result.append(formatted_date)
            origin_date_list.append(item)
    
    return result, origin_date_list

def format_given_day(given_day_list):
    """ Format and return given day dates
    Args:
        given_day_listt (list): List of given day dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in given_day_list:
        day = item.group("day")
        formatted_date = prep_iso8601(year=None, month=None, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list 

### Entry results

#### Asian formatted dates

In [9]:
res1_iso, res1_orig = format_asian_dates(asian_date_list)
for idx, item in enumerate(res1_iso):
    print("ISO8610: " + res1_iso[idx], "| Original: " + res1_orig[idx])

ISO8610: 2017-07 | Original: 2017年7月
ISO8610: 1945-10 | Original: 1945年10月
ISO8610: 1645-04 | Original: 1645年4月
ISO8610: 1662-02 | Original: 1662年2月
ISO8610: 1662-02-01 | Original: 1662年2月1日
ISO8610: --12-13 | Original: 12月13日
ISO8610: --06-23 | Original: 同年6月23日
ISO8610: 1684-04 | Original: 1684年4月
ISO8610: 1841-09 | Original: 1841年9月
ISO8610: 1874-05-27 | Original: 1874年5月27日
ISO8610: --12 | Original: 12月
ISO8610: 1894-07-25 | Original: 1894年7月25日
ISO8610: 1895-04-17 | Original: 1895年4月17日
ISO8610: 1895-05-29 | Original: 1895年5月29日
ISO8610: 1895-05-29 | Original: 1895年5月29日
ISO8610: --11-18 | Original: 11月18日
ISO8610: 1895-06-17 | Original: 1895年6月17日
ISO8610: 1945-09-02 | Original: 1945年9月2日
ISO8610: --06-26 | Original: 6月26日
ISO8610: 1894-07-25 | Original: 1894年7月25日
ISO8610: 1895-04-17 | Original: 1895年4月17日
ISO8610: 1895-04-17 | Original: 1895年4月17日
ISO8610: --05-29 | Original: 5月29日
ISO8610: --06-14 | Original: 6月14日
ISO8610: --10-21 | Original: 10月21日
ISO8610: 1930-04-10 | Orig

#### ISO8601 dates

In [10]:
res2_iso = format_iso_dates(iso_date_list)
for idx, item in enumerate(res2_iso):
    print("ISO8610: " + res2_iso[idx])

ISO8610: 2011-03-24
ISO8610: 2004-07-12
ISO8610: 2011-03-24
ISO8610: 2007-07-23
ISO8610: 2012-06-14
ISO8610: 2011-03-24
ISO8610: 2013-12-13
ISO8610: 2017-05-26
ISO8610: 1997-02-26
ISO8610: 2017-05-26
ISO8610: 2010-12-13
ISO8610: 2017-05-26
ISO8610: 2015-01-27
ISO8610: 2014-04-10
ISO8610: 2014-02-22
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2013-11-02
ISO8610: 2014-07-09
ISO8610: 2014-11-05
ISO8610: 2017-05-26
ISO8610: 2011-11-15
ISO8610: 2012-10-10
ISO8610: 2017-05-26
ISO8610: 2012-03-29
ISO8610: 2013-10-31
ISO8610: 2014-08-13
ISO8610: 2010-12-24
ISO8610: 2010-12-27
ISO8610: 2005-11-13
ISO8610: 2017-05-21
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2011-04-11
ISO8610: 2006-02-07
ISO8610: 2017-05-26
ISO8610: 2017-05-26
ISO8610: 2014-09-02
ISO8610: 2014-08-29
ISO8610: 2014-09-02
ISO8610: 2014-09-23
ISO8610: 2017-05-26
ISO8610: 2007-09-12
ISO8610: 2017-05-26
ISO8610: 2017-05-26


#### Written text month

In [11]:
res3_iso, res3_orig = format_text_month(text_month_list)
for idx, item in enumerate(res3_iso):
    print("ISO8610: " + res3_iso[idx], "| Original: " + res3_orig[idx].group())

ISO8610: 2006-02-07 | Original: February 7, 2006


#### Formatted given day

In [12]:
res4_iso, res4_orig = format_given_day(given_day_list)
for idx, item in enumerate(res4_iso):
    print("ISO8610: " + res4_iso[idx], "| Original: " + res4_orig[idx])

## Advanced Level

How can we parse 'date and time' representations that are not represented by standard/normal representation style, e.g., Christmas Eve for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

Possible solutions: ConceptNet http://conceptnet.io/

In [13]:
import spacy
import en_core_web_sm

In [14]:
nlp = en_core_web_sm.load()

In [15]:
from spacy.zh import Chinese
zh_nlp = Chinese()

In [16]:
zh_doc = zh_nlp(u'民國105年10月10日')

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\JHEREZ~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.042 seconds.
Prefix dict has been built succesfully.


In [17]:
for a in zh_doc:
    print("1 " + str(a))

1 民
1 國
1 105
1 年
1 10
1 月
1 10
1 日


In [18]:
import jieba
import jieba.analyse
import jieba.posseg as pseg

In [37]:
# zh_sent = "我爱北京天安门"
zh_sent = "蔡英文雙十節我爱北京天安门黃正聰"

In [38]:
words = pseg.cut(zh_sent)
for w in words:
    print('%s %s' % (w.word, w.flag))

蔡 nr
英文 nz
雙十節 m
我 r
爱 v
北京 ns
天安门 ns
黃正聰 nr


In [21]:
import pynlpir

In [22]:
pynlpir.open()

In [39]:
pynlpir.segment(zh_sent, pos_names='all')

[('蔡', 'noun:personal name:Chinese surname'),
 ('英文', 'noun:other proper noun'),
 ('雙', 'noun'),
 ('十', 'numeral'),
 ('節', 'noun'),
 ('我', 'pronoun:personal pronoun'),
 ('爱', 'verb'),
 ('北京', 'noun:toponym'),
 ('天安门', 'noun:toponym'),
 ('黃', 'noun'),
 ('正', 'distinguishing word'),
 ('聰', 'noun')]