# Date Representations

### Logic for parsing all the date and times from a given Chinese Wikipedia page

- Dates are to be formatted to ISO8601
- A single page might contain multiple 'date and time' representation styles
- Representations may be incomplete

#### Examples

- a: 民國105年10月10日 (Taiwan datetime representation for 2016-10-10)
- b: 2016-10-10
- c: Oct. 10, 2016
- d: 2016年10月 (Chinese date representation for Oct. 2016)
- e: 10月10日 (Chinese date representation for Oct. 10th)
- f: 同年10月 (Chinese date representation for 'In the same year, .... In October, ....')
- g: 本月10日 (Chinese date representation for 'In this month, .... On 10th, ....')
- h: 平成28年10月10日等等 (Japanese datetime representation for 2016-10-10)

#### Entry

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

#### Advanced 

How can we parse 'date and time' representations that are not 
represented by standard/normal representation style, e.g., Christmas Eve 
for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

#### Challenge 

How can we derive correct the 'date and time' from the contexts of a 
corpus, e.g., 'On May 20th, Ing-Wen Tsai took her oath as the president of 
Taiwan. Next day, ...' in which the 'Next day' is May 21st.

### ISO 8601 Formats [Output will use dashes] 

- YYYY-MM-DD	or	YYYYMMDD
- YYYY-MM	(but not YYYYMM)
- --MM-DD	or	--MMDD

### Calendar system conversions

- [Minguo Calendar](https://en.wikipedia.org/wiki/Minguo_calendar) 1 = 1912
- [Heisei period](https://en.wikipedia.org/wiki/Heisei_period) 1 = 1989

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import datetime
import wikipedia
import wptools

## Entry Level

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

### Init helper functions and page urls

Here I try to cover as much of the cases as possible. Several lookup objects are used throughout to get the various calendar systems that are covered, namely, the Minguo and Heisei Period calendars. Additionally covered are dates of the form 同年.

Note that I only consider dates that appear in the visible text of the Wikipedia page, ie, what you would see through a web browser.

In [3]:
# Store japanese era as char:year_offset lookup
japanese_era_lookup = {"平成":1988}

minguo_set = set(["民國", "民国","中華民國","中华民国"])
chinese_same_year_set = set(["同年"])

def convert_calendar_system(year, calendar_system):
    """ Accepts a year and calendar system and converts to the
    Gregorian calendar system.
    Args:
        year (int): Numerical year value
        calendar_system (str): Specifier for calendar system
    Return
        gregorian_year (int)
    """

    if year < 1:
        return None

    if calendar_system in minguo_set:
        return 1911 + year
    if calendar_system in japanese_era_lookup:
        return japanese_era_lookup[calendar_system] + year
    if calendar_system in chinese_same_year_set:
        return datetime.date.today().year

In [4]:
# Set wikipedia api language
lang = "zh"
wikipedia.set_lang(lang)

# Here I declare the title of the zh wikipedia page, formatted as https://zh.wikipedia.org/wiki/<page_title>
page_title = "民國紀年"
# page_title = "臺灣"

page_text = wikipedia.page(page_title).content

### Init and compile regex

I define several regex objects for readibility and ease of use. Groups are used to extract the date attributes as required.

In [5]:
# Asian dates. Matches examples: a,d,e,f,h
asian_dates = r"\b(?:\w*?)?(?P<calendar>民國|民国|中華民國|中华民国|平成|同年)?(?P<year>\d{1,3}\w{1}|[0-9]{4}\w{1})?(?P<month>0?[1-9]月|1[0-2]月)(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?(?:\w*?)?\b"

# ISO date. Matches example: b
iso_dates = r"\b(?P<year>\d{4})\-?(?P<month>0[1-9]|1[0-2])\-?(?P<day>[12]\d|0[1-9]|3[01])\b"

# Month text representation. Matches example: c
text_month = r"\b(?P<month>[a-zA-z]{3,9})(?:\D{1,2})(?P<day>[12]\d|0?[1-9]|3[01])?(?:\D{0,5})(?P<year>\d{4})?\b"

# Need to do a cleanup after for month validation
text_month_cleanup = r"\b(?P<month>Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)(?:.)?\s(?P<day>\d{1,2})?(?:th|st)?(?:,)?(?:\s)?(?P<year>\d{4})?\b"

# Chinese representation. Current year and month, given day. Matches example: g
given_day = r"\b(?:\w*?)?(?P<month>本月)(?P<day>[12]\d\D{1}|0?[1-9]\D{1}|3[01]\D{1})\b"

# Alternate text month where the day is listed first eg: 31 October
alt_text_month = r"\b(?P<day>[12]\d|0?[1-9]|3[01])(?:\D{1,3})?(?:\s)?(?P<month>[a-zA-z]{3,9})(?:\D{0,2})?(?:\D{0,5})(?P<year>\d{4})?\b"
alt_text_month_cleanup = r"\b(?P<day>\d{1,2})?(?:th|st)?(?:,)?(?:\s)(?P<month>Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)(?:.)?(?:\s)?(?P<year>\d{4})?\b"

cmp_asian_dates = re.compile(asian_dates, re.UNICODE)
cmp_iso_dates = re.compile(iso_dates, re.UNICODE)
cmp_text_month = re.compile(text_month, re.UNICODE | re.IGNORECASE)
cmp_given_day = re.compile(given_day, re.UNICODE | re.IGNORECASE)
cmp_text_month_cleanup = re.compile(text_month_cleanup, re.IGNORECASE)
cmp_alt_text_month = re.compile(alt_text_month, re.UNICODE | re.IGNORECASE)
cmp_alt_text_month_cleanup = re.compile(alt_text_month_cleanup, re.UNICODE | re.IGNORECASE)

In [6]:
asian_date_list = list(cmp_asian_dates.finditer(page_text))
iso_date_list = list(cmp_iso_dates.finditer(page_text))
text_month_list = list(cmp_text_month.finditer(page_text))
given_day_list = list(cmp_given_day.finditer(page_text))
alt_text_month_list = list(cmp_alt_text_month.finditer(page_text))

### Parse and format dates

The functions below deal with the preparation of our date to ISO8601 format.

In [7]:
def prep_origin_date(match_object):
    """ Prep the original date string
    Args:
        match_object (regex match object)
    Return:
        origin_date (str)
    """
    
    origin_date = ""
    try:
        if match_object.group("calendar"):
            origin_date += match_object.group("calendar")
    except:
        pass
    try:
        if match_object.group("year"):
            origin_date += match_object.group("year")
    except:
        pass
    try:
        if match_object.group("month"):
            origin_date += match_object.group("month")
    except:
        pass
    try:
        if match_object.group("day"):
            origin_date += match_object.group("day")
    except:
        pass
    
    return origin_date
    
def prep_iso8601(year=None, month=None, day=None):
    """ Accept and format year, month and day args.
    Args:
        year, month, day (int)
    Return
        result (str): Formatted iso8601 string.
    """
    
    # Format with leading zeros
    if month is not None:
        month = "".join(["0",str(month)]) if month and int(month) < 10 else str(month)
    if day is not None:
        day = "".join(["0",str(day)]) if day and len(day) == 1 and int(day) < 10 else str(day)

    result = ""
    result = result + str(year) + "-" if year else "--"
    if month and day:
        result = result + str(month) + "-"
    elif month:
        result = result + str(month)
    elif month is None:
        result = result + "--"
    result = result + str(day) if day else result
    
    return result

def get_date_attrs(match_object):
    """ Parse the year, month and day attributes
    from a given match object.
    Args:
        match_object: (regex match object)
    Return
        year, month, day (int)
    """
    
    calendar = match_object.group("calendar")
    year = match_object.group("year")
    month = match_object.group("month")
    day = match_object.group("day")

    # Remove chinese character suffix
    if year:
        year = year[:-1]
    if month:
        month = month[:-1]
    if day:
        day = day[:-1]

    if calendar in japanese_era_lookup:
        year = convert_calendar_system(year, calendar)

    if calendar in minguo_set:
        year = convert_calendar_system(year, calendar)
    
    return year, month, day

def format_iso_dates(iso_date_list):
    """ Format and return iso_dates
    Args:
        iso_date_list (list): List of iso dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """

    result = []
    for item in iso_date_list:
        if item.group()[4] != "-" and item.group()[6] != "-":
            item_list = list(item.group())
            item_list.insert(4, "-")
            item_list.insert(7, "-")
            result.append("".join(item_list))
        else:
            result.append(item.group())
    
    return result

def format_asian_dates(asian_date_list):
    """ Format and return asian_dates
    Args:
        asian_date_list (list): List of asian dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in asian_date_list:
        year, month, day = get_date_attrs(item)
        formatted_date = prep_iso8601(year=year, month=month, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list

def format_text_month(text_month_list, alt=False):
    """ Format and return written month dates
    Args:
        text_month_list (list): List of text month dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in text_month_list:
        if alt == False:
            parsed = cmp_text_month_cleanup.search(str(item.group()))
        elif alt == True:
            parsed = cmp_alt_text_month_cleanup.search(str(item.group()))
        if parsed:
            year = parsed.group("year") if parsed.group("year") else None
            month = datetime.datetime.strptime(parsed.group("month"),"%B").month if len(parsed.group("month")) > 3 else datetime.datetime.strptime(parsed.group("month"),"%b").month
            day = parsed.group("day") if parsed.group("day") else None
            
            formatted_date = prep_iso8601(year=year, month=month, day=day)
            result.append(formatted_date)
            origin_date_list.append(item)
    
    return result, origin_date_list

def format_given_day(given_day_list):
    """ Format and return given day dates
    Args:
        given_day_list (list): List of given day dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in given_day_list:
        day = item.group("day")
        if day:
            day = day[:-1]
        formatted_date = prep_iso8601(year=None, month=None, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list 

## Entry Level Results

#### Asian formatted dates

In [8]:
res1_iso, res1_orig = format_asian_dates(asian_date_list)
for idx, item in enumerate(res1_iso):
    print("ISO8610: " + res1_iso[idx], "| Original: " + res1_orig[idx])

ISO8610: 4609-11-13 | Original: 4609年11月13日
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: 2005-09-06 | Original: 2005年9月6日
ISO8610: 1997-07 | Original: 1997年7月
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: 1949-10-01 | Original: 1949年10月1日
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: --09-30 | Original: 9月30日
ISO8610: --10-01 | Original: 10月1日
ISO8610: 1950-03-27 | Original: 1950年3月27日
ISO8610: --05-01 | Original: 5月1日
ISO8610: 1955-02-25 | Original: 1955年2月25日
ISO8610: --01-01 | Original: 1月1日
ISO8610: 1909-01-22 | Original: 1909年1月22日
ISO8610: 1912-02-12 | Original: 1912年2月12日
ISO8610: 1917-07-01 | Original: 1917年7月1日
ISO8610: 1917-07-12 | Original: 1917年7月12日
ISO8610: 1916-01-01 | Original: 1916年1月1日
ISO8610: 1916-03-22 | Original: 1916年3月22日
ISO8610: 1911-12-16 | Original: 1911年12月16日
ISO8610: 1915-06-09 | Original: 1915年6月9日
ISO8610: 1921-02-21 | Original: 1921年2月21日
ISO8610: 1924-11-26 | Original: 1924年11月26日
ISO8610: 1868-01-25 | Original: 1868年1月25日
ISO8610: 191

#### ISO8601 dates

In [9]:
res2_iso = format_iso_dates(iso_date_list)
for idx, item in enumerate(res2_iso):
    print("ISO8610: " + res2_iso[idx])

ISO8610: 2013-01-01


#### Written text month

In [10]:
res3_iso, res3_orig = format_text_month(text_month_list)
for idx, item in enumerate(res3_iso):
    print("ISO8610: " + res3_iso[idx], "| Original: " + res3_orig[idx].group())

In [11]:
# This is just a static test since these dates don't often appear
test_text_month = "Oct. 1st, 2016, October 15th, Jan 07 2017, FEB 1995, J59 2038, March 10th"
test_text_month = list(cmp_text_month.finditer(test_text_month))
res3_test_iso, res3_test_orig = format_text_month(test_text_month)

for idx, item in enumerate(res3_test_iso):
    print("ISO8610: " + res3_test_iso[idx], "| Original: " + res3_test_orig[idx].group())

ISO8610: 2016-10-01 | Original: Oct. 1st, 2016
ISO8610: --10-15 | Original: October 15th, 
ISO8610: 2017-01-07 | Original: Jan 07 2017
ISO8610: 1995-02 | Original: FEB 1995
ISO8610: --03-10 | Original: March 10th


#### Formatted given day

In [12]:
res4_iso, res4_orig = format_given_day(given_day_list)
for idx, item in enumerate(res4_iso):
    print("ISO8610: " + res4_iso[idx], "| Original: " + res4_orig[idx])

In [13]:
# This is just a static test since these dates don't often appear
test_given_day = "本月10日"
test_given_day = list(cmp_given_day.finditer(test_given_day))
res4_test_iso, res4_test_orig = format_given_day(test_given_day)

for idx, item in enumerate(res4_test_iso):
    print("ISO8610: " + res4_test_iso[idx], "| Original: " + res4_test_orig[idx])

ISO8610: ----10 | Original: 本月10日


#### Alternate written text month

In [14]:
res5_iso, res5_orig = format_text_month(alt_text_month_list, alt=True)
for idx, item in enumerate(res5_iso):
    print("ISO8610: " + res5_iso[idx], "| Original: " + res5_orig[idx].group())

In [15]:
# This is just a static test since these dates don't often appear
test_alt_text_month = "1st October"
test_alt_text_month = list(cmp_alt_text_month.finditer(test_alt_text_month))
res5_test_iso, res5_test_orig = format_text_month(test_alt_text_month, alt=True)

for idx, item in enumerate(res5_test_iso):
    print("ISO8610: " + res5_test_iso[idx], "| Original: " + res5_test_orig[idx].group())

ISO8610: --10-01 | Original: 1st October


## Advanced Level

How can we parse 'date and time' representations that are not represented by standard/normal representation style, e.g., Christmas Eve for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

### Init helper functions and objects

Note that I primarily consider National Holidays. Secondary holidays will be returned if identified.
This one was a bit tricky. There are too many holidays in various countries to code them manually, there is also the case where you have unofficial holidays. Firstly, I use [`jieba`](https://github.com/fxsjy/jieba) to segment out text. Next, to resolve our issue a mapping is defined for official public holidays for Taiwan and the USA and check the tokens against it. I also use a calendar library [`workalendar`](https://github.com/novafloss/workalendar) to get the dates for the public holidays in the current year. This solves half the problem.

To get holidays that aren't covered I thought it best to rely on Wikipedia. The tokens from the page are restriced to the top 500 (arbitrary), ranked by their tf-idf score. Jieba provides a dictionary and part of speech tags that is used to reduce the search space for the Chinese characters. This is done by restricting to tokens that consist of two or more characters and are tagged as time words. I also retain tokens tagged as English.

For any Chinese or English token not found in our defined calendar mappings I retrieve its Wikipedia page (if it exists) with a wrapper for the [`wikipedia api`](https://github.com/goldsmith/Wikipedia) and extract its infobox with [`wptools`](https://github.com/siznax/wptools). The infobox of a page refers to the box at the right that offers summary information about the page. For most of the holidays this infobox stores a date attribute that lets us know when the calendar date for the holiday is. In this way, I can simply rely on the information that already exists in Wikipedia to tell us if a token is a holiday.


In [16]:
import jieba.analyse
import jieba.posseg as pseg
from workalendar.asia import Taiwan
from workalendar.usa.core import UnitedStates

zh_cal = Taiwan()
us_cal = UnitedStates()

In [17]:
taiwan_public_holidays_map = {
                                "元旦":"New year", "農曆除夕":"Chinese New Year's eve", "農曆新年":"Chinese New Year",
                                "春节":"Chinese New Year","228和平紀念日":"228 Peace Memorial Day", 
                                "和平紀念日":"228 Peace Memorial Day","婦女節":"Combination of Women's Day and Children's Day",
                                "兒童節合倂":"Combination of Women's Day and Children's Day","清明节":"Qingming Festival",
                                "淸明節":"Qingming Festival","端午節":"Dragon Boat Festival","端午节":"Dragon Boat Festival",
                                "中秋節":"Mid-Autumn Festival","中秋国庆":"Mid-Autumn Festival",
                                "雙十節":"National Day/Double Tenth Day","國慶日":"National Day/Double Tenth Day"
                             }

us_public_holidays_map = {
                            "christmas":"Christmas Day","christmas day":"Christmas Day","xmas":"Christmas Day",
                            "new year":"New year","new year's":"New year","new years":"New year",
                            "mlk Day": "Birthday of Martin Luther King, Jr.","martin luther king jr day":"Birthday of Martin Luther King, Jr.",
                            "washington's birthday":"Washington's Birthday","washingtons day":"Washington's Birthday",
                            "memorial day":"Memorial Day","independence day":"Independence Day","labor day":"Labor Day",
                            "columbus day":"Columbus Day","veterans day":"Veterans Day","thanksgiving":"Thanksgiving Day",
                            "thanksgiving day":"Thanksgiving Day"
                         }

taiwan_public_holidays = dict(zh_cal.holidays())
us_public_holidays = dict(us_cal.holidays())

def get_public_holiday(holiday_name):
    """ Lookup the date representation for a given Chinese holiday string
    Args:
        holiday_name_zh (str): Chinese holiday string
    
    Return
        result (dict): datetime.date:str
    """
    
    if holiday_name in taiwan_public_holidays_map:
        result = {k:v for k,v in taiwan_public_holidays.items() if v == taiwan_public_holidays_map[holiday_name]}
    
    elif holiday_name in us_public_holidays_map:
        result = {k:v for k,v in us_public_holidays.items() if v == us_public_holidays_map[holiday_name]}
    return result

In [18]:
def get_relative_dates(page_text):
    """ Attempt to parse relative dates from a given page.
    
    The logic is as follows, first check for characters >=2 in length
    in order to reduce the search space. Some holidays are represented 
    by two characters so check for any string in the results 
    that exists in our defined holiday mapping. 
    
    As a secondary step, attempt to check the wikipedia api for
    any string of 3 or more characters in order to identify additional
    dates.
    
    Args:
        page_text (str): Wiki text string.
    """
    
    # topK refers to the number of tags to return, based on tf-idf
    tags = jieba.analyse.extract_tags(page_text, topK=500)
    filtered_tags = [tag for tag in tags if not tag.isnumeric() and len(tag) >= 2]
    filtered_tag_set = set(filtered_tags)
    
    # Need to format the tags as a string for the secondary pass
    filtered_string = ", ".join(filtered_tags)
    
    fixed_holidays_zh = [get_public_holiday(tag) for tag in filtered_tag_set if tag in taiwan_public_holidays_map]
    fixed_holidays_us = [get_public_holiday(tag) for tag in filtered_tag_set if tag in us_public_holidays_map]

    result = {str(list(item.values())[0]):str(list(item.keys())[0]) for item in fixed_holidays_zh}
    for us_holiday in fixed_holidays_us:
        result[str(list(us_holiday.values())[0])] = str(list(us_holiday.keys())[0])
    
    secondary_tags = set(pseg.cut(filtered_string))
    for tag in secondary_tags:
        # Flag refers to the part of speech and t refers to time words
        if len(tag.word) > 2 and tag.word not in taiwan_public_holidays_map and tag.word not in us_public_holidays_map and tag.flag in ["t", "eng"]:
            if tag.flag == "t":
                res = parse_wiki_info_box(tag.word, "zh", "asian_date")
            elif tag.flag == "eng":
                res = parse_wiki_info_box(tag.word, "en", "text_month")
            if res:
                result[str(list(res.keys())[0])] = str(list(res.values())[0])
    return result

def parse_wiki_info_box(query, lang, date_format):
    """ Parses the date from the info box of a given wikipedia page.
    Accepts a search query and checks for page existence, returning a parsed
    date field from the info box of said page.
    
    Args:
        query (str): Search term
        lang (str): Language to set the wikipedia API to.
    """
    
    wikipedia.set_lang(lang)
    search_results = wikipedia.search(query)
    for entry in search_results:
        curr_result = wptools.page(entry, silent=True, lang=lang).get()
        if "date" in curr_result.infobox:
            top_result = curr_result
            break
        else:
            continue
    try:
        if date_format == "text_month":
            
            info_box_date = top_result.infobox["date"]
            date_label = top_result.label
            
            regex_results = list(cmp_text_month.finditer(info_box_date))
            if regex_results[0].group("day") is None:
                regex_results = list(cmp_alt_text_month.finditer(info_box_date))
                alt = True
            
            if regex_results:
                for item in regex_results:
                    parsed = cmp_text_month_cleanup.search(str(item.group()))
                    if not parsed:
                        parsed = cmp_alt_text_month_cleanup.search(str(item.group()))
                
                if parsed:
                    year = parsed.group("year") if parsed.group("year") else datetime.date.today().year
                    month = datetime.datetime.strptime(parsed.group("month"),"%B").month if len(parsed.group("month")) > 3 else datetime.datetime.strptime(parsed.group("month"),"%b").month
                    day = parsed.group("day") if parsed.group("day") else None
                    formatted_date = prep_iso8601(year=year, month=month, day=day)
        
        elif date_format == "asian_date":
            
            info_box_date = top_result.infobox["date"]
            date_label = top_result.label
            
            regex_results = list(cmp_asian_dates.finditer(info_box_date))
            if regex_results:
                year, month, day = get_date_attrs(item)
                if year is None:
                    year = datetime.date.today().year
                formatted_date = prep_iso8601(year=year, month=month, day=day)
    except:
        return None
    return {date_label:formatted_date}

## Advanced Level Results

In [19]:
wikipedia.set_lang(lang)
page_text2 = wikipedia.page("端午節_(華人)").content

In [20]:
get_relative_dates(page_text2)

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.191 seconds.
Prefix dict has been built succesfully.


{'Dragon Boat Festival': '2017-05-30'}

In [21]:
get_relative_dates("christmas")

{'Christmas Day': '2017-12-25'}

In [22]:
get_relative_dates("halloween")

{'Halloween': '2017-10-31'}

In [23]:
get_relative_dates("元旦")

{'New year': '2017-01-01'}

In [24]:
get_relative_dates("中華民國國慶日")

{'National Day/Double Tenth Day': '2017-10-10'}