# Date Representations

### Logic for parsing all the date and times from a given Chinese Wikipedia page

- Dates are to be formatted to ISO8601
- A single page might contain multiple 'date and time' representation styles
- Representations may be incomplete

#### Examples

- a: 民國105年10月10日 (Taiwan datetime representation for 2016-10-10)
- b: 2016-10-10
- c: Oct. 10, 2016
- d: 2016年10月 (Chinese date representation for Oct. 2016)
- e: 10月10日 (Chinese date representation for Oct. 10th)
- f: 同年10月 (Chinese date representation for 'In the same year, .... In October, ....')
- g: 本月10日 (Chinese date representation for 'In this month, .... On 10th, ....')
- h: 平成28年10月10日等等 (Japanese datetime representation for 2016-10-10)

#### Entry

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

#### Advanced 

How can we parse 'date and time' representations that are not 
represented by standard/normal representation style, e.g., Christmas Eve 
for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

#### Challenge 

How can we derive correct the 'date and time' from the contexts of a 
corpus, e.g., 'On May 20th, Ing-Wen Tsai took her oath as the president of 
Taiwan. Next day, ...' in which the 'Next day' is May 21st.

### ISO 8601 Formats [Output will use dashes] 

- YYYY-MM-DD	or	YYYYMMDD
- YYYY-MM	(but not YYYYMM)
- --MM-DD	or	--MMDD

### Calendar system conversions

- [Minguo Calendar](https://en.wikipedia.org/wiki/Minguo_calendar) 1 = 1912
- [Heisei period](https://en.wikipedia.org/wiki/Heisei_period) 1 = 1989

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import datetime
import wikipedia
import wptools

## Entry Level

How can the accuracy of parsing 'date and time' representations be improved for different 'date and time' representations?

### Init helper functions and page urls

In [3]:
# Store japanese era as char:year_offset lookup
japanese_era_lookup = {"平成":1988}

minguo_set = set(["民國", "民国","中華民國","中华民国"])
chinese_same_year_set = set(["同年"])

def convert_calendar_system(year, calendar_system):
    """ Accepts a year and calendar system and converts to the
    Gregorian calendar system.
    Args:
        year (int): Numerical year value
        calendar_system (str): Specifier for calendar system
    Return
        gregorian_year (int)
    """

    if year < 1:
        return None

    if calendar_system in minguo_set:
        gregorian_year = 1911 + year
    
    if calendar_system in japanese_era_lookup:
        gregorian_year = japanese_era_lookup[calendar_system] + year
    if calendar_system in chinese_same_year_set:
        gregorian_year = datetime.date.today().year
    
    return gregorian_year

In [4]:
# Set wikipedia api language
lang = "zh"
wikipedia.set_lang(lang)

# Here we declare the title of the zh wikipedia page, formatted as https://zh.wikipedia.org/wiki/<page_title>
page_title = "民國紀年"
# page_title = "臺灣"

page_text = wikipedia.page(page_title).content

### Init and compile regex

In [5]:
# Asian dates. Matches examples: a,d,e,f,h
asian_dates = r"\b(?:\w*?)?(?P<calendar>民國|民国|中華民國|中华民国|平成|同年)?(?P<year>\d{1,3}\w{1}|[0-9]{4}\w{1})?(?P<month>0?[1-9]月|1[0-2]月)(?P<day>3[01]日|[12][0-9]日|0?[1-9]日)?(?:\w*?)?\b"

# ISO date. Matches example: b
iso_dates = r"\b(?P<year>\d{4})\-?(?P<month>0[1-9]|1[0-2])\-?(?P<day>[12]\d|0[1-9]|3[01])\b"

# Month text representation. Matches example: c
text_month = r"\b(?P<month>[a-zA-z]{3,9})(?:\D{1,2})(?P<day>[12]\d|0?[1-9]|3[01])?(?:\D{0,5})(?P<year>\d{4})?\b"

# Need to do a cleanup after for month validation
text_month_cleanup = r"\b(?P<month>Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)(?:.)?\s+(?P<day>\d{1,2})(?:th)?(?:,)?(?:\s)?(?P<year>\d{4})?\b"

# Chinese representation. Current year and month, given day. Matches example: g
given_day = r"\b(?:\w*?)?(?P<month>本月)(?P<day>[12]\d\D{1}|0?[1-9]\D{1}|3[01]\D{1})\b"

cmp_asian_dates = re.compile(asian_dates, re.UNICODE)
cmp_iso_dates = re.compile(iso_dates, re.UNICODE)
cmp_text_month = re.compile(text_month, re.UNICODE | re.IGNORECASE)
cmp_given_day = re.compile(given_day, re.UNICODE | re.IGNORECASE)
cmp_text_month_cleanup = re.compile(text_month_cleanup, re.IGNORECASE)

In [6]:
asian_date_list = list(cmp_asian_dates.finditer(page_text))
iso_date_list = list(cmp_iso_dates.finditer(page_text))
text_month_list = list(cmp_text_month.finditer(page_text))
given_day_list = list(cmp_given_day.finditer(page_text))

### Parse and format dates

In [7]:
def prep_origin_date(match_item):
    """ Prep the original date string
    Return:
        origin_date (str)
    """
    
    origin_date = ""
    try:
        if match_item.group("calendar"):
            origin_date += match_item.group("calendar")
    except:
        pass
    try:
        if match_item.group("year"):
            origin_date += match_item.group("year")
    except:
        pass
    try:
        if match_item.group("month"):
            origin_date += match_item.group("month")
    except:
        pass
    try:
        if match_item.group("day"):
            origin_date += match_item.group("day")
    except:
        pass
    
    return origin_date
    
def prep_iso8601(year=None, month=None, day=None):
    """ Accept and format year, month and day args.
    Args:
        year, month, day (int)
    Return
        result (str): Formatted iso8601 string.
    """
    
    # Format with leading zeros
    if month is not None:
        month = "".join(["0",str(month)]) if month and int(month) < 10 else str(month)
    if day is not None:
        day = "".join(["0",str(day)]) if day and int(day) < 10 else str(day)

    result = ""
    result = result + str(year) + "-" if year else "--"
    if month and day:
        result = result + str(month) + "-"
    elif month:
        result = result + str(month)
    elif month is None:
        result = result + "--"
    result = result + str(day) if day else result
    return result

def format_iso_dates(iso_date_list):
    """ Format and return iso_dates
    Args:
        iso_date_list (list): List of iso dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """

    result = []
    for item in iso_date_list:
        if item.group()[4] != "-" and item.group()[6] != "-":
            item_list = list(item.group())
            item_list.insert(4, "-")
            item_list.insert(7, "-")
            result.append("".join(item_list))
        else:
            result.append(item.group())
    return result

def format_asian_dates(asian_date_list):
    """ Format and return asian_dates
    Args:
        asian_date_list (list): List of asian dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in asian_date_list:
        calendar = item.group("calendar")
        year = item.group("year")
        month = item.group("month")
        day = item.group("day")
        
        # Remove chinese character suffix
        if year:
            year = year[:-1]
        if month:
            month = month[:-1]
        if day:
            day = day[:-1]
        
        if calendar in japanese_era_lookup:
            year = convert_calendar_system(year, calendar)
        
        if calendar in minguo_set:
            year = convert_calendar_system(year, calendar)
        
        formatted_date = prep_iso8601(year=year, month=month, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list

def format_text_month(text_month_list):
    """ Format and return written month dates
    Args:
        text_month_list (list): List of text month dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in text_month_list:
        parsed = cmp_text_month_cleanup.search(str(item.group()))
        if parsed:
            year = parsed.group("year") if parsed.group("year") else None
            month = datetime.datetime.strptime(parsed.group("month"),"%B").month if len(parsed.group("month")) > 3 else datetime.datetime.strptime(parsed.group("month"),"%b").month
            day = parsed.group("day") if parsed.group("day") else None
            
            formatted_date = prep_iso8601(year=year, month=month, day=day)
            result.append(formatted_date)
            origin_date_list.append(item)
    
    return result, origin_date_list

def format_given_day(given_day_list):
    """ Format and return given day dates
    Args:
        given_day_list (list): List of given day dates
    Return
        result (list): List of formatted ISO8601 date strings.
    """
    
    result = []
    origin_date_list = []
    for item in given_day_list:
        day = item.group("day")
        formatted_date = prep_iso8601(year=None, month=None, day=day)
        result.append(formatted_date)
        
        origin_date = prep_origin_date(item)
        origin_date_list.append(origin_date)
    
    return result, origin_date_list 

### Entry results

#### Asian formatted dates

In [8]:
res1_iso, res1_orig = format_asian_dates(asian_date_list)
for idx, item in enumerate(res1_iso):
    print("ISO8610: " + res1_iso[idx], "| Original: " + res1_orig[idx])

ISO8610: 4609-11-13 | Original: 4609年11月13日
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: 2005-09-06 | Original: 2005年9月6日
ISO8610: 1997-07 | Original: 1997年7月
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: 1949-10-01 | Original: 1949年10月1日
ISO8610: 1912-01-01 | Original: 1912年1月1日
ISO8610: --09-30 | Original: 9月30日
ISO8610: --10-01 | Original: 10月1日
ISO8610: 1950-03-27 | Original: 1950年3月27日
ISO8610: --05-01 | Original: 5月1日
ISO8610: 1955-02-25 | Original: 1955年2月25日
ISO8610: --01-01 | Original: 1月1日
ISO8610: 1909-01-22 | Original: 1909年1月22日
ISO8610: 1912-02-12 | Original: 1912年2月12日
ISO8610: 1917-07-01 | Original: 1917年7月1日
ISO8610: 1917-07-12 | Original: 1917年7月12日
ISO8610: 1916-01-01 | Original: 1916年1月1日
ISO8610: 1916-03-22 | Original: 1916年3月22日
ISO8610: 1911-12-16 | Original: 1911年12月16日
ISO8610: 1915-06-09 | Original: 1915年6月9日
ISO8610: 1921-02-21 | Original: 1921年2月21日
ISO8610: 1924-11-26 | Original: 1924年11月26日
ISO8610: 1868-01-25 | Original: 1868年1月25日
ISO8610: 191

#### ISO8601 dates

In [9]:
res2_iso = format_iso_dates(iso_date_list)
for idx, item in enumerate(res2_iso):
    print("ISO8610: " + res2_iso[idx])

ISO8610: 2013-01-01


#### Written text month

In [10]:
res3_iso, res3_orig = format_text_month(text_month_list)
for idx, item in enumerate(res3_iso):
    print("ISO8610: " + res3_iso[idx], "| Original: " + res3_orig[idx].group())

#### Formatted given day

In [11]:
res4_iso, res4_orig = format_given_day(given_day_list)
for idx, item in enumerate(res4_iso):
    print("ISO8610: " + res4_iso[idx], "| Original: " + res4_orig[idx])

## Advanced Level

How can we parse 'date and time' representations that are not represented by standard/normal representation style, e.g., Christmas Eve for Dec. 24th or '雙十節' for Oct. 10th (a Taiwanese Festival).

### Init helper functions and objects

Note that we primarily consider National Holidays. Secondary holidays will be returned if identified

In [12]:
import jieba
import jieba.analyse
import jieba.posseg as pseg
from workalendar.asia import Taiwan
from workalendar.usa.core import UnitedStates

zh_cal = Taiwan()
us_cal = UnitedStates()

In [13]:
taiwan_public_holidays_map = {"元旦":"New year", "農曆除夕":"Chinese New Year's eve", "農曆新年":"Chinese New Year",
                            "春节":"Chinese New Year","228和平紀念日":"228 Peace Memorial Day", 
                            "和平紀念日":"228 Peace Memorial Day","婦女節":"Combination of Women's Day and Children's Day",
                            "兒童節合倂":"Combination of Women's Day and Children's Day","清明节":"Qingming Festival",
                            "淸明節":"Qingming Festival","端午節":"Dragon Boat Festival","端午节":"Dragon Boat Festival",
                            "中秋節":"Mid-Autumn Festival","中秋国庆":"Mid-Autumn Festival",
                            "雙十節":"National Day/Double Tenth Day","國慶日":"National Day/Double Tenth Day"}

taiwan_public_holidays = dict(zh_cal.holidays())
us_public_holidays = dict(us_cal.holidays())

def get_zh_holiday(holiday_name_zh):
    """ Lookup the date representation for a given Chinese holiday string
    Args:
        holiday_name_zh (str): Chinese holiday string
    
    Return
        result (dict): datetime.date:str
    """
    
    if holiday_name_zh in taiwan_public_holidays_map:
        result = {k:v for k,v in taiwan_public_holidays.items() if v == taiwan_public_holidays_map[holiday_name_zh]}
    return result

In [14]:
def get_relative_dates(page_text):
    """ Attempt to parse relative dates from a given page.
    
    The logic is as follows, we first check for characters >=2 in length
    in order to reduce the search space. Some holidays are represented 
    by two characters so we first check for any string in the results 
    that exists in our defined holiday mapping. 
    
    As a secondary step, we attempt to check the wikipedia api for
    any string of 3 or more characters in order to identify additional
    dates.
    
    Args:
        page_text (str): Wiki text string.
    """
    
    # topK refers to the number of tags to return, based on tf-idf
    tags = jieba.analyse.extract_tags(page_text, topK=500)
    filtered_tags = [tag for tag in tags if not tag.isnumeric() and len(tag) >= 2]
    filtered_tag_set = set(filtered_tags)
    
    # We need to format the tags as a string for the secondary pass
    filtered_string = ", ".join(filtered_tags)
    
    fixed_holidays = [get_zh_holiday(tag) for tag in filtered_tag_set if tag in taiwan_public_holidays_map]
    result = {str(list(item.values())[0]):str(list(item.keys())[0]) for item in fixed_holidays}
    
    secondary_tags = set(pseg.cut(filtered_string))
    for tag in secondary_tags:
        # Flag refers to the part of speech and t refers to time words
        if len(tag.word) > 2 and tag.word not in taiwan_public_holidays_map and tag.flag in ["t", "eng"]:
            if tag.flag == "t":
                res = parse_wiki_info_box(tag.word, "zh", "asian_date")
            elif tag.flag == "eng":
                res = parse_wiki_info_box(tag.word, "en", "text_month")
            if res:
                result[str(list(res.keys())[0])] = str(list(res.values())[0])
    return result

def parse_wiki_info_box(query, lang, date_format):
    """ Parses the date from the info box of a given wikipedia page.
    Accepts a search query and checks for page existence, returning a parsed
    date field from the info box of said page.
    
    Args:
        query (str): Search term
        lang (str): Language to set the wikipedia API to.
    """
    
    wikipedia.set_lang(lang)
    search_results = wikipedia.search(query)
    top_result = wptools.page(search_results[0], silent=True, lang=lang).get()
    
    try:
        if date_format == "text_month":
            
            info_box_date = top_result.infobox["date"]
            date_label = top_result.label
            
            regex_results = list(cmp_text_month.finditer(info_box_date))
            if regex_results:
                for item in regex_results:
                    parsed = cmp_text_month_cleanup.search(str(item.group()))
                
                if parsed:
                    year = parsed.group("year") if parsed.group("year") else datetime.date.today().year
                    month = datetime.datetime.strptime(parsed.group("month"),"%B").month if len(parsed.group("month")) > 3 else datetime.datetime.strptime(parsed.group("month"),"%b").month
                    day = parsed.group("day") if parsed.group("day") else None
                    formatted_date = prep_iso8601(year=year, month=month, day=day)
        
        elif date_format == "asian_date":
            
            info_box_date = top_result.infobox["date"]
            date_label = top_result.label
            
            regex_results = list(cmp_asian_dates.finditer(info_box_date))
            if regex_results:
                calendar = regex_results[0].group("calendar")
                year = regex_results[0].group("year")
                month = regex_results[0].group("month")
                day = regex_results[0].group("day")

                # Remove chinese character suffix
                if year:
                    year = year[:-1]
                if month:
                    month = month[:-1]
                if day:
                    day = day[:-1]

                if calendar in japanese_era_lookup:
                    year = convert_calendar_system(year, calendar)

                if calendar in minguo_set:
                    year = convert_calendar_system(year, calendar)
                
                if year is None:
                    year = datetime.date.today().year
                formatted_date = prep_iso8601(year=year, month=month, day=day)
    except:
        return None
    return {date_label:formatted_date}

#### Advanced Results

In [15]:
wikipedia.set_lang(lang)
page_text2 = wikipedia.page("端午節_(華人)").content

In [16]:
get_relative_dates(page_text2)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.140 seconds.
Prefix dict has been built succesfully.


{'Dragon Boat Festival': '2017-05-30'}

In [17]:
get_relative_dates("christmas")

{'Christmas': '2017-12-25'}

In [29]:
get_relative_dates("中華民國國慶日")

{'National Day/Double Tenth Day': '2017-10-10'}

In [18]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()