# Wikipedia - this day in history <small>(step 1 - get day data and save it to database)</small>
---
**Goal:** get events, births, deaths, and holidays_and_observances for each day of the year (using Wikipedia API)  

**Notes:**  
- this notebook is for the first step of this project. It consists of:
    - using the Wikipedia api to get the events for each day of the year
    - parsing the Wikipedia data, cleaning it up, and getting it in the right format needed for the SQLite database created for this project
- the notebook for the second step is [TADS_wikipedia_tdih_main_step_02_get_link_data_07feb21
](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks)
- the notebook for the third step is [TADS_wikipedia_tdih_main_step_03_get_image_data_10feb21](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks)
- to get data for a specific day run get_day_data(day_name). It returns a dictionary with day data.

In [1]:
import os
import requests
import re
import sqlite3
import config
import datetime as dt
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [2]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None

In [3]:
EVENT_TYPES = ['Events', 'Births', 'Deaths', 'Holidays_and_observances']

URL = config.WIKIPEDIA_URL
HEADERS = config.HEADERS
PARAMS = {'format': 'json',
         'action': 'parse'}

DATABASE_FILE = config.DATABASE_FILE
DOE = config.DOE

**Notes about the data:**
- processing the data
    - I decided to process the data for the entire page
    - the alternative is a multi-step process:
        1. get the data for the page
        2. find the section number for each type of event
        2. find possible subsections for each section (see January_1 page)
        4. make separate API calls for each section or subsection
        5. process the data for each section or subsection
    - processing the data for the whole page reduces the number of API calls by a factor of 4 or more (since there are four types of events, and some types might have subsections)
- most days have data in the following format:
    - for 'Events', 'Births', 'Deaths':
        - { year_link } - { link_to_event or link_to_person } some text
    - for 'Holidays_and_observances':
        - { holiday_link } some text
- edge cases I found:
    - 'Events' data is separated into sections based on different calendars (see January_1 page)
    - 'Deaths' data has nested lists (see February_29 page)
    - 'Holidays_and_observances' has more than one level deep nested lists (see January_1 page)

In [1]:
%%html
<div style='box-sizing: border-box; margin: 0; padding: 0; display: flex; justify-content: space-between'>
    <img src='../../images/img_ss_wikipedia_edge_case_01_for_jan_1_19jan21.jpg' width='300px'/>
    <img src='../../images/img_ss_wikipedia_edge_case_02_for_jan_1_19jan21.jpg' width='300px'/>
    <img src='../../images/img_ss_wikipedia_edge_case_01_for_feb_29_19jan21.jpg' width='300px'/>
</div>                       

### Process flow: <small>(with checkmarks for steps done in this notebook)</small>
- [x] get day data
    - [x] get json day data from wikipedia API
    - [x] extract data for 'events', 'births', 'deaths'
    - [x] extract data for 'holidays_and_observances'
    - [x] save data to wikipedia_tdih.db
- get link data
- get image data

In [4]:
def day_data_main(day_name = 'January_1', url = URL, event_types = EVENT_TYPES, database_file = DATABASE_FILE, 
                  doe = DOE, print_status = True):
    """
    Main function for requesting the data for a specific day, processing it, and saving it to database
    wikipedia_tdih.db.
    It uses two helper functions:
        - get_and_process_day_data
        - add_day_data_to_db
        
    Params:
        day_name: the name of the day to be processed
                  (needs to be a string in the format Month_dd (eg. 'January_1'))
        url: url for the wikipedia API
        event_types: list of events to request and process data for
                     (event_types common to all days are:
                         - Events
                         - Births
                         - Deaths
                         - Holidays_and_observances)
        database_file: database file to save the data to
        
    Returns:
    
    """
    # get day data from Wikipedia API and process it as needed for the database
    day_data_dict = get_and_process_day_data(day_name, url, event_types)
    
    if print_status:
        print(day_name)
        print('--------')
        print('status_code: ',day_data_dict['status_code'])
        print('adding data to db')
    
    # add day_data to database
    update_results = add_day_data_to_db(day_data_dict, database_file, doe)
    
    if print_status:
        print('data added\n')

    return day_data_dict, update_results    

### Helper functions for day_data_main

#### Section 1 - get and process day data

In [5]:
def get_and_process_day_data(day_name, url = URL, event_types = EVENT_TYPES):
    """
    Main function for requesting and processing the data for a specific day.
    It uses several helper functions to break down the process into simple steps:
        - get_day_data
        - process_day_soup
    
    Params:
        day_name: the name of the day to be processed
                  (needs to be a string in the format Month_dd (eg. 'January_1'))
        url: url for the wikipedia API
        event_types: list of events to request and process data for
                     (event_types common to all days are:
                         - Events
                         - Births
                         - Deaths
                         - Holidays_and_observances)
        
    Returns:
        day_data_dict: a dictionary with day_data
                       eg. {'day': 1,
                            'month': 1,
                            'status_code': status_code,
                            'day_soup': day_soup,
                            'events': events_list_items,
                            'births': 'births_list_items',
                            'deaths': deaths_list_items,
                            'holidays_and_observances' : holidays_list_items
                            'no_event_for': no_event_for_list}
    """
    # request day_data
    status_code, day_soup = get_day_data(day_name = day_name)
    
    # check that status_code is OK
    if status_code == requests.codes.ok:
        # call process_day_soup to process day_soup and get the starting day_data_dict
        day_data_dict = process_day_soup(day_soup, event_types)
    else:
        day_data_dict = {'events': np.nan,
                         'births': np.nan,
                         'deaths': np.nan,
                         'holidays_and_observances': np.nan,
                         'no_event_for': np.nan}
        
    # add 'day', 'month', 'status_code', and 'day_soup' data to the day_data_dict  
    # (adding '_2020' to day_name to avoid ValueError caused by 'February_29')
    day_data_dict['day'] = dt.datetime.strptime(day_name + '_2020', '%B_%d_%Y').day
    day_data_dict['month'] = dt.datetime.strptime(day_name + '_2020', '%B_%d_%Y').month
    day_data_dict['status_code'] = status_code
    day_data_dict['day_soup'] = day_soup
            
    return day_data_dict

In [6]:
def get_day_data(day_name = 'January_1', url = URL, headers=HEADERS, params = PARAMS):
    """
    Use requests to get day_data from wikipedia API.
    
    Params:
        day_name: needs to be in the format Month_dd (eg. 'January_1')
        url: first part of the Wikipedia API url for each day of the year
        headers: dictionary of headers data such as user_agent, etc
        params: dictionary of params used in requests.get, such as 'format', 'action', etc
        
    Returns:
        status_code: status code of the request
        day_text_soup: BeautifulSoup object of the response text
        
    """
    params['page'] = day_name
    url = f'{url}{day_name}'
    result = requests.get(url, headers=headers, params=params)
    
    if result.status_code == requests.codes.ok:
        # result is json file with one key ('parse') whose value is a nested dictionary
        # data is in result['parse']['text']['*']
        result_text = result.json()['parse']['text']['*']
        day_text_soup = BeautifulSoup(result_text, 'html.parser')
    else:
        day_text_soup = ''
        
    return result.status_code, day_text_soup
    

In [7]:
def process_day_soup(day_soup, event_types = EVENT_TYPES):
    """
    Take the raw data for a specific day, clean it up, and format it as needed for the database.
    Use several helper functions to break the process into simple steps:
        - 
    
    Params:
        day_soup: BeautifulSoup object representing the raw data for a specific day
        event_types: list of event_types for which to extract event items 
        
    Returns:
        day_data_dict: dictionary with day_data
                       e.g. {'events': events_list_items,
                             'births': 'births_list_items',
                             'deaths': deaths_list_items,
                             'holidays_and_observances' : holidays_list_items,
                             'no_event_for': no_event_for_list}
    """
    # initialize day_data_dict
    day_data_dict = {}
    
    # keep track of event types that don't have event items
    # this list is expected to stay empty
    no_event_for = []
    day_data_dict['no_event_for'] = no_event_for
    
    for event_type in event_types:
        # find section_span
        # events are in an <ul> element preceeded by an <h2> with two <span> children
        # the first <span> child has an id equal to the event_type (eg. <span class="mw-headline" id="Events">)
        section_span = day_soup.find(id = event_type)
        # if there are no events for the event_type
        if not section_span:
            # add the event_type for which there are no events to the list no_event_for
            no_event_for.append(event_type)
            day_data_dict['no_event_for'] = no_event_for
            # assign np.nan to day_data_dict[event_type]
            day_data_dict[event_type.lower()] = np.nan
            continue

        # if there are events for event_type
        section_h2 = section_span.find_parent()

        # event_type 'holiday' items have different data format than 'events', 'deaths', 'births'
        # need to process differently (they don't have year data)
        if 'holiday' in event_type.lower():
            day_data_dict[event_type.lower()] = extract_and_process_holiday_data(section_h2)

        else:
            day_data_dict[event_type.lower()] = extract_and_process_event_data(section_h2)
            
    return day_data_dict

**Sample html for events section on wikipedia page.**  
  
![sample_section_html](../../images/img_ss_wikipedia_html_explore_events_section_14feb21.jpg)   

In [8]:
def extract_and_process_holiday_data(section_h2, event_category = np.nan):
    """
    Process holiday data.
    
    Params:
        section_h2: a BeautifulSoup object (<h2> element) marking the start of event_type 'holiday'
        event_category: indicates if event is 'Christian feast day' or not
        
    Returns:
        holiday_list: list of dictionaries (each holiday item is a dictionary)  
                      eg. {'year': np.nan, # holidays don't have years
                           'event_description': ,
                           'event_links_list': ,
                           'event_links_text': ,
                           'event_first_link': ,
                           'event_category': ,
                           'event_contents_list': (optional)
                          }
    
    """
    holiday_list = []
    
    # section_h2 has one or more <ul> siblings
    sibling = section_h2.next_sibling
    
    # another <h2> sibling indicates the current section is over and another section starts
    # check that we are not into a new section
    while sibling.name != 'h2':
        
        # if sibling is empty string or a new line character ('\n'), continue to next sibling
        if not str(sibling).strip():
            sibling = sibling.next_sibling
            continue
    
        # check sibling is indeed ul_element and loop through its <li> children (they contain holiday data)
        if sibling.name == 'ul':
            # loop through the <li> items
            for child in sibling.children:
                
                # check if child is emtpy string or new line character; continue to next child if yes
                if not str(child).strip():
                    continue
                    
                # check if child has sub lists
                if child.find('ul'):
                # check if child is list of 'Christian feast day' elements
                    if is_christian_feast_day(child):
                        # change event_category
                        # don't add child data to holiday_list (even though it's an <li> item, it is just a category name)
                        event_category = 'christian feast day'

                    else:
                        # add child to holiday_list
                        child_data_dict = extract_single_li_event_data(child,
                                                                       get_li_element_contents = False,
                                                                       event_category = event_category)
                        holiday_list.append(child_data_dict)

                    # get the data for sub_items and add it to holiday_list
                    sub_items_data_list = get_sub_items_data(child, event_category = event_category)
                    holiday_list.extend(sub_items_data_list)

                    # re-set event_category to np.nan
                    event_category = np.nan
                
                else:
                    # child does not have nested lists
                    # add child data to holiday_list
                    child_data_dict = extract_single_li_event_data(child,
                                                                   get_li_element_contents = False,
                                                                   event_category = event_category)
                    holiday_list.append(child_data_dict)
                    
            sibling = sibling.next_sibling
                
    return holiday_list

In [9]:
def extract_and_process_event_data(section_h2):
    """
    Extract and process event_data for event of type 'Events', 'Births', or 'Deaths'
    
    Params:
        section_h2: a BeautifulSoup object (<h2> element) marking the start of event_type 'Events', 'Births', or 'Deaths'
        event_category: indicates if event date is specific to a certain calendar 
                        (eg. Pre-Julian calendar, Julian calendar, Gregorian calendar, etc)
                        (see wikipedia page for January_1)
        
    Returns:
        event_list: list of dictionaries (each event item is a dictionary called child_data_dict) 
                    eg. child_data_dict = {'year': ,
                                           'event_description': ,
                                           'event_links_list': ,
                                           'event_links_text': ,
                                           'event_first_link': ,
                                           'event_category': ,
                                           'event_contents_list': (optional)
                          }
    """
    event_list = []
    
    # only a small number of events is grouped in sub_lists based on the type of calendar used for their date
    # the event_category is changed if section_h2 has <h3> elements (they contain calendar data)
    event_category = np.nan
    
    # section_h2 usually has just one <ul> sibling (except days like January_1, which is divided in 
    # three parts based on the type of calendar used)
    sibling = section_h2.next_sibling
    
    while sibling.name != 'h2':
        
        # if sibling is empty string or a new line character ('\n'), continue to next sibling
        if not str(sibling).strip():
            sibling = sibling.next_sibling
            continue
            
        # if sibling is an <h3> element, it containg the name of the calendar under which several events are grouped
        # (eg. Julian Calendar, Gregorian Calendar, etc)
        # (see wikipedia page for January_1)
        if sibling.name == 'h3':
            event_category = sibling.text.replace('[edit]', '') # '[edit]' is included in text of <h3> elements
            sibling = sibling.next_sibling
            continue
            
        if sibling.name == 'ul': # this is the list that has the <li> events
            # each <li> should hold one event and have no sub_lists
            # in some instances (see February_29), the year is in <li> element and the actual events are 
            # sub_items of the year (according to the Wikipedia page, this needs to be fixed)
            for child in sibling.children:
                
                # if child is empty string or a new line character ('\n') continue to next child
                if not str(child).strip():
                    continue
                    
                # check if child has sub_lists
                if child.find_all('li'):
                    # the child.text is the year under which several events are grouped
                    # use regex to eliminate '-' or empty spaces
                    year, bc_ad, bc_ad_note = get_year(child.text)
                    for li in child.find_all('li'):
                        child_data_dict = extract_single_li_event_data(li, event_category = event_category, year = year)
                        child_data_dict['bc_ad'] = bc_ad
                        child_data_dict['bc_ad_note'] = 'nested year ' + bc_ad_note
                        event_list.append(child_data_dict)
                    continue

                    
                child_data_dict = extract_single_li_event_data(child, event_category = event_category)
                # finds the year (using the first or second element in child.contents)
                # most of the time year data is in first element
                # exception: style data is in first element (see January_1, years 193, 404, and 417)
                year_element = [str(child.contents[0]), str(child.contents[1])]
                year, bc_ad, bc_ad_note = get_year(year_element[0])
                if not year or year == '0':
                    year, bc_ad, bc_ad_note = get_year(year_element[1], bc_ad_note = 'content_2_')
                # add year data to child_data_dict
                child_data_dict['year'] = year
                child_data_dict['bc_ad'] = bc_ad
                child_data_dict['bc_ad_note'] = bc_ad_note
                event_list.append(child_data_dict)
                    
        sibling = sibling.next_sibling
        
    return event_list

In [10]:
def get_year(year_element, bc_ad_note = ''):
    """
    Get the year from the year_element of an event
    
    Params:
        year_element: the first item in the contents of an <li> element for an event 
                      (passed in from extract_event_data)
        bc_ad_note: defaults to np.nan
                    changes value if year pattern in found in first element of an <li> event element
                    helps identify exception <li> elements (see wikipedia page for January_1)
                                                            years 193, 404, and 417 have style data in their
                                                            first element)
        
    Returns:
        year:
        bc_ad:
        bc_ad_note: indicates if bc_ad info was included in wikipedia link, or it is assumed to be 'ad'
    """
    year= ''
    bc_ad = ''
    bc_ad_note = bc_ad_note

    # pattern: wiki/ad_100 or wiki/100
    year_pattern_1 = re.compile(r'wiki/(bc|ad)_{0,1}(\d{1,4})', re.IGNORECASE)
    # pattern: wiki/100_ad or wiki/100
    year_pattern_2 = re.compile(r'wiki/(\d{1,4})_{0,1}(bc|ad)', re.IGNORECASE)
    # pattern: wiki/100
    year_pattern_3 = re.compile(r'wiki/(\d{1,4})')
    # pattern: 1919 -  (note: if one year has several events, only the first year occurence has link to wikipedia year page)
    year_pattern_4 = re.compile(r'\b(\d{1,4})\b')
    
    result = year_pattern_1.search(year_element)
    if result:  
        bc_ad = result.groups()[0].lower()
        year = result.groups()[1]
        bc_ad_note += 'included'
        return year, bc_ad, bc_ad_note

    result = year_pattern_2.search(year_element)
    if result:  
        bc_ad = result.groups()[1].lower()
        year = result.groups()[0]
        bc_ad_note += 'included'
        return year, bc_ad, bc_ad_note
    
    result = year_pattern_3.search(year_element)
    if result:  
        bc_ad = 'ad'
        year = result.groups()[0]
        bc_ad_note += 'assumed'
        return year, bc_ad, bc_ad_note
    
    result = year_pattern_4.search(year_element)
    if result:  
        bc_ad = 'ad'
        year = result.groups()[0]
        bc_ad_note += 'assumed'
        return year, bc_ad, bc_ad_note

    return year, bc_ad, bc_ad_note

In [11]:
def get_first_link(element_links_list):
    """
    Get the first non-year link (i.e. it doesn't link to the year of the event, but to something in the event)
    
    Params:
        element_links_list: list of links in the element (extracted with function extract_links_data)
        f
    Returns:
        first_link
    """
    # match patterns like: wiki/AD_100, wiki/AD100, wiki/100, wiki/100_AD, wiki/100AD (for AD or BC)
    # in most cases these are links to wikipedia page for a specific year
    event_year = re.compile(r'(wiki/ad_*\d{1,4})|(wiki/bc_*\d{1,4})|(wiki/\d{1,4})|(wiki/\d{1,4}_*ad)|(wiki/\d{1,4}_*bc)', re.IGNORECASE)
    first_link = ''
    for link in element_links_list:
        if not event_year.search(link):
            if not first_link:
                first_link = link
    return first_link
    

In [37]:
def extract_links_data(li_element, exclude_reference_links = True):
    """
    Get element_links_list and first_link for li_element.
    
    Params:
        li_element: <li> element to extract data for
        
    Returns:
        element_links_list: a list of links in the element (exclude links in <sup> tags - they link to references)
        element_links_text: list of text of link elements (can be used to determine subject of sentence )
        first_link: the first link that is not a link to a year
    """
    element_links_list = []
    element_links_text = []
    first_link = ''
    
    for element in li_element.find_all('a'):
        if exclude_reference_links:
            if element.parent.name != 'sup':
                # append the element's href and title to element_links_list
                element_links_list.append(element['href'])
                element_links_text.append(element.text.strip())
        else:
            element_links_list.append(element['href'])
            element_links_text.append(element.text.strip())
            
    # get the first link that is not a year link
    first_link = get_first_link(element_links_list)
    
    return element_links_list, element_links_text, first_link

In [13]:
def extract_single_li_event_data(li_element, get_li_element_contents = False, 
                                 event_category = np.nan, year = np.nan):
    """
    Extract data for a single <li> element represent an event (regardless of the event_type, of whether the 
    <li> element is a sub_item of a different <li> element, etc)
    
    Params:
        li_element:
        get_li_element_contents: True only if I need to get li_element contents to troubleshoot missing li_element data
                                 (otherwise seems unncessary - waste of memory)
        event_category: used to mark 'Christian feast day' holidays or events grouped by calendar type (see January_1)
        year: applies to 'Events', 'Births', 'Deaths', not 'Holidays_and_observances'
              defaults to np.nan
              if needed, it is passed from extract_and_process_event_data, or assigned within that function
        
    Returns:
        li_element_dict: dictionary of fields common to all event elements (regardless of event_type, etc)
                         eg. {'year': ,
                              'event_description': ,
                              'event_links_list': ,
                              'event_links_text': ,
                              'event_first_link': ,
                              'event_category': ,
                              'event_contents_list': (optional)
                              }
    """
    li_element_dict = {}

    # get li_element data and build li_element_dict
    li_element_dict['year'] = year
    li_element_dict['event_description'] = get_event_description(li_element.text)
    
    try:        
        (li_element_dict['event_links_list'], 
         li_element_dict['event_links_text'],
         li_element_dict['event_first_link']) = extract_links_data(li_element)
    except:
        li_element_dict['event_links_list'] = 'NavigableString'
        li_element_dict['event_links_text'] = 'NavigableString'
        li_element_dict['event_first_link'] = 'NavigableString'
    
    li_element_dict['event_category'] = event_category
    
    
    # seems unncessary, but get li_element contents in case I need to troubleshoot missing li_element data
    if get_li_element_contents:
        li_element_dict['event_contents_list'] = li_element.contents
        
    return li_element_dict

In [14]:
# changed from get_sub_items to get_sub_items_data
def get_sub_items_data(child, event_category = np.nan):
    """
    Get data of sub_items of child event element.
    Loops through sub_items, uses function extract_single_li_event_data to extract sub_item data.
    
    Params:
        child:
        event_category: indicates if item is a 'Christian feast day' event type
        
    Returns:
        sub_items_data_list: list of dictionaries
                             each dictionary has fields common to all event elements
                             eg. {'event_description': ,
                                  'event_links_list': ,
                                  'event_first_link': ,
                                  'event_category': ,
                                  'event_contents_list': (optional)
                                  }
    """
    sub_items_data_list = []
    
    for li in child.find_all('li'):
        sub_item_dict = extract_single_li_event_data(li, event_category = event_category)
        sub_items_data_list.append(sub_item_dict)
            
    return sub_items_data_list

    

In [15]:
def is_christian_feast_day(ul_element):
    """
    Check if <ul> element is a list of 'Christian feast day' holidays
    
    Params:
        ul_element
        
    Returns:
        True or False
    """
    is_christian_feast_day = False
    
    # get the first line of the ul_element
    first_line = ul_element.text.split('\n')[0].lower()
    if 'christian' in first_line and 'feast' in first_line and 'day' in first_line:
        is_christian_feast_day = True

    return is_christian_feast_day
    

In [16]:
def get_event_description(li_element_text):
    """
    Get the description of an event using the event's li_element_text.
    Eliminate the year data from the event description (it is already captured in a field called 'year')
    
    Params:
        li_element_text: a string containing the text of the li_element
        
    Returns:
        li_element_description: the li_element description stripped of year data
    """
    # strip li_element_text of any white spaces
    li_element_text = li_element_text.strip()
    
    # compile year_pattern (year is at the beginning of li_element_text)
    # \W character is for matching - and –
    year_pattern = re.compile(r'^\d{1,4}\s{0,2}\W\s{0,2}')
    
    # if year_pattern in li_element_text, replace with ''
    li_element_description = year_pattern.sub('', li_element_text)
    
    return li_element_description

#### Section 2 - add day data to database
Tables affected:
- wiki_day_data_log
- wiki_link
- wiki_event
- wiki_link_usage

In [17]:
def add_day_data_to_db(day_data_dict, database_file = DATABASE_FILE, doe = DOE):
    """
    Main function for adding a day's data to wikipedia_tdih.db.
    It takes the day data retrieved and cleaned up with get_and_process_day_data and uses it to update
    the following tables:
        - wiki_day_data_log
        - wiki_link (if link not already in wiki_link)
        - wiki_event        
        - wiki_link_usage
    
    Params:
        day_data_dict: dictionary with day data (created with day_data_main)
        database_file: database to save data to
        doe: date of entry (defaults to current day)
        
    Returns:
        update_results: a list of dictionaries with the status of each table update
                        e.g. {'doe': doe,
                              'wiki_table_name': 'wiki_event',
                              'update_status': 'update_complete',
                              'update_note': 'events'}
    """
    update_results = []
    
    # make smaller dictionaries based on which table data belongs in
    # data for the wiki_day_data_log
    day_data = {key: day_data_dict[key] for key in ['status_code', 'day_soup', 'no_event_for']}

    # data for wiki_event (and, if needed, wiki_link)
    event_data = {key: day_data_dict.get(key, []) for key in ['events', 'births', 'deaths',
                                                            'holidays_and_observances']}

    
    # get day and month for the data being looged
    day = day_data_dict['day']
    month = day_data_dict['month']
    
    with sqlite3.connect(DATABASE_FILE) as conn:
        c = conn.cursor()
        
        # log data into wiki_day_data_log
        update_wiki_day_data_log = update_table_wiki_day_data_log(day_data, day, month, c, doe)
        update_results.append(update_wiki_day_data_log)
        
        # log data into wiki_event (and, if needed, in wiki_link)
        for event_type in event_data:
            update_wiki_event = update_table_wiki_event(event_data[event_type], event_type, day, month, c, doe)
            update_results.append(update_wiki_event)
            
        # update table wiki_log
        update_table_wiki_log(update_results, c, doe)
        
    return update_results

In [18]:
def update_table_wiki_day_data_log(day_data, day, month, database_cursor, doe = DOE):
    """
    Add to table wiki_day_data_log the data received from Wikipedia API. 
    This data is minimally processed and is used as restore point in case subsequent CRUD operations
    involving cleaned-up and processed data fail.
    
    Params:
        day_data: dictionary with minimally processed data retrived from Wikipedia for a specific day of the year
                  e.g. {'status_code': status_code_of_the_API_request,
                        'day_soup': : BeautifulSoup_object_representing_day_data,
                        'no_event_for': list_of_event_types_without_events}
        day: day of the events retrieved from Wikipedia
        month: month of the events retrieved from Wikipedia
    
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    c = database_cursor
    
    # data to be logged into wiki_day_data_log
    status_code = day_data['status_code']
    day_soup = str(day_data['day_soup']).encode('utf-8') # change BeautifulSoup object to Python blob
    no_event_for = str(day_data['no_event_for'])
    
    # organize the data into a tuple
    data = (doe, day, month, status_code, day_soup, no_event_for)
   
    
    try:
        c.execute('INSERT INTO wiki_day_data_log VALUES (null, ?, ?, ?, ?, ?, ?)', data)
        update_status = 'update_complete'
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_day_data_log',
                          'update_status': update_status,
                          'update_note': np.nan}
        
    return update_status_dict

In [19]:
def update_table_wiki_event(event_data, event_type, day, month, database_cursor, doe = DOE):
    """
    Add data received from Wikipedia API into the wiki_event table.
    (This data has been cleaned-up and formatted to match the specifications of wikipedia_tdih.db database.)
    
    Params:
        event_data: list of dictionaries with data for the event
                    (eg. {'year': year,
                          'bc_ad': bc_ad,
                          'bc_ad_note': bc_ad_note,
                          'event_description': event_description`,
                          'event_category': event_category, # e.g. 'Christian feast day'
                          'event_first_link': event_first_link,
                          'event_links_list': list_of_links,
                          'event_links_text': list_of_text_of_links} 
        event_type: event type being processed
                    (i.e. - events
                          - births
                          - deaths
                          - holidays_and_observances)
        day: day of the event
        month: month of the event
        database_cursor: cursor object for wikipedia_tdih.db
        doe: date of entry
       
    Returns:
        update_table_status: a dictionary with the update status for each event type
                             e.g {'doe': date_of_entry,
                                  'wiki_table_name': 'wiki_event',
                                  'update_status': update_status,
                                  'event_type': event_type,              
                                  'update_note': f'{event_type}: update_complete'}
    """
    c = database_cursor
    
    for event in event_data:
        # get the data the needs to be logged
        year = event['year']
        bc_ad = event.get('bc_ad', np.nan) # event_type 'holidays_and_observances' has no bc_ad data
        bc_ad_note = event.get('bc_ad_note', np.nan)
        event_description = event['event_description']
        event_category = event['event_category']
        event_first_link_id = get_wiki_link_id(event['event_first_link'], c, doe)
        event_links_list = event['event_links_list']
        event_links_text = event['event_links_text']
        
        # organize the data into a tuple
        data = (doe, day, month, year, bc_ad, bc_ad_note, event_type, event_description,
                event_category, event_first_link_id, str(event_links_list), str(event_links_text))
        
        # insert data into wiki_event
        c.execute('INSERT INTO wiki_event VALUES (null, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, null)', data)
        
        # update table wiki_link_usage
        event_id = c.lastrowid
        update_table_wiki_link_usage(event_links_list, event['event_first_link'], event_id, c, doe)
        
    update_note = event_type if event_data else f'{event_type}: no_data'
    
    update_table_status = {'doe': doe,
                           'wiki_table_name': 'wiki_event',
                           'update_status': 'update_complete',
                           'update_note': update_note}
    
    return update_table_status

In [35]:
def update_table_wiki_link_usage(event_links_list, event_first_link, event_id, database_cursor, doe = DOE):
    """
    Add link usage info to wiki_link_usage.
    
    Params:
        event_links_list: list of wiki links found in the description of an event
        event_first_link: the first link in the event description (that is not a year_link)
        database_cursor: cursor object for wikipedia_tdih.db
        doe: date of entry (defaults to current day)
        
    Returns:
        None
        
    """
    c = database_cursor
    
    for link in event_links_list:
        # initialize is_first_link
        is_first_link = 0
        # check that link is not the year link (found at the beginning of most events)
        if not is_year_link(link):
            if link == event_first_link:
                is_first_link = 1
            link_id = get_wiki_link_id(link, c, doe)
            
            # data for the wiki_link_usage table
            data = (doe, link_id, event_id, is_first_link)
            
            # update wiki_link_usage table
            c.execute('INSERT INTO wiki_link_usage VALUES (null, ?, ?, ?, ?)', data)

In [21]:
def is_year_link(wiki_link):
    """
    Check if wiki_link is a year link (in most cases the year links are not relevant to the description 
    of an event. They just link to the wikipedia page for a specific year.)
    
    Params:
        wiki_link: link found in the description of an event
        
    Returns:
        True or False
    """
    # match patterns like: wiki/AD_100, wiki/AD100, wiki/100, wiki/100_AD, wiki/100AD (for AD or BC)
    event_year = re.compile(r'(wiki/ad_*\d{1,4})|(wiki/bc_*\d{1,4})|(wiki/\d{1,4})|(wiki/\d{1,4}_*ad)|(wiki/\d{1,4}_*bc)', re.IGNORECASE)
    
    return event_year.search(wiki_link)

In [38]:
def get_wiki_link_id(event_link, database_cursor, doe = DOE):
    """
    Get the id of event_link from wiki_link table.
    (If event_link is not in wiki_link, add event_link to wiki_link and return the id of the 
    newly added link)
    
    Params:
        event_link: string representing a wiki_link (e.g. 'wiki/Albert_Einstein')
        database_cursor: cursor object for the wikipedia_tdih.db
        doe: date of entry
          
    Returns:
        link_id    
    """
    c = database_cursor
    link_id = ''
    
    c.execute('SELECT link_id FROM wiki_link WHERE link_url = ?', (event_link, ))
    
    try:
        link_id = c.fetchone()[0]
    except:
        pass
    
    if not link_id:
        c.execute('INSERT INTO wiki_link VALUES (null,?,?,null)', (doe, event_link))
        link_id = c.lastrowid
        
    return link_id

In [23]:
def update_table_wiki_log(update_results, database_cursor, doe = DOE):
    """
    Update table wiki_log with the results of updating wiki_day_data_log and wiki_event.
    
    Params:
        update_results: dictionary with update status
                        e.g. {'doe': '2021-02-15',
                              'wiki_table_name': 'wiki_day_data_log',
                              'update_status': 'update_complete',
                              'update_note': nan}
        cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Return:
        None
    """
    c = database_cursor
    
    for update in update_results:
        data = (update['doe'], 
                update['wiki_table_name'],
                update['update_status'],
                update['update_note'])
        
        c.execute('INSERT INTO wiki_log VALUES (null,?,?,?,?)', data)    

## Test day_data_main

In [39]:
day_data_dict, update_results = day_data_main('February_29')

February_29
--------
status_code:  200
adding data to db
['/wiki/1504', '/wiki/Christopher_Columbus', '/wiki/March_1504_lunar_eclipse']
/wiki/1504
/wiki/Christopher_Columbus
/wiki/March_1504_lunar_eclipse
['/wiki/1644', '/wiki/Abel_Tasman']
/wiki/1644
/wiki/Abel_Tasman
['/wiki/1704', '/wiki/Queen_Anne%27s_War', '/wiki/Native_American_(U.S.)', '/wiki/Raid_on_Deerfield']
/wiki/1704
/wiki/Queen_Anne%27s_War
/wiki/Native_American_(U.S.)
/wiki/Raid_on_Deerfield
['/wiki/1712', '/wiki/February_30', '/wiki/Swedish_calendar', '/wiki/Julian_calendar']
/wiki/1712
/wiki/February_30
/wiki/Swedish_calendar
/wiki/Julian_calendar
['/wiki/1720', '/wiki/Ulrika_Eleonora,_Queen_of_Sweden', '/wiki/Frederick_I_of_Sweden']
/wiki/1720
/wiki/Ulrika_Eleonora,_Queen_of_Sweden
/wiki/Frederick_I_of_Sweden
['/wiki/1752', '/wiki/Alaungpaya', '/wiki/Konbaung_Dynasty']
/wiki/1752
/wiki/Alaungpaya
/wiki/Konbaung_Dynasty
['/wiki/1768', '/wiki/Bar_Confederation']
/wiki/1768
/wiki/Bar_Confederation
['/wiki/1796', '/wiki/J

In [24]:
day_data_dict, update_results = day_data_main('February_29')

February_29
--------
status_code:  200
adding data to db
data added



In [25]:
update_results

[{'doe': '2021-02-16',
  'wiki_table_name': 'wiki_day_data_log',
  'update_status': 'update_complete',
  'update_note': nan},
 {'doe': '2021-02-16',
  'wiki_table_name': 'wiki_event',
  'update_status': 'update_complete',
  'update_note': 'events'},
 {'doe': '2021-02-16',
  'wiki_table_name': 'wiki_event',
  'update_status': 'update_complete',
  'update_note': 'births'},
 {'doe': '2021-02-16',
  'wiki_table_name': 'wiki_event',
  'update_status': 'update_complete',
  'update_note': 'deaths'},
 {'doe': '2021-02-16',
  'wiki_table_name': 'wiki_event',
  'update_status': 'update_complete',
  'update_note': 'holidays_and_observances'}]

#### Keeping the cells below for future reference (if needed)

In [17]:
d1 = day_data_main('February_29')
# d1

In [18]:
d1.keys()

dict_keys(['no_event_for', 'events', 'births', 'deaths', 'holidays_and_observances', 'day', 'month', 'status_code', 'day_soup'])

In [19]:
dfev = pd.DataFrame(d1['events'])
dfev

Unnamed: 0,year,event_description,event_links_list,event_first_link,event_category,bc_ad,bc_ad_note
0,1504,Christopher Columbus uses his knowledge of a l...,"[/wiki/1504, /wiki/Christopher_Columbus, /wiki...",/wiki/Christopher_Columbus,,ad,assumed
1,1644,Abel Tasman's second Pacific voyage begins.,"[/wiki/1644, /wiki/Abel_Tasman]",/wiki/Abel_Tasman,,ad,assumed
2,1704,Queen Anne's War: French forces and Native Ame...,"[/wiki/1704, /wiki/Queen_Anne%27s_War, /wiki/N...",/wiki/Queen_Anne%27s_War,,ad,assumed
3,1712,February 29 is followed by February 30 in Swed...,"[/wiki/1712, /wiki/February_30, /wiki/Swedish_...",/wiki/February_30,,ad,assumed
4,1720,"Ulrika Eleonora, Queen of Sweden abdicates in ...","[/wiki/1720, /wiki/Ulrika_Eleonora,_Queen_of_S...","/wiki/Ulrika_Eleonora,_Queen_of_Sweden",,ad,assumed
5,1752,"King Alaungpaya founds Konbaung Dynasty, the l...","[/wiki/1752, /wiki/Alaungpaya, /wiki/Konbaung_...",/wiki/Alaungpaya,,ad,assumed
6,1768,Polish nobles form the Bar Confederation.,"[/wiki/1768, /wiki/Bar_Confederation]",/wiki/Bar_Confederation,,ad,assumed
7,1796,The Jay Treaty between the United States and G...,"[/wiki/1796, /wiki/Jay_Treaty]",/wiki/Jay_Treaty,,ad,assumed
8,1864,American Civil War: Kilpatrick–Dahlgren Raid f...,"[/wiki/1864, /wiki/American_Civil_War, /wiki/K...",/wiki/American_Civil_War,,ad,assumed
9,1892,"St. Petersburg, Florida is incorporated.","[/wiki/1892, /wiki/St._Petersburg,_Florida]","/wiki/St._Petersburg,_Florida",,ad,assumed


In [20]:
ddea = pd.DataFrame(d1['deaths'])
ddea

Unnamed: 0,year,event_description,event_links_list,event_first_link,event_category,bc_ad,bc_ad_note
0,468,Pope Hilarius,"[/wiki/468, /wiki/Pope_Hilarius]",/wiki/Pope_Hilarius,,ad,assumed
1,992,"Oswald of Worcester, Anglo-Saxon archbishop an...","[/wiki/992, /wiki/Oswald_of_Worcester]",/wiki/Oswald_of_Worcester,,ad,assumed
2,1212,"Hōnen, Japanese monk, founded Jōdo-shū (b. 1133)","[/wiki/1212, /wiki/H%C5%8Dnen, /wiki/J%C5%8Ddo...",/wiki/H%C5%8Dnen,,ad,assumed
3,1460,"Albert III, Duke of Bavaria-Munich (b. 1401)","[/wiki/1460, /wiki/Albert_III,_Duke_of_Bavaria]","/wiki/Albert_III,_Duke_of_Bavaria",,ad,assumed
4,1528,"Patrick Hamilton, Scottish Protestant reformer...","[/wiki/1528, /wiki/Patrick_Hamilton_(martyr)]",/wiki/Patrick_Hamilton_(martyr),,ad,assumed
5,1592,"Alessandro Striggio, Italian composer and dipl...","[/wiki/1592, /wiki/Alessandro_Striggio]",/wiki/Alessandro_Striggio,,ad,assumed
6,1600,"Caspar Hennenberger, German pastor, historian ...","[/wiki/1600, /wiki/Caspar_Hennenberger]",/wiki/Caspar_Hennenberger,,ad,assumed
7,1604,"John Whitgift, English archbishop and academic...","[/wiki/1604, /wiki/John_Whitgift]",/wiki/John_Whitgift,,ad,assumed
8,1740,"Pietro Ottoboni, Italian cardinal (b. 1667)","[/wiki/1740, /wiki/Pietro_Ottoboni_(cardinal)]",/wiki/Pietro_Ottoboni_(cardinal),,ad,assumed
9,1744,"John Theophilus Desaguliers, French-English ph...","[/wiki/1744, /wiki/John_Theophilus_Desaguliers]",/wiki/John_Theophilus_Desaguliers,,ad,assumed


In [21]:
dfhol = pd.DataFrame(d1['holidays_and_observances'])
dfhol

Unnamed: 0,year,event_description,event_links_list,event_first_link,event_category
0,,Auguste Chapdelaine (one of the Martyr Saints ...,"[/wiki/Auguste_Chapdelaine, /wiki/Martyr_Saint...",/wiki/Auguste_Chapdelaine,christian feast day
1,,Oswald of Worcester (in leap year only),[/wiki/Oswald_of_Worcester],/wiki/Oswald_of_Worcester,christian feast day
2,,Saint John Cassian,[/wiki/John_Cassian],/wiki/John_Cassian,christian feast day
3,,February 29 in the Orthodox church,[/wiki/February_29_(Eastern_Orthodox_liturgics)],/wiki/February_29_(Eastern_Orthodox_liturgics),christian feast day
4,,The fourth day of Ayyám-i-Há (Baháʼí Faith) (o...,"[/wiki/Ayy%C3%A1m-i-H%C3%A1, /wiki/Bah%C3%A1%C...",/wiki/Ayy%C3%A1m-i-H%C3%A1,
5,,Rare Disease Day (in leap years; celebrated in...,[/wiki/Rare_Disease_Day],/wiki/Rare_Disease_Day,
6,,"Bachelor's Day (Ireland, United Kingdom)","[/wiki/Bachelor%27s_Day_(tradition), /wiki/Rep...",/wiki/Bachelor%27s_Day_(tradition),
