# Wikipedia - this day in history <small>(step 2)</small>
---
**Goal:** create a dataset of this-day-in-history events  
  
**Context:**  I need a starting dataset of world history events (name, short description, long description, image, and location). I couldn't find public datasets. I'm building one using the wikipedia this day in history data.

**Notes about this notebook:**  
- this notebook is for the second step of this project. 
- the notebook for the first step is [TADS_wikipedia_tdih_main_api_step_01_02_get_data_29jan21](http://localhost:8888/notebooks/temp_for_offline/bianca_aguglia/projects_wip/TADS_wikipedia_this_day_in_history/TADS_wikipedia_tdih_main_api_step_01_02_get_data_29jan21.ipynb)
- the first step consisted of:
    - using the Wikipedia api to get the events for each day of the year
    - parsing the Wikipedia data, cleaning it up, and getting it in the right format needed for the SQLite database created for this project
- step two is:
    - take the data from step one and rank each item based on page views, page length, and links to page
    - get image (and licence details) for each item in the data from step one
- to get data for a specific day run day_data_main(day_name). It returns a dictionary with day data.
  

### Process flow: <small>(with details and checkmarks for steps done in this notebook)</small>
- get day data
    - save data in wikipedia_tdih.db
- get link data
    - [ ] query wikipedia_tdih.db for:
        - [x] links that are new (i.e. in wiki_link but not in wiki_get_link_data_log)
        - [x] links that have have not been updated since a specified_date
    - [ ] get link data from wikipedia API and extract the following:
        - page_size
        - incoming_links
        - coordinates
        - page_score
        - first_paragraphs
        - image_url
        - image_file
        - wikibase_shortdesc
        - wikibase_item
        - wiki_desc
        - page_views
    - [ ] add link data to db

In [None]:
import sqlite3
import requests
import time
import config
import datetime as dt

In [None]:
DATABASE_FILE = 'wikipedia_tdih.db'
DOE = dt.date.today().strftime('%Y-%m-%d')
URL = 'https://en.wikipedia.org/w/api.php/'
HEADERS = config.HEADERS
PARAMS = {'titles': '', # placeholder for page
          'action': 'query',
          'format': 'json',
          'prop': 'cirrusbuilddoc|cirruscompsuggestbuilddoc|extracts|pageimages|pageprops|pageterms|pageviews',
          'piprop': 'original',
          'exintro': True # gives the first paragraphs of a page (before detail sections)
         }
# explanation of values for 'prop'
# cirrusbuilddoc for text_bytes (aka page_size), incoming_link, coordinates (if any)
# cirruscompsuggestbuilddoc for score
# extracts (+ exintro) for first_paragraph(s) (regex to clean up) (might me better to use ['cirrusbuilddoc']['text_source'])
# pageimages for pageimage (with piprop='original') for full size image (but doesn't return image file anymore)
# pageprops for page_image_free, wikibase_short_description, wikibase_item
# pageterms for description (similar but not identital to wikibase_short_description)
# pageviews for pageviews :-)

In [None]:
def link_data_main(database_file = DATABASE_FILE, date = '', batch_size = 5, doe = DOE):
    """
    Main function for getting and processing the link data for a group of links in wikipedia_tdih.db.
    It uses several helper functions to break down the process into simple steps:
        - get_links_to_update_from_db
        - get_link_data
        - extract_link_data
        - add_link_data_to_db
    
    Params:
        database_file:
        date: 
        batch_size: no. of links to request data for in a single call to wikipedia API (to reduce no. of API calls)
        
    Returns:
        None
    """
    with sqlite3.connect(DATABASE_FILE) as connection:
        cursor = connection.cursor()
    
        # select from wikipedia_tdih.db the links we need data for
        links_list = get_links_to_update_from_db(cursor, date)
    
        # exit if no links need to have data retrieved or updated
        if not links_list:
            return 'no link data needs to be retrieved / updated'
        
        # keep track of failed_request (break if failed_requests > 5)
        failed_request = 0
    
        # process links in batches of size batch_size
        for i in range(0, len(links_list), batch_size):
            link_batch = links_list[slice(i, i + batch_size)]
            
            # get link data for links in batch
            resp = get_link_data(link_batch, url = URL, params = PARAMS, headers = HEADERS)
            
            # extract link_data (if resp.status_code == requests.codes.ok)
            if resp.status_code == requests.codes.ok:
                page_dicts = extract_link_data(resp)

            else:
                # update wiki_link_data_log with response status_code
#                 update_table_wiki_link_data_log(link_id, status_code, cursor, doe = DOE)
                update_table_wiki_link_data_log(link_batch, resp.status_code)
                failed_request += 1
                if failed_request % 5 == 0:
                    return f'failed requests: {failed_request}'
                continue

            # add link_data to db
            for page_dict in page_dicts:
                add_link_data_to_db()

In [None]:
def get_links_to_update_from_db(database_cursor, date = ''):
    """
    Select from wikipedia_tdih.db the links for which wikipedia data is needed.
    
    There are two cases in which data is needed for a specific link:
        1. the link has just been added to the wikipedia_tdih.db and link data has not been requested from 
           wikipedia API yet
        2. the existing data for the link needs to be updated
    
    Params:
        date: default of '' indicates that only links without data should be selected
              if date is given, select the links that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
              
    Returns:
        links_list: list of links for which wikpedia data is needed
    """
    
    c = database_cursor
    
    # if no date is given
    if not date:
        # select the links that are in wiki_link but not in wiki_get_link_data_log
        # (these are links which have just been added to wiki_link)
        links_list = c.execute('''SELECT link_id, link_url FROM wiki_link WHERE link_id NOT IN (
                                    SELECT link_id FROM wiki_get_link_data_log) ''').fetchall()
    
    # if date is given
    else:
        links_list = c.execute('''SELECT link_id, link_url FROM wiki_link WHERE link_id IN (
                                    SELECT link_id FROM wiki_get_link_data_log WHERE doe > ?)''', (date,)).fetchall()
      
    return links_list

In [None]:
def get_link_data(link_batch, url = URL, params = PARAMS, headers = HEADERS):
    """
    Get link data from Wikipedia API. Links are processed in batches (usually of size 5) to reduce the number
    of API calls.
    
    Params:
        link_batch: list of links to get data for (usually a batch of size 5)
        url: url for wikipedia API
        params: params to use in API call
    """
    # join the links in link_batch and assign them to request params
#     titles = '|'.join(link_batch)
    titles = '|'.join(wiki_link[1] for wiki_link in link_batch)
    params['titles'] = titles
    
    # request data
    resp = requests.get(url = url, headers = headers, params = params)
    
    return resp

In [None]:
def extract_link_data(response):
    """
    Extract link data
    
    Params:
        response: response data received from Wikipedia API
        
    Returns:
        page_dicts: list of dictionaries with page data
    
    """
    resp = response.json()
    status_code = response.status_code
    page_dicts = []
    
    # get the wiki_links from the response (to match them to wiki_link in database)
    wiki_links_list = resp['query']['normalized']
    wiki_links_dict = {link['to']: link['from'] for link in wiki_links_list}
    
    for page_id in resp['query']['pages'].keys():
        page_dict = resp['query']['pages'][page_id]
        
        # wiki_link is normalized by Wikipedia (e.g. 'Albert_Einstein' normalized to 'Albert Einstein')
        # extract it and find its original value
        wiki_link_normalized = page_dict['title']
        wiki_link = wiki_links_dict[wiki_link_normalized]
        
        # build up the data dict
        data_dict = {'wiki_link': wiki_link,
                     'status_code': status_code,
                     'page_size': page_dict['cirrusbuilddoc']['text_bytes'],
                     'incoming_links': page_dict['cirrusbuilddoc']['incoming_links'],
                     'coordinates': page_dict['cirrusbuilddoc']['coordinates'],
                     'page_score': page_dict['cirruscompsuggestbuilddoc'][f'{page_id}t']['score_explanation']['value'],
                     'first_paragraphs': BeautifulSoup(page_dict['extract']).text.strip(),
                     'image_url': page_dict['original']['source'],
                     'image_file': page_dict['pageprops']['page_image_free'],
                     'wikibase_shortdesc': page_dict['pageprops']['wikibase-shortdesc'],
                     'wikibase_item': page_dict['pageprops']['wikibase_item'],
                     'wiki_desc': page_dict['terms']['description'],
                     'page_views': page_dict['pageviews']}
        page_dicts.append(data_dict)
        
    return page_dicts

In [None]:
def add_link_data_to_db(page_data_dict, db = DATABASE_FILE, doe = DOE):
    """
    Add wiki_link data to wikipedia_tdih.db
    """
    # connect to wikipedia_tdih.db
    with sqlite3.connect(db) as conn:
        c = conn.cursor()
    
        # get link_id from db
        c.execute('SELECT link_id, link_url FROM wiki_link WHERE link_url = ?', page_data_dict[wiki_link])
        link_id = c.fetchone()[0]
    
        # update tables
        update_table_wiki_link_data_log(link_id, page_data_dict['status_code'], c, doe)
        update_table_wiki_link_size(link_id, page_data_dict['page_size'], c, doe)
        update_table_wiki_link_page_views(link_id, page_data_dict['page_views'], c, doe)
        update_table_wiki_link_info(link_id, page_data_dict, c, doe)
        update_table_wiki_image()
        update_table_wiki_image_usage()  