# Wikipedia - this day in history <small>(step 2 - get link data)</small>
---
**Goal:** get link data (i.e. page data) for links in wikipedia_tdih.db that have no data in the database or have outdated data.

**Notes:**  
- this notebook is for step two out of three of this project. 
- the notebook for the first step is [TADS_wikipedia_tdih_main_step_01_get_day_data_30jan21](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks)
- the notebook for the third step is [TADS_wikipedia_tdih_main_step_03_get_image_data_10feb21](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks) 

### Process flow: <small>(with checkmarks for steps done in this notebook)</small>
- get day data
    - get day data from wikipedia API, process it, and save it to wikipedia_tdih.db
- get link data
    - [x] query wikipedia_tdih.db for:
        - [x] links that are new (i.e. in wiki_link but not in wiki_get_link_data_log)
        - [x] links that have have not been updated since a specified_date
    - [x] get link data from wikipedia API and extract the following:
        - [x] page_size
        - [x] incoming_links
        - [x] coordinates
        - [x] page_score
        - [x] first_paragraphs
        - [x] image_url
        - [x] image_file
        - [x] wikibase_shortdesc
        - [x] wikibase_item
        - [x] wiki_desc
        - [x] page_views
    - [x] update wikipedia_tdih.db tables
        - [x] wiki_link_data_log
        - [x] wiki_link_info
        - [x] wiki_link_page_views
        - [x] wiki_image <small>*</small>
        - [x] wiki_image_usage <small>*</small>
- get image data
    - get image data from wikipedia API, process it, and save it to wikipedia_tdih.db  
  
<small>* wiki_image and wiki_image_usage are updated in this step (step two) only with data that shows relationship between wiki_link and wiki_image. The image data from Wikipedia is retrieved, processed, and saved to the database in step three.</small>

In [1]:
import sqlite3
import requests
import time
import random
import config
import datetime as dt
import numpy as np
from bs4 import BeautifulSoup

In [2]:
DATABASE_FILE = config.DATABASE_FILE
DOE = config.DOE
URL = config.WIKIPEDIA_URL
HEADERS = config.HEADERS
PARAMS = {'titles': '', # placeholder for page
          'action': 'query',
          'format': 'json',
          'prop': 'cirrusbuilddoc|cirruscompsuggestbuilddoc|extracts|pageimages|pageprops|pageterms|pageviews',
          'piprop': 'original',
          'exintro': True # gives the first paragraphs of a page (before detail sections)
         }
# explanation of values for 'prop'
# cirrusbuilddoc for text_bytes (aka page_size), incoming_link, coordinates (if any)
# cirruscompsuggestbuilddoc for score
# extracts (+ exintro) for first_paragraph(s) (regex to clean up) (might me better to use ['cirrusbuilddoc']['text_source'])
# pageimages for pageimage (with piprop='original') for full size image (but doesn't return image file anymore)
# pageprops for page_image_free, wikibase_short_description, wikibase_item
# pageterms for description (similar but not identital to wikibase_short_description)
# pageviews for pageviews :-)

In [34]:
def link_data_main(database_file = DATABASE_FILE, date = '', batch_size = 5, doe = DOE, print_progress = 0,
                   sleep_range = (5,10), url = URL, params = PARAMS, headers = HEADERS):
    """
    Main function for getting and processing the link data for a group of links in wikipedia_tdih.db.
    It uses several helper functions to break down the process into simple steps:
        - get_links_to_update_from_db
        - get_link_data
        - extract_link_data
        - add_link_data_to_db
    
    Params:
        database_file: database file to save the data to
        date: default of '' indicates that only links without data should be selected
              if date is given, select the links that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
        batch_size: no. of links to request data for in a single call to wikipedia API (to reduce no. of API calls)
        doe: date of entry (defaults to current day)
        print_progress: if different than 0, prints batch number being processed (every print_progress no. of batches)
        sleep_range: tuple of integers indicating the lower_limit and upper_limit of how long to wait between
                     API calls
        url: url to use for request to wikipedia API
        params: params to use in request to wikipedia API
        headers: headers to use in request to wikipedia API
        
    Returns:
        links_list: list of links we needed Wikipedia data for
        failed_requests: how many requests to Wikipedia API didn't have requests.codes.ok (break at 5)
        missing_pages: list of pages no data was received for
        update_results: a list of dictionaries with the status of each table update
                        e.g. {'doe': doe,
                              'wiki_table_name': 'wiki_table_name,
                              'update_status': 'update_complete',
                              'update_note': np.nan}
    """
    # keep track of pages no data was received for and of failed requests
    missing_pages = []
    failed_request = 0
    
    # initialize update_results
    update_results = []
    
    with sqlite3.connect(database_file) as connection:
        cursor = connection.cursor()
    
        # select from wikipedia_tdih.db the links we need data for
        links_list = get_links_to_update_from_db(cursor, date)
        
        # if print_progress
        if print_progress:
            print(f'links to update: {len(links_list)}')
        links_list = links_list[:11] # keep this here only while testing
    
        # exit if no links need to have data retrieved or updated
        if not links_list:
            return 'no link data needs to be retrieved / updated'
        
        # keep track of failed_request (break if failed_requests > 5)
        
    
        # process links in batches of size batch_size
        for i in range(0, len(links_list), batch_size):
            # slice links_list into batches
            link_batch = links_list[slice(i, i + batch_size)]
            
            # if print_progress is given
            if print_progress:
                if i % print_progress:
                    print(f'processing batch {i+1} of {len(links_list)}')
                       
            # get link data for links in batch
            resp = get_link_data(link_batch, url = url, params = params, headers = headers)
            
            # extract link_data (if resp.status_code == requests.codes.ok)
            if resp.status_code == requests.codes.ok:
                page_dicts, missing_batch_pages = extract_link_data(resp, link_batch)
                
                # if missing_batch_pages list is not empty, update missing_pages
                missing_pages.extend(missing_batch_pages)

            else:
                # update wiki_link_data_log with response status_code
                update_results.append(update_table_wiki_link_data_log(link_batch, resp.status_code))
                failed_request += 1
                if failed_request == 5:
                    print('failed requests limit reached.')
                    return links_list, failed_requests, missing_pages
                continue

            # add link_data to db
            for page_dict in page_dicts:
                update_results.extend(add_link_data_to_db(page_dict, database_file, doe))
                
            # sleep time before the next API call
            time.sleep(round(random.uniform(sleep_range[0], sleep_range[1]), 2))
            
    return links_list, failed_request, missing_pages, update_results

In [7]:
def get_links_to_update_from_db(database_cursor, date = ''):
    """
    Select from wikipedia_tdih.db the links for which wikipedia data is needed.
    
    There are two cases in which data is needed for a specific link:
        1. the link has just been added to the wikipedia_tdih.db and link data has not been requested from 
           wikipedia API yet
        2. the existing data for the link needs to be updated
    
    Params:
        date: default of '' indicates that only links without data should be selected
              if date is given, select the links that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
              
    Returns:
        links_list: list of links for which wikpedia data is needed
    """
    
    cursor = database_cursor
    
    # if no date is given
    if not date:
        # select the links that are in wiki_link but not in wiki_get_link_data_log
        # (these are links which have just been added to wiki_link)
        links_list = cursor.execute('''SELECT link_id, link_title FROM wiki_link WHERE link_id NOT IN (
                                       SELECT link_id FROM wiki_link_data_log) ''').fetchall()
    
    # if date is given
    else:
        links_list = cursor.execute('''SELECT link_id, link_title FROM wiki_link WHERE link_id IN (
                                       SELECT link_id FROM wiki_get_link_data_log WHERE doe > ?)''', (date,)).fetchall()
      
    return links_list

In [8]:
def get_link_data(link_batch, url = URL, params = PARAMS, headers = HEADERS):
    """
    Get link data from Wikipedia API. Links are processed in batches (usually of size 5) to reduce the number
    of API calls.
    
    Params:
        link_batch: list of tuples representing links to get data for (usually a batch of size 5)
                    each tuple is of the form (link_id, link_title)
        url: url for wikipedia API
        params: params to use in API call
        
    Returns:
        resp: the response object
    """
    # join the titles for the links in the list link_batch and assign them to request params
    # wikipedia uses these titles as identifiers for its pages
    titles = '|'.join(wiki_link[1] for wiki_link in link_batch)

    params['titles'] = titles
    
    # request data
    resp = requests.get(url = url, headers = headers, params = params)
    
    return resp

In [26]:
def extract_link_data(response, link_batch):
    """
    Extract link data
    
    Params:
        response: response data received from Wikipedia API
        link_batch: list of titles for the links data was requested for
        
    Returns:
        page_dicts: list of dictionaries with page data
    
    """
    # the needed response data is in response.json()['query']['pages']
    resp = response.json()['query']['pages']
    status_code = response.status_code
    page_dicts = []
    missing_pages = []

    print(resp.keys())

    for page_id in resp:
        print(resp[page_id]['title'])
    
    for page_id in resp:
        
        # if page_id is negative number => page doesn't exist (possibly, title is wrong)
        # for each page that doesn't exist, the page_id decreases by one
        if int(page_id) < 0:
            # if page_id == -1, compare titles requested vs. titles received, and return missing ones
            # this only has to be done once, that's why we ignore page_ids less than -1
            if int(page_id) == -1:
                missing_pages = list(set(link_batch) - set(resp.keys()))
            continue
                    
        # for pages with valid data        
        page_dict = resp[page_id]
        
        # TO-DO cover for redirects
        page_redirect = list(page_dict['cirruscompsuggestbuilddoc'].keys())[0]
        
        # TO-DO function for getting needed values
        try: image_url = page_dict['original']['source']
        except: image_url = np.nan
            
        try: image_file = page_dict['pageprops']['page_image_free']
        except: image_file = np.nan
            
        try: wikibase_shortdesc = page_dict['pageprops']['wikibase-shortdesc']
        except: wikibase_shortdesc = np.nan
            
        try: wikibase_item = page_dict['pageprops']['wikibase_item']
        except: wikibase_item = np.nan
            
        try: wiki_desc = page_dict['terms']['description']
        except: wiki_desc = np.nan
        
        # build up the data dict
        data_dict = {'link_title': page_dict['title'], # change this so links saved to db are stripped of '/wiki/'?
                     'status_code': status_code,
                     'created_at': page_dict['cirrusbuilddoc']['create_timestamp'],
                     'page_size': page_dict['cirrusbuilddoc']['text_bytes'],
                     'incoming_links': page_dict['cirrusbuilddoc']['incoming_links'],
                     'coordinates': page_dict['cirrusbuilddoc'].get('coordinates', np.nan),
                     'page_score': page_dict['cirruscompsuggestbuilddoc'][page_redirect]['score_explanation']['value'],
                     'first_paragraphs': BeautifulSoup(page_dict['extract']).text.strip(),
                     'image_url': image_url,
                     'image_file': image_file,
                     'wikibase_shortdesc': wikibase_shortdesc,
                     'wikibase_item': wikibase_item,
                     'wiki_desc': wiki_desc,
                     'page_views': page_dict['pageviews']}
        page_dicts.append(data_dict)
        
    return page_dicts, missing_pages

In [32]:
def add_link_data_to_db(page_data_dict, database_file = DATABASE_FILE, doe = DOE):
    """
    Add wiki_link data to wikipedia_tdih.db
    
    Params:
        page_data_dict: dictionary with data about wiki_link
                        e.g. page_data_dict = {'link_title': ,
                                               'status_code': ,
                                               'created_at': ,
                                               'page_size': ,
                                               'incoming_links': ,
                                               'coordinates': ,
                                               'page_score': ,
                                               'first_paragraphs': ,
                                               'image_url': ,
                                               'image_file': ,
                                               'wikibase_shortdesc': ,
                                               'wikibase_item': ,
                                               'wiki_desc': ,
                                               'page_views': }
        database_file: database_file to save data to
        doe: data of entry (defaults to current day)
        
    Returns:
        update_results: a list of dictionaries with the status of each table update
                        e.g. {'doe': doe,
                              'wiki_table_name': 'wiki_event',
                              'update_status': 'update_complete',
                              'update_note': 'events'}
        
    """
    update_results = []
    no_image_for = []
    # connect to wikipedia_tdih.db
    with sqlite3.connect(database_file) as conn:
        cursor = conn.cursor()
    
        # get link_id from db
        cursor.execute('SELECT link_id, link_title FROM wiki_link WHERE link_title = ?', (page_data_dict['link_title'],))
        link_id = cursor.fetchone()[0]
    
        # update wiki_link tables and append results to update_results
        update_results.append(update_table_wiki_link_data_log(link_id, page_data_dict['status_code'], cursor, doe))
        update_results.append(update_table_wiki_link_page_views(link_id, page_data_dict['page_views'], cursor, doe))
        update_results.append(update_table_wiki_link_info(link_id, page_data_dict, cursor, doe))
        
        # get the image_url and image_file
        image_url = page_data_dict['image_url']
        image_file = page_data_dict['image_file']
        # if there's no image for the page, update list no_image_for 
        if not np.iterable(image_url) or not np.iterable(image_file):
            no_image_for = [link_id]
        # else, update image tables
        else:
            image_id, image_update = update_table_wiki_image(image_url, image_file, cursor, doe)
            update_results.append(image_update)
            update_results.append(update_table_wiki_image_usage(image_id, link_id, cursor, doe))
            
    return update_results

In [13]:
def update_table_wiki_link_data_log(link_id, status_code, cursor, doe = DOE):
    """
    Update table wiki_link_data_log
    
    Params:
        link_id: id of link to be updated in wiki_link_data_log
        status_code: requests.status_code from Wikipedia API
        cursor: database_cursor
        doe: date of entry (defaults to current day)
    
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    
    data = (doe, link_id, status_code)
    
    try:
        cursor.execute('INSERT INTO wiki_link_data_log VALUES (null,?,?,?)', data)
        update_status = 'update_complete'
    
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_link_data_log',
                          'update_status': update_status,
                          'update_note': np.nan}
        
    return update_status_dict

In [39]:
def update_table_wiki_link_page_views(link_id, page_views, cursor, doe = DOE):
    """
    Update table wiki_link_size
    
    Params:
        link_id: id of wiki_link to update page_views for
        page_views: dictionary of page view data to update
        doe: date of entry
        
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    # build up data list
    data = [(doe, link_id, date, views) for date, views in page_views.items()]
    
    try:
        cursor.executemany('INSERT INTO wiki_link_page_views VALUES (null,?,?,?,?)', data)
        update_status = 'update_complete'
    
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_link_page_views',
                          'update_status': update_status,
                          'update_note': link_id}
        
    return update_status_dict

In [51]:
def update_table_wiki_link_info(link_id, page_dict, cursor, doe = DOE):
    """
    Update table wiki_link_size
    
    Params:
        link_id: id of wiki_link to update page_views for
        page_data_dict: dictionary with data retrieved from Wikipedia API
                        eg. page_dict = {'link_title': link_title,
                                         'status_code': status_code,
                                         'created_at': created_at,
                                         'page_size': page_size,
                                         'incoming_links': incoming_links,
                                         'coordinates': coordinates,
                                         'page_score': page_score,
                                         'first_paragraphs': first_paragraphs,
                                         'image_url': image_url,
                                         'image_file': image_file,
                                         'wikibase_shortdesc': wikibase_short_desc,
                                         'wikibase_item': wikibase_item,
                                         'wiki_desc': wiki_desc,
                                         'page_views': page_views}
        cursor: database cursor
        doe: date of entry
        
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    # format created_at from string like '2008-03-01T03:13:01Z' to '2008-03-01'
    created_at = page_dict['created_at'].lower().replace('z','')
    created_at = dt.datetime.fromisoformat(created_at).strftime('%Y-%m-%d')
    
    # build up data tuple
    data = (doe, 
            link_id, 
            created_at,
            int(page_dict['page_size']),
            int(page_dict['incoming_links']), 
            str(page_dict['coordinates']), 
            int(page_dict['page_score']),
            page_dict['first_paragraphs'], 
            page_dict['wikibase_shortdesc'],
            page_dict['wikibase_item'],
            str(page_dict['wiki_desc']))
    
    try:
#         print('wiki_link_info')
#         for d in data:
#             print(type(d), ': ', d)
#         print('---')
        # insert data into wiki_link_info
        cursor.execute('INSERT INTO wiki_link_info VALUES (null,?,?,?,?,?,?,?,?,?,?,?)', data)
        update_status = 'update_complete'
        
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_link_info',
                          'update_status': update_status,
                          'update_note': np.nan}
        
    return update_status_dict        

In [41]:
def update_table_wiki_image(image_url, image_file, cursor, doe = DOE):
    """
    Update table wiki_image if image is not already in the table.
    
    Params:
        image_url: wikipedia url for image used for wiki_link
        image_file: wikipedia file name for image used for wiki_link
        cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Returns:
        image_id
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    try:
        # insert image_url if not already in wiki_image    
        cursor.execute('INSERT OR IGNORE INTO wiki_image VALUES (null,?,?,?)', (doe, image_url, image_file))

        # get the image_id (to update wiki_image_usage)
        cursor.execute('SELECT image_id FROM wiki_image WHERE image_url = ?', (image_url,))
        image_id = cursor.fetchone()[0]
        
        update_status = 'update_complete'
        
    except Exception as e:
        update_status = repr(e)
        image_id = np.nan
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_image',
                          'update_status': update_status,
                          'update_note': np.nan}
    
    # return image_id
    return image_id, update_status_dict

In [18]:
def update_table_wiki_image_usage(image_id, link_id, cursor, doe = DOE):
    """
    Update table wiki_image_usage.
    
    Params:
        image_id: id of image used by wiki_link (primary key in wiki_image)
        link_id: id of wiki_link (primary key in wiki_link)
        
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    try:
        # insert data into wiki_image_usage
        cursor.execute('INSERT INTO wiki_image_usage VALUES (null,?,?,?)', (doe, image_id, link_id))
        update_status = 'update_complete'
    
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_image_usage',
                          'update_status': update_status,
                          'update_note': np.nan}
    
    return update_status_dict

In [35]:
links_list, failed_request, missing_pages, update_results = link_data_main(database_file = DATABASE_FILE, 
                                                                            date = '', batch_size = 5, 
                                                                            doe = DOE, print_progress = 1,
                                                                            sleep_range = (5,10), url = URL, 
                                                                            params = PARAMS, headers = HEADERS)

links to update: 290
dict_keys(['1417207', '863', '345887', '193600', '55694068'])
Alaungpaya
American Civil War
Bar Confederation
Jay Treaty
Konbaung Dynasty
dict_keys(['49596931', '35019234', '53274', '84504', '3434750'])
Kilpatrick–Dahlgren Raid
Piedra Movediza
Richmond, Virginia
St. Petersburg, Florida
United States
dict_keys(['918969'])
Tandil
