# Wikipedia - this day in history <small>(step 3 - get image data)</small>
---
**Goal:** get image data for images in wikipedia_tdih.db (if no data already in database).

**Notes about this notebook:**  
- this notebook is for the third step of this project. 
- the notebook for the first step is [TADS_wikipedia_tdih_main_step_01_get_day_data_30jan21](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks)
- the notebook for the second step is [TADS_wikipedia_tdih_main_step_02_get_link_data_07feb21](https://github.com/Bianca-Aguglia/TADS_wikipedia_this_day_in_history_build_dataset/tree/master/notebooks)

### Process flow summary: <small>(with details and checkmarks for steps done in this notebook)</small>
1. get day data
    - get day data from wikipedia API, process it, and save it to wikipedia_tdih.db
2. get link data
    - get link data from wikipedia API, process it, and save in to wikipedia_tdih.db
3. get image data
    - [ ] query wikipedia_tdih.db for images that are in wiki_image but not in wiki_image_data_log
    - [ ] get image data from wikipedia API and extract the following:
        - [ ] user (used for wiki_credit)
        - [ ] image_url ?
        - [ ] copyright license
        - [ ] license name
        - [ ] license description
        - [ ] license usage rights

In [1]:
import requests
import time
import random
import sqlite3
import config
import numpy as np
import datetime as dt
from bs4 import BeautifulSoup

In [2]:
DATABASE_FILE = config.DATABASE_FILE
DOE = config.DOE
URL = config.WIKIPEDIA_URL
HEADERS = config.HEADERS
PARAMS = {'titles': '', # placeholder for page
          'action': 'query',
          'format': 'json',
          'prop': 'imageinfo',
          'iiprop': 'user|url|extmetadata', 
         }

# explanation of values for 'iiprop'
# user for wikipedia_user
# url for wikipedia_url for full picture
# extmetadata for license and attribution data

In [3]:
def image_data_main(database_file = DATABASE_FILE, date = '', batch_size = 5, doe = DOE, print_progress = 0,
                    sleep_range = (5,10), url = URL, params = PARAMS, headers = HEADERS):
    """
    Main function for getting and processing the image data for a group of images in wikipedia_tdih.db.
    It uses several helper functions to break down the process into simple steps:
        - get_images_to_update_from_db
        - get_image_data
        - extract_image_data
        - add_image_data_to_db
    
    Params:
        database_file: database_file to read from, update, and save data to
        date: default of '' indicates that only images without data should be selected
              if date is given, select the images that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
        batch_size: no. of images to request data for in a single call to wikipedia API (to reduce no. of API calls)
        doe: date of entry (defaults to current day)
        print_progress: if different than 0, prints batch number being processed (every print_progress no. of batches)
        sleep_range: tuple of integers indicating the lower_limit and upper_limit of how long to wait between
                     API calls
        url: url to use for request to wikipedia API
        params: params to use in request to wikipedia API
        headers: headers to use in request to wikipedia API
        
    Returns:
        image_list: list of images for which Wikipedia data is needed
        failed_request: how many requests to Wikipedia API didn't have requests.codes.ok (break at 5)
        missing_image_data: list of images no data was received for
        update_results: a list of dictionaries with the status of each table update
                        e.g. {'doe': doe,
                              'wiki_table_name': 'wiki_table_name,
                              'update_status': 'update_complete',
                              'update_note': np.nan}
    """
    # keep track of images no data was received for and of failed requests
    missing_image_data = []
    failed_request = 0
    resp_list = []
    
    # initialize update_results
    update_results = []
    
    with sqlite3.connect(database_file) as connection:
        cursor = connection.cursor()
    
        # select from wikipedia_tdih.db the image we need data for
        image_list = get_images_to_update_from_db(cursor)
        
        # if print_progress
        if print_progress:
            print(f'images to update: {len(image_list)}')
    
        # exit if no links need to have data retrieved or updated
        if not image_list:
            print('no image data needs to be retrieved / updated')
            return ((np.nan,) * 5)
        
        image_list = image_list[:7] # this is just for testing purposes
    
        # process images in batches of size batch_size
        for i in range(0, len(image_list), batch_size):
            image_batch = image_list[slice(i, i + batch_size)]
            
            # if print_progress is given
            if print_progress:
                if (i+1) % print_progress == 0:
                    print(f'processing batch {i+1} of {len(image_list)}')
            
            # get image data for images in batch
            resp = get_image_data(image_batch, url = url, params = params, headers = headers)
            resp_list.append(resp.json()) # this is just for testing
            
            # extract image_data (if resp.status_code == requests.codes.ok)
            if resp.status_code == requests.codes.ok:
                image_dicts, missing_batch_data = extract_image_data(resp)                
                            
                # if missing_batch_data list is not empty, update missing_image_data
                missing_image_data.extend(missing_batch_data)

            else:
                # for each image, update wiki_image_data_log with response status_code
                for image in image_batch:
                    update_results.append(update_table_wiki_image_data_log(image_batch[0], 
                                                                           resp.status_code, 
                                                                           cursor, 
                                                                           doe))
                
                failed_request += 1
                if failed_request == 5:
                    print('failed requests limit reached.')
                    return image_list, failed_request, missing_image_data, update_results, resp_list
                continue

            # add image_data to db and update update_results list
            for image_dict in image_dicts:
                update_results.extend(add_image_data_to_db(image_dict, cursor, doe))
                
            # sleep time before the next API call
            time.sleep(round(random.uniform(sleep_range[0], sleep_range[1]), 2))
            
        cursor.close()
                
    return image_list, failed_request, missing_image_data, update_results, resp_list

In [19]:
def get_images_to_update_from_db(database_cursor, date = ''):
    """
    Select from wikipedia_tdih.db the images for which data is needed. 
    
    There are two cases in which data is needed for a specific image:
        1. the image has just been added to the wikipedia_tdih.db and image data has not been requested from 
           wikipedia API yet
        2. the existing data for the image needs to be updated
    
    Params:
        database_cursor: database cursor
        date: default of '' indicates that only images without data should be selected
              if date is given, select the images that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
    in wiki_image_data_log.
        
    Returns:
        image_list: list of images for which wikipedia data is needed.
    """
    cursor = database_cursor
    
    # if no date is given
    if not date:
    
        image_list = cursor.execute('''SELECT image_id, image_file FROM wiki_image WHERE image_id NOT IN (
                                        SELECT image_id FROM wiki_image_info )''').fetchall()
    
    # if date is given
    else:
        image_list = cursor.execute('''SELECT image_id, image_file FROM wiki_image WHERE image_id IN (
                                       SELECT image_id FROM wiki_image_data_log WHERE doe > ?)''', (date,)).fetchall()
      
    return image_list

In [5]:
def get_image_data(image_batch, url = URL, params = PARAMS, headers = HEADERS):
    """
    Get image data from Wikipedia API. Images are processed in batches (usually of size 5) to reduce the number
    of API calls.
    
    Params:
        image_batch: list of tuples representing images to get data for (usually a batch of size 5)
                     each tuple is of the form (image_id, image_file)
        url: url for wikipedia API
        params: params to use in API call
        headers: headers to use in API call
        
    Returns:
        resp: requests response object
    """
    # join the image_files for images in image_batch and assign them to request params
    titles = '|'.join(f'File:{wiki_image[1]}' for wiki_image in image_batch)
    params['titles'] = titles
    
    # request data
    resp = requests.get(url = url, headers = headers, params = params)
    
    return resp

In [6]:
def extract_image_data(response):
    """
    Extract image data for a batch of images.
    
    Params:
        response: requests reponse from Wikipedia API
        
    Returns:
        image_dict_list: list of dictionaries with data for each image in the batch.
                         e.g. see image_dict below        
    """
    resp = response.json()
    image_dict_list = [] 
    missing_image_data = []
    
    # wikipedia normalizes file names (e.g. 'File:François_Ier_Louvre.jpg ' to 'File:François Ier Louvre.jpg')
    # keep track of original vs. normalized files names (to match to file name in wikipedia_tdih.db)
    file_names = resp['query']['normalized']
    file_names_dict = {file['to']:file['from'] for file in file_names}
    
    # fields that, if available, are dictionaries from which data in ['value'] needs to be extracted
    fields_with_value = ['image_credit','image_description','image_license_name', 'image_usage_terms', 
                         'image_attrib_required', 'image_copyright', 'image_restriction', 'image_license']
    
    # image data is dictionary resp['query']['pages'] where each key has the data for one image
    for image in resp['query']['pages'].values():
        
        # if image doesn't have imageinfo, add to missing_image_data
        if 'imageinfo' not in image.keys():
            missing_image_data.append(image['title'])
            continue
            
        # else, build up image_dict
        image_dict = {'title' : file_names_dict.get(image['title'], image['title']).replace('File:',''), # if image title was not normalized
                      'title_normalized': image['title'].replace('File:',''),
                      'image_repository' : image['imagerepository'],
                      'user' : image['imageinfo'][0]['user'],
                      'image_url' : image['imageinfo'][0]['url'],
                      'image_date' : image['imageinfo'][0]['extmetadata']['DateTime']['value'],
                      'image_credit' : image['imageinfo'][0]['extmetadata'].get('Credit', np.nan),
                      'image_description' : image['imageinfo'][0]['extmetadata'].get('ImageDescription', np.nan),
                      'image_license_name' : image['imageinfo'][0]['extmetadata'].get('LicenseShortName', np.nan),
                      'image_usage_terms' : image['imageinfo'][0]['extmetadata'].get('UsageTerms', np.nan),
                      'image_attrib_required' : image['imageinfo'][0]['extmetadata'].get('AttributionRequired', np.nan),
                      'image_copyright' : image['imageinfo'][0]['extmetadata'].get('Copyrighted', np.nan),
                      'image_restriction' : image['imageinfo'][0]['extmetadata'].get('Restrictions', np.nan),
                      'image_license' : image['imageinfo'][0]['extmetadata'].get('License', np.nan)}
        
        # if available, add data from 'value'
        for field in fields_with_value:
            if isinstance(image_dict[field], dict):
                image_dict[field] = image_dict[field]['value']

        # if image has image_description it is usually html data 
        # extract text for image_description
        if isinstance(image_dict['image_description'], str):
            image_dict['image_description'] = BeautifulSoup(image_dict['image_description']).text
            
        # standardize the data in image_attrib_required and image_copyright (sample values are 'False', 'false', 'True', etc)
        for str_data in ['image_attrib_required', 'image_copyright']:
            image_dict[str_data] = image_dict[str_data].strip().lower()
    
        image_dict_list.append(image_dict)
        
    return image_dict_list, missing_image_data

In [7]:
# TO-DO
# for update_results: is it better to only log failures
def add_image_data_to_db(image_dict, cursor, doe = DOE):
    """
    Add image data to wikipedia_tdih.db
    
    Params:
        image_dict: dictionary of data to be added to the database
        database_cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Returns:
        update_results: a list of dictionaries with the status of each table update
                        e.g. {'doe': doe,
                              'wiki_table_name': 'wiki_event',
                              'update_status': 'update_complete',
                              'update_note': 'events'}
        
    """
    update_results = []
    

    # get license_id (update table wiki_license first, if license_id not present)
    license_data = (image_dict['image_license_name'], image_dict['image_usage_terms'],
                    image_dict['image_attrib_required'], image_dict['image_copyright'])
    license_id, update_result = update_table_wiki_license(license_data, cursor, doe)
    # if wiki_license was updated, update_result is a non-emtpy dict
    # TO-DO: update_result is not-emtpy also if update_table_wiki_license failed => need to handle that
    if update_result:
        update_results.append(update_result)

    # get user_id (update table wiki_user first, if user_id not present) 
    user_id, update_result = update_table_wiki_user(image_dict['user'], cursor, doe)
    # TO-DO: action for when update_result is not empty => failure to update wiki_user => user_id = ''
    if update_result:
        update_results.append(update_result)

    # get image_id
    cursor.execute('SELECT image_id FROM wiki_image WHERE image_url = ?', (image_dict['image_url'],))
    image_id = cursor.fetchone()[0]

    # update table wiki_image_info
    image_data = (doe, image_id, license_id, user_id,
                  image_dict['image_repository'], image_dict['image_date'], image_dict['image_credit'],
                  image_dict['image_description'])
    update_results.append(update_table_wiki_image_info(image_data, cursor, doe))
    
    return update_results    

In [8]:
#
def update_table_wiki_image_data_log(link_id, status_code, cursor, doe = DOE):
    """
    Update table wiki_link_data_log
    
    Params:
        link_id: id of link to be updated in wiki_link_data_log
        status_code: requests.status_code from Wikipedia API
        cursor: database_cursor
        doe: date of entry (defaults to current day)
    
    Returns:
        update_status_dict: dictionary with status of wiki_day_data_log update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_day_data_log',
                                  'update_status': update_status
                                  'update_note': np.nan}  # this is relevant to other tables (e.g. wiki_event)
    """
    
    data = (doe, link_id, status_code)
    
    try:
        cursor.execute('INSERT INTO wiki_link_data_log VALUES (null,?,?,?)', data)
        update_status = 'update_complete'
    
    except Exception as e:
        update_status = repr(e)
        
    update_status_dict = {'doe': doe,
                          'wiki_table_name': 'wiki_link_data_log',
                          'update_status': update_status,
                          'update_note': np.nan}
        
    return update_status_dict

In [9]:
def update_table_wiki_image_info(image_data, cursor, doe = DOE):
    """
    Update table wiki_image_info.
    
    Params:
        image_data: tuple with data for image
        cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Returns:
        None
    """
    update_status_dict = {}
    
    try:
        cursor.execute('INSERT INTO wiki_image_info VALUES (null,?,?,?,?,?,?,?,?)', image_data)

    except Exception as e:
        update_status = repr(e)
        update_status_dict = {'doe': doe,
                              'wiki_table_name': 'wiki_image_info',
                              'update_status': update_status,
                              'update_note': image_data[1]} # image_data[1] is the image_id
        
    return update_status_dict        

In [10]:
# TO-DO
# update_status for when user already in wiki_user
# is it better to only log failures?
def update_table_wiki_user(user_name, cursor, doe = DOE):
    """
    Update (if needed) table wiki_user
    
    Params:
        user_name: wiki_user_name
        cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Returns:
        user_id
        update_status_dict: dictionary with status of wiki_user update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_user',
                                  'update_status': update_status
                                  'update_note': user_name} 
                            this is an empty dictionary if user was already in wiki_user
    """
    update_status_dict = {}
    user_id = ''
    
    try:
        # add user to wiki_user (if not already in database)
        cursor.execute('INSERT or IGNORE INTO wiki_user VALUES (null,?,?)', (doe, user_name))

        # get user_id
        cursor.execute('SELECT user_id FROM wiki_user WHERE user_name = ?', (user_name,))
        user_id = cursor.fetchone()[0]

    except Exception as e:
        update_status = repr(e)
        update_status_dict = {'doe': doe,
                              'wiki_table_name': 'wiki_user',
                              'update_status': update_status,
                              'update_note': user_name}
        
    return user_id, update_status_dict

In [11]:
def update_table_wiki_license(license_data, cursor, doe = DOE):
    """
    Update (if needed) table wiki_license)
    
    Params:
        license_data:
        cursor: database cursor
        doe: date of entry (defaults to current day)
        
    Returns:
        license_id: license_id
        update_status_dict: dictionary with status of wiki_license update
                            e.g. {'doe': doe,
                                  'wiki_table_name': 'wiki_license',
                                  'update_status': update_status
                                  'update_note': license_data} 
                            this is an empty dictionary if license was already in wiki_license
    """
    update_status_dict = {}
    license_id = ''
    
    cursor.execute('''SELECT license_id FROM wiki_license WHERE 
                        license_name = ? AND
                        license_description = ? AND
                        attrib_required = ? AND
                        copyright = ?''', license_data)
    
    try:
        # get the license_id (if successful, update_status_dict stays empty dictionary)
        license_id = cursor.fetchone()[0]

        return license_id, update_status_dict
    
    except:
        # insert the new license type into wiki_copyright_license
        try:
            license_data = (doe, *license_data)
            cursor.execute('INSERT INTO wiki_license VALUES (null, ?,?,?,?,?)', license_data)
            license_id = cursor.lastrowid
            update_status = 'update_complete'
        
        except Exception as e:
            update_status = repr(e)

        update_status_dict = {'doe': doe,
                              'wiki_table_name': 'wiki_license',
                              'update_status': update_status,
                              'update_note': license_data}
    
    return license_id, update_status_dict

### Testing

In [20]:
il, fr, mid, ur, resp = image_data_main(database_file = DATABASE_FILE, date = '', batch_size = 5, 
                                  doe = DOE, print_progress = 1, sleep_range = (5,10),
                                  url = URL, params = PARAMS, headers = HEADERS)

images to update: 25
processing batch 1 of 7
in wiki_license
in wiki_user
in wiki_image_info
in wiki_license
in wiki_user
in wiki_image_info
in wiki_license
in wiki_user
in wiki_image_info
in wiki_license
in wiki_user
in wiki_image_info
in wiki_license
in wiki_user
in wiki_image_info
processing batch 6 of 7
in wiki_license
in wiki_user
in wiki_image_info
in wiki_license
in wiki_user
in wiki_image_info


In [21]:
il

[(31, 'Evstafiev-sarajevo-building-burns.jpg'),
 (29, 'Flag_of_Haiti.svg'),
 (14, 'Flag_of_Morocco.svg'),
 (2, 'Flag_of_South_Carolina.svg'),
 (18, 'Flag_of_Vietnam.svg'),
 (1, 'Flag_of_the_Czech_Republic.svg'),
 (16, 'Gordie_Howe_Chex_card.jpg')]

In [20]:
# image_list, failed_request, missing_image_data, update_results
il, fr, mid, ur = image_data_main(database_file = DATABASE_FILE, date = '', batch_size = 5, 
                                  doe = DOE, print_progress = 0, sleep_range = (5,10),
                                  url = URL, params = PARAMS, headers = HEADERS)

In [21]:
il

[(6, '1941hattie.jpg'),
 (11, 'A_Finnish_Maxim_M-32_machine_gun_nest_during_the_Winter_War.jpg'),
 (24, 'ApartheidSignEnglishAfrikaans.jpg'),
 (21, 'Archbishop-Tutu-medium.jpg'),
 (8, 'Berkeley-downtown-Bay-bridge-SF-in-back-from-Lab.jpg'),
 (28, 'Boeing_737-222,_Braniff_(American_Airlines)_AN0203004.jpg')]

In [22]:
fr

0

In [23]:
mid

['1941hattie.jpg',
 'A Finnish Maxim M-32 machine gun nest during the Winter War.jpg',
 'ApartheidSignEnglishAfrikaans.jpg',
 'Archbishop-Tutu-medium.jpg',
 'Berkeley-downtown-Bay-bridge-SF-in-back-from-Lab.jpg',
 'Boeing 737-222, Braniff (American Airlines) AN0203004.jpg']

In [24]:
ur

[]