# Wikipedia - this day in history <small>(step 2)</small>
---
**Goal:** create a dataset of this-day-in-history events  
  
**Context:**  I need a starting dataset of world history events (name, short description, long description, image, and location). I couldn't find public datasets. I'm building one using the wikipedia this day in history data.

**Notes about this notebook:**  
- this notebook is for the second step of this project. 
- the notebook for the first step is [TADS_wikipedia_tdih_main_api_step_01_02_get_data_29jan21](http://localhost:8888/notebooks/temp_for_offline/bianca_aguglia/projects_wip/TADS_wikipedia_this_day_in_history/TADS_wikipedia_tdih_main_api_step_01_02_get_data_29jan21.ipynb)
- the first step consisted of:
    - using the Wikipedia api to get the events for each day of the year
    - parsing the Wikipedia data, cleaning it up, and getting it in the right format needed for the SQLite database created for this project
- step two is:
    - take the data from step one and rank each item based on page views, page length, and links to page
    - get image (and licence details) for each item in the data from step one
- to get data for a specific day run day_data_main(day_name). It returns a dictionary with day data.
  

### Process flow: <small>(with checkmarks for steps done in this notebook)</small>
- get day data
    - save data in wikipedia_tdih.db
- get link data
    - [ ] query wikipedia_tdih.db for:
        - [x] links that are new (i.e. in wiki_link but not in wiki_get_link_data_log)
        - [x] links that have have not been updated since a specified_date
    - 

In [None]:
import sqlite3
import config
import datetime as dt

In [None]:
DATABASE_FILE = 'wikipedia_tdih.db'
HEADERS = config.HEADERS

In [None]:
def get_link_data_main(database_file = DATABASE_FILE, date = ''):
    """
    Get link data for a group of links in wikipedia_tdih.db.
    
    Params:
        database_file:
        date: 
        
    Returns:
        None
    """
    with sqlite3.connect(DATABASE_FILE) as connection:
        c = connection.cursor()
        
    links_list = get_links_to_update_from_db(c, date)

In [None]:
def get_links_to_update_from_db(database_cursor, date = ''):
    """
    Select from wikipedia_tdih.db the links for which wikipedia data is needed.
    
    There are two cases in which data is needed for a specific link:
        1. the link has just been added to the wikipedia_tdih.db and link data has not been requested from 
           wikipedia API yet
        2. the existing data for the link needs to be updated
    
    Params:
        date: default of '' indicates that only links without data should be selected
              if date is given, select the links that have data but data has not been updated since specified date
              if date is given, format should be '%Y-%m-%d' (e.g. '2021-01-02' for January 2nd, 2021)
              
    Returns:
        links_list: list of links for which wikpedia data is needed
    """
    
    c = database_cursor
    
    # if no date is given
    if not date:
        # select the links that are in wiki_link but not in wiki_get_link_data_log
        # (these are links which have just been added to wiki_link)
        links_list = c.execute('''SELECT link_id, link_url FROM wiki_link WHERE link_id NOT IN (
                                    SELECT link_id FROM wiki_get_link_data_log) ''').fetchall()
    
    # if date is given
    else:
        links_list = c.execute('''SELECT link_id, link_url FROM wiki_link WHERE link_id IN (
                                    SELECT link_id FROM wiki_get_link_data_log WHERE doe > ?)''', (date,)).fetchall()
      
    return links_list