# Brazil Hemeroteca Scraper
## By: Robert C. Fritzen (Ph.D. Student, Dept. Geographic & Atmospheric Sciences) for Dr. Hanley's Research
### Contact: rfritzen1@niu.edu

### Project Synopsis
The purpose of this python notebook is to access and obtain, from the Brazil Hemeroteca, links to articles that are (1) within a specific periodical, (2) within a time range, and (3) contain a search term, or basically, to perform what the Hemeroteca is there to do.

This code block will access the hemeroteca, and obtain from each article, a hyperlink to return to the page, with an embedded argument to highlight the matching search term associated with that article. All of the links are then saved into an excel spreadsheet.

This spreadsheet saves into each column, all of the matches of that particular search term. For instances where you have multiple search terms, there will be additional columns at the end which find all of the "matching" links (IE: 3 terms will have two columns, one for two matches, and one for three matches) and save them. The sheets along the bottom will then be the individual years.

**Special Note**: I'm going to assume you have at least a "basic" understanding of Python. I wrote most of this code to be very "user friendly" and to try to "minimize" any of the work you'll need to do to get things up and running. I spent many many long hours on Stack Overflow getting some of these things working, and I highly recommend using that as a resource if something goes wrong. 

**Special Note #2**: This code is not 100% perfect. I've coded this up with as much as I could in the short time span I had available to get this finished. There are things that may break from time to time, but will just "magically" work another time. Most of these problems are on Hemeroteca's end of things, and other times Chrome will break randomly on you. There's not much to be said about it. I put lots of debug outputs in the text to at least give you an idea of where things go wrong.

### Requirements
This python code employs a few packages, all of which can be installed from a standard Anaconda2 (https://www.anaconda.com/download/) package through standard "conda install x" statements

ie: for numpy, you would call 'conda install numpy' on the anaconda command line.

These are the required packages:

* numpy 
* pandas
* selenium (See Notes Below)
* time / copy (Should be installed for MOST OS's)

**Selenium**

Selenium is a specialized cross-platform, cross-browser development toolkit used to perform autonomous tasks on webpages. This is a highly popular web library that allows you to automate just about anything that comes to webpages (Clicking buttons, selection options, etc), and thus it is employed for this package.

Installing selenium is a bit different compared to most python packages however, in order to use the embedded library calls, you need to install the **selenium driver**, specifically, the **google chrome** driver. This will require you to have a google chrome installation present on your machine. Once installed, you can download the driver from here: https://sites.google.com/a/chromium.org/chromedriver/

Place the chrome driver executable file in the **SAME DIRECTORY** as this python notebook, otherwise you'll get errors when trying to run the codeblocks below.

### Getting Started

Once you've got all of your dependencies installed, let's get started!

Run this first code block (Click inside of it, and press Shift + Enter) to load up the libraries!

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import TimeoutException
import time
import numpy as np
import pandas as pd
import copy

### Variables

This code block is where **most** of your work will be needed. These six parameters control **most** of the program. The first three variables, you will likely only ever edit the first time you use the code, and then never touch again.

The first variable, defines where on your computer you would like files to be saved. Make sure this points to a folder that exists!

The next variable (outFile), defines the name of the excel spreadsheet that will be created by the code, just edit the string portion of that variable.

The third variable (tmpSaveFile) will save the contents of the **lookup dictionary** to a text file. This was mainly a late addition I made to the code to shave off a load of debugging hours, I left it in just in case something goes wrong so you can either easily figure things out, or if you're able to get in touch with me, to make my job a bit easier.

The second block of variables below control the search parameters. Dr. Hanley is aware of how this page works so just make sure you follow **TO THE DOT** the naming conventions (Just because there is a random space between letters and colons (:), doesn't mean you can remove it).

**Reference**: If at any point, something doesn't work from the naming convention, browse to the hemeroteca page (http://bndigital.bn.gov.br/hemeroteca-digital/), and click the dropdown boxes yourself to verify the naming.

The **search_journal** parameter is the name of the periodical (First parameter on hemeroteca) that you would like to search. This can take up to 40 seconds to populate when the page loads, so give it some time to load. The name of this parameter needs to be **COMPLETELY IDENTICAL**, including any extra symbols.

The **search_timeRanges** is a list parameter which defines the time ranges you want to search. If Dr. Hanley is only asking for a single year, the format is ["YYYY - YYYY"].

The **search_terms** is a list parameter containing all of the terms you wish to search the journal for, see the above for "single" term lookups.

In [5]:
# The primary directory where you want the output files to be saved
saveDir = "D:/Robert Docs/College/NIU/PhD/Sp18_Research/Images/"
# This is the excel spreadsheet file that will be saved
outFile = saveDir + "downloads.xlsx"
# This is where a .txt file containing temporary data for debugging will be saved
tmpSaveFile = saveDir + "tmpDict.txt"

# Define a single journal on the hemeroteca page to search
search_journal = "A Assembleia Legislativa Provincial do Espirito Santo (ES)"
# Define the time ranges you want to lookup
search_timeRanges = ["1860 - 1869", "1870 - 1879"]
# Define the search terms to look for
search_terms = ["Milho", "Feijao", "Arroz"]#["Milho", "Feijao", "Arroz", "Peixe"]

## Functions
The functions below make up the entire script. Most of the functionality should be in place such that you don't need to adjust much unless something breaks too far beyond fixing it. Most of these are very basic python functions, but some advanced trickery is used for the Selenium and Pandas portions. 

Almost all of these code blocks can be deciphered from various stack overflow articles which I will not cite here (Becaus there's too many).

Just be sure to run this block after the above block and then you'll be good to go!

In [3]:
savedResultDictionary = {}

def runScript():
    print("Start Hemetoreca Scraper Script.")
    savedResultDictionary = grabData()
    resultDictionary = copy.deepcopy(savedResultDictionary)
    writeDictionary(resultDictionary)
    allWritingYears = []
    for tRange in search_timeRanges:
        timeTuple = time_range_to_tuple(tRange)
        rTime = np.arange(timeTuple[0], timeTuple[1]+1)
        allWritingYears.extend(rTime)    
    print("Begin excel writing.")
    writeOutputExcelFile(allWritingYears, search_terms, resultDictionary)
    print("All operations completed.")

def time_range_to_tuple(timeRange):
    if(timeRange == "todos"):
        return 0
    return ((int(timeRange[0:4]), int(timeRange[7:])))

def hold_until_element_changed(driver, element1_xpath, element2_xpath, old_element1_text, old_element2_text, end_time = 30):
    pause_interval = 1
    cTime = time.clock()
    end = cTime + end_time
    while time.clock() < end:
        try:
            element1 = driver.find_element_by_xpath(element1_xpath)
            element2 = driver.find_element_by_xpath(element2_xpath)
            if (element1.get_attribute('value') != old_element1_text) or (element2.get_attribute('value') != old_element2_text):
                return True
        except StaleElementReferenceException:
            return True
        except NoSuchElementException:
            raise
        time.sleep(pause_interval)
    return False  

def grabData():
    TOTAL_OPEN_TABS = 0 
    
    resultDictionary = {}
    
    browser = webdriver.Chrome()
    allWritingYears = []
    for tRange in search_timeRanges:
        timeTuple = time_range_to_tuple(tRange)
        rTime = np.arange(timeTuple[0], timeTuple[1]+1)
        allWritingYears.append(rTime)
        for term in search_terms:
            pageReady = False
            while(pageReady == False):
                pageReady = performHemerotecaLookup(browser, TOTAL_OPEN_TABS, search_journal, tRange, term)
                if(pageReady == False):
                    hemeroteca_search_page = ""
            resultsDone = False
            while(resultsDone == False):
                resultsDone = handleHemerotecaResultsPage(browser, search_journal, tRange, term, resultDictionary)
            time.sleep(1)
    #
    print("All searches completed")
    browser.quit()
    return resultDictionary
            
def performHemerotecaLookup(browser, TOTAL_OPEN_TABS, searchJournal, timeRange, searchTerm):
    print("Opening tab for Hemeroteca front page")
    url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
    
    browser.switch_to_window(browser.window_handles[-1])
    TOTAL_OPEN_TABS += 1
    browser.get(url) 
   
    frame_ref = browser.find_elements_by_tag_name("iframe")[0]
    iframe = browser.switch_to.frame(frame_ref)
    journal = browser.find_element_by_id("PeriodicoCmb1_Input")

    xpath_form = "//input[@name=\'PesquisarBtn1\']"
    xpath_journal = "//li[text()=\'"+searchJournal+"\']"
    xpath_timeRange = "//input[@name=\'PeriodoCmb1\' and not(@disabled)]"
    xpath_timeSelect = "//li[text()=\'"+timeRange+"\']"
    xpath_searchTerm = "//input[@name=\'PesquisaTxt1\']"

    print("Locating Journal/Periodical")
    journal.click()
    try:
        dropDownJournal = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.XPATH, xpath_journal)))
        dropDownJournal.click()
    except TimeoutException:
        print("The journal list took too long to update, closing tab.")
        hemeroteca_search_page = ""
        TOTAL_OPEN_TABS -= 1
        if(TOTAL_OPEN_TABS != 0):
            browser.close()
        return False
        
    print("Waiting for Time Selection")
    try:
        timeRangeElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeRange)))
        timeRangeElement.click()
        time.sleep(1)
        print("Locating Time Range")    
        dropDownTime = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_timeSelect)))
        dropDownTime.click()
        time.sleep(1)
    except:
        print("Failed...")
        TOTAL_OPEN_TABS -= 1
        if(TOTAL_OPEN_TABS != 0):
            browser.close()
        return False
    
    print("Adding Search Term")

    searchTermElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_searchTerm)))
    searchTermElement.clear()
    searchTermElement.send_keys(searchTerm)
    time.sleep(5)

    print("Perform search")

    submitButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xpath_form)))
    submitButton.click()
    return True
    
def handleHemerotecaResultsPage(browser, journal, timeRange, searchTerm, resultDictionary):
    browser.switch_to_window(browser.window_handles[-1])
    saveIndex = 0
    
    print("Waiting for next page to load...")
    #try:
    matches = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//span[@id=\'OcorNroLbl\']")))
    print("Next page ready, found match element... counting")
    try:
        countText = matches.text
    except StaleElementReferenceException:
        print("Denied StaleElementReferenceException at match counter text.")
        time.sleep(1)
        matches = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//span[@id=\'OcorNroLbl\']")))
        countText = matches.text
    startPage = int(countText[0:countText.find("/")])
    countTotal = int(countText[countText.find("/")+1:])
    print("A total of " + str(countTotal-startPage) + " matches (Starting at page "+str(startPage)+") have been found, standing by for page load.")
    for i in range(startPage, countTotal+1):               
        print("Waiting for page " + str(i) + " to load...")
        bibxpath = "//input[@name=\'HiddenBibAlias\']"
        pagexpath = "//input[@name=\'hPagFis\']"
        cyxpath = "//span[@id=\'PastaTxt\']"
        jIDElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, bibxpath)))
        jPageElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, pagexpath)))
        cyElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, cyxpath)))
        try:
            jidtext = jIDElement.get_attribute('value')      
            jpagetext = jPageElement.get_attribute('value')
            cy = cyElement.text
        except StaleElementReferenceException:
            print("Denied StaleElementReferenceException at URL grab code.")
            time.sleep(1)
            jIDElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, bibxpath)))
            jPageElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, pagexpath)))  
            cyElement = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, cyxpath)))
            jidtext = jIDElement.get_attribute('value')      
            jpagetext = jPageElement.get_attribute('value')            
            cy = cyElement.text
            
        currentY = int(cy[4:8])
        timeTuple = time_range_to_tuple(timeRange)
        if(currentY > timeTuple[1]):
            print("Reached a result after our search range: " + str(currentY) + " (End at: "+str(timeTuple[1])+"), end loop.")
            break
        fLink = "http://memoria.bn.br/DocReader/" + jidtext + "/" + jpagetext # + "?pesq=" + search_text         
        key = searchTerm + "_" + str(timeTuple[0]) + "_" + str(timeTuple[1]) + "_" + str(saveIndex)
        saveIndex += 1
        resultDictionary[key] = tuple((currentY, fLink, searchTerm))
        print("Adding element to dictionary: (" + key + "): " + str(resultDictionary[key]))

        if(i != countTotal):
            print("Moving to next page...")
            nextPageButton = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
            nextPageButton.click()
            # Wait for next page to be ready
            change = hold_until_element_changed(browser, bibxpath, pagexpath, jidtext, jpagetext)
            if(change == False):
                print("Still stuck on the page, pushing the next page button again.")
                nextPageButton2 = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//input[@id=\'OcorPosBtn\']")))
                nextPageButton2.click()      
                change2 = hold_until_element_changed(browser, bibxpath, pagexpath, jidtext, jpagetext)
                if(change2 == False):
                    print("Movement failed again, browser may be locked, abort.")
                    return False
    print("All elements gathered, proceeding")
    browser.close()
    return True

def writeDictionary(resultDictionary):
    f = open(tmpSaveFile, "w")
    for key, value in resultDictionary.iteritems():
        f.write(str(key) + " " + str(value) + "\n")
    f.close()
    
def loadDictionary():
    tmpDict = {}
    with open(tmpSaveFile, 'r') as f:
        for line in f:
            firstParen = line.find('(')
            firstComma = line.find(',', firstParen)
            firstQuot = line.find('\'', firstComma)
            secondQuot = line.find('\'', firstQuot+1)
            thirdQuot = line.find('\'', secondQuot+1)
            lastQuot = line.find('\'', thirdQuot+1)
            key = str(line[0:firstParen-1])
            year = int(line[firstParen+1:firstComma])
            link = str(line[firstQuot+1:secondQuot])
            product = str(line[thirdQuot+1:lastQuot])
            
            tmpDict[key] = ((year, link, product))
    return tmpDict

# Individual Series Dictionary:
# dict = {url="url", colName1=True, colName2=False}
def createListFromDictionary(year, resultDictionary, searchTerms):
    outList = []
    lookupDone = []
    for key, value in resultDictionary.iteritems():
        if(year == value[0]):
            tmpLookup = value[1]           
            if not tmpLookup in lookupDone:
                tmpDict = {}
                for t in searchTerms:
                    tmpDict[t] = False                 
                lookupDone.append(tmpLookup)
                tmpDict["URL"] = tmpLookup
                for v in resultDictionary.values():
                    if(v[1] == tmpLookup):
                        tmpDict[v[2]] = True
                outList.append(tmpDict)
    return outList

def writeOutputExcelFile(timeRange, searchTerms, resultDictionary):
    yCt = 0
    dfList = []
    headers = []
    tempDict = {}
    totalTerms = len(searchTerms)
    headers.append("URL")
    for t in searchTerms:
        headers.append(t)
    for year in timeRange:
        # Write the first row
        curDF = pd.DataFrame(columns=headers)
        """
        for key, value in resultDictionary.iteritems():
            if(year == value[0]):
                # Find the search term
                link = str(value[1]) + "?pesq=" + str(value[2])         
                colName = str(value[2])
                temp_df = pd.DataFrame(columns=[colName],data=[link])
                print("Add to " + colName + " => " + link)
                curDF = curDF.append(temp_df, ignore_index=True)
        """
        outList = createListFromDictionary(year, resultDictionary, searchTerms)
        tmp_df = pd.DataFrame(columns = headers, data = outList)       
        tmp_df[headers] = tmp_df[headers].replace({True: "Yes"})
        tmp_df[headers] = tmp_df[headers].replace({False: "No"})
        # Check to see if there is only "one" link
        for index, row in tmp_df.iterrows():
            if(row.value_counts()['Yes'] == 1):
                found = row[row[headers] == 'Yes']
                dfFound = pd.DataFrame(found)
                row['URL'] += "?pesq=" + str(dfFound.index[0])
        # Write the table
        curDF = curDF.append(tmp_df)
        dfList.append(curDF)
    # Write the final excel file
    with pd.ExcelWriter(outFile) as writer:
        for iDf in dfList:
            iDf.to_excel(writer, str(timeRange[yCt]))
            yCt += 1

            
##########################
##########################

## Script Run Block
Simple enough, this code block executes the script and performs the scrapping. Once the process is started a secondary web window will open up. Do not touch this as it may disrupt the mouse click events. You can keep track of the process by means of the debug output that comes up below.

On average it will take ~40 seconds to load up the front page to get into the page, from there it takes about 5s - 40s for each individual article to populate, the total time will depend upon how fast your internet connection is, if the server is undergoing maintenence, and how large of a volume you are looking for. 

Plan to have this script running in the background for some time, so it generally a good idea to let this run and then to work on something else on the side.

**NOTE:** This code block will perform all tasks, which will read the hemeroteca, save the debug dictionary, and then write the output to the excel file.

In [6]:
# RUN THIS BLOCK TO RUN THE SCRIPT
runScript()

Start Hemetoreca Scraper Script.
Opening tab for Hemeroteca front page
Locating Journal/Periodical
Waiting for Time Selection
Locating Time Range
Adding Search Term
Perform search
Waiting for next page to load...
Next page ready, found match element... counting
A total of 59 matches (Starting at page 12) have been found, standing by for page load.
Waiting for page 12 to load...
Adding element to dictionary: (Milho_1860_1869_0): (1860, u'http://memoria.bn.br/DocReader/287130/1122', 'Milho')
Moving to next page...
Waiting for page 13 to load...
Adding element to dictionary: (Milho_1860_1869_1): (1860, u'http://memoria.bn.br/DocReader/287130/1123', 'Milho')
Moving to next page...
Waiting for page 14 to load...
Adding element to dictionary: (Milho_1860_1869_2): (1861, u'http://memoria.bn.br/DocReader/287130/1219', 'Milho')
Moving to next page...
Waiting for page 15 to load...
Adding element to dictionary: (Milho_1860_1869_3): (1861, u'http://memoria.bn.br/DocReader/287130/1304', 'Milho')
M

## Excel Writing

If something goes wrong with the excel writing process alone, you can use this code block to debug the process in which the writing is performed.

In [67]:
resultDictionary = loadDictionary()
allWritingYears = []
for tRange in search_timeRanges:
    timeTuple = time_range_to_tuple(tRange)
    rTime = np.arange(timeTuple[0], timeTuple[1]+1)
    allWritingYears.extend(rTime)  

writeOutputExcelFile(allWritingYears, search_terms, resultDictionary)