## Scraping the United Nation's Digital Library

**Scraping script done on behalf of Oladoyin Okunoren @ Boston College**

By David J. Thomas

---

This notebook contains a series of scripts to scrape every item from the United Nation's Digital Librarys database about Ebola from 2014-2016. It is a part of the dissertation of research of Oladoyin Okunoren, at [Boston College](https://bc.edu)

---

## Installation

``` bash
pip install -r requirements.txt
jupyter lab
```
---

In [1]:
import time
import requests
from bs4 import BeautifulSoup

class BasePageScraper:
    """Gets HTML for a single page, parses it with BeautifulSoup, and stores results in self.data.
    Base class for child classes which target specific pages"""
    
    def __init__(self, url, total_tries=5, scrape_delay=0.5):
        """Gets url and pre-parses the html"""
        self._url = url
        self._total_tries = total_tries
        self._scrape_delay = scrape_delay
        # store souped data in object for extraction
        self.data = self.soup_page(tries_left=self._total_tries)

    def soup_page(self, tries_left=5):
        """Receives URL, requests raw html, then returns converted BeautifulSoup object."""
        # declare variable for raw html
        page_html = None
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Accept-Encoding": "*",
            "Connection": "keep-alive"
        }
        # ensure tries_left is set and valid, if not set to 5, check if url is valid
        if not tries_left or type(tries_left) != int or tries_left < 0:
            tries_left = 5
        if type(self._url) != str:
            raise Exception('URL must be a valid string')
        # enforce a time delay between each scrape for good internet citizenship
        time.sleep(self._scrape_delay)
        print('Getting', self._url)
        # attempt to get page data, decrement tries_left if successful
        try:
            page_html = requests.get(self._url, headers=headers).text
            tries_left -= 1
        # if an error occured, retry by returning recursively
        except:
            print('Error getting', self._url)
            if tries_left > 0:
                print('Retrying...')
                print(tries_left)
                return self.soup_page(self._url, tries_left=tries_left-1)
            if tries_left <= 0:
                print('Retry limit reached, ABORTING parse of', self._url)
                return None
        print('Success, souping...')
        # if all went well, return new BeautifulSoup populated with the page html and parser set to html.parser
        return BeautifulSoup(page_html, 'html.parser')

print('Function defined! PROCEED')

Function defined! PROCEED


In [4]:
class UNDLBrowseScraper(BasePageScraper):
    """Represents a browse page of the Awoko News Papers. Gathers links of stories returned from a search"""

    def __init__(self, url, total_tries=5, scrape_delay=0.2):
        super().__init__(url=url, total_tries=total_tries, scrape_delay=scrape_delay)

    @property
    def next_link(self):
        """Returns link to the next browse page if exists, or None if no further page"""
        link = None
        link_containers = self.data.find('span', class_='rec-navigation').find_all('a', class_='img')
        for link_container in link_containers:
            if link_container.img['alt'] == 'next':
                link = link_container['href']
        if link is None:
            return None
        return 'https://digitallibrary.un.org' + link

    @property
    def links(self):
        """Peruses the souped data and returns list of strings, each with a link"""
        return_links = []
        for article in self.data.find_all('div', class_='result-title'):
            # append link of article to list
            return_links.append('https://digitallibrary.un.org' + article.a['href'] + '?v=pdf#files')
        return return_links
    
    def gather_links(self):
        """Recursive function to gather links to all stories on this page, and subsequent pages. If not on the last page,
        return the links on the page plus those returned by a recursively call another AwokoBrowseScraper object on the next page.
        If on the last page, just return the links and break the recursive loop"""
        # if no next_link, just return the links on the page
        if not self.next_link:
            return self.links
        else:
            return self.links + UNDLBrowseScraper(self.next_link).gather_links()

article_links = []
print('Gathering stories, this may take awhile...')
# gather stories for 2014, 2015, 2016
article_links = UNDLBrowseScraper('https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=51&fct__3=2014&fct__3=2015&fct__3=2016&ln=en').gather_links()

print(str(len(article_links)) + ' links to stories gathered')
print(article_links[0:5])

Gathering stories, this may take awhile...
Getting https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=51&fct__3=2014&fct__3=2015&fct__3=2016&ln=en
Success, souping...
Getting https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=101&fct__3=2014&fct__3=2015&fct__3=2016&ln=en
Success, souping...
Getting https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=151&fct__3=2014&fct__3=2015&fct__3=2016&ln=en
Success, souping...
Getting https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=201&fct__3=2014&fct__3=2015&fct__3=2016&ln=en
Success, souping...
Getting https://digitallibrary.un.org/search?p=ebola&c=Resource+Type&c=UN+Bodies&rg=50&jrec=251&fct__3=2014&fct__3=2015&fct__3=2016&ln=en
Success, souping...
227 links to stories gathered
['https://digitallibrary.un.org/record/795883?v=pdf#files', 'https://digitallibrary.un.org/record/783809?v=pdf#files', 'https://digitallib

In [9]:
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


class BaseSeleniumScraper:
    """Base class for all Selenium Scraper objects."""
    url = ''
    scrape_delay = 5
    options = Options()
    webdriver_path = '/usr/local/bin/chromedriver'
    service = None
    driver = None
    metadata = {}


    def __init__(self, url,  *args, **kwargs):
        # store the target url
        self.url = url
        # store scrape delay, if provided and a valid number
        if 'scrape_delay' in kwargs:
            if type(kwargs['scrape_delay']) == 'int' or type(kwargs['scrape_delay']) == 'float' or type(kwargs['scrape_delay']) == 'complex':
                self.scrape_delay = kwargs['scrape_delay']
            else:
                raise Exception('Argument \'scrape_delay must be an integer\', float, or complex number')
        # store webdriver_path, if a string and pointing to a file
        if 'webdriver_path' in kwargs:
            if type(kwargs['webdriver_path']) != 'string':
                raise Exception('Argument \'webdriver_path\' must be a string')
            if not os.path.isfile(os.path.abspath(kwargs['webdriver_path'])):
                raise Exception('Argument \'webdriver_path\' must point to a valid webdriver')
            self.webdriver_path = kwargs['webdriver_path']
        # comment out this line to run Chrome normally
        # self.options.add_argument('--headless')
        self.service = ChromeService(executable_path=self.webdriver_path)
        self.driver = webdriver.Chrome(service=self.service, options=self.options)
        self.load()
        # make sure to shutdown the driver even if error occurs
        try:
            self.post_load()
        except Exception as e:
            self.shutdown()
            raise Exception(e)
        self.shutdown()

    def load(self, *args, **kwargs):
        """Enforces a per-page scraping delay, then performs the inital page load."""
        time.sleep(self.scrape_delay)
        # fetch the page data
        print('Getting page at ', self.url)
        self.driver.get(self.url)

    def post_load(self, *args, **kwargs):
        """Runs after the initial load of page data. Placeholder, SHOULD BE OVERWRITTEN by child classes to extract data"""
        pass

    def shutdown(self, *args, **kwargs):
        self.driver.quit()

print('Class defined, PROCEED.')
    


Class defined, PROCEED.


In [13]:
import requests
import fitz

class PDFScraper:
    """Separate scraper to handle fetching/converting the PDF into text. Will be used by the UNDLArticleScraper below"""
    url = ''
    scrape_delay = 5
    max_tries = 5
    headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Accept-Encoding": "*",
            "Connection": "keep-alive"
        }
    data = None

    def __init__(self, url, *args, **kwargs):
        if type(url) != str:
            raise Exception('URL must be a valid string')
        self.url = url
        # store scrape delay, if provided and a valid number
        if 'scrape_delay' in kwargs:
            if type(kwargs['scrape_delay']) != 'int' and type(kwargs['scrape_delay']) != 'float' and type(kwargs['scrape_delay']) == 'complex':
                raise Exception('Argument \'scrape_delay must be an integer\', float, or complex number')
            self.scrape_delay = kwargs['scrape_delay']
        # store scrape delay, if provided and a valid number
        if 'max_tries' in kwargs:
            if type(kwargs['max_tries']) != 'int' and type(kwargs['max_tries']) != 'float' and type(kwargs['max_tries']) == 'complex':
                raise Exception('Argument \'scrape_delay must be an integer\', float, or complex number')
            self.max_tries = kwargs['max_tries']
        self.data = self.pdf

    def get_pdf_data(self, tries_left=5):
        """Fetch data and returns raw content... if fail to fetch, returns self recursively, with tries_left decremented"""
        # enforce scrape delay
        pdf_content = None
        time.sleep(self.scrape_delay)
        if not tries_left or type(tries_left) != int or tries_left < 0:
            tries_left = 5
        print('Getting PDF at ', self.url)
        # attempt to get page data, decrement tries_left if successful
        try:
            pdf_content = requests.get(self.url, headers=self.headers).content
            tries_left -= 1
         # if an error occured, retry by returning recursively
        except:
            print('Error getting', self.url)
            if tries_left > 0:
                print('Retrying...')
                print(tries_left)
                return self.get_pdf_data(self.url, tries_left=tries_left-1)
            if tries_left <= 0:
                print('Retry limit reached, ABORTING parse of', self.url)
                return None
        return pdf_content
    
    @property
    def pdf(self):
        """Returns the content of the PDF as text"""
        text = ''
        pdf_document = fitz.open('pdf', self.get_pdf_data(self.max_tries))
        for page_num in range(pdf_document.page_count):
            page = pdf_document[page_num]
            text += page.get_text()
        # return text with redundant whitespace and all newlines/tabs replaced
        text = text.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
        text = ' '.join(text.split())
        return text

class UNDLArticleSeleniumScraper(BaseSeleniumScraper):
    """Used to scrape the metadata & pdf link from a single article page. Uses the PDF scraper to get contents of the PDF"""

    def post_load(self, *args, **kwargs):
        wait = WebDriverWait(self.driver, self.scrape_delay)
        metadata_container = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.detailed-record-content'))).get_attribute('innerHTML')
        time.sleep(self.scrape_delay)
        files_container = wait.until(EC.presence_of_element_located((By.ID, 'record-files-list'))).get_attribute('innerHTML')
        metadata_soup = BeautifulSoup(metadata_container, 'html.parser')
        files_soup = BeautifulSoup(files_container, 'html.parser')
        self.metadata = self.get_metadata(metadata_soup)
        self.metadata['enPdfLink'] = self.get_en_pdf_link(files_soup)

    def get_metadata(self, souped_data, *args, **kwargs):
        """Extract metadata from page after load and return as a dictionary"""
        extracted_data = {
            'title': '',
            'authors': '',
            'description': '',
            'agenda': '',
            'resolution': '',
            'meetingRecord': '',
            'draftResolution': '',
            'note': '',
            'date': '',
            'enPdfLink': '',
            'collections': ''
        }
        container = souped_data.find('div', id='details-collapse').find_all('div', class_='metadata-row')
        for data_row in container:
            # check label of metadata for the kind of metadata in that row
            match data_row.span.get_text().strip():
                case 'Title':
                    extracted_data['title'] = data_row.find_all('span')[1].get_text()
                case 'Authors':
                    extracted_data['authors'] = data_row.find_all('span')[1].get_text()
                case 'Agenda information':
                    extracted_data['agenda'] = data_row.find_all('span')[1].get_text()
                case 'Description':
                    extracted_data['description'] = data_row.find_all('span')[1].get_text()
                case 'Resolution / Decision':
                    extracted_data['resolution'] = data_row.find_all('span')[1].get_text()
                case 'Meeting record':
                    extracted_data['meetingRecord'] = data_row.find_all('span')[1].get_text()
                case 'Draft resolution':
                    extracted_data['draftResolution'] = data_row.find_all('span')[1].get_text()
                case 'Note':
                    extracted_data['note'] = data_row.find_all('span')[1].get_text()
                case 'Vote date':
                    extracted_data['date'] = data_row.find_all('span')[1].get_text()
                case 'Date':
                    extracted_data['date'] = data_row.find_all('span')[1].get_text()
                case 'Collections':
                    extracted_data['collections'] = data_row.find_all('span')[1].get_text()
        return extracted_data
    
    def get_en_pdf_link(self, souped_data, *args, **kwargs):
        """gets all the PDF links on the page"""
        link = ''
        for pdf_row in souped_data.find_all('tr')[1:]:
            if pdf_row.find_all('td')[4].get_text() == 'English':
                link =  pdf_row.find_all('td')[0].find('tindui-app-file-download-link')['url']
        return link
    
    @property
    def pdf(self):
        return PDFScraper(self.metadata['enPdfLink'], scrape_delay=self.scrape_delay).data

scraper = UNDLArticleSeleniumScraper('https://digitallibrary.un.org/record/795883?v=pdf#files')
print(scraper.pdf)


Getting page at  https://digitallibrary.un.org/record/795883?v=pdf#files
Getting PDF at  https://digitallibrary.un.org/record/795883/files/A_HRC_28_L-31_Rev-1-EN.pdf
GE.15-06370 (E) 260315 260315  Human Rights Council Twenty-eighth session Agenda item 10 Technical assistance and capacity-building Algeria (on behalf of the Group of African States), Bulgaria*, Cyprus*, Germany, Greece*, Hungary*, Italy*, Latvia, Luxembourg*, Netherlands, Slovakia*, United Kingdom of Great Britain and Northern Ireland: draft resolution 28/... Strengthening of technical cooperation and consultative services in Guinea The Human Rights Council, Guided by the Charter of the United Nations, the Universal Declaration of Human Rights and other applicable human rights instruments, Recalling General Assembly resolution 60/251 of 15 March 2006 and Human Rights Council resolutions 13/21 of 26 March 2010, 16/36 of 25 March 2011, 19/30 of 23 March 2012, 23/23 of 14 June 2013 and 25/35 of 28 March 2014, Reaff