## Scraping the World Health Organization's Iris (Institutional Repository for Information Sharing) Database

**Scraping script done on behalf of Oladoyin Okunoren @ Boston College**

By David J. Thomas
Edited by Lester E. Carver

---

This notebook contains a series of scripts to scrape every news story from the Awoko Newspaper about Ebola from 2014-2016. It is a part of the dissertation of research of Oladoyin Okunoren, at [Boston College](https://bc.edu)

---

## Installation

``` bash
pip install -r requirements.txt
jupyter lab
```
---

## Defining A General Object to Scrape Any Page

Before we can get going scraping pages, this step below will create a general class called `BasePageScraper`, with a few basic functions that, when given a URL, will fetch the page's HTML, and then parse it using BeautifulSoup. This object won't be used directly. However, it will serve as a base class for objects defined in the following steps that are aimed at scraping specific pages. These child classes will pick up the methods and attributes of their parents. Once this base class is defined, we will print a success message.

In [None]:
import time
import requests
from bs4 import BeautifulSoup
import os
import fitz
from urllib.parse import urlparse, unquote
import re
import csv

class BasePageScraper:
    """Gets HTML for a single page, parses it with BeautifulSoup, and stores results in self.data.
    Base class for child classes which target specific pages"""

    def __init__(self, url, total_tries=5, scrape_delay=5):
        """Gets url and pre-parses the html"""
        self._url = url
        self._total_tries = total_tries
        self._scrape_delay = scrape_delay
        # store souped data in object for extraction
        self.data = self.soup_page(tries_left=self._total_tries)

    def soup_page(self, tries_left=5):
        """Receives URL, requests raw html, then returns converted BeautifulSoup object."""
        # declare variable for raw html
        page_html = None
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Accept-Encoding": "*",
            "Connection": "keep-alive"
        }
        # ensure tries_left is set and valid, if not set to 5, check if url is valid
        if not tries_left or type(tries_left) != int or tries_left < 0:
            tries_left = 5
        if type(self._url) != str:
            raise Exception('URL must be a valid string')
        # enforce a time delay between each scrape for good internet citizenship
        time.sleep(self._scrape_delay)
        print('Getting', self._url)
        # attempt to get page data, decrement tries_left if successful
        try:
            page_html = requests.get(self._url, headers=headers).text
            tries_left -= 1
        # if an error occured, retry by returning recursively
        except:
            print('Error getting', self._url)
            if tries_left > 0:
                print('Retrying...')
                print(tries_left)
                return self.soup_page(self._url, tries_left=tries_left-1)
            if tries_left <= 0:
                print('Retry limit reached, ABORTING parse of', self._url)
                return None
        print('Success, souping...')
        # if all went well, return new BeautifulSoup populated with the page html and parser set to html.parser
        return BeautifulSoup(page_html, 'html.parser')

print('Function defined! PROCEED')

## Scraping the WHO IRIS Browse Page (Will take awhile)

This step will find the links to each story about ebola, from 2014-2016 onwards. The scraping object defined below uses the parent class from above. Howevr, it adds a few extra functions. One gets the link from the "next button (`.next_link`), if it exists. Another function gives a list of every link to an article on that specific browse page (`.links`). The last function, `.gather_links()`, gets all the links from the given page... and then calls itself reccursively on every following page (using the next page links) to get every link... not just those on this page, but on all following pages. All of those links are gathered together in a variable called `article_links` which will be used in the next step. There, we will use another scraper object on each of those pages.

In [None]:
class IRISBrowseScraper(BasePageScraper):
    """Represents a browse page of the WHO IRIS Database. Gathers links of stories returned from a search"""

    def __init__(self, url, total_tries=5, scrape_delay=5):
        super().__init__(url=url, total_tries=total_tries, scrape_delay=scrape_delay)

    @property
    def next_link(self):
        """Returns link to the next browse page if exists, or None if no further page"""
        link = self.data.find('a', class_='next-page-link')
        if not link:
            return None
        else:
            return 'https://iris.who.int/' + link['href']

    @property
    def links(self):
        """Peruses the souped data and returns list of strings, each with a link"""
        return_links = []
        for article in self.data.find_all('div', class_='artifact-description'):
            # append link of article to list
            return_links.append('https://iris.who.int' + article.find('h4', class_='artifact-title').find('a')['href'])
        return return_links

    def gather_links(self):
        """Recursive function to gather links to all stories on this page, and subsequent pages. If not on the last page,
        return the links on the page plus those returned by a recursively call another IRISBrowseScraper object on the next page.
        If on the last page, just return the links and break the recursive loop"""
        # if no next_link, just return the links on the page
        if not self.next_link:
            return self.links
        else:
            return self.links + IRISBrowseScraper(self.next_link).gather_links()

article_links = []
print('Gathering stories, this may take awhile...')
url = 'https://iris.who.int/discover?search-result=true&query=ebola&scope=&filtertype_0=dateIssued&filtertype_1=iso&filter_relational_operator_1=equals&filter_relational_operator_0=equals&filter_1=English&filter_0=%5B2013+TO+2016%5D&rpp=100&sort_by=score&order=desc'
article_links += IRISBrowseScraper(url).gather_links()


print(str(len(article_links)) + ' links to stories gathered')
print(article_links[0:5])

## Scraping Each Article (Will take a long while)

Warning: this step will take a long time to run. First, we will define another scraper object, this one representing the page of a document on the WHO IRIS site. Once we define the object, we will the use it on each url in `article_links` gathered above. The scraper object will make it quick and easy to extract metadata and links to the pdfs.

Once we have defined the object, we will loop through each of the `article_links`, calling the object, and appending all of the data it extracts as a dictionary inside of the list `story_data`. In the next steps, we will download and scrape the pdfs for their text in the next steps. 

In [None]:
class IRISArticleScraper(BasePageScraper):
    """Represents an article of the WHO. Aids extracting text and metadata"""

    def __init__(self, url, total_tries=5, scrape_delay=5):
        super().__init__(url=url, total_tries=total_tries, scrape_delay=scrape_delay)
        self.data = self.data.find('div', class_='row')

    @property
    def title(self):
        return self.data.find('span', id ='citation-article-title').get_text()

    @property
    def author(self):
        outer_span = self.data.find('span', id='citation-article-authors')
        if outer_span:
            nested_span = outer_span.find('span')
            if nested_span:
                return nested_span.get_text()
            else:
                return "No author found"
        else:
            return "No author found"

    @property
    def date(self):
        raw_date = self.data.find('span', id ='citation-article-date').get_text()
        cleaned_date = re.sub(r'[^\d]', '', raw_date)  # Remove non-numeric characters
        return cleaned_date

    @property
    def publication(self):
        return 'World Health Organization'

    @property
    def link(self):
        return self._url

    @property
    def pdf(self):
        outer_div = self.data.find('div', class_='item-page-field-wrapper table word-break')
        if outer_div:
            nested_div = outer_div.find('a')
            if nested_div and 'href' in nested_div.attrs:
                return 'https://iris.who.int' + nested_div['href']
        return 'No pdf found'

# loop through every link, create an IRISArtricleScraper object for it, extract data, and store in story_data
story_data = []
counter = 0
for article_link in article_links:
    counter += 1
    print('Scraping article ' + str(counter) + '/' + str(len(article_links)))
    current_page_scraper = IRISArticleScraper(article_link)
    story_data.append({
        'title': current_page_scraper.title,
        'author': current_page_scraper.author,
        'date': current_page_scraper.date,
        'publication': current_page_scraper.publication,
        'link': current_page_scraper.link,
        'type': 'International',
        'pdf': current_page_scraper.pdf
    })

print(story_data[0:5])

## Downloading and Scraping PDF Text

This step below will create a class called `PDFScraped`, with functions for downloading pdfs from each article and then extracting the text of each pdf.

In [None]:
class PDFScraper:
    """Handles downloading and optionally extracting text from PDF files."""

    def __init__(self, save_path='pdfs'):
        self.save_path = save_path
        if not os.path.exists(self.save_path):
            os.makedirs(self.save_path)

    def download_pdf(self, pdf_url):
        """Download PDF from the given URL and save it to the specified directory."""
        parsed_url = urlparse(pdf_url)
        pdf_filename = os.path.basename(parsed_url.path)
        pdf_filename = unquote(pdf_filename)  # Decode URL-encoded filename
        pdf_filename = os.path.join(self.save_path, pdf_filename)

        try:
            response = requests.get(pdf_url)
            response.raise_for_status()
            with open(pdf_filename, 'wb') as f:
                f.write(response.content)
            print(f'Successfully downloaded {pdf_filename}')
            return pdf_filename
        except requests.RequestException as e:
            print(f'Failed to download {pdf_url}: {e}')
            return None

    def extract_text_from_pdf(self, pdf_path):
        """Extract text from the given PDF file using PyMuPDF (fitz)."""
        try:
            document = fitz.open(pdf_path)
            text = ""
            for page_num in range(len(document)):
                page = document.load_page(page_num)
                text += page.get_text().replace('\n', ' ').replace('\r', ' ')  # Replace newlines and carriage returns with spaces
            return text
        except Exception as e:
            print(f'Failed to extract text from {pdf_path}: {e}')
            return ""

pdf_scraper = PDFScraper()
story_data = []
counter = 0

for article_link in article_links:
    if counter >= 10:  # Stop after scraping 10 articles
        break

    counter += 1
    print(f'Scraping article {counter}/{len(article_links)}')
    current_page_scraper = IRISArticleScraper(article_link)

    # Download PDF if available and add to story_data
    pdf_url = current_page_scraper.pdf
    pdf_filename = None
    extracted_text = None
    if pdf_url and pdf_url != 'No pdf found':
        pdf_filename = pdf_scraper.download_pdf(pdf_url)
        if pdf_filename:
            # Extract text from the downloaded PDF
            extracted_text = pdf_scraper.extract_text_from_pdf(pdf_filename)
            if extracted_text:
                print(f'Extracted text from {pdf_filename}:')
            else:
                print(f'Failed to extract text from {pdf_filename}')
        else:
            print(f'Failed to download PDF from {pdf_url}')
    else:
        print(f'No PDF found for {current_page_scraper.title}')

    # Collect data for story_data_with_pdfs
    story_data.append({
        'title': current_page_scraper.title,
        'author': current_page_scraper.author,
        'date': current_page_scraper.date,
        'publication': current_page_scraper.publication,
        'link': current_page_scraper.link,
        'type': 'International',
        'pdf': pdf_filename,  # Store the filename or None if no PDF
        'pdf_link': current_page_scraper.link,
        'text': extracted_text if extracted_text else 'No text extracted'
    })

print('Completed scraping articles.')
print(story_data[0:5])

## Saving to File (CSV)

Now we need to output the data for text analysis. In this step we will output each record as a line in a .CSV (spreadsheet) file. That file will be stored in `who_iris.csv`.

In [None]:
def clean_text(text):
    """Clean the text by replacing newlines and escaping quotes."""
    if text:
        # Replace newlines and carriage returns with spaces
        text = text.replace('\n', ' ').replace('\r', ' ')
        # Escape double quotes by replacing " with ""
        text = text.replace('"', '""')
    return text

OUTPUT_CSV_FILENAME = 'who_iris.csv'
OUTPUT_CSV_FIELDNAMES = ['title', 'author', 'date', 'publication', 'link', 'type', 'pdf', 'pdf_link', 'text']

# Specify the directory where PDFs are stored
pdfs_directory = 'C:/Users/carverle/Documents/pdfs'

# Create output directory path
output_filepath = os.path.join(os.path.abspath(os.getcwd()), pdfs_directory, OUTPUT_CSV_FILENAME)

# Ensure directory exists, if not, create it
output_directory = os.path.dirname(output_filepath)
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

print('Writing CSV File ', output_filepath)
try:
    with open(output_filepath, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=OUTPUT_CSV_FIELDNAMES, quoting=csv.QUOTE_MINIMAL)
        writer.writeheader()
        for story_datum in story_data:
            # Clean the text field
            story_datum['text'] = clean_text(story_datum['text'])
            writer.writerow(story_datum)
    print('Success writing CSV File!')
except IOError:
    print(f'Error writing to CSV file: {output_filepath}')