## Scraping the Sierra Leon Telegraph

**Scraping script done on behalf of Oladoyin Okunoren @ Boston College**

By David J. Thomas

---

This notebook contains a series of scripts to scrape every news story from the Sierra Leone Telegraph about Ebola from 2014-2016. It is a part of the dissertation of research of Oladoyin Okunoren, at [Boston College](https://bc.edu)

---

## Installation

``` bash
pip install -r requirements.txt
jupyter lab
```
---

## Gathering Links to Stories

This first step attempts to go to the url for search results for each of the three years in question (2014-2016). On each page, it will wait until the modal-popover appears which contains the links for each story. However... this popover is an infinite scroll that doens't load all the stories at once.

So, we will have to use the Selenium webdriver to run Chrome headlessly (i.e. in the background). For the results for each year, we will call the `gather_links()` function. It will need to use Selenium to automatically scroll the page over and over... until it detects that no new results have been loaded after scrolling. It will do this by using the `scrollIntoView()` method of Selenium, and targetting an element that is at the footer of the list. This will cause the page to scroll to the bottom. It will then check the height (in pixels) of the container element holding the stories. If that height has not changed since before the scroll, no new results were loaded and scrolling can stop.

At that point, once no more results load, we will use BeautifulSoup to parse the results, extract the URL of every story... and `gather_links()` will return them as a list of links.

Below the function definitions, we actually call `gather_links()` on each year's URL, and aggregate them in a variable called `story_links` which we will use in the next step.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

SCAPE_DELAY = 5
   
def scroll_popover(element_css_selector, driver):
    """Scrolls the page down to the given CSS selector"""
    element = False
    try:
        element = driver.find_element(By.CSS_SELECTOR, element_css_selector)
    except:
        element = False
    if element:
        driver.execute_script("arguments[0].scrollIntoView(true);", element)

def gather_browse_links(url):
    """Uses Selenium to load a search results page, and scroll infinitely until there are no more results... then gather and return a list of links to stories"""
    # links to be returned
    links = []
    # initialize webdriver
    webdriver_path = '/usr/local/bin/chromedriver'
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_service = ChromeService(executable_path=webdriver_path)
    driver = webdriver.Chrome(service=chrome_service, options=chrome_options)
    # open the browse page and wait for popover to load
    driver.get(url)
    wait = WebDriverWait(driver, 10)
    pop_over = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.jetpack-instant-search__search-results-content')))
    # make sure popover is in view
    driver.execute_script("arguments[0].scrollIntoView(true);", pop_over)
    # keep scrolling as long as height hasn't changed
    previous_height = driver.execute_script("return arguments[0].scrollHeight", pop_over)
    while True:
        # scroll popover to the bottom pagination element
        scroll_popover('div.jetpack-instant-search__search-results-pagination', driver)
        time.sleep(SCAPE_DELAY)
        # see if height has changed
        new_height = driver.execute_script("return arguments[0].scrollHeight", pop_over)
        if new_height == previous_height:
            break
        previous_height = new_height

    # once no more results, get the full html of the results. Parse with Sooup, and break into individual results
    pop_over_html = pop_over.get_attribute('innerHTML')
    soup = BeautifulSoup(pop_over_html, 'html.parser')
    stories = soup.find_all('li', class_='jetpack-instant-search__search-result')
    # iterate over results
    for story in stories:
        if story.find('a'):
            links.append('http://' + str(story.find('a')['href'])[2:])
    # terminate browser
    driver.quit()
    return links

target_urls = [
    'https://www.thesierraleonetelegraph.com/?s=ebola&year_post_date=2014-01-01%2000%3A00%3A00',
    'https://www.thesierraleonetelegraph.com/?s=ebola&year_post_date=2015-01-01%2000%3A00%3A00',
    'https://www.thesierraleonetelegraph.com/?s=ebola&year_post_date=2016-01-01%2000%3A00%3A00'
]

story_links = []

for target_url in target_urls:
    story_links += gather_browse_links(target_url)

print((str(len(story_links))) + ' stories')

for link in story_links[0:5]:
    print(link)

## Scraping Individual Story Pages (Will Take Swhile)

Since individual story pages do not require scrolling, or any other kind of dynamic interaction, we don't need to use as complex a solution as Selenium for this step. In our function to scrape each page `soup_page()`, we will use the `requests` library, to make simple HTTP requests to a URL and fetch the HTML. Then, we can parse it with `BeautifulSoup`, similar to the step above. `scrape_page()` will use the parsed HTML from the `soup_page()` function to extract the specific bits of destired data (text & metadata).

Below our function definitions, we then call the `scrape_page()` function on every `story_link` that we gathered in the step above. Each extracted record it added to a list of dictionaries in `story_data`. We will then save that data in the steps that follow.

In [None]:
import requests

def soup_page(url, scrape_delay=10, tries_left = 5):
    """Receives URL, requests raw html, then returns converted BeautifulSoup object."""
    # declare variable for raw html
    page_html = None
    # ensure tries_left is set and valid, if not set to 5, check if url is valid
    if not tries_left or type(tries_left) != int or tries_left < 0:
        tries_left = 5
    if type(url) != str:
        raise Exception('URL must be a valid string')
    # enforce a time delay between each scrape for good internet citizenship
    time.sleep(scrape_delay)
    print('Getting', url)
    # attempt to get page data, decrement tries_left if successful
    try:
        page_html = requests.get(url).text
        tries_left -= 1
    # if an error occured, retry by returning recursively
    except:
        print('Error getting', url)
        if tries_left > 0:
            print('Retrying...')
            return soup_page(url, tries_left=tries_left-1)
        if tries_left <= 0:
            print('Retry limit reached, ABORTING parse of', url)
            return None
    print('Success, souping...')
    # if all went well, return new BeautifulSoup populated with the page html and parser set to html.parser
    return BeautifulSoup(page_html, 'html.parser')

def scrape_page(url):
    """Receives a url, uses soup_page to fetch/parse it as soup.
    This function extracts specific desired data (text & metadata)
    and returns it in a dictionary"""
    extracted_data = {
        'title': '',
        'date': '',
        'author': '',
        'column': '',
        'link': url,
        'publication': 'The Sierra Leone Telegraph',
        'type': 'Local',
        'text': ''
    }
    soup = soup_page(url)
    # if page was not fetched successfully, abort and return None
    if not soup:
        return None
    # get specific bits of metadata
    extracted_data['title'] = soup.find('h1', class_='entry-title').get_text()
    extracted_data['date'] = soup.find('span', class_='entry-meta-date').get_text()
    extracted_data['author'] = soup.find('span', class_='entry-meta-author').get_text()
    extracted_data['column'] = soup.find('span', class_='entry-meta-categories').get_text()
    extracted_data['text'] = ' '.join(soup.find('div', class_='entry-content').get_text().split())
    return extracted_data

# iterate each link to a story, extract data, and append to story_data
story_data = []
counter = 0
for link in story_links:
    counter += 1
    print(str(counter) + '/' + str(len(story_links)))
    story_data.append(scrape_page(link))

for story_datum in story_data[0:5]:
    print(story_datum)
    
    

## Saving to File (CSV)

Now we need to output the data for text analysis. In this step we will output each record as a line in a .CSV (spreadsheet) file. That file will be stored in `output/awoko_newspaper.csv`.

In [None]:
import os
import csv

OUTPUT_CSV_FILENAME = 'sierra_leon_telegraph.csv'
OUTPUT_CSV_FIELDNAMES = ['title', 'date', 'author', 'column', 'publication', 'link', 'type', 'text']

output_filepath = os.path.join(os.path.abspath(os.getcwd()), 'output', OUTPUT_CSV_FILENAME)

# ensure directory exists, if not, create it
if not os.path.exists(os.path.join(os.path.abspath(os.getcwd()), 'output')):
    os.makedirs(os.path.join(os.path.abspath(os.getcwd()), 'output'))

print('Writing CSV File ', output_filepath)
with open(output_filepath, 'w+', encoding='utf8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=OUTPUT_CSV_FIELDNAMES)
    writer.writeheader()
    for story_datum in story_data:
        writer.writerow(story_datum)

print('Success writing CSV File!')

## Saving to File (TXT)


Finally, some text analysis packages use folders of .txt files, instead of .csv files. So, we will also output every record as a .txt file that will be located inside of `output/awoko_newspaper/FILENAME.txt`, where the FILENAME will be determined by the url to the story.

In [None]:
import os

OUTPUT_FOLDERNAME = 'sierra_leon_telegraph'

output_folderpath = os.path.join(os.path.abspath(os.getcwd()), 'output', OUTPUT_FOLDERNAME)

# ensure directory exists, if not, create it
if not os.path.exists(output_folderpath):
    os.makedirs(output_folderpath)

print('Writing TXT Files ', output_folderpath)
for story_datum in story_data:
    output_filename = story_datum['link'].split('/')[-2] + '.txt'
    output_filepath = os.path.join(output_folderpath, output_filename)
    txtfile = open(output_filepath, 'w+', encoding='utf8')
    txtfile.write(story_datum['text'])
    txtfile.close()

print('Success writing TXT Files!')