## Scraping the Awoko Newspaper

**Scraping script done on behalf of Oladoyin Okunoren @ Boston College**

By David J. Thomas

---

This notebook contains a series of scripts to scrape every news story from the Awoko Newspaper about Ebola from 2014-2016. It is a part of the dissertation of research of Oladoyin Okunoren, at [Boston College](https://bc.edu)

---

## Installation

``` bash
pip install -r requirements.txt
jupyter lab
```
---

## Defining A General Object to Scrape Any Page

Before we can get going scraping pages, this step below will create a general class called `BasePageScraper`, with a few basic functions that, when given a URL, will fetch the page's HTML, and then parse it using BeautifulSoup. This object won't be used directly. However, it will serve as a base class for objects defined in the following steps that are aimed at scraping specific pages. These child classes will pick up the methods and attributes of their parents. Once this base class is defined, we will print a success message.

In [None]:
import time
import requests
from bs4 import BeautifulSoup

class BasePageScraper:
    """Gets HTML for a single page, parses it with BeautifulSoup, and stores results in self.data.
    Base class for child classes which target specific pages"""
    
    def __init__(self, url, total_tries=5, scrape_delay=5):
        """Gets url and pre-parses the html"""
        self._url = url
        self._total_tries = total_tries
        self._scrape_delay = scrape_delay
        # store souped data in object for extraction
        self.data = self.soup_page(tries_left=self._total_tries)

    def soup_page(self, tries_left=5):
        """Receives URL, requests raw html, then returns converted BeautifulSoup object."""
        # declare variable for raw html
        page_html = None
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Accept-Encoding": "*",
            "Connection": "keep-alive"
        }
        # ensure tries_left is set and valid, if not set to 5, check if url is valid
        if not tries_left or type(tries_left) != int or tries_left < 0:
            tries_left = 5
        if type(self._url) != str:
            raise Exception('URL must be a valid string')
        # enforce a time delay between each scrape for good internet citizenship
        time.sleep(self._scrape_delay)
        print('Getting', self._url)
        # attempt to get page data, decrement tries_left if successful
        try:
            page_html = requests.get(self._url, headers=headers).text
            tries_left -= 1
        # if an error occured, retry by returning recursively
        except:
            print('Error getting', self._url)
            if tries_left > 0:
                print('Retrying...')
                print(tries_left)
                return self.soup_page(self._url, tries_left=tries_left-1)
            if tries_left <= 0:
                print('Retry limit reached, ABORTING parse of', self._url)
                return None
        print('Success, souping...')
        # if all went well, return new BeautifulSoup populated with the page html and parser set to html.parser
        return BeautifulSoup(page_html, 'html.parser')

print('Function defined! PROCEED')

## Scraping the Awoko Browse Page (Will take awhile)

This step will find the links to each story about ebola, from 2014-2016 onwards. The scraping object defined below uses the parent class from above. Howevr, it adds a few extra functions. One gets the link from the "next button (`.next_link`), if it exists. Another function gives a list of every link to an article on that specific browse page (`.links`). The last function, `.gather_links()`, gets all the links from the given page... and then calls itself reccursively on every following page (using the next page links) to get every link... not just those on this page, but on all following pages.

Once the object is defined, it can be used to scrape all of the ebola stories for a given year. So, below the object definition, the object is used three times... on the stories from (1) 2014, (2) 2015, and (3) 2016. All of those links are gathered together in a variable called `article_links` which will be used in the next step. There, we will use another scraper object on each of those pages.

In [None]:
class AwokoBrowseScraper(BasePageScraper):
    """Represents a browse page of the Awoko News Papers. Gathers links of stories returned from a search"""

    def __init__(self, url, total_tries=5, scrape_delay=5):
        super().__init__(url=url, total_tries=total_tries, scrape_delay=scrape_delay)

    @property
    def next_link(self):
        """Returns link to the next browse page if exists, or None if no further page"""
        link = self.data.find('a', class_='next')
        if not link:
            return None
        else:
            return link['href']

    @property
    def links(self):
        """Peruses the souped data and returns list of strings, each with a link"""
        return_links = []
        for article in self.data.find_all('article', class_='jeg_post'):
            # append link of article to list
            return_links.append(article.find('h3', class_='jeg_post_title').find('a')['href'])
        return return_links
    
    def gather_links(self):
        """Recursive function to gather links to all stories on this page, and subsequent pages. If not on the last page,
        return the links on the page plus those returned by a recursively call another AwokoBrowseScraper object on the next page.
        If on the last page, just return the links and break the recursive loop"""
        # if no next_link, just return the links on the page
        if not self.next_link:
            return self.links
        else:
            return self.links + AwokoBrowseScraper(self.next_link).gather_links()

article_links = []
print('Gathering stories, this may take awhile...')
# gather stories for 2014, 2015, 2016
article_links += AwokoBrowseScraper('https://awokonewspaper.sl/page/1/?s=ebola&year=2014').gather_links()
article_links += AwokoBrowseScraper('https://awokonewspaper.sl/page/1/?s=ebola&year=2015').gather_links()
article_links += AwokoBrowseScraper('https://awokonewspaper.sl/page/1/?s=ebola&year=2016').gather_links()

print(str(len(article_links)) + ' links to stories gathered')
print(article_links[0:5])

## Scraping Each Article (Will take a long while)

Warning: this step will take a long time to run. First, we will define another scraper object, this one representing the page of an article on the Awoko Newspaper site. Once we define the object, we will the use it on each url in `article_links` gathered above. The scraper object will make it quick and easy to extract the text and also various bits of metadata.

Once we have defined the object, we will loop through each of the `article_links`, calling the object, and appending all of the data it extracts as a dictionary inside of the list `story_data`. In the next steps, we will save the extracted data as both CSV and TXT files.

In [None]:
class AwokoArticleScraper(BasePageScraper):
    """Represents an article of the Awoko Newspaper. Aids extracting text and metadata"""

    def __init__(self, url, total_tries=5, scrape_delay=5):
        super().__init__(url=url, total_tries=total_tries, scrape_delay=scrape_delay)
        self.data = self.data.find('div', class_='jeg_inner_content')

    @property
    def title(self):
        return self.data.find('h1', class_='jeg_post_title').get_text()

    @property
    def author(self):
        return self.data.find('div', class_='jeg_meta_author').a.get_text()

    @property
    def date(self):
        return self.data.find('div', class_='jeg_meta_date').a.get_text()

    @property
    def column(self):
        return self.data.find('div', class_='jeg_meta_category').a.get_text()

    @property
    def publication(self):
        return 'Awoko Newspaper'
    
    @property
    def link(self):
        return self._url
    
    @property
    def text(self):
        return self.data.find('div', class_='entry-content').find('div', class_='content-inner').p.get_text()
    
# loop through every link, create an AwokoArtricleScraper object for it, extract data, and store in story_data
story_data = []
counter = 0
for article_link in article_links:
    counter += 1
    print('Scraping article ' + str(counter) + '/' + str(len(article_links)))
    current_page_scraper = AwokoArticleScraper(article_link)
    story_data.append({
        'title': current_page_scraper.title,
        'author': current_page_scraper.author,
        'date': current_page_scraper.date,
        'column': current_page_scraper.column,
        'publication': current_page_scraper.publication,
        'link': current_page_scraper.link,
        'type': 'local',
        'text': current_page_scraper.text
    })

print(story_data[0:5])

## Saving to File (CSV)

Now we need to output the data for text analysis. In this step we will output each record as a line in a .CSV (spreadsheet) file. That file will be stored in `output/awoko_newspaper.csv`.

In [None]:
import os
import csv

OUTPUT_CSV_FILENAME = 'awoko_newspaper.csv'
OUTPUT_CSV_FIELDNAMES = ['title', 'date', 'author', 'column', 'publication', 'link', 'type', 'text']

output_filepath = os.path.join(os.path.abspath(os.getcwd()), 'output', OUTPUT_CSV_FILENAME)

# ensure directory exists, if not, create it
if not os.path.exists(os.path.join(os.path.abspath(os.getcwd()), 'output')):
    os.makedirs(os.path.join(os.path.abspath(os.getcwd()), 'output'))

print('Writing CSV File ', output_filepath)
with open(output_filepath, 'w+', encoding='utf8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=OUTPUT_CSV_FIELDNAMES)
    writer.writeheader()
    for story_datum in story_data:
        writer.writerow(story_datum)

print('Success writing CSV File!')

## Saving to File (TXT)

Finally, some text analysis packages use folders of .txt files, instead of .csv files. So, we will also output every record as a .txt file that will be located inside of `output/awoko_newspaper/FILENAME.txt`, where the FILENAME will be determined by the url to the story.

In [None]:
import os

OUTPUT_FOLDERNAME = 'awoko_newspaper'

output_folderpath = os.path.join(os.path.abspath(os.getcwd()), 'output', OUTPUT_FOLDERNAME)

# ensure directory exists, if not, create it
if not os.path.exists(output_folderpath):
    os.makedirs(output_folderpath)

print('Writing TXT Files ', output_folderpath)
for story_datum in story_data:
    output_filename = story_datum['link'].split('/')[-2] + '.txt'
    output_filepath = os.path.join(output_folderpath, output_filename)
    txtfile = open(output_filepath, 'w+', encoding='utf8')
    txtfile.write(story_datum['text'])
    txtfile.close()

print('Success writing TXT Files!')