# Web Scraping with Selenium

Below is a pipeline for scraping webpages using the Selenium WebDriver. In order to run this program, ensure that the [Chrome WebDriver](https://googlechromelabs.github.io/chrome-for-testing/) is installed.

First, we import the necessary libraries and functions.

In [5]:
from bs4 import BeautifulSoup
from selenium import webdriver
# from selenium.webdriver.chrome.options import Options

# headless background execution
options = webdriver.ChromeOptions()
options.add_argument("headless")            # This is just to prevent Chrome from opening a window on your screen as the code runs.

We begin with an example of the scraping process. Below is a link to a sample incident report. Note that the webpage containing this report is *dynamically rendered* and so the HTML cannot be scraped directly using, for example, the `requests` and `BeautifulSoup` libraries. Thus, we must render the webpage with our WebDriver in order to scrape the HTML.

In [3]:
url = 'https://www.shetlandtimes.co.uk/2012/07/11/shetland-catch-ordered-to-pay-back-1-5-million-of-black-fish-profits '

First, we initialize the WebDriver and render the desired webpage. We then use `BeautifulSoup` to scrape the webpage and find all paragraph elements by finding all elements tagged with `<p>` and `</p>`. We can then get the text enclosed in these tags.

In [7]:
driver = webdriver.Chrome(options=options)
driver.get(url)

page_soup = BeautifulSoup(driver.page_source, 'html.parser')
p_list = page_soup.find_all("p")


for p in p_list:
    print(p.get_text())

The company at the heart of the black fish scam which involved thirteen Shetland fishermen and four others from the rest of Scotland was ordered to pay back £1.5 million of its illegal profits today.
Shetland Catch was also fined £150,000 for its role in helping skippers to get round quota regulations with an elaborate and secret electronic measuring system which concealed how much fish was actually processed at its factory.
Prosecutors had claimed that helping to land mackerel and herring which it did not declare to the authorities earned the Lerwick plant more than £6 million. Using legislation more usually directed against drug barons, the Crown demanded that the money be confiscated.
Shetland Catch challenged the demand and Lord Turnbull heard a week of evidence and legal argument at the High Court in Edinburgh earlier this year before announcing his decision on Wednesday.
The ruling marks the end of almost two years of argument and negotiation since August 2010 when Shetland Catch

We now concatenate all of the text in the report and can optionally save it to a text file.

In [13]:
text = ''

for p in p_list:
    text += ' ' + p.get_text()


save_txt = False

if save_txt:
    with open("sample_article.txt", "w") as text_file:
        text_file.write(text)

The following function summarises the above process and extracts all text from the provided url.

In [14]:
def get_url_text(url : str, save : bool = False, file : str = '') -> str:
    '''
    Uses the Selenium WebDriver to scrape all text from the webpage associated to the provided url.

    url : URL address for webpage to be scraped.
    save : Optional argument for saving scraped text as a txt file.
    file : Optional argument for naming txt file with scraped text.

    Webpage text is returned as a string.
    '''

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    page_soup = BeautifulSoup(driver.page_source, 'html.parser')
    p_list = page_soup.find_all("p")

    text = ''

    for p in p_list:
        text += ' ' + p.get_text()

    if save:
        with open(f"{file}.txt", "w") as text_file:
            text_file.write(text)

    return text
    

One issue to be aware of: The above function will scrape *all* text enclosed by paragraph tags on a webpage. This may include text that is not a part of the article, but should contain all article text. For our purposes, this should be okay since it is unlikely that any text not a part of the article will be unrelated to underreporting and misreporting fishing catches. For use with NER, especially if the NER model has been tuned, the inclusion of extra text should not be a major issue.