# Scraping Reuters

This notebook goes through the process of scraping articles from Reuters for a particular search term.

This was an interesting challenge because the Reuters website is awkward, with dynamic content and weird database pulls.

The following solution is **extremely** hacky, and probably not the best way to do anything. It will stop working immediately if Reuters ever changes minor details about their site, and there is a chance that using it will get your IP banned from Reuters (they definitely don't want you to do this).

## Disclaimer

I did this as an intellectual puzzle; I'm not at all suggesting that anyone should scrape Reuters.

## Importing libraries

In [1]:
from bs4 import BeautifulSoup  # Parse html
import requests  # Make HTTP requests
from selenium import webdriver  # Scraping browser for dealing with dynamic content
from time import sleep  # Pause every so often to avoid Reuters noticing
import pandas as pd  # Dataframes

## Creating the initial URL

In [2]:
search_terms = input("Enter search terms: ").replace(" ", "+")

url = f"https://uk.reuters.com/search/news?blob={search_terms}"

print(url)

Enter search terms: romance
https://uk.reuters.com/search/news?blob=romance


## Gathering the article links

This is the hacky bit.

We use [Selenium](https://pypi.org/project/selenium/) to automate clicking through the site to load sufficient links, then grab all the links on the page.

In [3]:
page_num = int(input("How many pages of results should be loaded? "))

How many pages of results should be loaded? 5


In [4]:
# Create an automated browser

driver = webdriver.Chrome()

# Direct the browser to the correct URL; this loads 10 results

driver.get(url)

# Click the "accept cookies" button

cookie_accept = driver.find_element_by_id("_evidon-banner-acceptbutton")

cookie_accept.click()

# For a set number of times
for i in range(page_num):
    next_button = driver.find_element_by_class_name("search-result-more-txt")
    
    # Break out if there are no more results
    if next_button.text == "NO MORE RESULTS":
        print("Reached end of results.")
        break
        
    # Click the "more results" button
    next_button.click()
    
    # Pause to be polite
    sleep(2)

# Grab all links on the page

links = driver.find_elements_by_tag_name('a')

# Remove duplicates and extract just the links themselves

links = list(set([link.get_attribute('href') for link in links]))

# Only keep links with "article" in them somewhere

links = [link for link in links if "article" in link]
    
# Close the Selenium driver for neatness

driver.close()

# Check the number of links returned - ideally would be equal to page_num - 1 / 10

print(f"Found {len(links)} links.")

Found 61 links.


## Extract details from a single article link

In [5]:
def get_reuters_article_details(url):
    """
    Access an article from the Reuters site and return the article details
    
    Parameters:
        - url (str): the URL of a Reuters article

    Returns:
        - details (dict): a dictionary of article details
            - url
            - title
            - attribution
            - date
            - category
            - content
    """
    
    # Make a request for the page
    
    page = requests.get(url)
    
    # Construct a soup object from the page contents
    
    soup = BeautifulSoup(page.content)
    
    # Build the dictionary by pulling the text of different elements from the page
    
    details = {"url": url}
    
    # Hacky error handling
    
    title = soup.find("h1", class_="ArticleHeader_headline")
    details["title"] = title.text if title else None
    
    attribution = soup.find("p", class_="Attribution_content")
    details["attribution"] = attribution.text if attribution else None
    
    date = soup.find("div", class_="ArticleHeader_date")
    details["date"] = date.text if date else None

    category = soup.find("div", class_="ArticleHeader_channel")
    category_link = category.find("a") if category else None
    details["category"] = category_link.text if category_link else None
    
    # Extract the paragraphs of content (immediate <p> children of the body) and join them together
    
    body = soup.find("div", class_="StandardArticleBody_body")
    paragraphs = body.findChildren("p" , recursive=False) if body else []
    
    # Join the text together
    
    text = "\n".join([paragraph.text for paragraph in paragraphs])
    
    # Store the text in a dictionary
    
    details["content"] = text
    
    # Return the details dict
    
    return details

## Extract details for all articles

Now that we have a function to extract one articles' details, we can simply apply it to each article in turn.

In [6]:
# Make a holder for article details

article_holder = []

# Loop through the links and extract article details for each one

for link in links:
    details = get_reuters_article_details(link)
    article_holder.append(details)
    
    # Wait to avoid detection
    sleep(1)
    
    # Output the article number 
    print(f"Articles scraped: {links.index(link) + 1}/{len(links)}", end="\r")

Articles scraped: 61/61

## Store the details in a dataframe

Now that we have a list of dicts, we can transform it into a dataframe.

In [7]:
# Create the dataframe

reuters_df = pd.DataFrame(article_holder)

# View the dataframe

reuters_df.head()

Unnamed: 0,url,title,attribution,date,category,content
0,https://uk.reuters.com/article/idUKKBN0DI0S220...,News Corp to buy Torstar's romance publisher H...,Reporting by Euan Rocha in Toronto and Ashutos...,"May 2, 2014 / 12:31 PM / 6 years ago",Media Industry News,"TORONTO, May 2 (Reuters) - News Corp NWSA.O sa..."
1,https://uk.reuters.com/article/idUKTRE81513420...,Finding out what went wrong with failed romance,Reporting by Natasha Baker; editing by Patrici...,"February 6, 2012 / 2:26 PM / 8 years ago",Technology News,SAN FRANCISCO (Reuters) - With Valentine’s Day...
2,https://uk.reuters.com/article/idUKKCN1R111M,Open that door? Netflix explores choose-your-o...,Reporting by Lisa Richwine; Editing by Darren ...,"March 20, 2019 / 10:04 AM / a year ago",Business News,LOS ANGELES (Reuters) - A Netflix Inc experime...
3,https://uk.reuters.com/article/idUKL1N2150ZR,Open that door? Netflix explores choose-your-o...,Reporting by Lisa Richwine; Editing by Darren ...,"March 20, 2019 / 10:04 AM / a year ago",Business News,LOS ANGELES (Reuters) - A Netflix Inc experime...
4,https://uk.reuters.com/article/idUKKCN1QA2M9,Prada contrasts two sides of romance at Milan ...,Reporting by Claudia Cristoferi and Marie-Loui...,"February 21, 2019 / 7:25 PM / a year ago",Entertainment News,MILAN (Reuters) - Italian luxury label Prada a...


Cleaning and parsing date/attribution values is left as an exercise for the reader.