## MA5851 A3 Web Crawler: Scraping the White House News Briefings
By Anthony Lighterness, jc244913

According to the White House Copyright Policy (please see https://www.whitehouse.gov/copyright/), the web-scraped press briefings for the purposes of this assignment are not copyright protected as stated: "*Pursuant to federal law, government-produced materials appearing on this site are not copyright protected. The United States Government may receive and hold copyrights transferred to it by assignment, bequest, or otherwise*."

The purpose of the web crawler presented here is to extract specific data items (date, title, style, category, and transcript) from each White House news briefing published on the official White House website (please see https://www.whitehouse.gov/news/). At the time of development, approximately 10 news briefings were available on each one of 835 pages. As such, we first extract the URL links of each one of these 6190 news briefings, which will then be used by the web scraper to extract relevant data items.

### Import Libraries

In [1]:
import sys
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
#from selenium.common.exceptions import NoSuchElementException


### Initialise ChromeDriver and Extract URLs of Each Press Briefing

In [3]:
# Initialise ChromeDriver
driver_path = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(driver_path)
driver.get('https://www.whitehouse.gov/news/')
driver.implicitly_wait(1) # short waiting time

# Initialise empty URL list for Each News Briefing
briefing_list = []

# Please note, when subsequent press briefings are stored, the total page
# number may change, requiring adjustment for the range(n). At the time of
# development, 835 pages were available. A new reader can also reduce the range(n)
# to a smaller integer like 2 or 3 to see the web crawler in action.

for i in range(5):
    # Define the URL for each briefing on every page
    briefings = driver.find_elements_by_class_name('briefing-statement__title')
        
    # For each briefing URL, append to the initialised empty list.
    for i in briefings:
        briefing_list.append(i.find_element_by_css_selector('a').get_attribute('href'))
        
    # Define the next page button to be clicked to access all pages
    element = WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.CLASS_NAME,'pagination__next')))
    driver.execute_script("return arguments[0].scrollIntoView();", element)
    # Click next page
    element.click()
    
driver.quit()


In [15]:
len(briefing_list)

6190

### Define Function to Extract Text Data

In [20]:
# Extract data for each news briefing
def extract_data(briefing_transcripts):
    # Initialise empty lists of variables to be extracted
    date = []
    category = []
    title = []
    transcript = []
    style = []

    # ---------- Extract Date ----------
    try:
        brief_date = driver.find_element_by_xpath('//*[@id="main-content"]/div[1]/div/div/p/time').text
        date.append(brief_date) 
    except:
        date.append("NaN")
    
    # -------- Extract Category --------
    # Some briefings lack a specific category, so if missing, input "NaN"
    try:
        brief_category = driver.find_element_by_xpath('//*[@id="main-content"]/div[1]/div/div/div/p/a').text
        category.append(brief_category) 
    except:
        category.append("NaN")
    
    # ------- Extract Style/Type --------
    # Some briefings lack a specific style/type, so if missing, input "NaN"
    try:
        brief_style = driver.find_element_by_class_name('page-header__section').text
        style.append(brief_style)
    except:
        style.append("NaN")
        
    # ---------- Extract Title ----------
    try:
        brief_title = driver.find_element_by_class_name('page-header__title').text
        title.append(brief_title)
    except:
        title.append("NaN")
        
    # -------- Extract Transcript -------
    brief_transcript = driver.find_element_by_css_selector('div.page-content__content.editor').text
    transcript.append(brief_transcript)

    # Append each extracted element/variable into a table
    briefing_transcripts.loc[len(briefing_transcripts)] = [date,title,style,category,url,transcript]
    return(briefing_transcripts)
    driver.quit()


### White House Press Briefing Web Crawl

In [23]:
# Measure time taken to extract text data
start = time.time()

# Ensure chromedriver path is set
driver_path = '/usr/local/bin/chromedriver'

# Initialise a data frame to receive extracted data
briefing_transcripts = pd.DataFrame(columns = ['date','title','style',
                                               'category','url','transcript'])

# Crawl through each briefing URL link
for link in briefing_list:
    url = str(link)
    
    # Activate driver
    driver = webdriver.Chrome(executable_path=driver_path) 
    
    # Try to extract text data, otherwise quit.
    try:
        driver.get(url)
        driver.implicitly_wait(2)
        briefing_transcripts = extract_data(briefing_transcripts)   
        driver.quit()        
    except:
        driver.quit()
        raise

# Output and save press briefing text data to an excel (xlsx) file
briefing_transcripts.to_excel('white_house_news.xlsx', 
                 index = None, header=True)

driver.quit()

end = time.time()
print(end - start)


22297.178542137146


In [47]:
# Python System Information 
print("Python version")
print (sys.version)
print("Version info.")
print (sys.version_info)


Python version
3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Version info.
sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)


In [None]:
driver = webdriver.Chrome(executable_path="D:\\chromedriver.exe", 
                          service_args=["--verbose", "--log-path=D:\\qc1.log"])
