# Extracting Information with Selenium
*Curtis Miller*

Selenium is a good tool to choose for data extraction when the content of a webpage changes. JavaScript in particular changes the content of the DOM and so needs to be handled in a special way.

The webpage for the archive of [Pycoder's Weekly](http://pycoders.com/), a weekly Python newsletter, is an example of a webpage where content is not initially present in the DOM when the page is loaded. As a result, if you were to use BeautifulSoup on the HTML document returned by the server, you would not get the content you expected if you looked at the page.

We can use Selenium to get around this problem.

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
# The following is needed for waiting support in Selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import requests

Let's demonstrate the problem by trying to get a list of links to [Pycoder's Weekly archive](http://pycoders.com/archive/) using just **requests** and BeautifulSoup.

In [None]:
session = requests.Session()
url = "http://pycoders.com/archive/"
page = session.get(url).text
bsObj = BeautifulSoup(page)

# Get all links
links_naive = dict()
for a in bsObj.findAll("a"):
    links_naive[a.contents[0]] = a.attrs["href"]

links_naive

We know there should be many more links than this.

We will start a Selenium session that visits t e webpage in question. We instruct the bot to wait until it detects a certain element in the DOM. Once it notices that element has been added, it moves forward.

**Note: The content of this webpage may change, which could break this code.**

In [None]:
path = "chromedriver"    # Depends on system/OS/etc.
driver = webdriver.Chrome(executable_path=path)
driver.get(url)

# Wait until the div with class "display_archive" appears
wait = WebDriverWait(driver, 120, 1)    # An object that uses driver and will wait 120 seconds for a condition until
                                        # timing out, checking every 1 second
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "display_archive")))    # This will wait until an element with
                                                                                  # class "display_archive" is present in
                                                                                  # the DOM

# At this phase we could use Selenium to get data, but we can also request the page source at this point in time and use
# BeautifulSoup for data extraction.
bsObj = BeautifulSoup(driver.page_source)
# Get all links
links_pro = dict()
for a in bsObj.findAll("a"):
    links_pro[a.contents[0]] = a.attrs["href"]

links_pro

In [None]:
driver.close()