In this notebook, we'll take a look at a couple of examples using Selenium. We start by importing the modules we need and starting the Selenium-driven web browser.

We don't use headless mode here, as we would like to see what's going on as we execute our commands.

In [1]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

In [38]:
driver = webdriver.Chrome()
driver.implicitly_wait(10)

# Navigating BlueCourses

For this first example, let's visit our home page and read out a list of courses, as we did before using Beautiful Soup.

In [3]:
driver.get('https://www.bluecourses.com')

Note that Selenium provides many ways to find elements. E.g. by using CSS selectors (more feature proof than `select()` in Beautiful Soup). Note that attributes here should be retrieved using `get_attribute()`.

In [4]:
courses = driver.find_elements_by_css_selector('article.course')

In [5]:
for course in courses:
    print(course.find_element_by_css_selector('.course-title').text)
    print(course.find_element_by_tag_name('a').get_attribute('href'))

Basic Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS
https://www.bluecourses.com/courses/course-v1:bluecourses+BC1+September2019/about
Advanced Credit Risk Modeling for Basel/IFRS 9 using R/Python/SAS
https://www.bluecourses.com/courses/course-v1:bluecourses+BC2+September2019/about
Machine Learning Essentials
https://www.bluecourses.com/courses/course-v1:bluecourses+BC3+October2019/about
Fraud Analytics
https://www.bluecourses.com/courses/course-v1:bluecourses+BC4+December2019/about
Social Network Analytics
https://www.bluecourses.com/courses/course-v1:bluecourses+BC5+2020/about
Recommender Systems
https://www.bluecourses.com/courses/course-v1:bluecourses+BC7+2020_Q1/about
Customer Lifetime Value Modeling
https://www.bluecourses.com/courses/course-v1:bluecourses+BC8+2020_Q2/about
Text Analytics
https://www.bluecourses.com/courses/course-v1:bluecourses+BC6+2019_Q4/about
Web Analytics
https://www.bluecourses.com/courses/course-v1:bluecourses+BC14+2020_Q2/about
Quantum Machine L

# Filling out a simple form

For a second example, we can show how to interact with various form elements. This example illustrates how Selenium requires a more UI-driven way of working rather than thinking from an HTTP interaction perspective.

In [39]:
driver.get('http://www.webscrapingfordatascience.com/postform2/')

Textual elements can be filled in using `clear` and `send_keys`.

In [40]:
driver.find_element_by_name('name').clear()
driver.find_element_by_name('name').send_keys('Seppe')

We can also retrieve elements through XPath selectors. XPath is a relatively complex but powerful XML query language. See https://www.w3schools.com/xml/xpath_syntax.asp for a good overview of the syntax.

In [41]:
driver.find_element_by_xpath('//input[@name="gender"][@value="N"]').click()

In [42]:
driver.find_element_by_name('fries').click()
driver.find_element_by_name('salad').click()

In [43]:
Select(driver.find_element_by_name('haircolor')).select_by_value('brown')

In [44]:
driver.find_element_by_name('comments').clear()
driver.find_element_by_name('comments').send_keys(['First line', Keys.ENTER, 'Second line'])

In [45]:
driver.find_element_by_xpath('//input[@type="submit"]').click()

In [46]:
driver.find_element_by_tag_name('body').text

'Thanks for submitting your information\nHere\'s a dump of the form data that was submitted:\narray(6) {\n  ["name"]=>\n  string(5) "Seppe"\n  ["gender"]=>\n  string(1) "N"\n  ["fries"]=>\n  string(4) "like"\n  ["salad"]=>\n  string(4) "like"\n  ["haircolor"]=>\n  string(5) "brown"\n  ["comments"]=>\n  string(23) "First line\nSecond line"\n}'

Note two special properties, `innerHTML` and `outerHTML` (DOM attributes), which allow to get the full inner and outer HTML contents of tags. Note that you could still use a HTML parsing library like Beautiful Soup if you'd like to parse these further without using Selenium.

In [47]:
driver.find_element_by_tag_name('body').get_attribute('innerHTML')

'\n\n\n<h2>Thanks for submitting your information</h2>\n\n<p>Here\'s a dump of the form data that was submitted:</p>\n\n<pre>array(6) {\n  ["name"]=&gt;\n  string(5) "Seppe"\n  ["gender"]=&gt;\n  string(1) "N"\n  ["fries"]=&gt;\n  string(4) "like"\n  ["salad"]=&gt;\n  string(4) "like"\n  ["haircolor"]=&gt;\n  string(5) "brown"\n  ["comments"]=&gt;\n  string(23) "First line\nSecond line"\n}\n</pre>\n\n\n\t\n\n'

In [48]:
driver.find_element_by_tag_name('body').get_attribute('outerHTML')

'<body>\n\n\n<h2>Thanks for submitting your information</h2>\n\n<p>Here\'s a dump of the form data that was submitted:</p>\n\n<pre>array(6) {\n  ["name"]=&gt;\n  string(5) "Seppe"\n  ["gender"]=&gt;\n  string(1) "N"\n  ["fries"]=&gt;\n  string(4) "like"\n  ["salad"]=&gt;\n  string(4) "like"\n  ["haircolor"]=&gt;\n  string(5) "brown"\n  ["comments"]=&gt;\n  string(23) "First line\nSecond line"\n}\n</pre>\n\n\n\t\n\n</body>'

# Getting a list of McDonalds locations in New York

In [49]:
driver.get('https://www.mcdonalds.com/us/en-us/restaurant-locator.html')

In [50]:
driver.find_element_by_id('search').send_keys('New York')

In [51]:
driver.find_element_by_css_selector('button[aria-label="search"]').click()

In [52]:
driver.find_element_by_css_selector('.button-toggle button[aria-label="List View"]').click()

ElementNotInteractableException: Message: element not interactable
  (Session info: chrome=84.0.4147.105)


Alternatively, we could also do the following, by executing JavaScript in the browser:

In [53]:
driver.execute_script(
    'arguments[0].click();', 
    driver.find_element_by_css_selector('.button-toggle button[aria-label="List View"]')
)

Next, we'll continue to load in all results until the 'Load More' button disappears. Normally, you'd opt to use a more robust approach here using explicit waits (https://www.selenium.dev/documentation/en/webdriver/waits/). Since we have defined an implicit wait above, Selenium will try executing our commands until the implicit timeout is reached, after which it throws an exception.

In [54]:
while True:
    try:
        driver.find_element_by_css_selector('div.rl-listview__load-more button').click()
    except:
        break # All results loaded

In [55]:
for details in driver.find_elements_by_css_selector('.rl-details'):
    print(details.text)

OPEN
160 Broadway
New York, Ny 10038


OPEN
167 Chambers St (303 Greenwich St)
New York, Ny 10013


OPEN
262 Canal St
New York, Ny 10013


OPEN
213 Madison Street
New York, Ny 10002


OPEN
114 Delancey St
New York, Ny 10002


OPEN
208 Varick St
New York, Ny 10014


OPEN
136 W 3rd St
New York, Ny 10012


OPEN
724 Broadway
New York, Ny 10003


OPEN
102 1st Ave
New York, Ny 10009


OPEN
82 Court St
Brooklyn, Ny 11201


OPEN
404 E 14th St
New York, Ny 10009


OPEN
541 6th Ave
New York, Ny 10011


OPEN
420 Fulton St
Brooklyn, Ny 11201


OPEN
39 Union Square W
New York, Ny 10003


OPEN
30 Mall Dr W
Jersey City, Nj 07310


OPEN
325 Grove St
Jersey City, Nj 07303


OPEN
395 Flatbush Ave Exten
Brooklyn, Ny 11201


OPEN
686 6th Ave
New York, Ny 10010


OPEN
26 E 23rd St
New York, Ny 10010


OPEN
336 E 23rd St
New York, Ny 10010


OPEN
197 12th St
Jersey City, Nj 07310


OPEN
234 Washington St
Hoboken, Nj 07030


OPEN
401 Park Ave S
New York, Ny 10016


OPEN
809/811 6th Ave/28th
Manhattan, Ny 100

If you follow along with the network requests in the browser. You might also have noticed that the restaurant location retriever actually calls an internal JavaScript API. Hence, we could also try accessing this directly using Requests and see whether that works. The URL parameters obviously expose ways to play around with this:

In [56]:
import requests

In [58]:
requests.get('https://www.mcdonalds.com/googleapps/GoogleRestaurantLocAction.do', params={
    'method': 'searchLocation',
    'latitude': 40.7127753,
    'longitude': -74.0059728,
    'radius': 30.045,
    'maxResults': 3,
    'country': 'us',
    'language': 'en-us'
}).json()

{'features': [{'geometry': {'coordinates': [-74.010086, 40.709438]},
   'properties': {'jobUrl': '',
    'longDescription': '',
    'todayHours': '04:00 - 04:00',
    'driveTodayHours': '04:00 - 04:00',
    'id': '195500284446-en-us',
    'filterType': ['WIFI',
     'GIFTCARDS',
     'MOBILEOFFERS',
     'MOBILEORDERS',
     'INDOORDININGAVAILABLE',
     'MCDELIVERY',
     'TWENTYFOURHOURS'],
    'addressLine1': '160 Broadway',
    'addressLine2': 'STAMFORD FIELD OFFICE',
    'addressLine3': 'New York',
    'addressLine4': 'USA',
    'subDivision': 'NY',
    'postcode': '10038',
    'customAddress': 'New York, NY 10038',
    'telephone': '(212) 385-2066',
    'restauranthours': {'hoursMonday': '04:00 - 04:00',
     'hoursTuesday': '04:00 - 04:00',
     'hoursWednesday': '04:00 - 04:00',
     'hoursThursday': '04:00 - 04:00',
     'hoursFriday': '04:00 - 04:00',
     'hoursSaturday': '04:00 - 04:00',
     'hoursSunday': '04:00 - 04:00'},
    'drivethruhours': {'driveHoursMonday': '04:00

Even if the website you wish to scrape does not provide an API, it's always recommended to keep an eye on your browser's developer tools networking information to see if you can spot JavaScript-driven requests to URL endpoints which return nicely structured JSON data, as is the case here.

Even although an API might not be documented, fetching the information directly from such an "internal APIs" is always a clever idea, as this will avoid having to deal with the HTML soup. In fact, we get here nicely structured JSON data directly!