# Introduction to Dynamic Web Scraping

A few useful modules:

* **webbrowser**: Comes with Python and opens a browser to a specific page.

* **Requests**: Downloads files and web pages from the Internet.

* **Beautiful Soup**: Parses HTML, the format that web pages are written in.

* **Selenium**: Launches and controls a web browser. Selenium is able to fill in forms and simulate mouse clicks in this browser.


## Learning Objectives

* Use Browser Automation for scraping, field entry, and manipulating the browser.

* Find elements using XPath.

* Use the Browser's Dev Tools to extract XPaths.

## Viewing the Source HTML of a Web Page

Right-click (or CTRL-click on OS X) any web page in your web browser and select **View Source** or **View Page Source** to see the HTML. I highly recommend viewing the source HTML of some of your favorite sites. It’s fine if you don’t fully understand what you are seeing when you look at the source. You won’t need HTML mastery to write simple web scraping programs—after all, you won’t be writing your own websites. You just need enough knowledge to pick out data from an existing site.


## Opening Your Browser's Developer Tools

In addition to viewing a web page’s source, you can look through a page’s HTML using your browser’s developer tools. In Chrome and Internet Explorer for Windows, the developer tools are already installed, and you can press F12 to make them appear (see Figure 11-4). Pressing F12 again will make the developer tools disappear. In Chrome, you can also bring up the developer tools by selecting View▸Developer▸Developer Tools. In OS X, pressing -OPTION-I will open Chrome’s Developer Tools.

In Firefox, you can bring up the Web Developer Tools Inspector by pressing CTRL-SHIFT-C on Windows and Linux or by pressing ⌘-OPTION-C on OS X. The layout is almost identical to Chrome’s developer tools.

In Safari, open the Preferences window, and on the Advanced pane check the Show Develop menu in the menu bar option. After it has been enabled, you can bring up the developer tools by pressing -OPTION-I.

After enabling or installing the developer tools in your browser, you can right-click any part of the web page and select Inspect Element from the context menu to bring up the HTML responsible for that part of the page. This will be helpful when you begin to parse HTML for your web scraping programs.

Don’t Use Regular Expressions to Parse HTML

Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as Beautiful Soup, will be less likely to result in bugs.

You can find an extended argument for why you shouldn’t to parse HTML with regular expressions at http://stackoverflow.com/a/1732454/1893164/.


## Creating a BeautifullSoup Object from HTML

In [1]:
import pandas as pd

import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://symptoms.webmd.com/default.htm#/info'
res = requests.get(url)

In [3]:
print(res.raise_for_status())
soup = BeautifulSoup(res.text)
type(soup)

None


bs4.BeautifulSoup

In [4]:
# print(soup.prettify())

## Controlling the Browser with the Selenium Module

In [15]:
import time

In [5]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# from selenium.webdriver.common.action_chains import ActionChains

In [6]:
options = Options() 

In [7]:
# tell selenium where the chromedriver executable is located
chrome_path = 'C:/Users/purem/OneDrive/Desktop/chromedriver_win32/chromedriver.exe'

In [8]:
# if you want to run Chrome in headless mode, use this:
# options.add_argument('--headless')

# other useful options:
# options.add_argument('--ignore-certificate-errors')
# options.add_argument('--incognito')

In [9]:
# set the window size
options.add_argument('--window-size=500,300')

# initialize the driver
driver = webdriver.Chrome(chrome_path, 
                          options=options)

In [10]:
driver.set_window_size(1400,1000)

In [11]:
# driver.minimize_window()
# driver.maximize_window()
# driver.get_window_position()
# driver.get_window_size()
# driver.get_window_rect()

In [12]:
driver.get(url)
time.sleep(15)

In [16]:
element = driver.find_element_by_xpath('//*[@id="age"]')

In [17]:
element.click()

In [18]:
element.send_keys('50')

In [19]:
button_sex_f = driver.find_element_by_xpath('//*[@id="female"]')

In [20]:
button_sex_f.click()

In [21]:
button = driver.find_element_by_xpath('//*[@id="symptom-checker"]/div[2]/div/div/div/div/div[2]/button/div/div[1]')

In [22]:
button.click()

In [31]:
common_symptoms = driver.find_elements_by_class_name('single-common-symptom')

In [32]:
common_symptom = common_symptoms[0]

In [33]:
common_symptom.click()

In [34]:
text = common_symptom.text

In [29]:
hover = driver.find_element_by_xpath('//*[@id="webmdHoverContent"]')

In [30]:
hover.is_enabled()

True

In [35]:
hover_button = driver.find_element_by_xpath('//*[@id="webmdHoverLoadedContent"]/div/div/button[2]/div/div[1]')

In [36]:
hover_button.click()

In [37]:
driver.quit()