# Web Scraping

A few useful modules:

* **webbrowser**: Comes with Python and opens a browser to a specific page.

* **Requests**: Downloads files and web pages from the Internet.

* **Beautiful Soup**: Parses HTML, the format that web pages are written in.

* **Selenium**: Launches and controls a web browser. Selenium is able to fill in forms and simulate mouse clicks in this browser.



## Viewing the Source HTML of a Web Page

Right-click (or CTRL-click on OS X) any web page in your web browser and select **View Source** or **View Page Source** to see the HTML. I highly recommend viewing the source HTML of some of your favorite sites. It’s fine if you don’t fully understand what you are seeing when you look at the source. You won’t need HTML mastery to write simple web scraping programs—after all, you won’t be writing your own websites. You just need enough knowledge to pick out data from an existing site.


## Opening Your Browser's Developer Tools

In addition to viewing a web page’s source, you can look through a page’s HTML using your browser’s developer tools. In Chrome and Internet Explorer for Windows, the developer tools are already installed, and you can press F12 to make them appear (see Figure 11-4). Pressing F12 again will make the developer tools disappear. In Chrome, you can also bring up the developer tools by selecting View▸Developer▸Developer Tools. In OS X, pressing -OPTION-I will open Chrome’s Developer Tools.

In Firefox, you can bring up the Web Developer Tools Inspector by pressing CTRL-SHIFT-C on Windows and Linux or by pressing ⌘-OPTION-C on OS X. The layout is almost identical to Chrome’s developer tools.

In Safari, open the Preferences window, and on the Advanced pane check the Show Develop menu in the menu bar option. After it has been enabled, you can bring up the developer tools by pressing -OPTION-I.

After enabling or installing the developer tools in your browser, you can right-click any part of the web page and select Inspect Element from the context menu to bring up the HTML responsible for that part of the page. This will be helpful when you begin to parse HTML for your web scraping programs.

Don’t Use Regular Expressions to Parse HTML

Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as Beautiful Soup, will be less likely to result in bugs.

You can find an extended argument for why you shouldn’t to parse HTML with regular expressions at http://stackoverflow.com/a/1732454/1893164/.


## Creating a BeautifullSoup Object from HTML

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://www.imdb.com/search/title/?genres=drama&groups=top_250&sort=boxoffice_gross_us,desc'
res = requests.get(url)

In [3]:
print(res.raise_for_status())
soup = BeautifulSoup(res.text)
type(soup)

None


bs4.BeautifulSoup

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   Drama,
IMDb "Top 250"
(Sorted by US Box Office Descending) - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex(

In [6]:
table = soup.findAll('table')
table

[]

In [405]:
table.findAll('tbody')[0].findAll('tr')[0].findAll('td')[1]

<td class="summary-data__cell" role="cell"></td>

## Controlling the Browser with the Selenium Module

In [406]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options 

In [407]:
options = Options() 

In [408]:
# tell selenium where the chromedriver executable is located
chrome_path = 'C:/Users/purem/OneDrive/Desktop/chromedriver_win32/chromedriver.exe'

In [409]:
# if you want to run Chrome in headless mode, use this:
# options.add_argument('--headless')

# other useful options:
# options.add_argument('--ignore-certificate-errors')
# options.add_argument('--incognito')

In [410]:
# set the window size
options.add_argument('--window-size=500,300')

# initialize the driver
driver = webdriver.Chrome(chrome_path, 
                          options=options)

In [411]:
driver.set_window_size(1400,1000)

In [412]:
# driver.minimize_window()
# driver.maximize_window()
# driver.get_window_position()
# driver.get_window_size()
# driver.get_window_rect()

In [413]:
driver.get(url)

In [414]:
element = driver.find_element_by_class_name("summary-data")

In [415]:
from selenium.webdriver.common.action_chains import ActionChains

In [416]:
actions = ActionChains(driver)
actions.move_to_element(element).perform()

In [417]:
if element.is_displayed():
    # if element is displayed, try to click it
    try:
        # this only works if the element is "clickable"
        element.click()
    except:
        print("Whoops!")

Whoops!


In [419]:
page_source = driver.page_source
soup = BeautifulSoup(page_source)
tables = soup.findAll('table')
table = tables[1].prettify()
pd.read_html(table)[0]

Unnamed: 0,Label,Value
0,Exchange,NASDAQ-CM
1,Sector,Health Care
2,Industry,Major Pharmaceuticals
3,1 Year Target,$4.00
4,Today's High/Low,$2.13/$2.06
5,Share Volume,630936
6,50 Day Average Vol.,6541820
7,Previous Close,$2.12
8,52 Week High/Low,$2.97/$0.35
9,Market Cap,125127486


In [420]:
element = driver.find_element_by_class_name("short-interest")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
page_source = driver.page_source
soup = BeautifulSoup(page_source)
tables = soup.findAll('table')
table = tables[3].prettify()
pd.read_html(table)[0]

Unnamed: 0,SETTLEMENT DATE,SHORT INTEREST,AVG. DAILY SHARE VOLUME,DAYS TO COVER
0,11/15/2019,5333079,15206573,1
1,10/31/2019,2874689,8686415,1
2,10/15/2019,659437,1203268,1
3,09/30/2019,369989,1188665,1
4,09/13/2019,425601,1037275,1


In [421]:
element = driver.find_element_by_class_name("forecasts")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
page_source = driver.page_source
soup = BeautifulSoup(page_source)
tables = soup.findAll('table')
table = tables[5].prettify()
pd.read_html(table)[0]

Unnamed: 0,Fiscal Year End,Consensus EPS* Forecast,High EPS* Forecast,Low EPS* Forecast,Number of Estimates,Over the Last 4 Weeks Number of Revisions - Up,Over the Last 4 Weeks Number of Revisions - Down
0,Dec 2019,-0.08,-0.07,-0.09,5,0,0
1,Mar 2020,-0.08,-0.06,-0.09,3,1,0
2,Jun 2020,-0.11,-0.07,-0.14,3,1,0
3,Sep 2020,-0.13,-0.08,-0.18,3,1,0
4,Dec 2020,-0.07,0.03,-0.19,3,1,0


In [423]:
element = driver.find_element_by_xpath("//button[@data-value='yearly']")
actions = ActionChains(driver)
actions.move_to_element(element).perform()

if element.is_displayed():
    # if element is displayed, try to click it
    try:
        # this only works if the element is "clickable"
        element.click()
    except:
        print("Whoops!")
        
page_source = driver.page_source
soup = BeautifulSoup(page_source)
tables = soup.findAll('table')
table = tables[5].prettify()
pd.read_html(table)[0]

Unnamed: 0,Fiscal Year End,Consensus EPS* Forecast,High EPS* Forecast,Low EPS* Forecast,Number of Estimates,Over the Last 4 Weeks Number of Revisions - Up,Over the Last 4 Weeks Number of Revisions - Down
0,Dec 2019,-0.36,-0.34,-0.39,5,0,0
1,Dec 2020,-0.5,-0.25,-0.71,5,1,0
2,Dec 2021,-0.14,0.27,-0.57,3,0,0
3,Dec 2022,0.44,0.89,-0.02,2,0,0


In [424]:
import time

SCROLL_PAUSE_TIME = 1.0 # seconds

In [425]:
# INFINITE SCROLL

# # Get scroll height
# last_height = driver.execute_script("return document.body.scrollHeight")

# while True:
#     # Scroll down to bottom
#     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#     # Wait to load page
#     time.sleep(SCROLL_PAUSE_TIME)

#     # Calculate new scroll height and compare with last scroll height
#     new_height = driver.execute_script("return document.body.scrollHeight")
#     if new_height == last_height:
#         break
#     last_height = new_height

In [426]:
from selenium.webdriver.common.keys import Keys 

In [427]:
element = driver.find_element_by_partial_link_text('Institutional')
actions = ActionChains(driver)
actions.move_to_element(element).perform()
actions.reset_actions()
actions.send_keys(Keys.PAGE_DOWN).perform()
actions.reset_actions()

time.sleep(SCROLL_PAUSE_TIME)
if element.is_displayed():
    # if element is displayed, try to click it
    try:
        # this only works if the element is "clickable"
        element.click()
    except:
        print("Whoops!")

In [428]:
driver.current_url

'https://www.nasdaq.com/market-activity/stocks/agrx/institutional-holdings'

In [429]:
page_source = driver.page_source
soup = BeautifulSoup(page_source)
tables = soup.findAll('table')

In [430]:
name_list = list(set([t['class'][0].split('_')[0] for t in tables]))
name_list

['institutional-holdings']

In [431]:
name = name_list[0]
element = driver.find_element_by_class_name(name)
actions = ActionChains(driver)
actions.move_to_element(element).perform()
page_source = driver.page_source
soup = BeautifulSoup(page_source)

In [432]:
tables = soup.findAll('table')
table = tables[0].prettify()
pd.read_html(table)[0]

Unnamed: 0,Label,Value
0,Institutional Ownership,43.27 %
1,Total Shares Outstanding (millions),59
2,Total Value of Holdings (millions),$54


In [433]:
tables = soup.findAll('table')
table = tables[1].prettify()
pd.read_html(table)[0]

Unnamed: 0,ACTIVE POSITIONS,HOLDERS,SHARES
0,Increased Positions,22,6637789
1,Decreased Positions,8,556921
2,Held Positions,12,18466156
3,Total Institutional Shares,42,25660866


In [434]:
tables = soup.findAll('table')
table = tables[2].prettify()
pd.read_html(table)[0]

Unnamed: 0,ACTIVE POSITIONS,HOLDERS,SHARES
0,New Positions,15,3494703
1,Sold Out Positions,4,495802


In [435]:
tables = soup.findAll('table')
table = tables[3].prettify()
df = pd.read_html(table)[0]
df

Unnamed: 0,OWNER NAME,DATE,SHARES HELD,CHANGE (SHARES),CHANGE (%),"VALUE (IN 1,000S)"
0,PERCEPTIVE ADVISORS LLC,09/30/2019,10726750,2300000,27.294%,"$22,741"
1,INVESTOR AB,09/30/2019,3510189,0,0%,"$7,442"
2,RENAISSANCE TECHNOLOGIES LLC,09/30/2019,2427100,172920,7.671%,"$5,145"
3,VANGUARD GROUP INC,09/30/2019,2321726,589078,33.999%,"$4,922"
4,"DEERFIELD MANAGEMENT COMPANY, L.P. (SERIES C)",09/30/2019,1591652,1591652,New,"$3,374"
5,"VIVO CAPITAL, LLC",09/30/2019,1513975,0,0%,"$3,210"
6,"EVERSEPT PARTNERS, LP",09/30/2019,909485,909485,New,"$1,928"
7,ACADIAN ASSET MANAGEMENT LLC,09/30/2019,431548,-18632,-4.139%,$915
8,"683 CAPITAL MANAGEMENT, LLC",09/30/2019,375000,375000,New,$795
9,"GEODE CAPITAL MANAGEMENT, LLC",09/30/2019,281210,0,0%,$596


In [436]:
# click button until there's no new data
dfs = [df]
df_old = pd.DataFrame()
while not df.equals(df_old):
    df_old = df
    element = driver.find_element_by_xpath("//button[@aria-label='click to go to the next page']")
    actions = ActionChains(driver)
    actions.move_to_element(element).perform()

    if element.is_displayed():
        # if element is displayed, try to click it
        try:
            # this only works if the element is "clickable"
            element.click()
        except:
            print("Whoops!")

    page_source = driver.page_source
    soup = BeautifulSoup(page_source)
    tables = soup.findAll('table')
    table = tables[3].prettify()
    df = pd.read_html(table)[0]
    dfs.append(df)

In [437]:
len(dfs)

4

In [443]:
pd.concat(dfs, ignore_index=True)

Unnamed: 0,OWNER NAME,DATE,SHARES HELD,CHANGE (SHARES),CHANGE (%),"VALUE (IN 1,000S)"
0,PERCEPTIVE ADVISORS LLC,09/30/2019,10726750,2300000,27.294%,"$22,741"
1,INVESTOR AB,09/30/2019,3510189,0,0%,"$7,442"
2,RENAISSANCE TECHNOLOGIES LLC,09/30/2019,2427100,172920,7.671%,"$5,145"
3,VANGUARD GROUP INC,09/30/2019,2321726,589078,33.999%,"$4,922"
4,"DEERFIELD MANAGEMENT COMPANY, L.P. (SERIES C)",09/30/2019,1591652,1591652,New,"$3,374"
5,"VIVO CAPITAL, LLC",09/30/2019,1513975,0,0%,"$3,210"
6,"EVERSEPT PARTNERS, LP",09/30/2019,909485,909485,New,"$1,928"
7,ACADIAN ASSET MANAGEMENT LLC,09/30/2019,431548,-18632,-4.139%,$915
8,"683 CAPITAL MANAGEMENT, LLC",09/30/2019,375000,375000,New,$795
9,"GEODE CAPITAL MANAGEMENT, LLC",09/30/2019,281210,0,0%,$596


In [444]:
driver.quit()