# Chapter 12: WEB SCRAPING

- `webbrowser` - Comes with Python and opens a browser to a specific page.

- `requests` - Downloads files and web pages from the internet.

- `bs4` - Parses HTML, the format that web pages are written in.

- `selenium` - Launches and controls a web browser. The selenium module is able to fill in forms and
simulate mouse clicks in this browser.

In [1]:
import webbrowser

webbrowser.open('https://inventwithpython.com/')

True

## Project: mapIt.py with the webbrowser Module

In [4]:
# ADDRESS: mapit 870 Valencia St, San Francisco, CA 94110

#! python3
# mapIt.py - Launches a map in the browser using an address from the

import sys
import pyperclip
import webbrowser

# Command line or clipboard.
if len(sys.argv) > 1:
    # get address from command line.
    address = " ".join(sys.argv[1:])
else:
# get address from clipboard.
    address = pyperclip.paste()

webbrowser.open("https://google.com/maps/place/" + address)

True

### Ideas for Similar Programs

- Open all links on a page in separate browser tabs.
- Open the browser to the URL for your local weather.
- Open several social network sites that you regularly check.

In [2]:
# Open all links on a page in separate browser tabs.
links = ["google.com", "amazon.com", "telegram.org"]

for link in links:
    webbrowser.open("https://" + link)

## Downloading Files from the Web with the requests Module

The `requests` module lets you easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression. The `requests` module doesn’t come with Python, so you’ll have to install it first

### Downloading a Web Page with the requests.get() Function

In [41]:
import requests

In [70]:
res = requests.get("https://automatetheboringstuff.com/files/rj.txt")
res

<Response [200]>

In [71]:
type(res)

requests.models.Response

In [72]:
res.status_code == requests.codes.ok

True

In [73]:
len(res.text)

178978

In [74]:
print(res.text[:251])

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project


### Checking for Errors

In [75]:
res = requests.get("https://inventwithpython.com/page_that_does_not_exist")
res

<Response [404]>

In [76]:
res.raise_for_status()

HTTPError: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist

In [77]:
res = requests.get("https://inventwithpython.com/page_that_does_not_exist")

try:
    res.raise_for_status()
except Exception as exc:
    print(f"There was a problem: {exc}")

There was a problem: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist


## Saving Downloaded Files to the Hard Drive

From here, you can save the web page to a file on your hard drive with the standard `open()` function and `write()` method. There are some slight differences, though. First, you must open the file in *write binary* mode by passing the string `'wb'` as the second argument to `open()`. Even if the page is in plaintext (such as the *Romeo and Juliet* text you downloaded earlier), you need to write binary data instead of text data in order to maintain the *Unicode encoding* of the text.

In [108]:
import requests

res = requests.get("https://automatetheboringstuff.com/files/rj.txt")
res.raise_for_status()
playFile = open('RomeoAndJuliet.txt', 'wb')
for chunk in res.iter_content(100_000):
    playFile.write(chunk)

playFile.close()

In [109]:
with open("RomeoAndJuliet.txt", 'r') as f:
    rj_content = f.read()[:247]
print(rj_content)

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project


In [111]:
with open("RomeoAndJuliet.txt", 'r') as f:
    rj_content = f.read()
print(len(rj_content))
print(rj_content)

174126


In [6]:
import requests
import sys

res = requests.get("https://automatetheboringstuff.com/2e/chapter12")
res.raise_for_status()
size = sys.getsizeof(res.text)
playFile = open('chapter12.html', 'wb')
print(f"size: {size} bytes")
for chunk in res.iter_content(size):
    playFile.write(chunk)
playFile.close()

size: 130277 bytes


In [7]:
with open("chapter12.html", 'r') as f:
    html_content = f.read()[:545]
print(html_content)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <link rel="stylesheet" type="text/css" href="/automate2_website.css" />
  <meta charset="UTF-8" />
  <title>Automate the Boring Stuff with Python</title>
</head>

<body>
  <header class="top_header">
  <a href="https://automatetheboringstuff.com/">Home</a> | <a href="https://www.nostarch.com/automatestuff2">Buy Direct from Publisher</a> | 


## Parsing HTML with the bs4 Module

Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The Beautiful Soup module’s name is `bs4` (for Beautiful Soup, version 4).

### Creating a BeautifulSoup Object from HTML

In [8]:
import requests
import bs4

res = requests.get("https://nostarch.com")
res.raise_for_status()
noStarchSoup = bs4.BeautifulSoup(res.text, 'html.parser')
print(type(noStarchSoup))

<class 'bs4.BeautifulSoup'>


In [10]:
exampleFile = open("example.html")
exampleSoup = bs4.BeautifulSoup(exampleFile, 'html.parser')
print(type(exampleSoup))

<class 'bs4.BeautifulSoup'>


The `'html.parser'` parser used here comes with Python. However, you can use the faster `'lxml'` parser if you install the third-party `lxml` module.

### Finding an Element with the select() Method

| Selector passed to the select() method | Will match . . . |
| :- | :- |
| **`soup.select('div')`** | All elements named `<div>` |
| **`soup.select('#author')`** | The element with an id attribute of author |
| **`soup.select('.notice')`** | All elements that use a CSS class attribute named notice |
| **`soup.select('div span')`** | All elements named `<span>` that are within an element named `<div>` |
| **`soup.select('div > span')`** | All elements named `<span>` that are directly within an element named `<div>`, with no other element in between |
| **`soup.select('input[name]')`** | All elements named `<input>` that have a name attribute with any value |
| **`soup.select('input[type="button"]')`** | All elements named `<input>` that have an attribute named type with value button |

In [21]:
import bs4

exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), 'html.parser')
elems = exampleSoup.select('#author')

In [22]:
print(type(elems))
elems

<class 'bs4.element.ResultSet'>


[<span id="author">Al Sweigart</span>]

In [27]:
print(type(elems[0]))
elems[0]

<class 'bs4.element.Tag'>


<span id="author">Al Sweigart</span>

In [32]:
elems[0].getText()

'Al Sweigart'

In [47]:
elems[0].attrs

{'id': 'author'}

In [48]:
pElems = exampleSoup.select('p')

In [49]:
pElems[0]

<p>Download my <strong>Python</strong> book from <a href="https://
inventwithpython.com">my website</a>.</p>

In [50]:
pElems[0].getText()

'Download my Python book from my website.'

In [51]:
pElems[1].getText()

'Learn Python the easy way!'

In [52]:
pElems[2].getText()

'By Al Sweigart'

### Getting Data from an Element’s Attributes

In [63]:
import bs4

soup = bs4.BeautifulSoup(open('example.html'), 'html.parser')
spanElem = soup.select('span')[0]
spanElem

<span id="author">Al Sweigart</span>

In [64]:
spanElem.get('id')

'author'

In [68]:
spanElem.get('some_nonexistent_attr') == None

True

In [69]:
spanElem.attrs

{'id': 'author'}

## Project: Opening All Search Results

It would be nice if I could simply type a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs. Let’s write a script to do this with the search results page for the Python Package Index at https://pypi.org/. A program like this can be adapted to many other websites, although the Google and DuckDuckGo often employ measures that make scraping their search results pages difficult.

In [70]:
import sys
import requests
import bs4
import webbrowser

print("Searching ...")  # display text while downlading the search result page

res = requests.get("https://pypi.org/search/?q=" + " ".join(sys.argv[1:]))
res.raise_for_status()

# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'lxml')
# Open a browser tab for each result.
linkElems = soup.select('.package-snippet')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    urlToOpen = 'https://pypi.org' + linkElems[i].get('href')
    print('Opeing', urlToOpen)
    webbrowser.open(urlToOpen)

Searching ...
Opeing https://pypi.org/project/f16774e1d64c/
Opeing https://pypi.org/project/6d657461666c6f77/
Opeing https://pypi.org/project/3636c788d0392f7e84453434eea18c59/
Opeing https://pypi.org/project/c3d/
Opeing https://pypi.org/project/d2c/


### Ideas for Similar Programs

- Open all the product pages after searching a shopping site such as Amazon.
- Open all the links to reviews for a single product.
- Open the result links to photos after performing a search on a photo site such as Flickr or Imgur.

In [None]:
import sys
import requests
import bs4
import webbrowser

search_term = "ibm quantum"
link = "https://google.com/search?q=ibm+quantum"
classes = ["MjjYud", "ULSxyf"]

print("Searching...")
resp = requests.get(link + "+".join(sys.argv[1:]))
resp

## Project: Downloading All XKCD Comics

In [None]:
# downloadXkcd.py
import os
import sys
import requests
import bs4

url = "https://xkcd.com"  # starting url
os.makedirs('xkcd', exist_ok=True)  # store comics in ./xkcd

while not url.endswith('#'):

    # Scrape the page
    print(f"Scraping {url} ...")
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'lxml')

    # Find the URL of the comic image
    comic_elem = soup.select("#comic img")
    if comic_elem == []:
        print("Could not find comic image.")
    else:
        comic_url = 'https:' + comic_elem[0].get('src')

        # Download the image
        print(f"Downloading image {comic_url} ...")
        res = requests.get(comic_url)
        res.raise_for_status()

        # Save the image to ./xkcd
        open(os.path.join('xkcd', os.path.basename(comic_url)), 'wb').write(res.content)

        # Get the Prev button's url
        prev_link = soup.select('a[rel="prev"]')[0]
        url = 'https://xkcd.com' + prev_link.get('href')

print('Done.')

Scraping https://xkcd.com ...
Downloading image https://imgs.xkcd.com/comics/account_problems.png ...
Scraping https://xkcd.com/2699/ ...
Downloading image https://imgs.xkcd.com/comics/feature_comparison.png ...
Scraping https://xkcd.com/2698/ ...
Downloading image https://imgs.xkcd.com/comics/bad_date.png ...
Scraping https://xkcd.com/2697/ ...
Downloading image https://imgs.xkcd.com/comics/y2k_and_2038.png ...
Scraping https://xkcd.com/2696/ ...


### Ideas for Similar Programs

- Back up an entire site by following all of its links.
- Copy all the messages off a web forum.
- Duplicate the catalog of items for sale on an online store.

## Controlling the Browser with the selenium Module

### Starting a selenium-Controlled Browser

In [4]:
from selenium import webdriver

browser = webdriver.Chrome()
type(browser)

selenium.webdriver.chrome.webdriver.WebDriver

In [5]:
browser.get('https://inventwithpython.com')

### Finding Elements on the Page

| **Method name** | **`WebElement` object/list returned** |
| :- | :- |
| **`browser.find_element_by_class_name(name)`, `browser.find_elements_by_class_name(name)`** | Elements that use the CSS class name |
| **`browser.find_element_by_css_selector(selector)`, `browser.find_elements_by_css_selector(selector)`** | Elements that match the CSS selector |
| **`browser.find_element_by_id(id)`, `browser.find_elements_by_id(id)`** | Elements with a matching id attribute value |
| **`browser.find_element_by_link_text(text)`, `browser.find_elements_by_link_text(text)`** | `<a>` elements that completely match the text provided |
| **`browser.find_element_by_partial_link_text(text)`, `browser.find_elements_by_partial_link_text(text)`** | `<a>` elements that contain the text provided |
| **`browser.find_element_by_name(name)`, `browser.find_elements_by_name(name)`** | Elements with a matching name attribute value |
| **`browser.find_element_by_tag_name(name)`, `browser.find_elements_by_tag_name(name)`** | Elements with a matching tag name (case-insensitive; an `<a>` element is matched by 'a' and 'A') |

---

### `WebElement` Attributes and Methods

| Attribute or method | Description |
| :- | :- |
| **`tag_name`** | The tag name, such as 'a' for an `<a>` element |
| **`get_attribute(name)`** | The value for the element’s name attribute |
| **`text`** | The text within the element, such as 'hello' in `<span>`hello`</span>` |
| **`clear()`** | For text field or text area elements, clears the text typed into it |
| **`is_displayed()`** | Returns True if the element is visible; otherwise returns False |
| **`is_enabled()`** | For input elements, returns True if the element is enabled; otherwise returns False |
| **`is_selected()`** | For checkbox or radio button elements, returns True if the element is selected; otherwise returns False |
| **`location`** | A dictionary with keys `'x'` and `'y'` for the position of the element in the page |

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://inventwithpython.com')

try:
    web_elem = browser.find_element(By.CLASS_NAME, "cover-thumb")  # WebElement object
    print(f"Found <{web_elem.tag_name}> element with that class name!")
except:
    print("Was not able to find an element with that name.")

browser.quit()

Found <img> element with that class name!


### Clicking the Page

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://inventwithpython.com')

linkElem = browser.find_element(By.LINK_TEXT, "Read Online for Free")
linkElem.click()  # follows the "Read Online for Free" link

# browser.quit()

### Filling Out and Submitting Forms

In [4]:
from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get("https://login.metafilter.com")
userElem = browser.find_element(By.ID, 'user_name')
userElem.send_keys("your_username_here")

passwordElem = browser.find_element(By.ID, 'user_pass')
passwordElem.send_keys('your_password_here')
passwordElem.submit()

Calling the `submit()` method on any element will have the same result as clicking the Submit button for the form that element is in. (You could have just as easily called `emailElem.submit()`, and the code would have done the same thing.)

### Sending Special Keys

**Commonly Used Variables in the `selenium.webdriver.common.keys` Module**
| **Attributes** | **Meanings** |
| :- | :- |
| **`Keys.DOWN`, `Keys.UP`, `Keys.LEFT`, `Keys.RIGHT`** | The keyboard arrow keys |
| **`Keys.ENTER`, `Keys.RETURN`** | The ENTER and RETURN keys |
| **`Keys.HOME`, `Keys.END`, `Keys.PAGE_DOWN`, `Keys.PAGE_UP`** | The HOME, END, PAGEDOWN, and PAGEUP keys |
| **`Keys.ESCAPE`, `Keys.BACK_SPACE, Keys.DELETE`** | The ESC, BACKSPACE, and DELETE keys |
| **`Keys.F1`, `Keys.F2`, ... , `Keys.F12`** | The F1 to F12 keys at the top of the keyboard |
| **`Keys.TAB`** | The TAB key |

In [10]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome()
browser.get("https://nostarch.com")
htmlElem = browser.find_element(By.TAG_NAME, "html")
htmlElem.send_keys(Keys.END)  # scrolls to bottom
time.sleep(1)
htmlElem.send_keys(Keys.HOME)  # scrolls to top

### Clicking Browser Buttons

- `browser.back()` - Clicks the Back button.
- `browser.forward()` - Clicks the Forward button.
- `browser.refresh()` - Clicks the Refresh/Reload button.
- `browser.quit()` - Clicks the Close Window button.

In [34]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome()
browser.maximize_window()
browser.get("http://inventwithpython.com")
link_elem = browser.find_element(By.LINK_TEXT, "Read Online for Free")
link_elem.click()
browser.find_element(By.TAG_NAME, "html").send_keys(Keys.END)  # scrolls to bottom
time.sleep(2)
browser.back()  # click back button in browser
time.sleep(2)
browser.find_element(By.TAG_NAME, "html").send_keys(Keys.HOME)  # scrolls to top
# browser.quit()  # close browser window

## Practice Projects

### Command Line Emailer
Write a program that takes an email address and string of text on the command line and then, using selenium, logs in to your email account and sends an email of the string to the provided address. (You might want to set up a separate email account for this program.)
This would be a nice way to add a notification feature to your programs. You could also write a similar program to send messages from a Facebook or Twitter account.

In [7]:
""" Script to send email from mail.ru account to any email. """

import sys
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pyinputplus as pyip

# get email address and text message
emailTo = sys.argv[1]
message = " ".join(sys.argv[2:])
# open chrome browser
browser = webdriver.Chrome()
browser.maximize_window()
# login in to account
browser.get("https://mail.ru")
browser.find_element(By.CSS_SELECTOR, 'button[data-testid="enter-mail-primary"]').click()  # click 'Log in' button
time.sleep(20)  # time for login to your account

unameElem = browser.find_element(By.CSS_SELECTOR, 'input[name="username"]')
login = pyip.inputEmail(prompt='Enter your login: ')
unameElem.send_keys(login)
unameElem.submit()
# enter the password
passwElem = browser.find_element(By.NAME, "password")
passwElem.click()
password = pyip.inputPassword(prompt='Enter your password: ')
passwElem.send_keys(password)
passwElem.submit()


WebDriverWait(browser, 10).until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, 'div.sidebar__compose-btn-box > a'))
    ).click()

WebDriverWait(browser, 10).until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, 'label div input'))
    ).send_keys(emailTo)

browser.find_element(By.CSS_SELECTOR, 'div[role="textbox"]').send_keys(message)  # put message to text field
browser.find_element(By.CSS_SELECTOR, 'button[data-test-id="send"]').click()  # click 'Send' button

browser.quit()

### Image Site Downloader
Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images. You could write a program that works with any photo site that has a search feature.

In [27]:
""" Searches command line text from yandex.com/images and Downloads the images """

import sys
import os
import requests
import bs4

search_term = " ".join(sys.argv[1:])
link = f"https://yandex.com/images/search?text={search_term}"
folder_name = "_".join(sys.argv[1:]) + "_images"
os.makedirs(folder_name, exist_ok=True)  # make the images folder

print(f"Searching '{link}'")
res = requests.get(link)
soup = bs4.BeautifulSoup(res.text, 'lxml')
img_elems = soup.select("div > a > img")
img_urls = ["http:" + img.get('src') for img in img_elems]

img_names = []
for elem in img_elems:
    alt = elem.get('alt')
    name = ""
    for char in alt:
        if char.isalnum():
            name += char
        if char.isspace():
            name += " "
    img_names.append(name)

for url, name in zip(img_urls, img_names):
    print(f"Downloading image {url} ...")
    open(os.path.join(folder_name, f"{name}.png"), 'wb').write(requests.get(url).content)

print("===== Done. =====")

### 2048
2048 is a simple game where you combine tiles by sliding them up, down, left, or right with the arrow keys. You can actually get a fairly high score by repeatedly sliding in an up, right, down, and left pattern over and over again. Write a program that will open the game at https://gabrielecirulli.github.io/2048/ and keep sending up, right, down, and left keystrokes to automatically play the game.

In [12]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get("https://2048game.com")

# play the game
body_elem = browser.find_element(By.CSS_SELECTOR, "body")
for i in range(100):
    body_elem.send_keys(Keys.UP)
    body_elem.send_keys(Keys.RIGHT)
    body_elem.send_keys(Keys.DOWN)
    body_elem.send_keys(Keys.LEFT)

### Link Verification
Write a program that, given the URL of a web page, will attempt to download every linked page on the page. The program should flag any pages that have a 404 “Not Found” status code and print them out as broken links.