# Web Scraping

## Using the webbrowser Module

In [12]:
!pip install pyperclip

Collecting pyperclip
  Downloading https://files.pythonhosted.org/packages/2d/0f/4eda562dffd085945d57c2d9a5da745cfb5228c02bc90f2c74bbac746243/pyperclip-1.7.0.tar.gz
Installing collected packages: pyperclip
  Running setup.py install for pyperclip: started
    Running setup.py install for pyperclip: finished with status 'done'
Successfully installed pyperclip-1.7.0


In [13]:
import webbrowser
import sys
import pyperclip

In [None]:
#Opens the URL in the browser
webbrowser.open('http://inventwithpython.com/')

In [17]:
user_input = input('Input address: ')

Input address: 


## Handle the Clipboard Content and Launch the Browser

In [18]:
if len(user_input) > 1:
    #address = ''.join(user_input[0:])
    print(address)
    print(type(address))
    address = user_input
else:
    # Get address from clipboard. (Will be triggered when the input is empty and it assumes that the address is in the clipboard)
    address = pyperclip.paste()

webbrowser.open('https://www.google.com/maps/place/' + address)

True

## Similar use cases 

* Open all links on a page in separate browser tabs.
* Open the browser to the URL for your local weather.
* Open several social network sites that you regularly check.

## Downloading Files from the Web with the requests Module

In [13]:
import requests

## Downloading a Web Page with the requests.get() Function

In [14]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
print(type(res))
print(res.status_code == requests.codes.ok)
print(len(res.text))
print(res.text[:250]) # prints the first 250 characters

<class 'requests.models.Response'>
True
178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


## Checking for Errors

In [15]:
res = requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


## Handling Errors

In [16]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
try:
    res.raise_for_status()
    print(res.text[:250])
except Exception as exc:
    print('There was a problem: %s' % (exc))

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


## Saving Downloaded Files to the Hard Drive

In [17]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
try:
    res.raise_for_status()
    playFile = open('RomeoAndJuliet.txt', 'wb')
    for chunk in res.iter_content(100000):
        playFile.write(chunk)
    playFile.close()
    print('Done!')
except Exception as exc:
    print('There was a problem: %s' % (exc))

Done!


* The `iter_content()` method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass `100000` as the argument to `iter_content()`.

* The file `RomeoAndJuliet.txt` will now exist in the current working directory. Note that while the filename on the website was `rj.txt`, the file on your hard drive has a different filename. The `requests` module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer.

* The `write()` method returns the number of bytes written to the file. In the previous example, there were `100,000 bytes` in the first chunk, and the remaining part of the file needed only `78,981 bytes`.

* To review, here’s the complete process for downloading and saving a file:
    1. Call `requests.get()` to download the file.
    2. Call `open()` with `wb` to create a new file in write binary mode.
    3. Loop over the Response object’s `iter_content()` method.
    4. Call `write()` on each iteration to write the content to the file.
    5. Call `close()` to close the file.

## Parsing HTML with the BeautifulSoup Module

In [18]:
!pip install beautifulsoup4



In [1]:
import bs4
import requests

In [2]:
# Creating a BeautifulSoup Object from HTML

#Reading from a URL
res = requests.get('http://nostarch.com')
try:
    res.raise_for_status()
    noStarchSoup = bs4.BeautifulSoup(res.text, "lxml")
    print(type(noStarchSoup))
    print(noStarchSoup)
except Exception as exc:
    print('There was a problem: %s' % (exc))

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<link href="https://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://nostarch.com/sites/default/files/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Drupal 7 (http://drupal.org)" name="generator"/>
<link href="https://nostarch.com/" rel="canonical"/>
<link href="https://nostarch.com/" rel="shortlink"/>
<title>No Starch Press | "The finest in geek entertainment"</title>
<link href="https://nostarch.com/sites/default/files/css/css_lQaZfjVpwP_oGNqdtWCSpJT1EMqXdMiU84ekLLxQnc4.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://nostarch.com/sites/default/files/css/css_iJE8OMtNhvOQPbQGg8OqRmpr7AhRCfmCisQy8q7fFhk.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://nostarch.com/sites/defau

In [3]:
#Reading from an HTML file 
exampleFile = open('automate_online-materials/example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile, "lxml")
print(type(exampleSoup))
print(exampleSoup)

<class 'bs4.BeautifulSoup'>
<!-- This is the example.html file. --><html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>


## Finding an Element with the select() Method

In [22]:
print(exampleSoup.select('title'))
print(exampleSoup.select('#author'))
print(exampleSoup.select('p > strong'))

[<title>The Website Title</title>]
[<span id="author">Al Sweigart</span>]
[<strong>Python</strong>]


* `soup.select('div')` - All elements named `<div>`
* `soup.select('#author')` - The element with an `id` attribute of `author`
* `soup.select('.notice')` - All elements that use a `CSS` class attribute named `notice`
* `soup.select('div span')` - All elements named `<span>` that are within an element named `<div>`
* `soup.select('div > span')` - All elements named `<span>` that are directly within an element named `<div>`, with no other element in between
* `soup.select('input[name]')` - All elements named `<input>` that have a name attribute with any                                                                   value
* `soup.select('input[type="button"]')` - All elements named `<input>` that have an attribute named type with                                                               value `button`



In [26]:
import bs4
exampleFile = open('automate_online-materials/example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), "lxml")

In [27]:
elems = exampleSoup.select('#author')
print(type(elems))
print(len(elems))
print(type(elems[0]))
print(elems[0].getText())
print(str(elems[0]))
print(elems[0].attrs)

<class 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'}


* You can also pull all the `<p>` elements from the BeautifulSoup object. Enter this into the interactive shell:

In [28]:
pElems = exampleSoup.select('p')
print(str(pElems[0]))
print(pElems[0].getText())
print(str(pElems[1]))
print(pElems[1].getText())
print(str(pElems[2]))
print(pElems[2].getText())

<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
Download my Python book from my website.
<p class="slogan">Learn Python the easy way!</p>
Learn Python the easy way!
<p>By <span id="author">Al Sweigart</span></p>
By Al Sweigart


## Getting Data from an Element’s Attributes

In [4]:
import bs4
soup = bs4.BeautifulSoup(open('automate_online-materials/example.html'), "lxml")
spanElem = soup.select('span')[0]
print(str(spanElem))
print(spanElem.get('id'))
print(spanElem.get('some_nonexistent_addr') == None)
print(spanElem.attrs)

<span id="author">Al Sweigart</span>
author
True
{'id': 'author'}


## Project: “I’m Feeling Lucky” Google Search

In [8]:
import webbrowser
import sys
import requests
import bs4

user_input = input('Input searh keyword: ')

if len(user_input) > 1:
    #print(user_input)
    res = requests.get('https://www.google.com/search?q=' + user_input)
    try:
        res.raise_for_status()
        noStarchSoup = bs4.BeautifulSoup(res.text, "lxml")
        links = noStarchSoup.select('.r a')
        for link in links:
            currLink = 'https://www.google.com' + link.get('href')
            #print(currLink)
            webbrowser.open(currLink)
    
    
    except Exception as exc:
        print('There was a problem: %s' % (exc))

Input searh keyword: test


## Project: Downloading All XKCD Comics

In [28]:
import webbrowser
import sys
import requests
import bs4
import os
import urllib

user_input = input('Number of images you need: ')

# store comics in ./xkcd
os.makedirs('xkcd', exist_ok=True)

# Download the page.
res = requests.get('https://xkcd.com')

for x in range(int(user_input)):
    try:
        if(x>0):
            res = requests.get(new_next_link)
        res.raise_for_status()
        print("Downloading page: " + res.url)
        noStarchSoup = bs4.BeautifulSoup(res.text, "lxml")
        
        # Find the URL of the comic image.
        link = noStarchSoup.select('#comic img')[0]
        if(link == []):
            print("Couldn't find an image")
        else:
            new_link = 'https:' + link.get('src')
            print("Downloading image: " + new_link)
            
            # Get the Prev button's url.
            next_link = noStarchSoup.select('a[rel="prev"]')[0]
            new_next_link = 'https://xkcd.com' + next_link.get('href')
            
            # Save the image to ./xkcd.
            name = new_link.split("/")[4]
            urllib.request.urlretrieve(new_link, 'xkcd/'+ name)

    except Exception as exc:
        print('There was a problem: %s' % (exc))


Number of images you need: 5
Downloading page: https://xkcd.com/
Downloading image: https://imgs.xkcd.com/comics/heists_and_escapes.png
Downloading page: https://xkcd.com/2144/
Downloading image: https://imgs.xkcd.com/comics/adjusting_a_chair.png
Downloading page: https://xkcd.com/2143/
Downloading image: https://imgs.xkcd.com/comics/disk_usage.png
Downloading page: https://xkcd.com/2142/
Downloading image: https://imgs.xkcd.com/comics/dangerous_fields.png
Downloading page: https://xkcd.com/2141/
Downloading image: https://imgs.xkcd.com/comics/ui_vs_ux.png


# Controlling the Browser with the selenium Module

## Starting a Selenium-Controlled Browser

In [30]:
!pip install selenium

Collecting selenium
  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [1]:
# Debug geckodriver error: https://stackoverflow.com/questions/40208051/selenium-using-python-geckodriver-executable-needs-to-be-in-path

from selenium import webdriver

browser = webdriver.Firefox()
print(type(browser))
browser.get('http://inventwithpython.com')

<class 'selenium.webdriver.firefox.webdriver.WebDriver'>


## Finding Elements on the Page

Method name | WebElement object/list returned
--- | ---
`browser.find_element_by_class_name(name)`<br>`browser.find_elements_by_class_name(name)` | Elements that use the CSS class `name`
`browser.find_element_by_css_selector(selector)`<br>`browser.find_elements_by_css_selector(selector)` | Elements that match the CSS `selector`
`browser.find_element_by_id(id)`<br>`browser.find_elements_by_id(id)` | Elements with a matching `id` attribute value
`browser.find_element_by_link_text(text)`<br>`browser.find_elements_by_link_text(text)` | `<a>` elements that completely match the `text` provided
`browser.find_element_by_partial_link_text(text)`<br>`browser.find_elements_by_partial_link_text(text)` | `<a>` elements that contain the `text` provided
`browser.find_element_by_name(name)`<br>`browser.find_elements_by_name(name)` | Elements with a matching `name` attribute value
`browser.find_element_by_tag_name(name)`<br>`browser.find_elements_by_tag_name(name)` | Elements with a matching tag `name` (case insensitive; an `<a>` element is matched by 'a' and 'A')

Attribute or method | Description
--- | ---
`tag_name` | The tag name, such as 'a' for an `<a>` element
`get_attribute(name)` | The value for the element’s `name` attribute
`text` | The text within the element, such as 'hello' in `<span>hello</span>`
`clear()` | For text field or text area elements, clears the text typed into it
`is_displayed()` | Returns `True` if the element is visible; otherwise returns `False`
`is_enabled()` | For input elements, returns `True` if the element is enabled; otherwise returns `False`
`is_selected()` | For checkbox or radio button elements, returns `True` if the element is selected; otherwise returns `False`
`location` | A dictionary with keys `'x'` and `'y'` for the position of the element in the page

































In [3]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://inventwithpython.com')
try:
    elem = browser.find_element_by_class_name('cover-thumb')
    print('Found <%s> element with that class name!' % (elem.tag_name))
except:
    print('Was not able to find an element with that name.')

Found <img> element with that class name!


## Clicking the Page

In [7]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://inventwithpython.com')
try:
    linkElem = browser.find_element_by_link_text('Coding with Minecraft')
    print('Found <%s> element with that class name!' % (linkElem.tag_name))
    print(type(linkElem))
    linkElem.click()
except:
    print('Was not able to find an element with that name.')

Found <a> element with that class name!
<class 'selenium.webdriver.firefox.webelement.FirefoxWebElement'>


## Filling Out and Submitting Forms

In [39]:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

browser = webdriver.Firefox()
wait = WebDriverWait(browser, 10)
browser.get('https://kametsu.com/login')
try:
    usernameElem = browser.find_element_by_id('auth')
    usernameElem.send_keys('username')
    passwordElem = browser.find_element_by_id('password')
    passwordElem.send_keys('password')
    submitElem = browser.find_element_by_id('elSignIn_submit')
    submitElem.click()
except:
    print('Was not able to find an element with that name.')
finally:
    browser.quit()
    
# Support: https://stackoverflow.com/questions/40002826/wait-for-page-redirect-selenium-webdriver-python

## Sending Special Keys

Attributes | Meanings
--- | ---
`Keys.DOWN, Keys.UP, Keys.LEFT, Keys.RIGHT` | The keyboard arrow keys
`Keys.ENTER, Keys.RETURN` | The ENTER and RETURN keys
`Keys.HOME, Keys.END, Keys.PAGE_DOWN, Keys.PAGE_UP` | The home, end, pagedown, and pageup keys
`Keys.ESCAPE, Keys.BACK_SPACE, Keys.DELETE` | The ESC, BACKSPACE, and DELETE keys
`Keys.F1, Keys.F2,..., Keys.F12` | The F1 to F12 keys at the top of the keyboard
`Keys.TAB` | The TAB key

In [42]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Firefox()
browser.get('http://nostarch.com')
try:
    htmlElem = browser.find_element_by_tag_name('html')
    htmlElem.send_keys(Keys.END)     # scrolls to bottom
    #htmlElem.send_keys(Keys.HOME)    # scrolls to top
except:
    print('Was not able to find an element with that name.')
finally:
    browser.quit()

## Clicking Browser Buttons

* `browser.back()` - Clicks the Back button.
* `browser.forward()` - Clicks the Forward button.
* `browser.refresh()` - Clicks the Refresh/Reload button.
* `browser.quit()` - Clicks the Close Window button.

### More info: http://selenium-python.readthedocs.org/.