# Web Scraping

## Using the webbrowser Module

In [12]:
!pip install pyperclip

Collecting pyperclip
  Downloading https://files.pythonhosted.org/packages/2d/0f/4eda562dffd085945d57c2d9a5da745cfb5228c02bc90f2c74bbac746243/pyperclip-1.7.0.tar.gz
Installing collected packages: pyperclip
  Running setup.py install for pyperclip: started
    Running setup.py install for pyperclip: finished with status 'done'
Successfully installed pyperclip-1.7.0


In [13]:
import webbrowser
import sys
import pyperclip

In [None]:
#Opens the URL in the browser
webbrowser.open('http://inventwithpython.com/')

In [17]:
user_input = input('Input address: ')

Input address: 


## Handle the Clipboard Content and Launch the Browser

In [18]:
if len(user_input) > 1:
    #address = ''.join(user_input[0:])
    print(address)
    print(type(address))
    address = user_input
else:
    # Get address from clipboard. (Will be triggered when the input is empty and it assumes that the address is in the clipboard)
    address = pyperclip.paste()

webbrowser.open('https://www.google.com/maps/place/' + address)

True

## Similar use cases 

* Open all links on a page in separate browser tabs.
* Open the browser to the URL for your local weather.
* Open several social network sites that you regularly check.

## Downloading Files from the Web with the requests Module

In [21]:
import requests

## Downloading a Web Page with the requests.get() Function

In [25]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
print(type(res))
print(res.status_code == requests.codes.ok)
print(len(res.text))
print(res.text[:250]) # prints the first 250 characters

<class 'requests.models.Response'>
True
178978
The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


## Checking for Errors

In [26]:
res = requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


## Handling Errors

In [31]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
try:
    res.raise_for_status()
    print(res.text[:250])
except Exception as exc:
    print('There was a problem: %s' % (exc))

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


## Saving Downloaded Files to the Hard Drive

In [34]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
try:
    res.raise_for_status()
    playFile = open('RomeoAndJuliet.txt', 'wb')
    for chunk in res.iter_content(100000):
        playFile.write(chunk)
    playFile.close()
    print('Done!')
except Exception as exc:
    print('There was a problem: %s' % (exc))

Done!


* The `iter_content()` method returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass `100000` as the argument to `iter_content()`.

* The file `RomeoAndJuliet.txt` will now exist in the current working directory. Note that while the filename on the website was `rj.txt`, the file on your hard drive has a different filename. The `requests` module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer.

* The `write()` method returns the number of bytes written to the file. In the previous example, there were `100,000 bytes` in the first chunk, and the remaining part of the file needed only `78,981 bytes`.

* To review, here’s the complete process for downloading and saving a file:
    1. Call `requests.get()` to download the file.
    2. Call `open()` with `wb` to create a new file in write binary mode.
    3. Loop over the Response object’s `iter_content()` method.
    4. Call `write()` on each iteration to write the content to the file.
    5. Call `close()` to close the file.

## Parsing HTML with the BeautifulSoup Module

In [1]:
!pip install beautifulsoup4



In [3]:
import bs4
import requests

In [8]:
# Creating a BeautifulSoup Object from HTML

#Reading from a URL
res = requests.get('http://nostarch.com')
try:
    res.raise_for_status()
    noStarchSoup = bs4.BeautifulSoup(res.text, "lxml")
    print(type(noStarchSoup))
    print(noStarchSoup)
except Exception as exc:
    print('There was a problem: %s' % (exc))

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<link href="https://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://nostarch.com/sites/default/files/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Drupal 7 (http://drupal.org)" name="generator"/>
<link href="https://nostarch.com/" rel="canonical"/>
<link href="https://nostarch.com/" rel="shortlink"/>
<title>No Starch Press | "The finest in geek entertainment"</title>
<link href="https://nostarch.com/sites/default/files/css/css_lQaZfjVpwP_oGNqdtWCSpJT1EMqXdMiU84ekLLxQnc4.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://nostarch.com/sites/default/files/css/css_iJE8OMtNhvOQPbQGg8OqRmpr7AhRCfmCisQy8q7fFhk.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://nostarch.com/sites/defau

In [12]:
#Reading from an HTML file 
exampleFile = open('automate_online-materials/example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile, "lxml")
print(type(exampleSoup))
print(exampleSoup)

<class 'bs4.BeautifulSoup'>
<!-- This is the example.html file. --><html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>


## Finding an Element with the select() Method

In [18]:
print(exampleSoup.select('title'))
print(exampleSoup.select('#author'))
print(exampleSoup.select('p > strong'))

[<title>The Website Title</title>]
[<span id="author">Al Sweigart</span>]
[<strong>Python</strong>]


* `soup.select('div')` - All elements named `<div>`
* `soup.select('#author')` - The element with an `id` attribute of `author`
* `soup.select('.notice')` - All elements that use a `CSS` class attribute named `notice`
* `soup.select('div span')` - All elements named `<span>` that are within an element named `<div>`
* `soup.select('div > span')` - All elements named `<span>` that are directly within an element named `<div>`, with no other element in between
* `soup.select('input[name]')` - All elements named `<input>` that have a name attribute with any                                                                   value
* `soup.select('input[type="button"]')` - All elements named `<input>` that have an attribute named type with                                                               value `button`

