# Web Scraping

In those rare, terrifying moments when I’m without Wi-Fi, I realize just how much of what I do on the computer is really what I do on the Internet. Out of sheer habit I’ll find myself trying to check email, read friends’ Twitter feeds, or answer the question, “Did Kurtwood Smith have any major roles before he was in the original 1987 Robocop?”

Since so much work on a computer involves going on the Internet, it’d be great if your programs could get online. Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.

* Requests. Downloads files and web pages from the Internet.

* Beautiful Soup. Parses HTML, the format that web pages are written in.

## Downloading Files from the Web with the requests Module

The requests module lets you easily download files from the Web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn’t come with Python, so we need to manually install it. In this case i've already done that for you. If you would ever want to use requests in your own code you would have to install it using pip. https://pypi.org/project/pip/

The requests module was written because Python’s urllib2 module is too complicated to use. In fact, take a permanent marker and black out this entire paragraph. Forget I ever mentioned urllib2. If you need to download things from the Web, just use the requests module.

Lets' make sure that loads correctly:

In [None]:
import requests

If no error messages show up, then the requests module has been successfully installed.

## Downloading a Web Page with the requests.get() Function

The requests.get() function takes a string of a URL ( https://en.wikipedia.org/wiki/URL ) to download. By calling type() on requests.get()’s return value, you can see that it returns a Response object, which contains the response that the web server gave for your request. I’ll explain the Response object in more detail later, but for now, try this out:

In [None]:
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
print type(res)
print res.status_code == requests.codes.ok
print len(res.text)
print(res.text[:250])

The URL goes to a text web page for the entire play of Romeo and Juliet. You can tell that the request for this web page succeeded by checking the `status_code` attribute of the Response object. If it is equal to the value of `requests.codes.ok`, then everything went fine. (Incidentally, the status code for “OK” in the HTTP protocol is 200. You may already be familiar with the 404 status code for “Not Found.”)

If the request succeeded, the downloaded web page is stored as a string in the Response object’s text variable. This variable holds a large string of the entire play; the call to len(res.text) shows you that it is more than 178,000 characters long. Finally, calling print(res.text[:250]) displays only the first 250 characters.

## Checking for Errors

As you’ve seen, the Response object has a `status_code` attribute that can be checked against requests.codes.ok to see whether the download succeeded. A simpler way to check for success is to call the `raise_for_status()` method on the Response object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded. Try it out:

In [None]:
import requests
res = requests.get('http://inventwithpython.com/page_that_does_not_exist')
res.raise_for_status()

The `raise_for_status()` method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isn’t a deal breaker for your program, you can wrap the `raise_for_status()` line with try and except statements to handle this error case without crashing.

In [None]:
import requests
res = requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

This `raise_for_status()` method call causes the program to output the following:


`There was a problem: 404 Client Error: Not Found`
Always call `raise_for_status()` after calling `requests.get()`. You want to be sure that the download has actually worked before your program continues.

## Saving Downloaded Files to the Hard Drive

From here, you can save the web page to a file on your hard drive with the standard open() function and write() method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string 'wb' as the second argument to open(). Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.

<b>Task 1:</b> Try it out. Download 'https://automatetheboringstuff.com/files/rj.txt' and save it to a file called `RomeoAndJuliet.txt`

In [None]:
import requests
url = 'https://automatetheboringstuff.com/files/rj.txt'
# Insert your code here


The file RomeoAndJuliet.txt will now exist in the current working directory. Note that while the filename on the website was rj.txt, the file on your hard drive has a different filename. The requests module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program. Even if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer. Or in this case, my server.

There is another, better way of doing this than just writing each character to the file. And that is calling the `iter_content()` function on the Request response object and asking for "chunks" of data:

In [None]:
import requests
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
res.raise_for_status()
playFile = open('RomeoAndJuliet2.txt', 'wb')
for chunk in res.iter_content(100000):
    playFile.write(chunk)

Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size.

## Parsing HTML with the BeautifulSoup Module

Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version 4). It's one of the modules also not provided out of the box with python. In this case I've made sure to install it for you. If you want to perform HTML parsing in your own code, you will need to install it using `pip`(https://pypi.org/project/pip/).

For this chapter, the Beautiful Soup examples will parse (that is, analyze and identify the parts of) an HTML file on the hard drive.

### Creating a BeautifulSoup Object from HTML

The bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns is a BeautifulSoup object. Try it out:

In [None]:
import bs4
example = open('example.html')
soup = bs4.BeautifulSoup(example)
type(soup)

Once you have a BeautifulSoup object, you can use its methods to locate specific parts of an HTML document.

### Finding an Element with the select() Method

You can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Selector passed to the select() method

* soup.select('div')
    * All elements named div

* soup.select('#author')
    * The element with an id attribute of author

* soup.select('.notice')
    * All elements that use a CSS class attribute named notice

* soup.select('div span')
    * All elements named <span> that are within an element named div
* soup.select('div > span')
    * All elements named <span> that are directly within an element named div, with no other element in between

* soup.select('input[name]')
    * All elements named input that have a name attribute with any value

* soup.select('input[type="button"]')
    * All elements named input that have an attribute named type with value button



The various selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a `<p>` element.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str() function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary.

In [None]:
import bs4
example = open('example.html')
soup = bs4.BeautifulSoup(example.read())
elems = soup.select('#author')
print type(elems)
print len(elems)
print type(elems[0])
print elems[0].getText()
print str(elems[0])
print elems[0].attrs

This code will pull the element with id="author" out of our example HTML. We use select('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Al Sweigart'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'.

You can also pull all the `<p>` elements from the BeautifulSoup object. 

In [None]:
import bs4
example = open('example.html')
soup = bs4.BeautifulSoup(example.read())
elems = soup.select('#author')
pElems = soup.select('p')
print str(pElems[0])

<b>Task 2:</b> Finish the code below and print all the quotes available on the website. (You will need to open the website in a webbrowser and view the HTML source to figure out the how to fetch the quotes). To open the source right click on the website in your browser and select inspect page source

In [None]:
import bs4, requests
url = "http://quotes.toscrape.com/"
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
#Insert your code here


<b>Task 3:</b> Display the Name, Price and rating of all books available on the front page of http://books.toscrape.com/

In [None]:
import bs4, requests
url = "http://books.toscrape.com/"
#Insert your code here