# Web Scraping With Python

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [2]:
#means it looks at the Python module request (found within the urllib library) and imports only the function urlopen.
from urllib.request import urlopen

# urllib or urllib2?

If you’ve used the urllib2 library in Python 2.x, you might have noticed that things have changed somewhat between urllib2 and urllib. In Python 3.x, urllib2 was renamed urllib and was split into several submodules: urllib.request, urllib.parse, and urllib.error. Although function names mostly remain the same, you might want to note which functions have moved to submodules when using the new urllib.

urllib is a standard Python library (meaning you don’t have to install anything extra to run this example) and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. 

# An Introduction to BeautifulSoup

“Beautiful Soup, so rich and green,
Waiting in a hot tureen!
Who for such dainties would not stoop?
Soup of the evening, beautiful Soup!”

The BeautifulSoup library was named after a Lewis Carroll poem of the same name in Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called the Mock Turtle (itself a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow).Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical;
it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects representing XML structures.

# Installing BeautifulSoup

This installs the Python package manager pip. Then run the following:
$pip install beautifulsoup4

# Running BeautifulSoup

The most commonly used object in the BeautifulSoup library is, appropriately, the
BeautifulSoup object. 

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

<h1>An Interesting Title</h1>




 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


As in the example before, we are importing the urlopen library and calling html.read() in order to get the HTML content of the page. This HTML content is then transformed into a BeautifulSoup object, with the following structure:

In [None]:
#html → <html><head>...</head><body>...</body></html>
#head → <head><title>A Useful Page<title></head>
#title → <title>A Useful Page</title>
#body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
#h1 → <h1>An Interesting Title</h1>
#div → <div>Lorem Ipsum dolor...</div>

# Connecting Reliably

Let’s take a look at the first line of our scraper, after the import statements, and figureout how to handle any exceptions this might throw: html = urlopen("http://www.pythonscraping.com/pages/page1.html")

There are two main things that can go wrong in this line:
• The page is not found on the server (or there was some error in retrieving it)
• The server is not found

In the first situation, an HTTP error will be returned. This HTTP error may be “404 Page Not Found,” “500 Internal Server Error,” etc. In all of these cases, the urlopen function will throw the generic exception “HTTPError”

In [None]:
#We can handle this exception in the following way:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
     print(e)
 #return null, break, or do some other "Plan B"
else:
 #program continues. Note: If you return or break in the
 #exception catch, you do not need to use the "else" statement
    
#If an HTTP error code is returned, the program now prints the error, and does not execute the rest of the program under the else statement.

In [None]:
#If the server is not found at all (if, say, http://www.pythonscraping.com was down, or the URL was mistyped), urlopen returns a None object.
#We can add a check to see if the returned html is None:
 
if html is None:
    print("URL is not found")
else:
    #program continues 

Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what we expected. Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a None object. The problem is, attempting to access a tag on a None object itself will result in an AttributeError being thrown.

In [None]:
#The following line (where nonExistentTag is a made-up tag, not the name of a real BeautifulSoup function):
print(bsObj.nonExistentTag)  # returns a None object. 

This object is perfectly reasonable to handle and check for. The trouble comes if you don’t check for it, but instead go on and try to call some other function on the None object, as illustrated in the following:

In [None]:
print(bsObj.nonExistentTag.someTag)

#which returns the exception:
     #AttributeError: 'NoneType' object has no attribute 'someTag'
    
#So how can we guard against these two situations? The easiest way is to explicitly check for both situations:
try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
         print ("Tag was not found")
    else:
         print(badContent)

In [None]:
#This code, for example, is our same scraper written in a slightly different way:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
    html = urlopen(url)
except HTTPError as e:
    return None
try:
    bsObj = BeautifulSoup(html.read())
    title = bsObj.body.h1
     except AttributeError as e:
             return None
 return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
 print("Title could not be found")
else:
 print(title)