## Using `BeautifulSoup` and `urllib` 

   * `urllib` is a standard Python library that functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. Python documentation for the library : https://docs.python.org/3/library/urllib.html
 
 
   * `BeautifulSoup` helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures. Documentation : https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Simple example with `urllib`

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


### Simple example with `urllib` and `BeautifulSoup` 

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


## Connecting Reliably and Handling Exceptions

### HTTP and URL Errors
   * If the page is not found, an HTTP error will be returned. This HTTP error may be **"404 Page Not Found"** or **"500 Internal Server Error"**. In thoses cases, the urlopen will throw the generic exception HTTPError. You can handle this exception in the code


   * If the server is not found at all (server is down, or the URL is mistyped), urlopen will throw an URLError, which indicates that no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an HTTPError cannot be thrown, and the more serious URLError must be caught.

In [3]:
from urllib.error import HTTPError, URLError

try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
    
except HTTPError as e:
    print(e)
    # return NULL, BREAK, do other thnigs rather than plan B
    
except URLError as e:
    print('The server could not be found!')
    
else:
    print('It works !!!')
    # program continues. Note: If you return or break in the
    # exception catch, you do not need to use the "else" statement

It works !!!


### Content Errors (None Tag)

Even if we get the page with success, there is still the issue of the content on the page not quite being what we expected. Every time we access a `tag` in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. The objective is to avoid calling methods or parameters of a None object.

We always check a tag before using it :

In [4]:
try:
    badContent = bs.nonExistingTag.anotherTag

except AttributeError as e:
    print('Tag was not found')

else:
    if badContent == None:
        print ('Tag was not found')
    else:
        print(badContent)

Tag was not found


  name=tag_name


In this example, we create a function getTitle, which returns either the title of the page, or a None object if there was a problem retrieving it. When writing scrapers, it’s important to think about the overall pattern of the code in order to handle exceptions and make it readable at the same time.

In [5]:
def getTitle(url):
    try:
        html = urlopen(url)
    
    # check possible http error
    except HTTPError as e:
        print("HTTP ERROR !")
        print(e)
        return None
    
    # check possible content (missing tag) error
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
        
    except AttributeError as e:
        print("AttributeError ERROR !")
        print(e)        
        return None
    return title

In [6]:
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
print(title)

<h1>An Interesting Title</h1>
