# Part 1: Building Scrapers

## Your first Web Scraper

For Johnny and Sep: http://www.wefeelfine.org/wefeelfine_mac.html

For codes and examples by the girl herself: https://github.com/REMitchell/python-scraping

What will be done?
• Retrieving HTML data from a domain name
• Parsing that data for target information
• Storing the target information
• Optionally, moving to another page to repeat the process

Let's take some web pages apart.

### Connecting

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html") 
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


In [2]:
from urllib.request import urlopen # https://docs.python.org/3/library/urllib.html

### An Introduction to BeautifulSoup
“Beautiful Soup, so rich and green,  
Waiting in a hot tureen!  
Who for such dainties would not stoop?  
Soup of the evening, beautiful Soup!”  

In [3]:
html2 = urlopen("http://www.thuisbezorgd.nl")
print(html2)

<http.client.HTTPResponse object at 0x1051ef438>


In [4]:
from bs4 import BeautifulSoup

In [5]:
html = urlopen("http://pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read(), 'html5lib')
print(bsObj.h1)

<h1>An Interesting Title</h1>


### Connecting Reliably
A lot can go wrong while scraping the web. So it's wise to take in to account the things that can go wrong.
* The page is not found on the server
* The server is not found

In [None]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the 
    # exception catch, you do not need to use the "else" statement.

In [None]:
if html is None:
    print("URL is not found")
else:
    # Program continues

In [None]:
print(bs0bj.nonExistentTag)

print(bs0bj.nonExistentTag.someTag)

In [None]:
try:
    badContent = bs0bj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found")
    else:
        print(badContent)