# Chapter 1 Your First Web Scraper

## Connecting
Although web browsers do us a big favor to decode data to render fantastic interfaces, in order to learn web scraping, we need to start at the level of the network connection.

Here's a very basic function using Python standard library `urllib` to grab html contents from our [example website](http://pythonscraping.com/pages/page1.html).

In [1]:
from urllib.request import urlopen

http_res = urlopen("http://pythonscraping.com/pages/page1.html")
print(http_res.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## An Introduction to BeautifulSoup
Using `urllib`, we can already get the html contents rendered on website. However, our example is simple for us to manually handle all tags. What if we have a more complex one that contains titles, panels, tables, and images? The answer is we need a library to handle this cleaning process for us. Here we have `BeautifulSoup`!

Let's use the same example above:

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

http_res = urlopen("http://pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(http_res.read(), "html.parser")
print(bs)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



With the help of `BeautifulSoup`, we have a better output. Besides, we can extract different parts in the html:

In [3]:
print(bs.title)

<title>A Useful Page</title>


In [4]:
# same as bs.html.body.h1/bs.body.h1/bs.html.h1
print(bs.h1)

<h1>An Interesting Title</h1>


In [5]:
print(bs.div)

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


In addtion to `html.parser`, `lxml` and `html5lib` are also userful. They are better at parsing "messy" or malformed HTML code. They can be used as:
```
bs = BeautifulSoup(http_res.read(), "lxml")
bs = BeautifulSoup(http_res.read(), "html5lib")
```

HTTP requests don't always return a successful response; sometimes HTTP error will be returned. For example:
- If the page is not found on the server, error code 404 (page not found) will be returned.
- If the server is not found, error code 500 (internal server error) will be returned.

To handle these cases, we need `HTTPError` and `URLError` in `urllib`:

In [6]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    http_res = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print("The server could not be found!")
else:
    print("It Worked!")

The server could not be found!


`find()` function in `BeautifulSoup` is handy and helpful. Here we give use an example of finding the first thing with tag `<h1>`. When we try to find something that doesn't exist, `find()` return `None`.

In [7]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

http_res = urlopen("http://pythonscraping.com/pages/page1.html")
bs = BeautifulSoup(http_res.read(), "html.parser")

In [8]:
print(bs.find("h1"))

<h1>An Interesting Title</h1>


In [9]:
print(bs.find("notExistedTag"))

None


Adding error handling to previous example, we get:

In [10]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        http_res = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(http_res.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

<h1>An Interesting Title</h1>
