# *Your First Web Scraper*

## 1. Connecting

**urllib** is a standard Puthon library and contains functions for requesting data across the web, handling cookies and even other functions. <br>
*urlopen* is used to open a remote object across a network and read it


In [1]:
from urllib.request import urlopen                                    
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


## 2. An Introduction to BeautifulSoup

**BeautifulSoup** is not a default Python library, it must be installed. <br>
The basic method for Windows is: <br>
*$pip3 install beautifulsoup4* <br>

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
print(bsObj.div)

<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


As in the example before calling *html.read()* we get the HTML content of the page. The content is then transformed into a BeautifulSoup object with a precise structure. Note that the *h1* tag we extracted from the page was nested two layers deep into our BS object (html -> body -> h1).

In [7]:
print(bsObj.html.body.h1)
print(bsObj.body.h1)
print(bsObj.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


How to handle with HTTP error:

In [None]:
try:
    html = urlopen("http://pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
    #return null, break,, or do some other 'plan B'
else:
    #program continues Note: If you return or break in the 
    #exception catch, you do not need to use the 'else' statement

If the server is not found:

In [None]:
if html is None:
    print("Urls is not found")
else:
    #program continues

Add a check to make sure the tag actually exists:

In [9]:
print(bsObj.nonExistentTag) # return a None Object
print(bsObj.nonExistentTag.someTag)  #Return an exception



None


AttributeError: 'NoneType' object has no attribute 'someTag'

Guard against None and Attribute Error:

In [10]:
try:
    badContent = bsObj.nonExistentTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found")
    else:
        print(badContent)

Tag was not found


### Create a function *getTitle*

In [11]:
from urllib.request import urlopen
from urllib.request import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title=getTitle("http://pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>
