# Basics of webscraping.

How do you open a html document in python ?

In [1]:
from urllib.request import urlopen

In [7]:
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


Well we retrived the html document, But its not quite readable. This is where BeautifulSoup comes into picture.

In [6]:
from bs4 import BeautifulSoup

In [8]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


In [11]:
print(bs.head)

<head>
<title>A Useful Page</title>
</head>


Now, what if we provide a url that does not exists?

In [12]:
html = urlopen('http://alinkthatdoesnotexists.edu')

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

Let's prevent this error from occuring.

In [13]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com")
except HTTPError as e:
    print('The url has returned an HTTP error')
except URLError as e:
    print('The server could not be found')
else:
    print(html.read())


The server could not be found


- ## Program to create get the title of an html document

In [4]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs_obj = BeautifulSoup(html.read(), features='xml')
        title = bs_obj.body.h1
    except AttributeError as a:
        return None
    return title

title = get_title("http://www.pythonscraping.com/pages/page1.html")
if title != None:
    print(title)
else:
    print('Title was not found')

    

<h1>An Interesting Title</h1>
