# Web Scraping 101
in Python 3.x, the core module for web scraping is split into several submodules:  
**urllib** is a package that collects several modules for working with URLs:
 * **urllib.request** for opening and reading URLs
 * **urllib.error** containing the exceptions raised by urllib.request
 * **urllib.parse** for parsing URLs
 * **urllib.robotparser** for parsing robots.txt files

## The browser

In [None]:
from urllib.request import urlopen

Get a html page link:

In [None]:
URL = "http://pythonscraping.com/pages/page1.html"

Retrieve the page:

In [None]:
html = urlopen(URL)
html.read()

In [None]:
with urlopen(URL) as f:
    print(f.read(300).decode('utf-8'))

## Anticipating http errors
The web is messy, it pays to anticipate http errors and put in exceptions.  
Two common things that can go wrong:
 * The page is not found on the server (404 Not Found)
 * The server is not found (500 Internal Server Error)

For list of possible error codes, see: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [None]:
import urllib.request as r

In [None]:
URL2 = 'https://www.womenwhocode.com/singapore'

In [None]:
import urllib.error as err
try:
    html = urlopen(URL2)
except err.HTTPError as e:
    print(e)

## request headers
Some sites require basic header information, such as browser information, before providing page data, in order to deter robots.  
This can be checked by looking at robots.txt of the site.

In [None]:
# add header by building request
req = r.Request(URL2)
# Add User-Agent header value:
req.add_header('User-Agent', 'Mozilla/5.0')
html = r.urlopen(req)
html.read(100)

#### Use **OpenerDirector** to automatically adds a User-Agent header to every Request:  

In [None]:
# add headers by building an opener
opener = r.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open(URL2).read(300)

## Alternate Simpler method 
Using the requests library:

    pip install requests

In [None]:
import requests
html = requests.get(URL2)
html.text[:300]

In [None]:
html.content[:300]

#### References:
https://docs.python.org/3/library/urllib.request.html