# The urllib Module

## urllib module in Python

urllib is a collection of modules for working with URLs. urllib includes the following modules:
* urllib.request
* urllib.error
* urllib.parse
* urllib.robotparser

This chapter covers all of the above outside of urllib.error.

## urllib.request

In [5]:
# Using the .urlopen function to get data
import urllib

url = urllib.request.urlopen('https://www.google.com/')

# print some data associated with the url'
# printing geturl is good for knowing if we called a redirect page
print(url.geturl())
print('\n')

# Info will give us some basic info on the page
print(url.info())
print('\n')

# Get code will provide us with the http response code
print(url.getcode())

https://www.google.com/


Date: Sat, 27 Nov 2021 17:59:17 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Server: gws
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2021-11-27-17; expires=Mon, 27-Dec-2021 17:59:17 GMT; path=/; domain=.google.com; Secure
Set-Cookie: NID=511=Bf0UpUog0OYuE4teINawJvF1X5XM-p16QfxaQHl5hsnvqFm8-kIUFq8a-wsdO3VsEnwPacKl0nazdjQPkT9gWr6v7HZYDjmsOZ6eGDDbm7Yn1gKqUDJh2evvX3UdBa1SM4wtUs3cboEjHh5vMaTYipwILHTHL09TetspGrvg4lw; expires=Sun, 29-May-2022 17:59:17 GMT; path=/; domain=.google.com; HttpOnly
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Accept-Ranges: none
Vary: Accept-Encoding
Connection: close
Transfer-Encoding: chunked




200


In [6]:
# Downloading a file
url = 'http://www.blog.pythonlibrary.org/wp-content/uploads/2012/06/wxDbViewer.zip'
response = urllib.request.urlopen(url)
data = response.read()

%cd 22_urllib_demos

with open('test.zip', 'wb') as fobj:
    fobj.write(data)

/Users/miesner.jacob/python-for-programmers-educative/Module 4 - Advanced Concepts in Python/22_urllib_demos


When you visit a website with your browser, the browser tells the website who it is. This is called the user-agent string. Some websites won’t recognize this user-agent string and will behave in strange ways or not work at all. Fortunately, it’s easy for you to set up your own custom user-agent string.

In [8]:
# Specifying User Agent

user_agent = ' Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0'
url = 'http://www.whatsmyua.com/'
headers = {'User-Agent': user_agent}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
    with open('user_agent.html', 'wb') as out:
        out.write(response.read())

## urllib.parse

In [12]:
# Parsing a url
from urllib.parse import urlparse

result = urlparse('https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa')

# Printing out some data we can get for urllib.parse
print("Full result: " + str(result) + '\n')
print("Netloc: " + str(result.netloc) + '\n')
print("Get url: " + str(result.geturl()) + '\n')
print("Port: " + str(result.port) + '\n')

Full result: ParseResult(scheme='https', netloc='duckduckgo.com', path='/', params='', query='q=python+stubbing&t=canonical&ia=qa', fragment='')

Netloc: duckduckgo.com

Get url: https://duckduckgo.com/?q=python+stubbing&t=canonical&ia=qa

Port: None



In [14]:
# Submitting a Web Form using the urlencode method

# Encode data
data = urllib.parse.urlencode({'q': 'Python'})
print(data)

# Submitting query to duckduckgo and saving result
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib.request.urlopen(full_url)
with open('results.html', 'wb') as f:
    f.write(response.read())

q=Python


## urllib.robotparser

The robotparser module is made up of a single class, RobotFileParser. This class will answer questions about whether or not a specific user agent can fetch a URL that has a published robot.txt file. The robots.txt file will tell a web scraper or robot what parts of the server should not be accessed. 

In [18]:
import urllib.robotparser

# Instantiate RobotFileParser class and set url
robot = urllib.robotparser.RobotFileParser()
robot.set_url('http://arstechnica.com/robots.txt')
robot.read()

# Can we crawl this page?
page = 'http://arstechnica.com/'
print(page)
print(robot.can_fetch('*', page))
print('\n')

# Can we crawl this page?
page = 'http://arstechnica.com/cgi-bin/'
print(page)
print(robot.can_fetch('*', page))

http://arstechnica.com/
True


http://arstechnica.com/cgi-bin/
False
