## What is urllib?

- urllib is a standard Python library used for fetching data across the web (HTTP/HTTPS).

- It has submodules that handle different tasks:

- **urllib.request** → open and read URLs

- **urllib.error** → exceptions related to requests

- **urllib.parse** → URL parsing and building

- **urllib.robotparser** → parse robots.txt files

**urllib.request -> making requests**
This is the most used part of the urllib

In [None]:
# a) urlopen() : Opens a url (like opening a file).
import urllib.request

response = urllib.request.urlopen('https://chatgpt.com/c/68aa9da5-1d08-8327-89dd-a3efba956f1b')
print(response.status)   # HTTP status code (200 = OK)
content = response.read().decode('utf-8')


In [3]:
# b) Request() + custom headers
from urllib.request import Request, urlopen

url = 'https://google.com'
req = Request(url, headers={'User-Agent':'Mozilla/5.0'})
response = urlopen(url)
print(response.read().decode('utf-8'))

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="48YFmsSeYid_WdFr-NS_FA">(function(){var _g={kEI:'ELCqaI2MKtmb5OUPn7TamQM',kEXPI:'0,202854,2,32,3497404,1092,538661,48791,30022,16105,344796,270476,19568,11106,5230576,622,36812020,25366949,14110,22917,42250,6758,23879,9139,4599,328,6225,28719,35446,15048,8206,3292,4134,30379,28334,42888,10065,1259,352,18880,5870,7714,5773,18859,8753,3050,2,1357,1147,10968,6251,35,3420,2864,2882,7738,2820,9287,5683,3605,17771,2731,15268,918,1,360,678,3544,4,4,5746,3,5344,929,809,1856,3,1760,1219,1,3463,2,1323,359,2418,88,7,1970,1033,1579,2254,5,2599,6723,1,3018,2646,103,4,1,320,5100,670,2260,1842,4,6124,484,2303,3445,176,1410,746,2242,7042,1649,5,155,18,1944,216,1158,123,405,185,2,1,260,1120,31,104,3355,2,534,520,2171,673

In [5]:
# 3) urlretrieve() -> download Files
urllib.request.urlretrieve('https://google.com')

('C:\\Users\\ASUS\\AppData\\Local\\Temp\\tmpwzx14cl6',
 <http.client.HTTPMessage at 0x24fe3869fd0>)

## 2) urllib.error -> Handling exceptions
- HTTPError : server returned an Errorcode (402, 500, etc)
- URLError : failed to reach the server

In [None]:
import urllib.request
import urllib.error

try:
    urllib.request.urlopen('https://fjsslfjs.com')
except urllib.error.HTTPError as e:
    print('HTTPError:', e.code, e.reason)
except urllib.error.URLError as e:
    print(f'URL Error:', e.reason)


URL Error: [Errno 11001] getaddrinfo failed


## 3) urllib.parse() -> URL parsing and building
Helps in breaking URLs into components or building query strings.

In [9]:
# a) urlparse() 
from urllib.parse import urlparse

url = "https://www.example.com/path/page?name=prashant&age=20"
parsed = urlparse(url)
print(parsed.scheme)    # 'https'
print(parsed.netloc)    # 'www.example.com'
print(parsed.path)      # '/path/page' 
print(parsed.query)     # 'name=prashant&age=20'  

https
www.example.com
/path/page
name=prashant&age=20


In [10]:
# b) parse_qs() : Turn query string into dictionary
from urllib.parse import parse_qs

query = 'name=prashant&age=20'
print(parse_qs(query))

{'name': ['prashant'], 'age': ['20']}


In [11]:
# 3) urlencode() : Build query string
from urllib.parse import urlencode

params = {'name':'prashant', 'age':20}
query_string = urlencode(params)
print(query_string)

name=prashant&age=20


## 4) urllib.robotparser -> Respecting robots.txt
- Websites have a robots.txt file that says what bots can/can't access.

In [14]:
import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.example.com/robots.txt")
rp.read()

print(rp.can_fetch("*", "https://www.example.com/"))   # 👉 True/False

True
