## World's Simplest Web Browser

First the program makes a connection to port 80 on the server www.py4e.com.

Since our program is playing the role of the “web browser”, the HTTP protocol
says we must send the GET command followed by a blank line.

Once we send that blank line, we write a loop that receives data in 512-character
chunks from the socket and prints the data out until there is no more data to read
(i.e., the recv() returns an empty string).

In [4]:
#socket1.py

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))

cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()

mysock.send(cmd)

while True:
    data = mysock.recv(20)
    if (len(data) < 1):
        break
    print(data.decode(),end='')

mysock.close()


HTTP/1.1 200 OK
Date: Mon, 14 Nov 2022 19:47:54 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


## Retrieving an image over HTTP

In [5]:
#urljpeg.py

import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))

#send string as byte literal using prefix "b"
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
    data = mysock.recv(5120)
    if (len(data) < 1): break
    time.sleep(0.25)
    count = count + len(data)
    print(len(data), count)
    picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()


5120 5120
5120 10240
5120 15360
5120 20480
5120 25600
5120 30720
5120 35840
5120 40960
5120 46080
5120 51200
5120 56320
5120 61440
5120 66560
5120 71680
5120 76800
5120 81920
5120 87040
5120 92160
5120 97280
5120 102400
5120 107520
5120 112640
5120 117760
5120 122880
5120 128000
5120 133120
5120 138240
5120 143360
5120 148480
5120 153600
5120 158720
5120 163840
5120 168960
5120 174080
5120 179200
5120 184320
5120 189440
5120 194560
5120 199680
5120 204800
5120 209920
5120 215040
5120 220160
5120 225280
5120 230400
208 230608
Header length 394
HTTP/1.1 200 OK
Date: Mon, 14 Nov 2022 19:48:47 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg


## Retrieving web pages with urllib

Using urllib, you can treat a web page much like a file. 

You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details.

In [8]:
#urllib1.py

import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

# The headers are still sent, but the urllib code consumes the 
# headers and only returns
the data to us.
for line in fhand:
    print(line.decode().strip())


But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [9]:
#urlwords.py
# a program to retrieve the data for romeo.txt and
# compute the frequency of each word in the file
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


## Parsing HTML using regex

In [6]:
#urlregex.py

import urllib.request, urllib.parse, urllib.error
import re

#  http://www.dr-chuck.com/page1.htm

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall(b'href="(http://.*?)"', html)
for link in links:
    print(link.decode())


Enter -  http://www.dr-chuck.com/page1.htm


http://www.dr-chuck.com/page2.htm


## Parsing HTML using BeautifulSoup

In [7]:
#urllinks.py

# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#url = input('Enter - ')
url = "http://www.dr-chuck.com/page2.htm"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))
    
tags = soup.find_all('p')
print("there are %d paragraph tags" % len(tags))
for tag in tags:
    print(tag)


page1.htm
there are 1 paragraph tags
<p>
If you like, you can switch back to the 
<a href="page1.htm">
First Page</a>.
</p>


## Reading binary files using urllib

This program reads all of the data in at once across the network and stores it in the
variable img in the main memory of your computer, then opens the file cover.jpg
and writes the data out to your disk. 

This will work if the size of the file is less
than the size of the memory of your computer.

In [6]:
#curl1.py

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()


If the file is a large file, this program may crash or at least
run extremely slowly when your computer runs out of memory. In order to avoid
running out of memory, we retrieve the data in blocks (or buffers) and then write
each block to your disk before retrieving the next block. This way the program can
read any size file without using up all of the memory you have in your computer.

In [8]:
#curl2.py

import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
    info = img.read(100000)
    if len(info) < 1: break
    size = size + len(info)
    fhand.write(info)

print(size, 'characters copied.')
fhand.close()


230210 characters copied.
