## The World's Simplest Web Browser

In [3]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

# Code: http://www.py4e.com/code3/socket1.py
# Or select Download from this trinket's left-hand menu

HTTP/1.1 200 OK
Date: Sat, 05 Nov 2022 15:52:57 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



## Retrieving an image over HTTP

In [5]:
import socket
import time

HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""

while True:
    data = mysock.recv(5120)
    if (len(data) < 1): break
    #time.sleep(0.25)
    count = count + len(data)
    print(len(data), count)
    picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()

# Code: http://www.py4e.com/code3/urljpeg.py
# Or select Download from this trinket's left-hand menu§

1448 1448
5120 6568
5120 11688
2792 14480
5120 19600
5120 24720
5120 29840
5120 34960
5120 40080
3360 43440
1448 44888
5120 50008
5120 55128
5120 60248
5120 65368
5120 70488
5120 75608
5120 80728
5120 85848
5120 90968
5120 96088
4696 100784
1448 102232
5120 107352
5120 112472
5120 117592
5120 122712
5120 127832
5120 132952
5120 138072
5120 143192
5120 148312
3728 152040
1448 153488
1448 154936
2896 157832
1448 159280
1448 160728
5120 165848
5120 170968
5120 176088
5120 181208
5120 186328
5120 191448
5120 196568
5120 201688
5120 206808
5120 211928
5120 217048
5120 222168
3720 225888
4344 230232
376 230608
Header length 394
HTTP/1.1 200 OK
Date: Sat, 05 Nov 2022 15:56:22 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type

## Retrieving web pages with ``urllib``

In [108]:
import urllib.request

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
    print(line.decode().strip())
    

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [110]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
counts

{'But': 1,
 'soft': 1,
 'what': 1,
 'light': 1,
 'through': 1,
 'yonder': 1,
 'window': 1,
 'breaks': 1,
 'It': 1,
 'is': 3,
 'the': 3,
 'east': 1,
 'and': 3,
 'Juliet': 1,
 'sun': 2,
 'Arise': 1,
 'fair': 1,
 'kill': 1,
 'envious': 1,
 'moon': 1,
 'Who': 1,
 'already': 1,
 'sick': 1,
 'pale': 1,
 'with': 1,
 'grief': 1}

## Parsing HTML using BeautifulSoup

In [11]:
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

# Code: http://www.py4e.com/code3/urllinks.py
# Or select Download from this trinket's left-hand menu

Enter - http://www.dr-chuck.com/page1.htm
http://www.dr-chuck.com/page2.htm


In [12]:
import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
    info = img.read(100000)
    if len(info) < 1: break
    size = size + len(info)
    fhand.write(info)

print(size, 'characters copied.')
fhand.close()

# Code: http://www.py4e.com/code3/curl2.py
# Or select Download from this trinket's left-hand menu

230210 characters copied.


## Exercise 1 

Change the socket program ``socket1.py`` to prompt the user for the URL so it can read any web page. You can use ``split('/')`` to break the URL into its component parts so you can extract the host name for the socket connect call. Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.



In [104]:
import socket
import re 
def _socket(URL):
    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    if not re.findall('http://+.*[.]+.*/\S',URL) :
        if not re.findall('https://+.*[.]+.*/\S',URL) :
            print('Wrong Formatting. Address %s does not seem to be written properly '%URL)
            return 0
    HOST = URL.split('/')[2]
    mysock.connect((HOST, 80))
    
    cmd = 'GET '+URL+' HTTP/1.0\r\n\r\n'
    
    mysock.send(cmd.encode())
    
    while True:
        data = mysock.recv(512)
        if 'Not Found' in data.decode():
            print('Address "%s" not found'%URL)
            return 0
        if (len(data) < 1):
            break
        print(data.decode())
    mysock.close()
    return 1

In [106]:
x = 0
fname = input('Enter URL: ')
while True:
    x =_socket(fname)
    if x == 0:
        print('An error occurred')
        fname = input('Enter a new URL:')
    else:
        break

Enter URL: htt://en.wikipedia.org/wiki/Letter_frequency
Wrong Formatting. Address htt://en.wikipedia.org/wiki/Letter_frequency does not seem to be written properly 
An error occurred
Enter a new URL:https://en.wikipedia.org/wiki/Letter_frequency
HTTP/1.1 301 TLS Redirect
Date: Sat, 05 Nov 2022 18:48:07 GMT
Server: Varnish
X-Varnish: 601250524
X-Cache: cp3064 int
X-Cache-Status: int-front
Server-Timing: cache;desc="int-front", host;desc="cp3064"
Permissions-Policy: interest-cohort=()
Set-Cookie: WMF-Last-Access=05-Nov-2022;Path=/;HttpOnly;secure;Expires=Wed, 07 Dec 2022 12:00:00 GMT
Set-Cookie: WMF-Last-Access-Global=05-Nov-2022;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 07 Dec 2022 12:00:00 GMT
X-Client-IP: 94.210.165.103

Location: https://en.wikipedia.org/wiki/Letter_frequency
Content-Length: 0
Connection: close




## Exercise 2

Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.

In [130]:
import socket
import re 
def _socket(URL):
    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    if not re.findall('http://+.*[.]+.*/\S',URL) :
        if not re.findall('https://+.*[.]+.*/\S',URL) :
            print('Wrong Formatting. Address %s does not seem to be written properly '%URL)
            return 0
    HOST = URL.split('/')[2]
    mysock.connect((HOST, 80))
    cmd = 'GET '+URL+' HTTP/1.0\r\n\r\n'
    mysock.send(cmd.encode())
    recv = ''
    while True:
        data = mysock.recv(512)
        if 'Not Found' in data.decode():
            print('Address "%s" not found'%URL)
            return 0
        if (len(data) < 1):
            break
        recv += data.decode() 
    print(recv[:3000])
    print(len(recv))
    mysock.close()
    
    return 1

In [131]:
_socket('http://data.pr4e.org/romeo-full.txt')

HTTP/1.1 200 OK
Date: Sat, 05 Nov 2022 19:04:36 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "22a0-54f6609245537"
Accept-Ranges: bytes
Content-Length: 8864
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

Romeo and Juliet
Act 2, Scene 2 

SCENE II. Capulet's orchard.

Enter ROMEO

ROMEO

He jests at scars that never felt a wound.
JULIET appears above at a window

But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon,
Who is already sick and pale with grief,
That thou her maid art far more fair than she:
Be not her maid, since she is envious;
Her vestal livery is but sick and green
And none but fools do wear it; cast it off.
It is my lady, O, it is my love!
O, that she knew she were!
She speaks yet she says nothing: what of that?
Her eye discou

1

## Exercise 3

Use ``urllib`` to replicate the previous exercise of (1) retrieving the document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don't worry about the headers for this exercise, simply show the first 3000 characters of the document contents.

In [149]:
import urllib.request

def _socketURLib(URL):
    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    if not re.findall('http://+.*[.]+.*/\S',URL) :
        if not re.findall('https://+.*[.]+.*/\S',URL) :
            print('Wrong Formatting. Address %s does not seem to be written properly '%URL)
            return 0
    fhand = urllib.request.urlopen(URL)
    recv = ''
    for line in fhand:
        recv += line.decode() 
    print(recv[:3000])
    print(len(recv))
    
    return 1

_socketURLib('http://data.pr4e.org/romeo.txt')

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

167


1

## Exercise 4

Change the ``urllinks.py`` program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.

In [159]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('p')
print(len(tags))

# Code: http://www.py4e.com/code3/urllinks.py
# Or select Download from this trinket's left-hand menu

Enter - https://en.wikipedia.org/wiki/The_Gold-Bug
38


## Exercise 5 

(Advanced) Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that recv is receiving characters (newlines and all), not lines.

In [176]:
import socket
import re 
def _socket(URL):
    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    if not re.findall('http://+.*[.]+.*/\S',URL) :
        if not re.findall('https://+.*[.]+.*/\S',URL) :
            print('Wrong Formatting. Address %s does not seem to be written properly '%URL)
            return 0
    HOST = URL.split('/')[2]
    mysock.connect((HOST, 80))
    cmd = 'GET '+URL+' HTTP/1.0\r\n\r\n'
    mysock.send(cmd.encode())
    recv = ''
    while True:
        data = mysock.recv(512)
        if 'Not Found' in data.decode():
            print('Address "%s" not found'%URL)
            return 0
        if (len(data) < 1):
            break
        recv += data.decode() 
    recv = recv.split('\r\n\r\n')[1]
    print(recv[:3000])
    print(len(recv))
    mysock.close()
    
    return 1

In [177]:
_socket('http://data.pr4e.org/romeo.txt')

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

167


1