# Python for Everybody
## Chapter 12 | Networked Programs - Exercises

https://www.py4e.com/html3/12-network

***

### Exercise 1: 

#### Change the socket program socket1.py to prompt the user for the URL so it can read any web page. You can use split('/') to break the URL into its component parts so you can extract the host name for the socket connect call. Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.

#### Orginal socket1.py program:

In [None]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

mysock.close()

#### Modified Code of socket1.py:

I'm using the URL of https://www.w3.org/Protocols/ for the purposes of this exercise.

In [1]:
import socket

geturl = input("Please enter a URL: ")

try:
    getsock = geturl.split('/')

    url = 'GET ' + geturl + ' HTTP/1.0\r\n\r\n'

    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    mysock.connect((getsock[2], 80))
    cmd = url.encode()
    mysock.send(cmd)

    while True:
        data = mysock.recv(512)
        if len(data) < 1:
            break
        print(data.decode(),end='')

    mysock.close()

except:
    print('Error, improperly formatted or non-existent URL.')

Please enter a URL: https://www.w3.org/Protocols/
HTTP/1.1 200 OK
date: Thu, 25 Feb 2021 13:54:38 GMT
content-location: Overview.html
last-modified: Wed, 11 Jun 2014 14:21:46 GMT
etag: "6a1b-4fb902a09f280"
accept-ranges: bytes
content-length: 27163
cache-control: max-age=21600
expires: Thu, 25 Feb 2021 19:54:38 GMT
vary: Accept-Encoding,upgrade-insecure-requests
keep-alive: timeout=5, max=2000
content-type: text/html; charset=iso-8859-1
x-backend: www-mirrors
connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                       "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US">
<head>
  <title>HTTP - Hypertext Transfer Protocol Overview</title>
  <link href="/StyleSheets/activities.css" rel="stylesheet" type="text/css">
</head>

<body>
<p><a href="../"><img src="../Icons/WWW/w3c_home" alt="W3C" border="0"
height="48" width="72"></a>&nbsp;<a href="../Architecture/"><img
src="../Icons/arch" alt="Architecture Domain" border="0" height="48"
width

    reads</a>! A collection of papers from work shops, conferences etc.</li>
  <li><a href="HTTP/Performance/Overview.html#Compression">Compression and
    HTTP</a> - also check out the <a
    href="http://en.wikipedia.org/wiki/Data_compression">comprehensive
    compression overview</a></li>
  <li><a href="Classic.html">Classic HTTP Documents</a> - read how it all
    started</li>
</ul>

<h2><a name="Software" id="Software">HTTP Sample Software</a></h2>

<p>W3C offers the <a href="../Jigsaw/">Jigsaw</a> server written in Java and
the <a href="../Library/">libwww</a> client API - both released with a full
set of HTTP/1.1 functionality including caching and persistent connections.
Please see the <a href="/Status.html">W3C open source contributions</a> for
more details.</p>

<h2><a name="Talks" id="Talks">Talks and Presentations</a></h2>
<dl>
  <dt><a href="../Talks/970210HTTP/all.htm">Preliminary HTTP/1.1 Performance
  Evaluation</a> by Jim Gettys</dt>
    <dd>The <a href="HTTP/Performa

***

### Exercise 2: 

#### Change your socket program so that it counts the number of characters it has received and stops displaying any text after it has shown 3000 characters. The program should retrieve the entire document and count the total number of characters and display the count of the number of characters at the end of the document.

In [2]:
# This version of the program displays all received characters, including the header data.

import socket

geturl = input("Please enter a URL: ")

count = 0
site = b""

try:
    getsock = geturl.split('/')

    url = 'GET ' + geturl + ' HTTP/1.0\r\n\r\n'

    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    mysock.connect((getsock[2], 80))
    cmd = url.encode()
    mysock.send(cmd)
    
    while True:
        data = mysock.recv(512)
        if len(data) < 1:
            break
        count += len(data)
        site += data

    mysock.close()
    
    print(site[:3001].decode())
    print("\n")
    print('Total character count is:', count)

except:
    print('Error, improperly formatted or non-existent URL.')

Please enter a URL: https://www.w3.org/Protocols/
HTTP/1.1 200 OK
date: Thu, 25 Feb 2021 13:54:46 GMT
content-location: Overview.html
last-modified: Wed, 11 Jun 2014 14:21:46 GMT
etag: "6a1b-4fb902a09f280"
accept-ranges: bytes
content-length: 27163
cache-control: max-age=21600
expires: Thu, 25 Feb 2021 19:54:46 GMT
vary: Accept-Encoding,upgrade-insecure-requests
keep-alive: timeout=5, max=2000
content-type: text/html; charset=iso-8859-1
x-backend: www-mirrors
connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                       "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US">
<head>
  <title>HTTP - Hypertext Transfer Protocol Overview</title>
  <link href="/StyleSheets/activities.css" rel="stylesheet" type="text/css">
</head>

<body>
<p><a href="../"><img src="../Icons/WWW/w3c_home" alt="W3C" border="0"
height="48" width="72"></a>&nbsp;<a href="../Architecture/"><img
src="../Icons/arch" alt="Architecture Domain" border="0" height="48"
width

In [3]:
# This version of the program displays all received characters, excluding the header data.

import socket

geturl = input("Please enter a URL: ")

count = 0
site = b""

try:
    getsock = geturl.split('/')

    url = 'GET ' + geturl + ' HTTP/1.0\r\n\r\n'

    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    mysock.connect((getsock[2], 80))
    cmd = url.encode()
    mysock.send(cmd)
    
    while True:
        data = mysock.recv(512)
        if len(data) < 1:
            break
        count += len(data)
        site += data

    mysock.close()
    
    # Look for the end of the header (2 CRLF = \r\n\r\n)
    pos = site.find(b"\r\n\r\n")
    
    print(site[pos+4:3001].decode())
    print("\n")
    print('Total character count is:', count)

except:
    print('Error, improperly formatted or non-existent URL.')

Please enter a URL: https://www.w3.org/Protocols/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                       "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US">
<head>
  <title>HTTP - Hypertext Transfer Protocol Overview</title>
  <link href="/StyleSheets/activities.css" rel="stylesheet" type="text/css">
</head>

<body>
<p><a href="../"><img src="../Icons/WWW/w3c_home" alt="W3C" border="0"
height="48" width="72"></a>&nbsp;<a href="../Architecture/"><img
src="../Icons/arch" alt="Architecture Domain" border="0" height="48"
width="212"></a> <img src="../Icons/WWW/HTTP48x" alt="HTTP" height="48"
width="48"></p>

<h1>HTTP - Hypertext Transfer Protocol</h1>

<p align="center"><a href="#News">News</a> | <a href="Activity.html">HTTP
Activity</a> | <a href="#Specs">Specs</a> | <a href="#Software">Software</a>
| <a href="#Talks">Talks</a> | <a href="#Lists">Mailing lists</a> | <a
href="#IETF">IETF</a> | <a href="HTTP/ietf-http-ext/">HTTP Extensions</a> |
<a hre

***

### Exercise 3: 
#### Use <font color="red">***urllib***</font> to replicate the previous exercise of (1) retrieving the document from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number of characters in the document. Don’t worry about the headers for this exercise, simply show the first 3000 characters of the document contents.

In [4]:
import urllib.request, urllib.parse, urllib.error

geturl = input("Please enter a URL: ")

try:
    fhand = urllib.request.urlopen(geturl)

    count = 0
    site = ""

    for line in fhand:
        site += line.decode()
        words = line.decode().split()
        for word in words:
            for char in word:
                count += 1
       
    print(site[0:3001])
    print("\n")
    print('Total character count is:', count)
    
except:
    print('Error, improperly formatted or non-existent URL.')

Please enter a URL: https://www.w3.org/Protocols/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                       "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US">
<head>
  <title>HTTP - Hypertext Transfer Protocol Overview</title>
  <link href="/StyleSheets/activities.css" rel="stylesheet" type="text/css">
</head>

<body>
<p><a href="../"><img src="../Icons/WWW/w3c_home" alt="W3C" border="0"
height="48" width="72"></a>&nbsp;<a href="../Architecture/"><img
src="../Icons/arch" alt="Architecture Domain" border="0" height="48"
width="212"></a> <img src="../Icons/WWW/HTTP48x" alt="HTTP" height="48"
width="48"></p>

<h1>HTTP - Hypertext Transfer Protocol</h1>

<p align="center"><a href="#News">News</a> | <a href="Activity.html">HTTP
Activity</a> | <a href="#Specs">Specs</a> | <a href="#Software">Software</a>
| <a href="#Talks">Talks</a> | <a href="#Lists">Mailing lists</a> | <a
href="#IETF">IETF</a> | <a href="HTTP/ietf-http-ext/">HTTP Extensions</a> |
<a hre

***

### Exercise 4: 
#### Change the <font color="red">***urllinks.py***</font> program to extract and count paragraph (p) tags from the retrieved HTML document and display the count of the paragraphs as the output of your program. Do not display the paragraph text, only count them. Test your program on several small web pages as well as some larger web pages.

In [5]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the (p) tags
tags = soup('p')
count = 0
for tag in tags:
    count += 1

print("\n")
print('Total paragraphs count is:', count)

Enter - https://www.w3.org/Protocols/


Total paragraphs count is: 25


***

### Exercise 5:
#### (Advanced) Change the socket program so that it only shows data after the headers and a blank line have been received. Remember that <font color="red">***recv***</font> receives characters (newlines and all), not lines.

In [6]:
import socket

geturl = input("Please enter a URL: ")

count = 0
site = b""

try:
    getsock = geturl.split('/')

    url = 'GET ' + geturl + ' HTTP/1.0\r\n\r\n'

    mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    mysock.connect((getsock[2], 80))
    cmd = url.encode()
    mysock.send(cmd)
    
    while True:
        data = mysock.recv(512)
        if len(data) < 1:
            break
        count += len(data)
        site += data

    mysock.close()
    
    # Look for the end of the header (2 CRLF = \r\n\r\n)
    pos = site.find(b"\r\n\r\n")
    
    print(site[pos+4:].decode())


except:
    print('Error, improperly formatted or non-existent URL.')

Please enter a URL: https://www.w3.org/Protocols/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                       "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-US">
<head>
  <title>HTTP - Hypertext Transfer Protocol Overview</title>
  <link href="/StyleSheets/activities.css" rel="stylesheet" type="text/css">
</head>

<body>
<p><a href="../"><img src="../Icons/WWW/w3c_home" alt="W3C" border="0"
height="48" width="72"></a>&nbsp;<a href="../Architecture/"><img
src="../Icons/arch" alt="Architecture Domain" border="0" height="48"
width="212"></a> <img src="../Icons/WWW/HTTP48x" alt="HTTP" height="48"
width="48"></p>

<h1>HTTP - Hypertext Transfer Protocol</h1>

<p align="center"><a href="#News">News</a> | <a href="Activity.html">HTTP
Activity</a> | <a href="#Specs">Specs</a> | <a href="#Software">Software</a>
| <a href="#Talks">Talks</a> | <a href="#Lists">Mailing lists</a> | <a
href="#IETF">IETF</a> | <a href="HTTP/ietf-http-ext/">HTTP Extensions</a> |
<a hre

***

## Assignment 1
### Request-Response Cycle

#### Exploring the HyperText Transport Protocol

You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers.

http://data.pr4e.org/intro-short.txt
There are three ways that you might retrieve this web page and look at the response headers:

- **Preferred**: Modify the <a href="https://www.py4e.com/code3/socket1.py?PHPSESSID=6a51929b35f8d515dd5b1c79c534fa8e">socket1.py</a> program to retrieve the above URL and print out the headers and data. Make sure to **change** the code to retrieve the above URL - the values are different for each URL.<br>
<br>
- Open the URL in a web browser with a developer console or FireBug and manually examine the headers that are returned.

Enter the header values in each of the fields below:

Last-Modified:

ETag:

Content-Length:

Cache-Control:

Content-Type:

In [1]:
# My program for the Assigment.

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode(),end='')

mysock.close()

HTTP/1.1 200 OK
Date: Fri, 26 Feb 2021 15:09:33 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "1d3-54f6609240717"
Accept-Ranges: bytes
Content-Length: 467
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

Why should you learn to write programs?

Writing programs (or programming) is a very creative 
and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
a difficult data analysis problem to having fun to helping
someone else solve a problem.  This book assumes that 
everyone needs to know how to program, and that once 
you know how to program you will figure out what you want 
to do with your newfound skills.  


#### Answers:

Last-Modified: Sat, 13 May 2017 11:22:22 GMT

ETag: "1d3-54f6609240717"

Content-Length: 467

Cache-Control: max-age=0, no-cache, no-store, must-revalidate

Content-Type: text/plain

***

## Assignment 2
### Scraping HTML Data with BeautifulSoup
In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. 

The program will use **urllib** to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

- Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
- Actual data: http://py4e-data.dr-chuck.net/comments_941921.html (Sum ends with 66)

You do not need to save these files to your folder since your program will read the data directly from the URL. **Note**: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

**Data Format**

The file is a table of names and comment counts. You can ignore most of the data in the file except for lines like the following:

In [None]:
<tr><td>Modu</td><td><span class="comments">90</span></td></tr>
<tr><td>Kenzie</td><td><span class="comments">88</span></td></tr>
<tr><td>Hubert</td><td><span class="comments">87</span></td></tr>

You are to find all the < span > tags in the file and pull out the numbers from the tag and sum the numbers.
    
Look at the <a href="https://www.py4e.com/code3/urllink2.py?PHPSESSID=825a5179d623e96f8d4202c133cf91bb">sample code</a> provided. It shows how to find all of a certain kind of tag, loop through the tags and extract the various aspects of the tags.

In [None]:
...
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
   # Look at the parts of a tag
   print 'TAG:',tag
   print 'URL:',tag.get('href', None)
   print 'Contents:',tag.contents[0]
   print 'Attrs:',tag.attr

You need to adjust this code to look for **span** tags and pull out the text content of the span tag, convert them to integers and add them up to complete the assignment.

**Sample Execution**

In [None]:
$ python3 solution.py
Enter - http://py4e-data.dr-chuck.net/comments_42.html
Count 50
Sum 2...

**Turning in the Assignment**

Enter the sum from the actual data and your Python code below:<br>
Sum:

In [3]:
# My program for the Assigment.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

tags = soup('span')

total = sum([int(i.contents[0]) for i in tags])

print(total)

Enter - http://py4e-data.dr-chuck.net/comments_941921.html
2066


**Answer:**<br>
Sum: 2066

***

## Assignment 3
### Following Links with BeautifulSoup

**Following Links in Python**

In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

- Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html<br>
Find the link at position **3** (the first name is 1). Follow that link. Repeat this process **4** times. The answer is the last name that you retrieve.<br>
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah<br>
Last name in sequence: Anayah<br><br>

- Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Shae.html<br>
Find the link at position **18** (the first name is 1). Follow that link. Repeat this process **7** times. The answer is the last name that you retrieve.<br>
Hint: The first character of the name of the last page that you will load is: A<br><br>

**Strategy**<br>
The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for you to do the assignment without writing a Python program. But frankly with a little effort and patience you can overcome these attempts to make it a little harder to complete the assignment without writing a Python program. But that is not the point. The point is to write a clever Python program to solve the program.

**Sample execution**

Here is a sample execution of a solution:

In [None]:
$ python3 solution.py
Enter URL: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Enter count: 4
Enter position: 3
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html

The answer to the assignment for this execution is "Anayah".

**Turning in the Assignment**

Enter the last name retrieved and your Python code below:

Name:

In [6]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL: ')
count = int(input('Enter count: '))
posistion = int(input('Enter posistion: '))

for i in range(count+1):
    print("Retrieving: ",url)
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    addresses = list()
    for tag in tags:
        html = tag.get('href', None)
        addresses.append(html)
    for address in addresses:
        address = addresses[posistion-1]
    url = address

Enter URL: http://py4e-data.dr-chuck.net/known_by_Shae.html
Enter count: 7
Enter posistion: 18
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Shae.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Nabeeha.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Cator.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Kelsiee.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Jeannie.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Nidhi.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Romi.html
Retrieving:  http://py4e-data.dr-chuck.net/known_by_Artemis.html


**Answer:**

Name: Artemis