## Exploring the HyperText Transport Protocol

You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers.

- http://data.pr4e.org/intro-short.txt

There are three ways that you might retrieve this web page and look at the response headers:

- Preferred: Modify the socket1.py program to retrieve the above URL and print out the headers and data. Make sure to change the code to retrieve the above URL - the values are different for each URL.
- Open the URL in a web browser with a developer console or FireBug and manually examine the headers that are returned.
- Use the telnet program as shown in lecture to retrieve the headers and content.

Enter the header values in each of the fields below and press "Submit". 

In [1]:
import re
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

text = ''
while True:
    data = mysock.recv(600)
    if len(data) < 1:
        break
    for line in data.decode():
        text = text + line
text = text.splitlines()
for i in text:
    if re.search('Last-Modified:|ETag:|Content-Length:|Cache-Control:|Content-Type:', i):
        print (i)
mysock.close()

Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "1d3-54f6609240717"
Content-Length: 467
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Content-Type: text/plain


## Scraping Numbers from HTML using BeautifulSoup 

In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

- Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)
- Actual data: http://py4e-data.dr-chuck.net/comments_237568.html (Sum ends with 50)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis. 

In [2]:
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
if url == '':
    url = 'http://py4e-data.dr-chuck.net/comments_42.html'
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
count = 0
suma = 0
for tag in tags:
   # Look at the contents of the 'span' tag
    suma = suma + int(tag.contents[0])
    count = count + 1
print (suma)
print (count)


## Following Links in Python

In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

- Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html
    Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
    Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
    Last name in sequence: Anayah
    
- Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Neco.html
    Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
    Hint: The first character of the name of the last page that you will load is: O

## Strategy

The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for you to do the assignment without writing a Python program. But frankly with a little effort and patience you can overcome these attempts to make it a little harder to complete the assignment without writing a Python program. But that is not the point. The point is to write a clever Python program to solve the program. 

In [4]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter URL: ')
count = int(input('Enter count: '))
pos = int(input('Enter position: ')) - 1

if url == '':
    url = 'http://py4e-data.dr-chuck.net/known_by_Neco.html'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
step = 1
while step <= count:
    print ('Count number', step)
    print (url.split('_')[2].split('.')[0],"'s ",pos + 1 ,'fiend is:',tags[pos].contents[0], '\n' )
    url = tags[pos].get('href', None)
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    step = step + 1

Enter URL: 
Enter count: 7
Enter position: 18
Count number 1
Neco 's  18 fiend is: Laughlan 

Count number 2
Laughlan 's  18 fiend is: Arandeep 

Count number 3
Arandeep 's  18 fiend is: Saoirse 

Count number 4
Saoirse 's  18 fiend is: Priscillia 

Count number 5
Priscillia 's  18 fiend is: Caie 

Count number 6
Caie 's  18 fiend is: Tyson 

Count number 7
Tyson 's  18 fiend is: Okeoghene 

