# Ch 12 Networked Programs

Chapter 12 of Py4E gets into networked programs, primarily using HyperText Transfer Protocol (http).

## Sockets

This chapter introduces the concept of a **socket**. This is something that will continue to come up, so is important to understand. 

In some ways, a socket is like a file handle--it provides access to the information, not the information itself. In other ways a socket is different in that it provides two way communication for sending and recieving information.

Here, we'll use sockets to connect to a web server and get the contents of a web page. Different from opening a file on disk, more coordination is needed between your computer and the web server to transmit data, confirm reciept of data, etc. Later, we'll use sockets to connect to databases. And again coordination and established protocols for sending and receiving data come into play.

## Protocols

As described in the text, in order for two computers to communicate successfully, they need to be following some protocol, or established proceedures for communicating. HTTP is one protocol. We looked briefly at the SFTP (Secure File Transfer Protocol) earlier in the semester for transfering files from our computers to the cluster. Other protocols you are familiar with include Internet Message Access Protocol (IMAP), Post Office Protocol version 3 (POP3) and Simple Mail Transfer Protocol (SMTP) all used for email systems.

There are many protocols for different types of communications, the important thing is that you need to establish which protocol is being used and follow the specifications of that protocol.

## 12.2 The world’s simplest web browser (p. 142)

Here's the code for socket1.py (remember these are in the code3 directory of the repository).

In [1]:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512) 
    if (len(data) < 1):
        break
    print(data.decode(),end='')

mysock.close()

# Code: http://www.py4e.com/code3/socket1.py

HTTP/1.1 200 OK
Date: Wed, 06 May 2020 13:38:19 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


Take a look at the `urljpeg.py` script, which downloads an image, and the information on the script in the text.

## 12.4 Retrieving web pages with `urllib`

The `urllib` module makes getting stuff from the web a bit easier. The `socket1.py` script above can be simplified to:

In [5]:
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') 
for line in fhand:
    print(line.decode().strip())
# Code: http://www.py4e.com/code3/urllib1.py

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


Notice that `urllib` sits between the script and the socket to make an opened socket look just like a file handle and we can treat the web page in much the same way as we treat a local file.

## 12.7 Parsing HTML using regular expressions (p. 149)

We could write our own scripts to parse HTML looking for information. The example here is using a regular expression to search for a link and make a list of links on a page.

Run `urlregex.py` on some site, google.com for example:

```bash
[magitz@login2 code3]$ python3 urlregex.py
Enter - http://google.com
http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=US&tab=w1
http://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/options/
http://www.google.com/history/optout?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/
https://plus.google.com/116899029375914044550
[magitz@login2 code3]$ 
```

Unfortunately, not all web pages follow HTML guidelines totally and sometimes pages can be really hard to parse. Tags can be upper and lower case, some closing tags are optional, etc. It can quickly get quite complex.

## 12.8 Parsing HTML using BeautifulSoup

As I mentioned earlier, whatever you are trying to do, look for a module to make your life easier. If you need to parse HTML, don't start trying to write your own script to do it, look at available modules. One is `BeautifulSoup` available from crummy.com--some people sure have fun naming things!


In [4]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a') 
for tag in tags:
    print(tag.get('href', None))

# Code: http://www.py4e.com/code3/urllinks.py


https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/
/search?ie=UTF-8&q=popular+Google+Doodle+games&oi=ddle&ct=153499290&hl=en&sa=X&ved=0ahUKEwiY6NeisZ_pAhXClXIEHaVpCZMQPQgD
/advanced_search?hl=en&authuser=0
https://www.google.com/url?q=https://www.youtube.com/stayhome%3Futm_source%3Dgoogle%26utm_medium%3Dhppromo%26utm_campaign%3DHelpathomeYTUS&source=hpp&id=19017550&ct=3&usg=AFQjCNE2FZizHR5ncV3c9xnzo6f1UpGKmQ&sa=X&ved=0ahUKEwiY6NeisZ_pAhXClXIEHaVpCZMQ8IcBCAU
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
