## Network Technology

### Transport Control Protocol (TCP)

- Built on top of IP(Internet Protocol)<br>
- Assumes IP might lose some data stored and retransmits data if it seems to be lost<br>
- Handles "flow control" using a transmit window<br>
- Provides a nice reliable pipe<br>

### TCP Connections / Sockets

"In computer networking, an Internet socket or network socket is an endpoint of a bidirectional inter-process communication flow across an Internet Protocol-based computer network, such as the Internet."

### TCP Port Numbers

- A port is an application-specific or process-specific software communication endpoint<br>
- It allows multiple networked applications to coexist on the same server<br>
- There is a list of well-known TCP port numbers

### Common TCP Ports

### Sockets in Python

Python has built-in support for TCP Sockets. Socket programming is a way of connecting two nodes on a network to communicate with each other. One socket(node) listens on a particular port at an IP, while other socket reaches out to the other to form a connection. Server forms the listener socket while client reaches out to the server.<br><br>
They are the real backbones behind web browsing. In simpler terms there is a server and a client. 
Socket programming is started by importing the socket library and making a simple socket.

In [None]:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

Here we made a socket instance and passed it two parameters. The first parameter is <b> AF_INET </b> and the second one is <b> SOCK_STREAM. </b> AF_INET refers to the address family ipv4. The SOCK_STREAM means connection oriented TCP protocol. 

### Application Protocol

Since TCP (and python) gives us a reliable socket, what do we want to do with the socket? What problem do we want to solve?

Application-level network protocols can also be accessed using high-level access provided by Python libraries. These protocols are HTTP, FTP, etc.

### HTTP - HyperText Transfer Protocol

- The dominant Application Layer Protocol on the Internet<br>
- Invented for the Web - to Retrieve HTML, Images, Documents, etc<br>
- Extended to be data in addition to documents - RSS, Web Services, etc.. Basic Concept - Make a connection - Request a document - Retrieve the Document - Close the Connection.

The HyperText Transfer Protocol is the set of rules to allow browsers to retreive web documents from servers over the Internet

   ### What is a Protocol?

A set of rules that all parties follow so we can predict each other's behaviour and not bump into eaachother. The Internet Protocol is designed to implement a uniform system of addresses on all of the Internet-connected computers everywhere and to make it possible for packets to travel from one end of the Internet to the other. A program like the web browser should be able to connect to a host anywhere without ever knowing which maze of network devices each packet is traversing on its journey. There are various categories of internet protocols. These protocols are created to serve the needs of different types of data communication between different computers in the internet.

### Getting Data From The Server

Each time the user clicks on an anchor tag with an href= value to switch to a new page, the browser makes a connection to the web server and issues a "GET" request - to GET the content of the page at the specified URL. The server returns the HTML document to the browser which formats and displays the document to the user.

### Internet Standards

- The standards for all the Internet protocols are developed by an organization<br>
- Internet Engineering Task Force (IETF)
- Standards are called "RFCs" - "Request for Comments".

### An HTTP Request in Python

In [6]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode())

mysock.close()

HTTP/1.1 200 OK
Date: Fri, 07 Aug 2020 09:22:05 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



Here, <b> data.pr4e.org </b> is Host and <b> 80 </b> is Port<br>
<b> mysock.recv() </b>actually reads the data <br>
<b> href="(+)" </b> is the regular expression that extract the URL <br>


First the program makes a connection to port 80 on the server www.py4e.com. Since our program is playing the role of the “web browser”, the HTTP protocol says we must send the GET command followed by a blank line. \r\n signifies an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the equivalent of a blank line.

Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string).

After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt.<br><br>
This example shows how to make a low-level network connection with sockets. Sockets can be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to find the document which describes the protocol and write the code to send and receive the data according to the protocol.<br><br>
However, since the protocol that we use most commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and data over the web.<br> <br>
One of the requirements for using the HTTP protocol is the need to send and receive data as bytes objects, instead of strings. In the preceding example, the encode() and decode() methods convert strings into bytes objects and back again.

## Unicode Characters and Strings

### Representing Simple Strings

- Each character is represented by a nuber between 0 and 256 stored in 8 bits of memory <br>
- We refer to "8 bits of memory as a "byte" of memory - (i.e. my disk drive contains 3 Terabytes of memory) <br>
- The ord() function tells us the numeric value of a simple ASCII character

In [7]:
print(ord("G"))

71


In [8]:
print(ord("\n"))

10


In [9]:
print(ord("h"))

104


### Multi-Byte Characters

To represent the wide range of characters, we represent characters with more than one byte

- UTF-16 - Fixed length - Two bytes <br>
- UTF-32 - Fixed length - Four bytes <br>
- UTF-8  - 1-4 bytes <br>
    * Upwards compatible with ASCII <br>
    * Automatic detection between ASCII and UTF-8 <br>
    * UTF-8 is recommended practice for encoding data to be exchanged between systems <br>

In [10]:
x = u'abc'
type(x)

str

In [11]:
x = b'abc'
type(x)

bytes

- In Python 3, all strings internally are UNICODE<br>
- Working with string variables in Python programs and reading data from files usually "just works"<br>
- When we talk to a network resource using sockets or talk to a database we have to encode and decode data (usually to UTF-8)

### Python Strings to Bytes

- When we talk to an external resource like a network socket we sends bytes, so we need to encode Python 3 strings into a given character encoding.<br>
- nWhen we read data from an external resource, we must decode it based on the character set so it is properly represented in Python 3 as a string.

## Using urllib in Python

Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file

In [12]:
import urllib.request,urllib.parse,urllib.error
fhand = urllib.request.urlopen("http://data.pr4e.org/romeo.txt")
for line in fhand:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


<b> import urllib.request,urllib.parse,urllib.error </b> has data structure most similar to file handle

In [13]:
import urllib.request,urllib.parse,urllib.error

fhand = urllib.request.urlopen("http://data.pr4e.org/romeo.txt")

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0)+1
print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


## Parsing a Web Page

### What is Web Scraping?

- When a program or script pretends to be a browser and retrives web pages, looks at those web pages, extracts information, and then looks at more web pages.<br>
- Before scraping a web site you should check that the web site allows scraping.<br>
- Search engine scrape web pages - we call this "spidering the web" or "web crawling".

### Why Scrape?

- Pull data - particularly social data - who links to who? <br>
- Get your own data back out of some system that has no "export capability"<br>
- Monitor a site for new information<br>
- Spider the web to make a database for a search engine

### Scraping Web Pages

- There is some controversy about web page scraping and some sites are a bit snippy about it.<br>
- Republishing copyrighted information is not allowed.<br>
- Violating terms of services is not allowed.

In [14]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

link = input('Enter URL: ')
cont = int(input('Enter count: '))
line = int(input('Enter position: '))

print('Retrieving: %s' % link)
for i in range(0, cont):
    html = urllib.request.urlopen(link, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    cn = 0
    ps = 0
    for tag in tags:
        ps += 1
        if ps == line:
            print('Retrieving: %s' % str(tag.get('href', None)))
            link = str(tag.get('href', None))
            ps = 0
            break

Enter URL: http://py4e-data.dr-chuck.net/comments_42.html
Enter count: 50
Enter position: 18
Retrieving: http://py4e-data.dr-chuck.net/comments_42.html


Problem: Start at: http://py4e-data.dr-chuck.net/known_by_Rogan.html

Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.

Hint: The first character of the name of the last page that you will load is: G

In [1]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

#SSL Certification Error Handle
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#Data Collection
link = input('Enter URL: ')
cont = int(input('Enter count: '))
line = int(input('Enter position: '))



print('Retrieving: %s' % link)
for i in range(0, cont):
    html = urllib.request.urlopen(link, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    
    tags = soup('a')
    cn = 0
    ps = 0
    for tag in tags:
        ps += 1
        if ps == line:
            print('Retrieving: %s' % str(tag.get('href', None)))
            link = str(tag.get('href', None))
            ps = 0
            break

Enter URL: http://py4e-data.dr-chuck.net/known_by_Rogan.html
Enter count: 7
Enter position: 18
Retrieving: http://py4e-data.dr-chuck.net/known_by_Rogan.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Nicodemus.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Luic.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Macy.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Dolci.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Alber.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Allen.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Grzegorz.html


Scraping Numbers from HTML using BeautifulSoup <br>
The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file. 

In [2]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')


tags = soup('span')
sum = 0
for tag in tags:
    sum = sum+int(tag.contents[0])
print (sum)

Enter - http://py4e-data.dr-chuck.net/comments_704348.html
2713
