<h1> Chap 2: Data Processing via Internet </h1>

<h3> 2.1 Reading Web Page Using HTTP </h3>
- The Mechanics of retrieving information from web pages using the HyperText Transport Protocol (HTTP), and extracting data from this process

<h4> HyperText Transport Protocol - HTTP </h4>

- a set of rules to facilitate the transfer and exchange of files on Web pages on the internet

- a set of rules to facilitate the transfer and exchange of files on Web pages on the internet

- a set of rules to facilitate the transfer and exchange of files on Web pages on the internet

- Python has a basic built-in support called sockets, this module to understand how network connections are made and data is retrieved in a Python program.

- A socket is like a file, except that a single socket provides a two-way connection between two programs.


In [3]:
# Retrieve a Web Page with Socket

import socket 

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if len(data) < 1:
        break
    print(data.decode())

mysock.close()

#example here illustrates how a 
#low-level network connection could be made with sockets.
# use sockets when there is a need for 
# customised communication protocols.

HTTP/1.1 200 OK
Date: Tue, 24 Jan 2023 05:34:30 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



In [10]:
# Retrieve Web Pages with urllib
# For common daily tasks to retrieve web pages and media files,
# we can use the Python urllib library
# using urllib, the web page is treated like a file
# indicate which web page to retrieve
# urllib handles the rest of the HTTP protocol and header details

import urllib.request
file = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') 
for line in file:
    print(line.decode().strip())

# only see the output of the contents of the file
# headers are still sent by the servers, but only return the data 

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


In [18]:
# store the data into a variable mytxt, using List Comprehension method

import urllib.request
file = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') 
mytxt = [line.decode().strip() for line in file]
print(mytxt)
print(mytxt[0])

['But soft what light through yonder window breaks', 'It is the east and Juliet is the sun', 'Arise fair sun and kill the envious moon', 'Who is already sick and pale with grief']
But soft what light through yonder window breaks


<h4> Retrieve Images with urllib </h4>

In [42]:
# download image and save to folder

import urllib.request as urllib

img = open('img.jpg','wb')
img.write(urllib.urlopen('https://data.pr4e.org/cover3.jpg').read())
img.close()

In [40]:
# web contents from https://data.pr4e.org/

#show retrieved Image 
import urllib.request
from PIL import Image

urllib.request.urlretrieve('https://data.pr4e.org/cover3.jpg','cover3.jpg')
imgFile = Image.open('cover3.jpg')
 
imgFile.show()

<h4> Parsing Web Pages using Regular Expressions </h4>

- use Python program to scrape the web

- Web scraping refers to writing a program to retrieve pages from the internet, then examine the data 
in those pages to extract relevant information.

- find out if the home page of a website contains link to other web pages 

-  regular expression to match and extract the link values from web page should be href="(http://.*?)"

- expression ‘http://.+?‘ in parentheses to indicate that this is the part of our matched pattern that we would like to be extracted.

-  question mark added in the “.+?” indicates that the match should be done in a “non-greedy” manner – i.e. to match the smallest possible



In [47]:
# gives us a list of all the strings that match the regular expression, 
# returning the link text between the double quotes

import urllib.request as urllib
import re

url = 'http://www.bbc.com'
html = urllib.urlopen(url).read()
links = re.findall(b'href="(http://.+?)"', html) 
for link in links:
    print(link.decode())

http://www.bbc.com/future
http://www.bbc.com/sport
http://www.bbc.com/worklife
http://www.bbc.com/culture
http://www.bbc.com/future
http://www.bbc.co.uk/worldserviceradio/
http://www.bbc.co.uk/programmes/p00w940j
http://www.bbc.com/news/in_pictures/


<h3> Using Web Services </h3> 

- Web services are software or applications made available over 
the internet, using a collection of open protocols 
and standards to exchange data with other applications or systems.

- Web service providers develop applications to produce documents 
that are designed to be consumed by other programs.

-  two common formats that are used in web services today

1. XML (eXtensible Markup Language) is used for exchanging 
document-style data

2. JSON (JavaScript Object Notation) format when 
applications exchange dictionaries, lists with each other


<h4>  Parsing XML </h4> 
- Primary purpose of XML is to 

- help information systems share structured data

-  XML looks like a more structured form of HTML

![image.png](attachment:e180208b-e28b-4bbc-b8f9-d979cb74689c.png)
![image.png](attachment:d48ef599-2508-4a70-a27e-e2d02f299366.png)

In [57]:
# a simple program to parse an XML 
# document and extract some data elements from it:
# An XML parser such as ElementTree allows us to extract data from XML 
# without worrying about the rules of the XML syntax.

import xml.etree.ElementTree as ET
data = '''
<person>
  <name>John</name>
  <phone type="intl">+1 81818181</phone>
  <address>64 Marine Drive</address>
   <email hide="yes"/>
</person>
'''

# fromstring() method converts the string representation 
# of the XML into a “tree” of XML nodes

tree = ET.fromstring(data)

# the find function searches the XML tree 
# and retrieves a node that matches the specified tag
print('Name:', tree.find('name').text)
print('Phone:', tree.find('phone').text)
print('Address:', tree.find('address').text)
print('Attr:', tree.find('email').get('hide'))

# Each node may have text, attributes (e.g. hide) and child nodes

Name: John
Phone: +1 81818181
Address: 64 Marine Drive
Attr: yes


<h4> Looping through XML nodes </h4>

- Often an XML has multiple nodes 

- use a loop to process all the nodes

In [59]:
import xml.etree.ElementTree as ET
_data = '''
<students>
 <persons>
  <person>  
      <name>John</name>
      <phone type="intl">
     +1 81818181
       </phone>
       <email hide="yes"/>
   </person>
   <person>
         <name>Peter</name>
      <phone type="intl">
         +1 91919191
       </phone>
       <email hide="yes"/>
    </person>
 </persons>
</students>
'''
people = ET.fromstring(_data)
# findall method retrieves a list of nodes that 
# represent the person structure in the XML tree
lst = people.findall('persons/person')
# a loop is created to look at each person nodes
for person in lst:
    print(person.find('name').text)
    print(person.find('email').get('hide'))
    print(person.find('phone').get('type'))

John
yes
intl
Peter
yes
intl


<h4> Parsing JSON </h4>

- JSON format and constructing it 
using Python list and dictionaries (i.e. key-value pairs) 

- we use the built-in json library to parse the JSON and read the data

In [62]:
class_list ='''
[{
"id": 1, 
"name": "John", 
"email": "john@gmail.com"
}, 
{
"id": 2, 
"name": "Mary", 
"email": "mary@gmail.com"
}, 
{
"id": 3, 
"name": "Peter", 
"email": "peter@gmail.com"
}]
'''
import json

info = json.loads(class_list)
print(f'Length of info : {len(info)}')
for item in info:
    print(item['id'], item['name'], item['email'])

Length of info : 3
1 John john@gmail.com
2 Mary mary@gmail.com
3 Peter peter@gmail.com


- Compared to XML, JSON has less details

- it maps directly to some combination of dictionaries and lists, it is a natural format for different cooperating programs to exchange data

- JSON’s relative simplicity has made it a format of choice for nearly all data exchanges in internet applications

<h4> Parsing Data from Web Services </h4> 

- Singapore government site data.gov.sg has a number of data sets that provide information on Singapore institutions.

- gives information on graduates’ salary by different universities

- data provided by the web resources are all in JSON format.

- import urllib2 and json libraries for the code

- urllib2 here because to access the data, we need to pass in headers information ('User-Agent': 'Mozilla/5.0' ) like a normal web browser

- website data.gov.sg rejects url calls without header information

- the web page (the variable is specified as html), we then use json.loads to parse the data

- result is a dictionary file with nested arrays inside

- exploring the keys (using the keys ()method) of the nested dictionaries, you can tease out the relevant information from the returned data

In [102]:
#Get the keys/headers in API data 

import urllib.request
import json

url = '''
	https://data.gov.sg/api/action/datastore_search?resource_id=8b94f596-91fd-4545-bf9e-7a426493b674&q
'''

req = urllib.request.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib.request.urlopen(req).read()

data = json.loads(html)
print(len(data['result']['records'])) # Number of records
print(data['result']['records'][0].keys() )
for item in data:
    print(item)

100
dict_keys(['_id', 'no_of_rainy_days', 'month'])
help
success
result


In [103]:
# printing web api data to JSON
import urllib.request
import json

url = '''
	https://data.gov.sg/api/action/datastore_search?resource_id=8b94f596-91fd-4545-bf9e-7a426493b674&q
'''

req = urllib.request.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib.request.urlopen(req).read()

data = json.loads(html)
print(len(data['result'])) # Number of records
print(data['result']) 
for item in data:
    print(item)

6
{'resource_id': '8b94f596-91fd-4545-bf9e-7a426493b674', 'fields': [{'type': 'int4', 'id': '_id'}, {'type': 'text', 'id': 'month'}, {'type': 'numeric', 'id': 'no_of_rainy_days'}], 'q': '', 'records': [{'_id': 1, 'no_of_rainy_days': '10', 'month': '1982-01'}, {'_id': 2, 'no_of_rainy_days': '5', 'month': '1982-02'}, {'_id': 3, 'no_of_rainy_days': '11', 'month': '1982-03'}, {'_id': 4, 'no_of_rainy_days': '14', 'month': '1982-04'}, {'_id': 5, 'no_of_rainy_days': '10', 'month': '1982-05'}, {'_id': 6, 'no_of_rainy_days': '8', 'month': '1982-06'}, {'_id': 7, 'no_of_rainy_days': '8', 'month': '1982-07'}, {'_id': 8, 'no_of_rainy_days': '11', 'month': '1982-08'}, {'_id': 9, 'no_of_rainy_days': '9', 'month': '1982-09'}, {'_id': 10, 'no_of_rainy_days': '10', 'month': '1982-10'}, {'_id': 11, 'no_of_rainy_days': '13', 'month': '1982-11'}, {'_id': 12, 'no_of_rainy_days': '21', 'month': '1982-12'}, {'_id': 13, 'no_of_rainy_days': '18', 'month': '1983-01'}, {'_id': 14, 'no_of_rainy_days': '2', 'month'

<h3> Application Programming Interfaces </h3> 

- general name for these application-to-application contracts is typically known as Application Program Interfaces or APIs

- we need to follow strictly the published APIs in order to access the services provided by those applications

- There are various approaches to building APIs for Web Services.
Two dominant frameworks are
   
1. Service-Oriented Architecture (SOA)
- functionality of our application will include access to services provided by other components, which are made available by external applications, through a communication protocol over a network

- basic principles of service-oriented architecture are independent of vendors, products and technologies

![image.png](attachment:0bab289c-75dc-4deb-91e5-9e97f2cb5bac.png)

-  Advantages of SOA include:

(i) only one copy of the data is maintained by the owner (this prevents problem such as over committing reservation)

(ii) the owners of the data can set the rules about using their data

- SOA implementation comes with higher upfront investment.

2. Representational State Transfer Architecture (ReST).

- REST is an architectural style that defines a set of constraints and properties based on HTTP protocol

- Web services that follow the REST architecture style are known to be RESTful, and they provide interoperability between computer systems on the internet

- In RESTful web service, each web resource is associated with a URI. For example, “/students” relates to resource for all students, while “/student/<id>” points to the resources of student of a particular id.

- allows the user to remotely create, read, update and delete (CRUD - the four basic functions of persistent storage
    
- When HTTP is used, the operations available are the common GET, POST, PUT, DELETE methods
    
- REST system uses a stateless protocol and standard operations
    
- fast performance, reliability, and the ability to grow, by re-using components that can be managed and updated without affecting the system as a whole, even while it is running

<h4> Using RESTful APIs </h4> 

In [113]:
# To see RESTful APIs in action, we can use JSONPlaceholder
# need the Python requests library to simulate using different 
# types of HTTP methods to interact with the server.

import requests
import json

URI = 'https://jsonplaceholder.typicode.com/posts/'
r = requests.get(URI)
print(r.text)
      
# parse the output text using the json.loads method to determine records

data = json.loads(r.text)
print('number of records', len(data)) # gets 100 records

# uses requests.post to simulate a POST method to create a 
# new resource, using the dictionary information in the variable data, 
# and pointing to the URI in the format “/posts
# data={'title': 'My title', 'body': 'My post texts', 'userId': 99}
# r = requests.post('https://jsonplaceholder.typicode.com/posts/', 
# data=data)
# print(r.text)

data={'id':'10', 'title': 'foo', 'body': 'bar', 'userId': 1}
r = requests.put('https://jsonplaceholder.typicode.com/posts/10', data=data)
print(r.text)

[
  {
    "userId": 1,
    "id": 1,
    "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
    "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
  },
  {
    "userId": 1,
    "id": 2,
    "title": "qui est esse",
    "body": "est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla"
  },
  {
    "userId": 1,
    "id": 3,
    "title": "ea molestias quasi exercitationem repellat qui ipsa sit aut",
    "body": "et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut"
  },
  {
    "userId": 1,
    "id": 4,
    "title": "eum et est occaecati",
    "body": "ullam et saepe reic