# Data on the Web

There are two commonly used formats: XML and JSON

### <b> XML - </b> eXtensible Markup Language. 
- It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework.
- XML creates a tree-like structure that is easy to interpret and supports a hierarchy. Whenever a page follows XML, it can be called an XML document. 
- XML documents have sections, called elements, defined by a beginning and an ending tag. 
- A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, if there are any, are the element's content. Elements can contain markup, including other elements, which are called "child elements".
- The largest, top-level element is called the root, which contains all other elements.
- Attributes are nameâ€“value pair that exist within a start-tag or empty-element tag. An XML attribute can only have a single value and each attribute can appear at most once on each element.


## XML  "Elements" (or Nodes)

- Simple Element
- Complex Element

+ Primary purpose is to help information systems <b> share structured data </b>
+ It started as a simplified subset of the Standard Generalized Markup Language (SGML), and is designed to be relatively human-legible.

### XML Basics

- Start Tag
- End Tag
- Text Content
- Attribute
- Self Closing Tag

### White Space

- Line ends do not matter.
- White space is generally discarded on text elements.
- We indent only to be redable.

### XML Terminologies

- <b>  Tag </b> indicate the beginning and ending of elements. <br>
- <b>  Attributes </b> Keyword/value pairs on the opening tag of XML. <br>
- <b>  Serialize/ De-Serialize </b> Convert data in ine program into a common format that can be stored and/or transmitted between systems in a programming language-independent manner. <br>

## XML Schema

- Describing a "contract as to what is acceptable XML
- XML Schema describes the structure of an XML document.
- Expressed in terms of constraints on the structure and content of documents.
- Often used to specify a "contract" between systems-"My system will only accept XML that conforms to this particular Schema"
- If a particular piece of Xml meets the specification of the Schema - it is said to "Validate"
- The XML Schema language is also referred to as XML Schema Definition (XSD).

- The purpose of an XML Schema is to define the legal building blocks of an XML document:<br>
    + the elements and attributes that can appear in a document
    + the number of (and order of) child elements
    + data types for elements and attributes
    + default and fixed values for elements and attributes


<b> XSD - </b> XML Schema Definition <br>
- The string data type can contain characters, line feeds, carriage returns, and tab characters.<br><br>
- The following is an example of a string declaration in a schema:
    + <xs:element name="customer" type="xs:string"/><br>
         <customer>John Smith</customer> <br><br>
- The following is an example of a date declaration in a schema:
    + <xs:element name="start" type="xs:date"/><br>
         <start>2002-09-24</start> <br><br> 
- The following is an example of a time declaration in a schema:
    + <xs:element name="start" type="xs:time"/><br>
         <start>09:00:00</start> <br><br>
- The following is an example of a decimal declaration in a schema:
    + <xs:element name="price" type="xs:decimal"/>   
         <price>999.50</price>
         <br><br>
- The time interval is specified in the following form "PnYnMnDTnHnMnS" where:

    + P indicates the period (required)
    + nY indicates the number of years
    + nM indicates the number of months
    + nD indicates the number of days
    + T indicates the start of a time section (required if you are going to specify hours, minutes, or seconds)
    + nH indicates the number of hours
    + nM indicates the number of minutes
    + nS indicates the number of seconds <br><br>

- The following is an example of a duration declaration in a schema:
    + <xs:element name="period" type="xs:duration"/> <br>
         <period>P5Y</period> <br><br>
- The following is an example of an integer declaration in a schema:
    + <xs:element name="price" type="xs:integer"/> <br>
        <price>999</price> <br><br>

### Extracting Data from XML

The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file. 

In [1]:
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET

link = input('Enter location: ')
html = urllib.request.urlopen(link).read().decode()
print('Retrieving', link)
print('Retrieved', len(html), 'characters')


#data calculation
cn = 0
sm = 0
data = ET.fromstring(html)
tags = data.findall('comments/comment')

for tag in tags:
    cn += 1
    sm += int(tag.find('count').text)

    
print('Count:', cn)
print('Sum:', sm)

Enter location:  http://py4e-data.dr-chuck.net/comments_704350.xml
Retrieving  http://py4e-data.dr-chuck.net/comments_704350.xml
Retrieved 4231 characters
Count: 50
Sum: 2147


### <b> JSON - </b> JavaScript Object Notation

- Douglas Crockford - discovered JSON.
- JSON is Object literal notation in JavaScript.
- JSON is a syntax for storing and exchanging data.
- JSON is text, written with JavaScript object notation.
- Python has a built-in package called json, which can be used to work with JSON data.
- JSON represents data as nested "lists" and "dictionaries".7

In [2]:
import json 

## Parse JSON - Convert from JSON to Python

- If you have a JSON string, you can parse it by using the json.loads() method.
- json.load is used when loading a file while json.loads(load string) is used when loading a string.

In [6]:
import json

# some JSON:
x =  '{ "name":"John", "age":30, "city":"New York"}'

# parse x:
y = json.loads(x)

# the result is a Python dictionary:
print(y["age"]) 

30


## Convert from Python to JSON

- If you have a Python object, you can convert it into a JSON string by using the json.dumps() method.
- We use json.dump when we want to dump JSON into a file. json.dumps(dump string) is used when we need the JSON data as a string for parsing or printing.

In [4]:
import json

# a Python object (dict):
x = {
  "name": "John",
  "age": 30,
  "city": "New York"
}

# convert into JSON:
y = json.dumps(x)

# the result is a JSON string:
print(y) 

{"name": "John", "age": 30, "city": "New York"}
