<a href="https://colab.research.google.com/github/Sandeep0076/Miscellaneous-Data-Science-Projects/blob/main/XML%2C%20Json%20And%20Web.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#<Center> **XML, Json And Web**

## **Aim**
The aim of this exercise is to learn how XML and JSON are used to store and exchange data, see how to use Python to retrieve XML and JSON data over the web, and find out how to parse that data using a range of different Python modules and features. Joe also shows how to fetch data from URLs and retrieve and send data via HTTP using the Python Requests library.

### **Accessing the internet**

## **urlib**:

*   **urllib.request**: which handle opening and reading of the url.
*   **urllib.error**: which defines the exception classes for error raised by the request module.
*   **urllib.parse**: for parsing url structure
* **urllib.robotparse**: for working with robot.txt





http://httpbin.org/xml : This is test url, which returns test data
```

<!--   A SAMPLE set of slides   -->
<slideshow title="Sample Slide Show" date="Date of publication" author="Yours Truly">
<!--  TITLE SLIDE  -->
<slide type="all">
<title>Wake up to WonderWidgets!</title>
</slide>
<!--  OVERVIEW  -->
<slide type="all">
<title>Overview</title>
<item>
Why
<em>WonderWidgets</em>
are great
</item>
<item/>
<item>
Who
<em>buys</em>
WonderWidgets
</item>
</slide>
</slideshow>


```



### Get data

In [16]:
import urllib.request

def retrieve_data():
  url = "http://httpbin.org/xml"

  result = urllib.request.urlopen(url)
  print(f'The return code is {result.status}')

  #for returning headers
  print(f'This is the header data : {result.getheaders()}')
  #for getting data
  print(f'This is the returned data : {result.read().decode("utf-8")}')

In [17]:
retrieve_data()

The return code is 200
This is the header data : [('Date', 'Thu, 09 Sep 2021 19:05:01 GMT'), ('Content-Type', 'application/xml'), ('Content-Length', '522'), ('Connection', 'close'), ('Server', 'gunicorn/19.9.0'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]
This is the returned data : <?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>


### Sending data

In [24]:
import urllib.parse
def sending_data():
  # create dictionary that you need to send 
  args ={
      "name":"Sandeep Pathania",
      "is_author":True
  }
  # post request to send 
  url = "http://httpbin.org/post"
  data = urllib.parse.urlencode(args)#encode the data into strings
  #to further convert into byte
  data = data.encode()
  result = urllib.request.urlopen(url,data)

  print('*'*20)
  print(result.read().decode("utf-8"))


In [25]:
sending_data()

********************
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "is_author": "True", 
    "name": "Sandeep Pathania"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "36", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7", 
    "X-Amzn-Trace-Id": "Root=1-613a5e77-26825f7a6b996cc35e333b11"
  }, 
  "json": null, 
  "origin": "35.237.120.232", 
  "url": "http://httpbin.org/post"
}



### Exception handling

In [29]:
from http import HTTPStatus
from urllib.error import HTTPError, URLError

def handel_excetions():
  url = "http://httpbin.org/status/404"

  try:
    result = urllib.request.urlopen(url)
    print(f'Status: {result.status}')
    if result.getcode() == HTTPStatus.OK:
      print(result.read().decode("utf-8"))
  except HTTPError as err:
    print(f'Error {err.code}')
  except URLError as err:
    print(f'That url is not working: {err.code}')

  return


In [30]:
handel_excetions()

Error 404


## **Request** library

It decodes automaticall and solves many shortcommings of urllib

In [44]:
import requests

def return_result():
  url = "http://httpbin.org/xml"
  # standard HTTP GET request
  result = requests.get(url)
  print(f'header = {result.headers}')
  print(f'data = {result.text}')

return_result()


header = {'Date': 'Thu, 09 Sep 2021 20:04:13 GMT', 'Content-Type': 'application/xml', 'Content-Length': '522', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
data = <?xml version='1.0' encoding='us-ascii'?>

<!--  A SAMPLE set of slides  -->

<slideshow 
    title="Sample Slide Show"
    date="Date of publication"
    author="Yours Truly"
    >

    <!-- TITLE SLIDE -->
    <slide type="all">
      <title>Wake up to WonderWidgets!</title>
    </slide>

    <!-- OVERVIEW -->
    <slide type="all">
        <title>Overview</title>
        <item>Why <em>WonderWidgets</em> are great</item>
        <item/>
        <item>Who <em>buys</em> WonderWidgets</item>
    </slide>

</slideshow>


#### Send some parameters to the url via GET request

In [42]:
def send_params():
  url = "http://httpbin.org/get"
  #send some parameters to the url via GET request
  data_values = {
                "key1":"value1",
                "key2":22
                 }
  # no encoding needed as in urllib
  result = requests.get(url,params=data_values)
  print(f'header = {result.headers}')
  print(f'data = {result.text}')
  return

In [43]:
send_params()

header = {'Date': 'Thu, 09 Sep 2021 20:03:14 GMT', 'Content-Type': 'application/json', 'Content-Length': '370', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
data = {
  "args": {
    "key1": "value1", 
    "key2": "22"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-613a6882-0388dddf4d3d4bce4f638e56"
  }, 
  "origin": "35.237.120.232", 
  "url": "http://httpbin.org/get?key1=value1&key2=22"
}



#### POST request

In [46]:
def send_params():
  url = "http://httpbin.org/post"
  #send some parameters to the url via POST request
  data_values = {
                "key1":"value1",
                "key2":22
                 }
  # no encoding needed as in urllib
  result = requests.post(url,data=data_values)
  print(f'header = {result.headers}')
  print(f'data = {result.text}')
  return

send_params()

header = {'Date': 'Thu, 09 Sep 2021 20:06:50 GMT', 'Content-Type': 'application/json', 'Content-Length': '501', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
data = {
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "22"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "19", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.23.0", 
    "X-Amzn-Trace-Id": "Root=1-613a695a-57409fec6d2d798e39375352"
  }, 
  "json": null, 
  "origin": "35.237.120.232", 
  "url": "http://httpbin.org/post"
}



#### Error handling in request library

If the server doesnt begin the response.

In [57]:
from http import HTTPStatus
from requests import HTTPError, Timeout

def handel_excetions():
  
  try:
    url = "http://httpbin.org/delay/5"
    result = requests.get(url, timeout=2)# if the server doesnt begin the response.
    result.raise_for_status()
    print(result)
  except HTTPError as err:
    print(f'Error {err}')
  except Timeout as err:
    print(f'Error {err}')
  except URLError as err:
    print(f'That url is not working: {err}')

  return

handel_excetions()

Error HTTPConnectionPool(host='httpbin.org', port=80): Read timed out. (read timeout=2)


#### Authentication

In [61]:
from requests.auth import HTTPBasicAuth

def auth_me():
  url = "http://httpbin.org//basic-auth/sunny/123"

  result = requests.get(url, auth=("sunny","123"))
  print(result.text) 
  
auth_me()

{
  "authenticated": true, 
  "user": "sunny"
}



## **Json**

### Parsing and Serializing Json

In [1]:
import json

In [9]:
def parse_json():

  #defining a string of JSON code
  json_str = '''{
    "name":"Sandeep",
    "is_male":true,
    "hobbies":[
                "Playing guitar",
                "Fifa21",
                "Photography",
                "Tech"
    ],
    "siblings":1}
    '''

  data = json.loads(json_str)
  print("First name is :" + data["name"])
  if data["is_male"]:
    print("Is a male")
  for items in data["hobbies"]:
    print("Hobbies:" + items)

parse_json()

First name is :Sandeep
Is a male
Hobbies:Playing guitar
Hobbies:Fifa21
Hobbies:Photography
Hobbies:Tech


In [10]:
def serialize_json():

  #defining a string of python code to be serialize to json format
  python_data = {
    "name":"Sandeep",
    "is_male":True, #syntax of bool is different here
    "hobbies":[
                "Playing guitar",
                "Fifa21",
                "Photography",
                "Tech"
    ],
    "siblings":1}
    

  json_str = json.dumps(python_data,indent=3)#indent by 3 spaces
  print(json_str)
 

serialize_json()

{
   "name": "Sandeep",
   "is_male": true,
   "hobbies": [
      "Playing guitar",
      "Fifa21",
      "Photography",
      "Tech"
   ],
   "siblings": 1
}


### Error handling

In [18]:
from json import JSONDecodeError

def error_handling():

  # remove comma after hobbies finishes to have an error
  json_str = '''{
    "name":"Sandeep",
    "is_male":true,
    "hobbies":[
                "Playing guitar",
                "Fifa21",
                "Photography",
                "Tech"
    ]
    "siblings":1}
    '''
  
  try:
    data = json.loads(json_str)
    print("First name is :" + data["name"])
    if data["is_male"]:
      print("Is a male")
    for items in data["hobbies"]:
      print("Hobbies:" + items)
  except JSONDecodeError as err:
    print("OOPS therz an error")
    print(err.msg)
    print('at line ' + str(err.lineno))


error_handling()

OOPS therz an error
Expecting ',' delimiter
at line 10


### Request data from JSON

In [21]:
import requests

def req_json():
  url = "http://httpbin.org/json"
  result = requests.get(url)

  dataobj = result.json()
  print(json.dumps(dataobj,indent=3))

  print(list(dataobj.keys()))

req_json()

{
   "slideshow": {
      "author": "Yours Truly",
      "date": "date of publication",
      "slides": [
         {
            "title": "Wake up to WonderWidgets!",
            "type": "all"
         },
         {
            "items": [
               "Why <em>WonderWidgets</em> are great",
               "Who <em>buys</em> WonderWidgets"
            ],
            "title": "Overview",
            "type": "all"
         }
      ],
      "title": "Sample Slide Show"
   }
}
['slideshow']


## **XML** 

*   SAX : Simple API for XML
     * Memory efficient 
     * fast and easy
     * No random access or context 
     * cannot modify the XML file
*   DOM Document Object Model:
   * Can modify  
* Element tree API



### DOM API

In [34]:
import xml.dom.minidom as dom

def return_xml():
  url = "http://httpbin.org/xml"
  result = requests.get(url)

  domtree = dom.parseString(result.text)
  rootnode =domtree.documentElement

  print(f"The root element is {rootnode.nodeName}")
  print(f"Title is {rootnode.getAttribute('title')}")
  items = domtree.getElementsByTagName('item')
  print(f"Total items are : {items.length}")

  #create new item tag
  newitem = domtree.createElement('item')
  #add some text to the item
  newitem.appendChild(domtree.createTextNode('This is new text'))
  #add item to the slide
  firstslide = domtree.getElementsByTagName('slide')[0]#first index
  firstslide.appendChild(newitem)
  items = domtree.getElementsByTagName('item')
  print(f"Total items after creating new item : {items.length}")

return_xml()

The root element is slideshow
Title is Sample Slide Show
Total items are : 3
Total items after creating new item : 4


### Element tree API

In [38]:
from lxml import etree

def use_etree():
   url = "http://httpbin.org/xml"
   result = requests.get(url)

   # build a doc structre using elemttree api
   doc = etree.fromstring(result.content)

   ## access the values
   print(doc.tag)
   print(doc.attrib['title'])
   slidecount = len(doc.findall('slide'))
   print(f'No. of slide elements: {slidecount}')

   print('*'*20)
   #iterate over tags
   for elem in doc.findall('slide'):
     print(elem.tag)
  
   #create new slide
   newslide=etree.SubElement(doc,'slide')
   newslide.text= 'this is a new slide'

   #count slide
   slidecount = len(doc.findall('slide'))
   itemcount = len(doc.findall('.//item'))

   print(f'No. of slide elements: {slidecount}')
   print(f'No. of slide items: {itemcount}')

use_etree()

slideshow
Sample Slide Show
No. of slide elements: 2
********************
slide
slide
No. of slide elements: 3
No. of slide items: 3
