# JSON & XML

## JSON Introduction

In [40]:
import pandas as pd

Pandas has the function, read_json(), that can load JSON either from a file or a url.

In [2]:
url = "http://api.open-notify.org/astros.json"
first_json = pd.read_json(url)
first_json.head()

Unnamed: 0,people,number,message
0,"{'craft': 'ISS', 'name': 'Oleg Kononenko'}",12,success
1,"{'craft': 'ISS', 'name': 'Nikolai Chub'}",12,success
2,"{'craft': 'ISS', 'name': 'Tracy Caldwell Dyson'}",12,success
3,"{'craft': 'ISS', 'name': 'Matthew Dominick'}",12,success
4,"{'craft': 'ISS', 'name': 'Michael Barratt'}",12,success


Writing the JSON data is as simple as reading and is one line of code. Instead of read_json(), you will use to_json() with a filename and that's all!

In [3]:
first_json.to_json('json_columns.json', orient="columns")
first_json.to_json('json_index.json', orient="index")

read_json() and to_json() works only with simple JSON. All arrays inside need to have arrays of same length.

In [None]:
df = pd.read_json("nested.json")

We can see that it doesn't work. Fortunately, we have another method. This is not a Pandas function but the method from package JSON which comes with core Python.

In [4]:
import json
#load json object
with open('nested.json') as f:
    nested_json = json.load(f)
print(nested_json)
print(type(nested_json))

{'article': [{'id': '01', 'language': 'JSON', 'edition': 'first', 'author': 'Allen'}, {'id': '02', 'language': 'Python', 'edition': 'second', 'author': 'Aditya Sharma'}], 'blog': [{'name': 'Datacamp', 'URL': 'datacamp.com'}]}
<class 'dict'>


**We can use package pprint for pretty printing dictionaries. This makes the human-parsing of json requests much easier to understand.**

We will use a function from Pandas json_normalize()

json_normalize() has 3 main parameters:

- data 
    - input data
- record_path 
    - nested elements
- meta 
    - let them as they are elements

Limitation
 - We can only use json_normalize() if it makes logical sense within the file. To normalize, the entire dictionary structure must be consistent. 

In [5]:
pd.json_normalize(nested_json)

Unnamed: 0,article,blog
0,"[{'id': '01', 'language': 'JSON', 'edition': '...","[{'name': 'Datacamp', 'URL': 'datacamp.com'}]"


We are going to add a parameter record_path to json_normalize to put a focus on a specific key from the file:

In [6]:
blog = pd.json_normalize(nested_json,record_path ='blog')
blog.head()

Unnamed: 0,name,URL
0,Datacamp,datacamp.com


In [7]:
article = pd.json_normalize(nested_json,record_path ='article')
article.head()

Unnamed: 0,id,language,edition,author
0,1,JSON,first,Allen
1,2,Python,second,Aditya Sharma


Additional JSON practice

In [10]:
data = [{"state": "Florida", 
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [{"name": "Dade", "population": 12345},
                     {"name": "Broward", "population": 40000},
                     {"name": "Palm Beach", "population": 60000}]},
       {"state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [{"name": "Summit", "population": 1234},
                     {"name": "Cuyahoga", "population": 1337}]}]

In [15]:
# pd.json_normalize(data)
pd.json_normalize(data=data, record_path='counties', meta=['state', 'shortname', ['info', 'governor']])

Unnamed: 0,name,population,state,shortname,info.governor
0,Dade,12345,Florida,FL,Rick Scott
1,Broward,40000,Florida,FL,Rick Scott
2,Palm Beach,60000,Florida,FL,Rick Scott
3,Summit,1234,Ohio,OH,John Kasich
4,Cuyahoga,1337,Ohio,OH,John Kasich


### Interesting

Review the above pd.json_normalize() function, how the recordpath and meta helped construct a successful pandas DataFrame

## More JSON

continue developing our AI Literacy skills! We'll use AI tools to help learn about the structure of a JSON response.

How to convert JSON response from an API to a DataFrame

In [15]:
import requests
import pandas as pd
from pprint import pprint # this will display the data in a structured, more readable manner

params = {"page": 1}
response = requests.get("https://reqres.in/api/users", params=params)

data = response.json()

In [9]:
# Run these two cells and compare how the output is displayed

print(data, "\n")
pprint(data)

{'page': 1, 'per_page': 6, 'total': 12, 'total_pages': 2, 'data': [{'id': 1, 'email': 'george.bluth@reqres.in', 'first_name': 'George', 'last_name': 'Bluth', 'avatar': 'https://reqres.in/img/faces/1-image.jpg'}, {'id': 2, 'email': 'janet.weaver@reqres.in', 'first_name': 'Janet', 'last_name': 'Weaver', 'avatar': 'https://reqres.in/img/faces/2-image.jpg'}, {'id': 3, 'email': 'emma.wong@reqres.in', 'first_name': 'Emma', 'last_name': 'Wong', 'avatar': 'https://reqres.in/img/faces/3-image.jpg'}, {'id': 4, 'email': 'eve.holt@reqres.in', 'first_name': 'Eve', 'last_name': 'Holt', 'avatar': 'https://reqres.in/img/faces/4-image.jpg'}, {'id': 5, 'email': 'charles.morris@reqres.in', 'first_name': 'Charles', 'last_name': 'Morris', 'avatar': 'https://reqres.in/img/faces/5-image.jpg'}, {'id': 6, 'email': 'tracey.ramos@reqres.in', 'first_name': 'Tracey', 'last_name': 'Ramos', 'avatar': 'https://reqres.in/img/faces/6-image.jpg'}], 'support': {'url': 'https://contentcaddy.io?utm_source=reqres&utm_medium

Use ChatGPT to answer the following questions:

- When I access the JSON method of the response, what form is the data in when I'm using Python?

    - The .json() method parses the response body as JSON and returns it as native Python data structures, usually:
        - dict for the top-level object
        - list for arrays
        - strings, ints, floats, booleans, or None for other values


- What is the dictionary structure of this response? 

    - it's a dictionary with metadata + a list of user data + a support object.


- What are some strategies for parsing JSON responses data to dataframes?

    - You typically want to extract the list under the 'data' key and convert that to a pandas DataFrame.

In [7]:
import pandas as pd

df = pd.DataFrame(data['data'])
df

Unnamed: 0,id,email,first_name,last_name,avatar
0,1,george.bluth@reqres.in,George,Bluth,https://reqres.in/img/faces/1-image.jpg
1,2,janet.weaver@reqres.in,Janet,Weaver,https://reqres.in/img/faces/2-image.jpg
2,3,emma.wong@reqres.in,Emma,Wong,https://reqres.in/img/faces/3-image.jpg
3,4,eve.holt@reqres.in,Eve,Holt,https://reqres.in/img/faces/4-image.jpg
4,5,charles.morris@reqres.in,Charles,Morris,https://reqres.in/img/faces/5-image.jpg
5,6,tracey.ramos@reqres.in,Tracey,Ramos,https://reqres.in/img/faces/6-image.jpg


Additional bonus strategies to disect and interpret API JSON



In [11]:
# - View JSON Hierarchy

import json
print(json.dumps(data, indent=4))  # Pretty print

{
    "page": 1,
    "per_page": 6,
    "total": 12,
    "total_pages": 2,
    "data": [
        {
            "id": 1,
            "email": "george.bluth@reqres.in",
            "first_name": "George",
            "last_name": "Bluth",
            "avatar": "https://reqres.in/img/faces/1-image.jpg"
        },
        {
            "id": 2,
            "email": "janet.weaver@reqres.in",
            "first_name": "Janet",
            "last_name": "Weaver",
            "avatar": "https://reqres.in/img/faces/2-image.jpg"
        },
        {
            "id": 3,
            "email": "emma.wong@reqres.in",
            "first_name": "Emma",
            "last_name": "Wong",
            "avatar": "https://reqres.in/img/faces/3-image.jpg"
        },
        {
            "id": 4,
            "email": "eve.holt@reqres.in",
            "first_name": "Eve",
            "last_name": "Holt",
            "avatar": "https://reqres.in/img/faces/4-image.jpg"
        },
        {
            "id": 5,
  

In [13]:
# Access individual values:

first_email = data['data'][0]['email']

first_email


'george.bluth@reqres.in'

Now, requesting a method from ChatGPT to manually extract the JSON from the above JSON file

In [23]:
# Access the 'data' key
users = data['data']

# Extract into separate lists
ids = []
emails = []
full_names = []
avatars = []

for user in users:
    ids.append(user['id'])
    emails.append(user['email'])
    full_names.append(f"{user['first_name']} {user['last_name']}")
    avatars.append(user['avatar'])


# Access the 'data' key
users = data['data']

# Extract into separate lists
ids = []
emails = []
full_names = []
avatars = []

for user in users:
    ids.append(user['id'])
    emails.append(user['email'])
    full_names.append(f"{user['first_name']} {user['last_name']}")
    avatars.append(user['avatar'])

full_names

['George Bluth',
 'Janet Weaver',
 'Emma Wong',
 'Eve Holt',
 'Charles Morris',
 'Tracey Ramos']

# XML

Everything about XML in Python is done with package xml.

In [1]:
import xml.etree.ElementTree as ET

In [7]:
tree = ET.parse('/Users/mitchellpalmer/Projects/Lighthouse Lab Projects/JSON_XML/Data/data.xml')

In [8]:
print(type(tree))

<class 'xml.etree.ElementTree.ElementTree'>


To get the main (root) tag of the file, we can call function getroot().

Then check it tag, attributes and length

In [10]:
root = tree.getroot()
root

<Element 'data' at 0x1073690d0>

In [11]:
print(root.tag)
print(root.attrib)
print(len(root))

data
{}
3


length of this element is 3. This means that it has 3 children. We can access these children the same way as elements in a list.

In [12]:
# First child of the root
country1 = root[0]
# First child of the child
rank = country1[0]
# What is the tag of the grandchild
print(rank.tag)
# What is the text inside this grandchild
print(rank.text)
# What are the attributes of last element?
print(country1[4].attrib)

rank
1
{'name': 'Switzerland', 'direction': 'W'}


In [15]:
# same information about the third child of the root.

country3 = root[2]
rank3 = country3[0]
print(rank3.tag)
print(rank3.text)
print(country3[4].attrib)

rank
68
{'name': 'Colombia', 'direction': 'E'}


In [16]:
# Find all child with tag country
for country in root.findall('country'):
    # rank is child of the country
    rank = country.find('rank').text
    # name is attribute of the country
    name = country.get('name')
    print(name, rank)

Liechtenstein 1
Singapore 4
Panama 68


In [18]:
# Root findall() tips

for neighbor in root.iter('neighbor'):
    print(neighbor.attrib)
# Top-level elements
root.findall(".")
# All 'neighbor' grand-children of 'country' children of the top-level elements
root.findall("./country/neighbor")
# elements with name='Singapore' that have a 'year' child
root.findall(".//year/..[@name='Singapore']")
# 'year' elements that are children of elements with name='Singapore'
root.findall(".//*[@name='Singapore']/year")
# All 'neighbor' elements that are the second child of their parent
root.findall(".//neighbor[2]")

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


[<Element 'neighbor' at 0x107cedf30>, <Element 'neighbor' at 0x107cee2a0>]

## Task

Extract the name, rank, year and gdppc from the countries and create a Pandas DataFrame.

In [42]:
# assistance used

xml_dict = {'country': [],
            'rank' :[],
            'year' : [],
            'gdppc' : []}

for country in root.findall('country'):
        name = country.attrib['name']
        xml_dict['country'].append(name)

        rank_value = country[0].text
        xml_dict['rank'].append(rank_value)

        year_value = country[1].text
        xml_dict['year'].append(year_value)
        
        gdppc_value = country[2].text
        xml_dict['gdppc'].append(gdppc_value)


    
xml_dict

{'country': ['Liechtenstein', 'Singapore', 'Panama'],
 'rank': ['1', '4', '68'],
 'year': ['2008', '2011', '2011'],
 'gdppc': ['141100', '59900', '13600']}

In [41]:
df = pd.DataFrame(xml_dict) 
df

Unnamed: 0,country,rank,year,gdppc
0,Liechtenstein,1,2008,141100
1,Singapore,4,2011,59900
2,Panama,68,2011,13600
