# Extract from JSON and XML

You'll now get practice extracting data from JSON and XML. You'll extract the same population data from the previous exercise, except the data will be in a different format.

Both JSON and XML are common formats for storing data. XML was established before JSON, and JSON has become more popular over time. They both tend to be used for sending data via web APIs,

Sometimes, you can obtain the same data in either JSON or XML format. The World Bank indicator data is available in either form. In this exercise, you'll use the same data except one file is formatted as JSON and the other as XML.



# Extract JSON and JSON Exercise

First, you'll practice extracting data from a JSON file. Run the cell below to print out the first line of the JSON file.

In [1]:
### 
#   Run the following cell.
#   This cell loads a function that prints the first n lines of
#   a file.
#
#   Then this function is called on the JSON file to print out
#   the first line of the population_data.json file
###
def print_lines(n, file_name):
    f = open(file_name)
    for i in range(n):
        print('line %i, %s'%(i,f.readline()))
        print('-----------------------------')
    f.close

print_lines(1, './data/population_data.json')

FileNotFoundError: [Errno 2] No such file or directory: './data/population_data.json'

The first "line" in the file is actually the entire file. JSON is a compact way of representing data in a dictionary-like format. Luckily, pandas has a method to [read in a json file](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html) and parse the results for you. 

If you open the link with the documentation, you'll see there is an *orient* option that can handle JSON formatted in different ways:
```
'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

'records' : list like [{column -> value}, ... , {column -> value}]

'index' : dict like {index -> {column -> value}}

'columns' : dict like {column -> {index -> value}}

'values' : just the values array
```

In this case, the JSON is formatted with a 'records' orientation, so you'll need to use that value in the read_json() method. You can tell that the format is 'records' by comparing the pattern in the documentation with the pattern in the JSON file.

Next, read in the population_data.json file using pandas.

In [2]:
# TODO: Read in the population_data.json file using pandas's 
# read_json method. Don't forget to specific the orient option
# store the results in df_json

import pandas as pd
df_json = pd.read_json('../data/population_data.json',orient='records')

In [3]:
# TODO: Use the head method to see the first few rows of the resulting
# dataframe
df_json.head()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,Country Code,Country Name,Indicator Code,Indicator Name
0,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0,ABW,Aruba,SP.POP.TOTL,"Population, total"
1,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0,AFG,Afghanistan,SP.POP.TOTL,"Population, total"
2,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0,AGO,Angola,SP.POP.TOTL,"Population, total"
3,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0,ALB,Albania,SP.POP.TOTL,"Population, total"
4,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0,AND,Andorra,SP.POP.TOTL,"Population, total"


Notice that this population data is the same as the data from the previous exercise. The column order might have changed, but the data is otherwise the same.

# Other Ways to Read in JSON

Besides using pandas to read JSON files, you can use the json library. Run the code cell below to see an example of reading in JSON with the json library. Python treats JSON data like a dictionary.

In [4]:
import json

# read in the JSON file
with open('../data/population_data.json') as f:
    json_data =json.load(f)
    


# print the first record in the JSON file
print(json_data[1])
print('\n')

# show that JSON data is essentially a dictionary
print(json_data[1]['Country Name'])
print(json_data[1]['Country Code'])

{'Country Name': 'Afghanistan', 'Country Code': 'AFG', 'Indicator Name': 'Population, total', 'Indicator Code': 'SP.POP.TOTL', '1960': 8996351.0, '1961': 9166764.0, '1962': 9345868.0, '1963': 9533954.0, '1964': 9731361.0, '1965': 9938414.0, '1966': 10152331.0, '1967': 10372630.0, '1968': 10604346.0, '1969': 10854428.0, '1970': 11126123.0, '1971': 11417825.0, '1972': 11721940.0, '1973': 12027822.0, '1974': 12321541.0, '1975': 12590286.0, '1976': 12840299.0, '1977': 13067538.0, '1978': 13237734.0, '1979': 13306695.0, '1980': 13248370.0, '1981': 13053954.0, '1982': 12749645.0, '1983': 12389269.0, '1984': 12047115.0, '1985': 11783050.0, '1986': 11601041.0, '1987': 11502761.0, '1988': 11540888.0, '1989': 11777609.0, '1990': 12249114.0, '1991': 12993657.0, '1992': 13981231.0, '1993': 15095099.0, '1994': 16172719.0, '1995': 17099541.0, '1996': 17822884.0, '1997': 18381605.0, '1998': 18863999.0, '1999': 19403676.0, '2000': 20093756.0, '2001': 20966463.0, '2002': 21979923.0, '2003': 23064851.0,

# Extract XML

Next, you'll work with the same data except now the data is in xml format. Run the next code cell to see what the first fifteen lines of the data file look like.

In [5]:
# Run the code cell to print out the first 15 lines of the xml file
print_lines(15, '../data/population_data.xml')

line 0, ï»¿<?xml version="1.0" encoding="utf-8"?>

-----------------------------
line 1, <Root xmlns:wb="http://www.worldbank.org">

-----------------------------
line 2,   <data>

-----------------------------
line 3,     <record>

-----------------------------
line 4,       <field name="Country or Area" key="ABW">Aruba</field>

-----------------------------
line 5,       <field name="Item" key="SP.POP.TOTL">Population, total</field>

-----------------------------
line 6,       <field name="Year">1960</field>

-----------------------------
line 7,       <field name="Value">54211</field>

-----------------------------
line 8,     </record>

-----------------------------
line 9,     <record>

-----------------------------
line 10,       <field name="Country or Area" key="ABW">Aruba</field>

-----------------------------
line 11,       <field name="Item" key="SP.POP.TOTL">Population, total</field>

-----------------------------
line 12,       <field name="Year">1961</field>

------------

XML looks very similar to HTML. XML is formatted with tags having values inside the tags. XML is not as easy to navigate as JSON. Pandas cannot read in XML directly. One reason is that tag names are user defined. Every XML file might have different formatting. You can imagine why XML has fallen out of favor relative to JSON.

### How to read and navigate XML

There is a Python library called BeautifulSoup, which makes reading in and parsing XML data easier. Here is the link to the documentation: [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/)

The find() method will find the first place where an xml element occurs. For example using find('record') will return the first record in the xml file:

```xml
<record>
  <field name="Country or Area" key="ABW">Aruba</field>
  <field name="Item" key="SP.POP.TOTL">Population, total</field>
  <field name="Year">1960</field>
  <field name="Value">54211</field>
</record>
```

The find_all() method returns all of the matching tags. So find_all('record') would return all of the elements with the `<record>` tag.

Run the code cells below to get a basic idea of how to navigate XML with BeautifulSoup. To navigate through the xml file, you search for a specific tag using the find() method or find_all() method. 

Below these code cells, there is an exercise for wrangling the XML data.

In [6]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# open the population_data.xml file and load into Beautiful Soup
with open('../data/population_data.xml') as fp:
    soup = BeautifulSoup(fp,'lxml')# lxml is the Parser type


In [7]:
# output the first 5 records in the xml file
# this is an example of how to navigate the XML document with BeautifulSoup

i = 0
# use the find_all method to get all record tags in the document
for record in soup.find_all('record'):
    # use the find_all method to get all fields in each record
    i += 1
    for record in record.find_all('field'):
        print(record['name'], ': ' , record.text)
    print()
    if i == 5:
        break

Country or Area :  Aruba
Item :  Population, total
Year :  1960
Value :  54211

Country or Area :  Aruba
Item :  Population, total
Year :  1961
Value :  55438

Country or Area :  Aruba
Item :  Population, total
Year :  1962
Value :  56225

Country or Area :  Aruba
Item :  Population, total
Year :  1963
Value :  56695

Country or Area :  Aruba
Item :  Population, total
Year :  1964
Value :  57032



# XML Exercise (Challenge)

Create a data frame from the xml file. This exercise is somewhat tricky. One solution would be to convert the xml data into dictionaries and then use the dictionaries to create a data frame. 

The dataframe should have the following layout:

| Country or Area | Year | Item | Value |
|----|----|----|----|
| Aruba | 1960 | Population, total | 54211 |
| Aruba | 1961 | Population, total | 55348 |
etc...



In [8]:
# TODO: Create a pandas data frame from the XML data.
# HINT: You can use dictionaries to create pandas data frames.
# HINT: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html#pandas.DataFrame.from_dict
# HINT: You can make a dictionary for each column or for each row (See the link above for more information)
# HINT: Modify the code from the previous code cell

# use the find_all method to get all record tags in the document
data_dictionary = {'Country or Area':[], 'Year':[], 'Item':[], 'Value':[]}

for record in soup.find_all('record'):
    for record in record.find_all('field'):
        data_dictionary[record['name']].append(record.text)
df = pd.DataFrame.from_dict(data_dictionary)
df = df.pivot(index= 'Country or Area', columns='Year', values='Value')

In [9]:
df.head()

Year,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country or Area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,8996351,9166764,9345868,9533954,9731361,9938414,10152331,10372630,10604346,10854428,...,27294031,28004331,28803167,29708599,30696958,31731688,32758020,33736494,34656032,35530081
Albania,1608800,1659800,1711319,1762621,1814135,1864791,1914573,1965598,2022272,2081695,...,2947314,2927519,2913021,2905195,2900401,2895092,2889104,2880703,2876101,2873457
Algeria,11124888,11404859,11690153,11985136,12295970,12626952,12980267,13354197,13744387,14144438,...,34860715,35465760,36117637,36819558,37565847,38338562,39113313,39871528,40606052,41318142
American Samoa,20013,20486,21117,21882,22698,23520,24321,25116,25885,26614,...,57030,56227,55637,55320,55230,55307,55437,55537,55599,55641
Andorra,13411,14375,15370,16412,17469,18549,19647,20758,21890,23058,...,83861,84462,84449,83751,82431,80788,79223,78014,77281,76965


# Conclusion

Like CSV, JSON and XML are ways to format data. If everything is formatted correctly, JSON is especially easy to work with. XML is an older standard and a bit trickier to handle.
