## JSON data parsing

In [None]:
import json
import pandas as pd

In [None]:
json_data = json.load(open('data.json'))
cols = [col["fieldName"] for col in json_data["meta"]["view"]["columns"]]
json_df = pd.DataFrame(json_data["data"], columns=cols)
json_df = json_df.drop(cols[:8], axis=1) # Drop the first 8 columns. These data are not present in the xml so no need.
json_df.set_index("state_or_nation")
json_df.shape

## XML data Parsing

In [None]:
from lxml import objectify

In [None]:
parsed_xml = objectify.parse(open("rows.xml"))
root = parsed_xml.getroot()

In [None]:
data = []

for elt in root.row.row:
    el_data = {}
    for child in elt.getchildren():
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
xml_df = pd.DataFrame(data)
xml_df.head()
xml_df.set_index("state_or_nation")
xml_df.shape

## Merging and comparing

In [None]:
merged = json_df.merge(xml_df, on="state_or_nation")
merged.to_csv("joined_data.csv")

In [None]:
jcol_df = pd.DataFrame([col for col in json_df.columns])
xcol_df = pd.DataFrame([col for col in xml_df.columns])

for col in [col for col in json_df.columns]:
    for i in zip(json_df[col],xml_df[col]):
        print(i[0], i[1])

## Questions:

1. The two above datasets were linked from this datasearch description. Does the description give a good overview of the data?

        No, the description doesn't mention that this data looks to be nursing home related. It mentions state averages, and other measures, but should really inform the user that this is nursing home data.
 
2. Do the two datasets contain the same information? What is different about them?

        No, the JSON dataset had more data. The JSON dataset contained other information such as metadata about the columns, dataset tags, publication dates, and dataset owner information.

3. Why did you decide to use the columns that you chose?

        I clipped off the first 8 columns from the JSON data after it was loaded into a dataframe because those columns did not exist in the XML data. I decided to compare all values in the remaining columns just for research purposes. 
    
4. What kinds of information can you learn from the data?

        I could learn about which states have the highest resident to nursing home ratio. With that, I could find out if there is a correlation between high resident nursing homes and number of fines. I'd hypothesize that nursing homes with high residents and low nurse counts would be fined more due to understaffing and cutting corners.
        
5. Compare how much effort it took you to parse through the json vs. the xml file. What do you prefer? Why?

        Parsing the loading the xml took a lot more effort than loading the json and was pretty frustrating at first. I would much rather work with JSON data because it is better structured and similar to python dictionaries. I know that XML is still a prominent data format so learning to parse XML was worth learning.