# Exploring and Transforming JSON Schemas

# Introduction

In this lesson, you'll formalize how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

## Objectives
You will be able to:
* Explore unknown JSON schemas
* Access and manipulate data inside a JSON file
* Convert JSON to alternative data formats

## Loading the JSON file

Load the data from the file disease_data.json.

In [1]:
#Your code here 
import json
f = open('disease_data.json')
data = json.load(f)

## Explore the first and second levels of the schema hierarchy

In [2]:
#Your code here
type(data)

dict

In [3]:
data.keys()

dict_keys(['meta', 'data'])

In [4]:
type(data['meta'])

dict

In [7]:
data['meta'].keys()

dict_keys(['view'])

In [5]:
type(data['data'])

list

In [8]:
len(data['data'])

60266

In [9]:
type(data['data'][0])

list

In [10]:
len(data['data'][0])

42

## Convert to a DataFrame

Create a DataFrame from the JSON file. Be sure to retrive the column names for the dataframe. (Search within the 'meta' key of the master dictionary.) The DataFrame should include all 42 columns.

In [11]:
#Your code here
type(data['meta']['view'])

dict

In [12]:
data['meta']['view'].keys()

dict_keys(['id', 'name', 'attribution', 'attributionLink', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'licenseId', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'license', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

In [16]:
import pandas as pd

df = pd.DataFrame(data['data'])
print(df.shape)
df.columns = [item['name'] for item in data['meta']['view']['columns']]
df.head()

(60266, 42)


Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,1,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,2,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,4,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,5,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,


## Level-Up
## Create a bar graph of states with the highest asthma rates for adults age 18+

## Level-Up!
## Create a function (or class) that returns an outline of the schema structure like this: 
<img src="images/outline.jpg" width="350">

Rules:
* Your outline should follow the numbering outline above (I, A, 1, a, i).
* Your outline should be properly indented! (Four spaces or one tab per indentation level.)
* Your function goes to at least a depth of 5 (Level-up: create a parameter so that the user can specify this)
* If an entry is a dictionary, list its keys as the subheadings
* After listing a key name (where applicable) include a space, a dash and the data type of the entry
* If an entry is a dict or list put in parentheses how many items are in the entry
* lists will not have key names for their entries (they're just indexed)
* For subheadings of a list, state their datatypes. 
* If a dictionary or list is more then 5 items long, only show the first 5 (we want to limit our previews); make an arbitrary order choice for dictionaries. (Level-up: Parallel to above; allow user to specify number of items to preview for large subheading collections.)

In [17]:
# Your code here
def get_header(depth, exn):
    lvl_headers = [['I', 'II', 'III', 'IV', 'V'],
                   ['A', 'B', 'C', 'D', 'E'],
                   [i for i in range(1,6)],
                   ['a', 'b', 'c', 'd', 'e'],
                   ['i', 'ii', 'iii', 'iv', 'v']
                  ]
    depth = depth % 5 #determine index for deeply nested structures
    return lvl_headers[depth][exn]
def get_obj_length(obj):
    if type(obj) == dict:
        return '({} items)'.format(len(obj.keys()))
    elif type(obj) == list:
        return '({} items)'.format(len(obj))
    else:
        return ""
def obj_overview(obj, cur_printout, depth, exn, name=None):
    cur_header = get_header(depth, exn)
    obj_length = get_obj_length(obj)
    if cur_printout == "":
        cur_printout += "{}. root - {} {}".format(cur_header, type(obj), obj_length)
        return cur_printout
    else:
        spaces = ' '*depth*4
        newline = '\n{}{}. {}{} {}'.format(spaces, cur_header, name, type(obj), obj_length)
        cur_printout += newline
        return cur_printout

def print_obj_outline(obj, cur_printout="", depth=0, exn=0, max_en=5, max_depth=10, name=""):
    """obj is the current data object within the json to be processed.
    Call this on the root node, and the function will iteratively build the tree, burrowing down through nested data.
    depth is the current depth you are at within the recursive calls. This determines the indentation and what headers to use.
    exn is which example number you are on for that iteration, again successive calls update this accordingly.
    Name is provided for identifying the keys associated with values from dictionaries."""
    cur_printout = obj_overview(obj, cur_printout, depth=depth, exn=exn, name=name)
    if type(obj) == list:
        n_items = 5
        if len(obj) < n_items:
            n_items = len(obj)
        for n, item in enumerate(obj[:n_items]):
            cur_printout = print_obj_outline(item, cur_printout=cur_printout, depth=depth+1, exn=n)
    elif type(obj) == dict:
        n_items = 5
        if len(obj.keys()) < n_items:
            n_items = len(obj.keys())
        firstn = list(obj.keys())[:n_items]
        for n, key in enumerate(firstn):
            if depth < max_depth:
                cur_printout = print_obj_outline(obj[key], cur_printout=cur_printout, depth=depth+1, exn=n, name=key+' ')
    else:
        pass
    return cur_printout

    
# 1st Draft....work left for demonstrating initial thought process  
# def outline_hierarchy(json_obj):
#     outer_depth = 1 #initialize depth counter
#     inner_depth = 1
#     cur_header = get_header(outer_depth-1, inner_depth-1)
#     output = "{}root - {} {}".format(cur_header, type(json_obj), get_obj_length(json_obj))
#     #Initialize a parent object
#     parent = json_obj
    
#     #Depth first search; easier for creating the string by appending
#     while outer_depth <= 5:
#         while inner_depth <=5:
#             #add a new line to our output
#             spaces = ' '*4*outer_depth #four spaces is equivalent to a tab
#             children = get_children(obj)

In [18]:
outline = print_obj_outline(data)

In [25]:
print(outline) #Your function should produce the following output for this json object (and work for all json files!)

I. root - <class 'dict'> (2 items)
    A. meta <class 'dict'> (1 items)
        1. view <class 'dict'> (40 items)
            a. id <class 'str'> 
            b. name <class 'str'> 
            c. attribution <class 'str'> 
            d. attributionLink <class 'str'> 
            e. averageRating <class 'int'> 
    B. data <class 'list'> (60266 items)
        1. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        2. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        3. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <class 'str'> 
            c. <class 'int'> 
            d. <class 'int'> 
            e. <class 'str'> 
        4. <class 'list'> (42 items)
            a. <class 'int'> 
            b. <c

## Summary

Well done! In this lab you got some extended practice exploring the structure of JSON files and writing a recursive generalized function for outlining a JSON file's schema! 