# Exploring and Transforming JSON Schemas

# Introduction

In this lesson, you'll formalize how to explore a JSON file whose structure and schema is unknown to you. This often happens in practice when you are handed a file or stumble upon one with little documentation.

## Objectives
You will be able to:
* Use the JSON module to load and parse JSON documents
* Load and explore unknown JSON schemas
* Convert JSON to a pandas dataframe

## Loading the JSON file

Load the data from the file `disease_data.json`.

In [1]:
#Your code here 
import pandas as pd
import json
f = open('disease_data.json')
data = json.load(f)

## Explore the first and second levels of the schema hierarchy

In [2]:
#Your code here
type(data)

dict

In [3]:
data.keys()

dict_keys(['meta', 'data'])

In [4]:
type(data['data'])

list

In [5]:
len(data['data'])

60266

In [6]:
data['data'][0]

[1,
 'FF49C41F-CE8D-46C4-9164-653B1227CF6F',
 1,
 1527194521,
 '959778',
 1527194521,
 '959778',
 None,
 '2016',
 '2016',
 'US',
 'United States',
 'BRFSS',
 'Alcohol',
 'Binge drinking prevalence among adults aged >= 18 years',
 None,
 '%',
 'Crude Prevalence',
 '16.9',
 '16.9',
 '*',
 '50 States + DC: US Median',
 '16',
 '18',
 'Overall',
 'Overall',
 None,
 None,
 None,
 None,
 [None, None, None, None, None],
 None,
 '59',
 'ALC',
 'ALC2_2',
 'CRDPREV',
 'OVERALL',
 'OVR',
 None,
 None,
 None,
 None]

In [7]:
type(data['data'][0])

list

In [8]:
len(data['data'][0])

42

In [9]:
type(data['meta'])

dict

In [10]:
data['meta'].keys()

dict_keys(['view'])

In [11]:
type(data['meta']['view'])

dict

In [12]:
data['meta']['view'].keys()

dict_keys(['id', 'name', 'attribution', 'attributionLink', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'indexUpdatedAt', 'licenseId', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowClass', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'columns', 'grants', 'license', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

In [13]:
data['meta']['view']['columns'][0]

{'id': -1,
 'name': 'sid',
 'dataTypeName': 'meta_data',
 'fieldName': ':sid',
 'position': 0,
 'renderTypeName': 'meta_data',
 'format': {},
 'flags': ['hidden']}

## Convert to a DataFrame

Create a DataFrame from the JSON file. Be sure to retrive the column names for the dataframe. (Search within the 'meta' key of the master dictionary.) The DataFrame should include all 42 columns.

In [14]:
#Your code here
df_data = pd.DataFrame(data['data'])
df_data.columns = [x['name'] for x in data['meta']['view']['columns']]
df_data

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,YearStart,YearEnd,...,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1,StratificationCategoryID2,StratificationID2,StratificationCategoryID3,StratificationID3
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,,2016,2016,...,59,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,,2016,2016,...,01,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,,2016,2016,...,02,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,,2016,2016,...,04,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,,2016,2016,...,05,ALC,ALC2_2,CRDPREV,OVERALL,OVR,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60261,519150,1B28C1DD-B25F-457E-86E4-7D1463BE82C3,519150,1527194644,959778,1527194644,959778,,2016,2016,...,72,DIS,DIS1_0,CRDPREV,RACE,ASN,,,,
60262,519704,4FF6ADF8-CC4B-4D94-A5B0-7766346A0D3E,519704,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,BLK,,,,
60263,519705,02896705-4A9F-45A2-A84B-923DEA6DC6A2,519705,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,AIAN,,,,
60264,519706,4DF2E74C-5043-474B-9739-98B4D8736BDB,519706,1527194644,959778,1527194644,959778,,2016,2016,...,72,OVC,OVC3_1,CRDPREV,RACE,ASN,,,,


## Level-Up
## Create a bar graph of states with the highest asthma rates for adults age 18+

In [15]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60266 entries, 0 to 60265
Data columns (total 42 columns):
sid                          60266 non-null int64
id                           60266 non-null object
position                     60266 non-null int64
created_at                   60266 non-null int64
created_meta                 60266 non-null object
updated_at                   60266 non-null int64
updated_meta                 60266 non-null object
meta                         0 non-null object
YearStart                    60266 non-null object
YearEnd                      60266 non-null object
LocationAbbr                 60266 non-null object
LocationDesc                 60266 non-null object
DataSource                   60266 non-null object
Topic                        60266 non-null object
Question                     60266 non-null object
Response                     0 non-null object
DataValueUnit                60158 non-null object
DataValueType                60266 n

In [16]:
nulls = df_data.isna().sum()

In [17]:
drop_columns = [nulls[nulls == 60266]]
type(drop_columns)
type(drop_columns[0])
drop_cols = list(drop_columns[0].index)
type(drop_cols)
drop_cols

['meta',
 'Response',
 'StratificationCategory2',
 'Stratification2',
 'StratificationCategory3',
 'Stratification3',
 'ResponseID',
 'StratificationCategoryID2',
 'StratificationID2',
 'StratificationCategoryID3',
 'StratificationID3']

In [18]:
df_data.drop(drop_cols, axis = 1, inplace = True)

In [20]:
df_data.head()

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,YearStart,YearEnd,LocationAbbr,...,HighConfidenceLimit,StratificationCategory1,Stratification1,GeoLocation,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1
0,1,FF49C41F-CE8D-46C4-9164-653B1227CF6F,1,1527194521,959778,1527194521,959778,2016,2016,US,...,18.0,Overall,Overall,"[None, None, None, None, None]",59,ALC,ALC2_2,CRDPREV,OVERALL,OVR
1,2,F4468C3D-340A-4CD2-84A3-DF554DFF065E,2,1527194521,959778,1527194521,959778,2016,2016,AL,...,14.1,Overall,Overall,"[None, 32.84057112200048, -86.63186076199969, ...",1,ALC,ALC2_2,CRDPREV,OVERALL,OVR
2,3,65609156-A343-4869-B03F-2BA62E96AC19,3,1527194521,959778,1527194521,959778,2016,2016,AK,...,20.6,Overall,Overall,"[None, 64.84507995700051, -147.72205903599973,...",2,ALC,ALC2_2,CRDPREV,OVERALL,OVR
3,4,0DB09B00-EFEB-4AC0-9467-A7CBD2B57BF3,4,1527194521,959778,1527194521,959778,2016,2016,AZ,...,16.9,Overall,Overall,"[None, 34.865970280000454, -111.76381127699972...",4,ALC,ALC2_2,CRDPREV,OVERALL,OVR
4,5,D98DA5BA-6FD6-40F5-A9B1-ABD45E44967B,5,1527194521,959778,1527194521,959778,2016,2016,AR,...,17.2,Overall,Overall,"[None, 34.74865012400045, -92.27449074299966, ...",5,ALC,ALC2_2,CRDPREV,OVERALL,OVR


In [23]:
df_data['GeoLocation'][50]

[None, '44.39319117400049', '-89.81637074199966', None, False]

In [29]:
df_data.Topic.unique()

array(['Alcohol', 'Arthritis', 'Asthma', 'Cancer', 'Diabetes',
       'Mental Health', 'Chronic Obstructive Pulmonary Disease',
       'Oral Health', 'Cardiovascular Disease', 'Immunization',
       'Chronic Kidney Disease',
       'Nutrition, Physical Activity, and Weight Status', 'Older Adults',
       'Tobacco', 'Overarching Conditions', 'Reproductive Health',
       'Disability'], dtype=object)

In [39]:
df_data['Question'][(df_data.Question.str.contains('Asthma') == True) | df_data.Question.str.contains('asthma') == True].unique()

array(['Current asthma prevalence among adults aged >= 18 years',
       'Asthma prevalence among women aged 18-44 years',
       'Influenza vaccination among noninstitutionalized adults aged 18-64 years with asthma',
       'Influenza vaccination among noninstitutionalized adults aged >= 65 years with asthma',
       'Pneumococcal vaccination among noninstitutionalized adults aged 18-64 years with asthma',
       'Pneumococcal vaccination among noninstitutionalized adults aged >= 65 years with asthma'],
      dtype=object)

In [41]:
df_data['LocationAbbr'].unique()

array(['US', 'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL',
       'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
       'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM',
       'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
       'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'GU', 'PR', 'VI'],
      dtype=object)

## Summary

Well done! In this lab you got some extended practice exploring the structure of JSON files, converting json files to pandas DataFrame, and visualizing data!