In GrEx3 you are required to write a loop to "process" 26 json files (in a directorty). Below I show how to write a loop to create a list of dictories, oner per json. You can use this template to process the 26 json file where instead of creating a list dictionaries you create the list of DataFreams you need to concatenate together to get the DataFrame for part 1.

In [1]:
import json
import os
import pandas as pd
from pandas.io.json import json_normalize

In [2]:
# https://www.tutorialspoint.com/python/os_listdir.htm
# get list of files in current directory, i.e. path = '.'
file_list = os.listdir('.') 
file_list

['.DS_Store',
 '.ipynb_checkpoints',
 'company1.json',
 'company2.json',
 'Examples',
 'Examples-1.zip',
 'GrEx3 Introduction (Part I).pdf',
 'GrEx3 Part 1.html',
 'GrEx3 Part 1.ipynb']

In [3]:
company_list=[]
for file in file_list:
    if file.endswith('.json'):
        # Equivalent to the following:
        # input_file = open(file)
        # jsondat = json.load(input_file)
        # input_file.close()
        with open(file) as input_file:
            jsondat = json.load(input_file)
            company_list.append(jsondat)

Let us display each dictionary in turn

In [4]:
company_list[0]

{u'cars': [{u'make': u'Toyota',
   u'mileage': {u'mpg(city)': 48, u'mpg(hwy)': 43},
   u'model': u'Prius C',
   u'year': 2017},
  {u'make': u'Pontiac',
   u'mileage': {u'mpg(city)': 16, u'mpg(hwy)': 26},
   u'model': u'Bonneville',
   u'year': 1997},
  {u'make': u'Maserati',
   u'mileage': {u'mpg(city)': 10, u'mpg(hwy)': 16},
   u'model': u'Spider',
   u'year': 2014}],
 u'dealer': {u'address': u'<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">77 Industry Way</span>, <span class="locality"><span property="v:locality">Atlanta</span>, <span property="v:region">GA</span> <span property="v:postal-code">30301-2530</span></span> </span> </span> </address>',
  u'name': u'Buy Here, Buy Now'}}

In [5]:
company_list[1]

{'cars': [{'make': 'Volkswagen',
   'mileage': {'mpg(city)': 21, 'mpg(hwy)': 31},
   'model': 'Eos',
   'year': 2017},
  {'make': 'Chrysler',
   'mileage': {'mpg(city)': 18, 'mpg(hwy)': 25},
   'model': 'Sebring',
   'year': 1997}],
 'dealer': {'address': '<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">123 Main Street</span>, <span class="locality"><span property="v:locality">Seattle</span>, <span property="v:region">WA</span> <span property="v:postal-code">11111-1234</span></span> </span> </span> </address>',
  'name': 'Cars R Us'}}

Let us just work with the first company here. Again, some of what I do below you would be doing in the body of loop used to read the json files. First, let us ket the keys..

In [6]:
jsondat = company_list[0]
jsondat.keys()

dict_keys(['dealer', 'cars'])

We save the values associated with each key separately...

In [7]:
dealer_info = jsondat['dealer']
dealer_info

{'address': '<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">77 Industry Way</span>, <span class="locality"><span property="v:locality">Atlanta</span>, <span property="v:region">GA</span> <span property="v:postal-code">30301-2530</span></span> </span> </span> </address>',
 'name': 'Buy Here, Buy Now'}

In [8]:
cars_info = jsondat['cars']
cars_info

[{'make': 'Toyota',
  'mileage': {'mpg(city)': 48, 'mpg(hwy)': 43},
  'model': 'Prius C',
  'year': 2017},
 {'make': 'Pontiac',
  'mileage': {'mpg(city)': 16, 'mpg(hwy)': 26},
  'model': 'Bonneville',
  'year': 1997},
 {'make': 'Maserati',
  'mileage': {'mpg(city)': 10, 'mpg(hwy)': 16},
  'model': 'Spider',
  'year': 2014}]

Now we create DataFrames out of dealer_info and cars_info.

In [9]:
cars_df = pd.DataFrame(cars_info)
cars_df

Unnamed: 0,make,mileage,model,year
0,Toyota,"{'mpg(city)': 48, 'mpg(hwy)': 43}",Prius C,2017
1,Pontiac,"{'mpg(city)': 16, 'mpg(hwy)': 26}",Bonneville,1997
2,Maserati,"{'mpg(city)': 10, 'mpg(hwy)': 16}",Spider,2014


Note that we have a problem here. Observer the nested structure of each dictionary in car_info. Fortunately, json_normalize comes to the rescue. 

In [10]:
cars_df = json_normalize(cars_info)
cars_df

Unnamed: 0,make,mileage.mpg(city),mileage.mpg(hwy),model,year
0,Toyota,48,43,Prius C,2017
1,Pontiac,16,26,Bonneville,1997
2,Maserati,10,16,Spider,2014


That's better. Now for dealer_info...

In [11]:
dealer_df = json_normalize(dealer_info)
dealer_df

Unnamed: 0,address,name
0,"<address class=""addressReset""> <span rel=""v:ad...","Buy Here, Buy Now"


Did you notice the address in dealer_info has html tags. Let's fix the problem using BeautifulSoup.

In [14]:
#https://www.crummy.com/software/BeautifulSoup/
from bs4 import BeautifulSoup   #First install beautifulsoup4 using Canopy Package Manager
soup = BeautifulSoup(dealer_info['address'], 'html.parser')  # create a BeautifulSoup object out of the address string

In [15]:
clean_address = soup.get_text().strip()    # get text and strip it of enclosing white space
dealer_info['address'] = clean_address     # save the clean address to the deal_info
dealer_info

{'address': '77 Industry Way, Atlanta, GA 30301-2530',
 'name': 'Buy Here, Buy Now'}

In [16]:
dealer_df = json_normalize(dealer_info)   # create a DataFrame from the new dealer_info
dealer_df

Unnamed: 0,address,name
0,"77 Industry Way, Atlanta, GA 30301-2530","Buy Here, Buy Now"


In [17]:
# company1= {
#  'dealer':{'name':'Buy Here, Buy Now', 'address':'<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">77 Yesler Way</span>, <span class="locality"><span property="v:locality">Seattle</span>, <span property="v:region">WA</span> <span property="v:postal-code">98104-2530</span></span> </span> </span> </address>'},
#  'cars': [
#     {"make":"Toyota", "model": "Prius C", "year":2017, "mileage":{'mpg (city)': 48, 'mpg (hwy)': 43}},
#     {"make":"Pontiac", "model": "Bonneville", "year":1997, "mileage":{'mpg (city)': 16, 'mpg (hwy)': 26}},
#     {"make":"Maserati", "model": "Spider", "year":2014, "mileage":{'mpg (city)': 10, 'mpg( hwy)': 16}}
#     ]
#  }

In [18]:
# company2={'cars': [{'make': 'Volkswagen',
#    'mileage': {'mpg(city)': 21, 'mpg(hwy)': 31},
#    'model': 'Eos',
#    'year': 2017},
#   {'make': 'Chrysler',
#    'mileage': {'mpg(city)': 18, 'mpg(hwy)': 25},
#    'model': 'Sebring',
#    'year': 1997}],
#  'dealer': {'address': '<address class="addressReset"> <span rel="v:address"> <span dir="ltr"><span class="street-address" property="v:street-address">123 Main Street</span>, <span class="locality"><span property="v:locality">Seattle</span>, <span property="v:region">WA</span> <span property="v:postal-code">11111-1234</span></span> </span> </span> </address>',
#   'name': 'Cars R Us'}}

In [19]:
# import json
# with open('company2.json', 'w') as fp:
#     json.dump(company2, fp)