# Processing JSON: the case of Trex

For this example to work, file Trex_reduced.json should be in the same folder as this notebook.

In [1]:
INFILENAME = 'Trex_reduced.json'

## Getting JSON data into Python
First we have to import module `json` -- the module that knows what to do with JSON data.

In [2]:
import json

Before module `json` can handle any data, we have to get it into memory.  
In this case, we let Python read the entire file into one very long string.  

In [3]:
infile = open(INFILENAME, 'r')
trex_json = infile.read()
infile.close()

Don't print the whole string...  
But a little piece won't harm?

In [4]:
print trex_json[:50]

{
"data_provider":"The Paleobiology Database",
"da


Now module `json` can do the parsing. It produces a Python object that contains all data as found in (the string from) the JSON file.  
We're done with the string. To free some memory, we can delete it.

In [5]:
trex = json.loads(trex_json)
del trex_json

Again, don't print this object.  
Instead, let's find some information about it.

In [6]:
print 'type(trex) =', type(trex)
print 'len(trex) =', len(trex) # would not work if it were an int of float or ...

type(trex) = <type 'dict'>
len(trex) = 13


There are only ... values. Let's just print them.

In [7]:
for v in trex:
    print v

elapsed_time
data_url
data_source
parameters
title
access_time
records_returned
records_found
data_license
documentation_url
license_url
records
data_provider


Perhaps we should also inspect the types

In [8]:
for v in trex:
    print type(v), v

<type 'unicode'> elapsed_time
<type 'unicode'> data_url
<type 'unicode'> data_source
<type 'unicode'> parameters
<type 'unicode'> title
<type 'unicode'> access_time
<type 'unicode'> records_returned
<type 'unicode'> records_found
<type 'unicode'> data_license
<type 'unicode'> documentation_url
<type 'unicode'> license_url
<type 'unicode'> records
<type 'unicode'> data_provider


Type `unicode` is a generalized type of strings. Unicode contains many more characters than possible in standard strings, e.g. Greek letters and special punctuation characters.  
In fact these Unicode strings are keys in a dictionary; a for-loop over a dictionary loops over the keys (in an order determined internally).  
Let's look at the associated values:

In [9]:
for v in trex:
    value = trex[v]
    print 'key =', v
    print 'type(value) =', type(value)
    if type(value) in [dict, list, unicode]:
        print 'len(value) =', len(value)
    else:
        print 'value =', value
    print

key = elapsed_time
type(value) = <type 'float'>
value = 0.00526

key = data_url
type(value) = <type 'unicode'>
len(value) = 121

key = data_source
type(value) = <type 'unicode'>
len(value) = 25

key = parameters
type(value) = <type 'dict'>
len(value) = 5

key = title
type(value) = <type 'unicode'>
len(value) = 17

key = access_time
type(value) = <type 'unicode'>
len(value) = 27

key = records_returned
type(value) = <type 'int'>
value = 75

key = records_found
type(value) = <type 'int'>
value = 75

key = data_license
type(value) = <type 'unicode'>
len(value) = 22

key = documentation_url
type(value) = <type 'unicode'>
len(value) = 48

key = license_url
type(value) = <type 'unicode'>
len(value) = 43

key = records
type(value) = <type 'list'>
len(value) = 75

key = data_provider
type(value) = <type 'unicode'>
len(value) = 25



Most entries are so-called meta-data, i.e. data describing the actual data.  
Let's look into the first and second entry:

In [10]:
print trex[0]
print trex[1]

KeyError: 0

"KeyError"? This means that we are trying to obtain something from a dictionary that is not there.  
For dictionaries, we cannot select elements by number (unless keys are numbers), but we have to supply the keys instead.

In [11]:
print trex['elapsed_time']
print trex['data_url']

0.00526
http://paleobiodb.org/data1.2/occs/list.json?datainfo&rowcount&base_name=Tyrannosaurus&taxon_reso=genus&show=genus,coords


Find the two entries that contain further structure:

In [18]:
# fill in the dots
further_context = trex['access_time']
data = trex['records']
# i.e. the real data

In [19]:
print type(further_context)
print type(data)

<type 'unicode'>
<type 'list'>


First explore `further_context`. Print the types of the values contained in it.  
Does it make sense printing those values immediately?  
If yes, do so. Otherwise delve a bit deeper.

In [20]:
# fill in this cell
print further_context

Wed 2016-09-21 19:45:25 GMT


Now, we delve into the real data, in variable `data`.  
It is a structured type. First let's look again at its type and length.

In [17]:
print "type(data) =", type(data)
print "len(data) =", len(data)

type(data) = <type 'list'>
len(data) = 75


So this is a list. [If not, assign `data` again above; the correct key is `'records'`.]  
For each element of this list, print its index in the list and its type.

In [22]:
for i in range(len(data)):
    print 'index %d type: %s' % (i, type(data[i]))
    # fill in the details

index 0 type: <type 'dict'>
index 1 type: <type 'dict'>
index 2 type: <type 'dict'>
index 3 type: <type 'dict'>
index 4 type: <type 'dict'>
index 5 type: <type 'dict'>
index 6 type: <type 'dict'>
index 7 type: <type 'dict'>
index 8 type: <type 'dict'>
index 9 type: <type 'dict'>
index 10 type: <type 'dict'>
index 11 type: <type 'dict'>
index 12 type: <type 'dict'>
index 13 type: <type 'dict'>
index 14 type: <type 'dict'>
index 15 type: <type 'dict'>
index 16 type: <type 'dict'>
index 17 type: <type 'dict'>
index 18 type: <type 'dict'>
index 19 type: <type 'dict'>
index 20 type: <type 'dict'>
index 21 type: <type 'dict'>
index 22 type: <type 'dict'>
index 23 type: <type 'dict'>
index 24 type: <type 'dict'>
index 25 type: <type 'dict'>
index 26 type: <type 'dict'>
index 27 type: <type 'dict'>
index 28 type: <type 'dict'>
index 29 type: <type 'dict'>
index 30 type: <type 'dict'>
index 31 type: <type 'dict'>
index 32 type: <type 'dict'>
index 33 type: <type 'dict'>
index 34 type: <type 'di

If you have done this correctly, all types are structured.  
So we can print the lengths of all records.  
For each record, print the index in the list and length of the record.

In [23]:
for i in range(len(data)):
    print i, "length", len(data[i])

0 length 13
1 length 13
2 length 12
3 length 12
4 length 12
5 length 13
6 length 13
7 length 16
8 length 12
9 length 12
10 length 12
11 length 12
12 length 12
13 length 12
14 length 12
15 length 12
16 length 12
17 length 12
18 length 12
19 length 12
20 length 12
21 length 12
22 length 12
23 length 12
24 length 12
25 length 12
26 length 12
27 length 12
28 length 12
29 length 12
30 length 12
31 length 15
32 length 12
33 length 12
34 length 13
35 length 13
36 length 13
37 length 13
38 length 13
39 length 12
40 length 13
41 length 13
42 length 12
43 length 12
44 length 12
45 length 13
46 length 13
47 length 12
48 length 13
49 length 13
50 length 12
51 length 16
52 length 12
53 length 12
54 length 14
55 length 12
56 length 12
57 length 12
58 length 12
59 length 12
60 length 12
61 length 12
62 length 13
63 length 12
64 length 12
65 length 12
66 length 13
67 length 12
68 length 12
69 length 13
70 length 12
71 length 12
72 length 12
73 length 18
74 length 13


Take the index of a record with maximal length.

In [25]:
# fill in the dots
len_list = []
for i in range(len(data)):
    length = len(data[i])
    len_list.append(length)
max_len = max(len_list)
print max_len
selected_index = len_list.index(max_len)
print selected_index

18
73


Let's look into this record. First we assign define a variable `record` that refers to the record in the list at the selected index.

In [26]:
record = data[selected_index]

Now, find the keys in this record and the types of the associated values.

In [27]:
for key in record:
    print key, type(record[key])
    # fill in the details

rnk <type 'int'>
lag <type 'int'>
lat <type 'float'>
cid <type 'unicode'>
oli <type 'unicode'>
oid <type 'unicode'>
idr <type 'int'>
tdf <type 'unicode'>
eag <type 'float'>
idn <type 'unicode'>
iid <type 'unicode'>
oei <type 'unicode'>
tna <type 'unicode'>
eid <type 'unicode'>
tid <type 'unicode'>
rid <type 'unicode'>
lng <type 'float'>
gnl <type 'unicode'>


All values turn out to be numbers or (Unicode) strings.  
So just let's print keys and values.

In [28]:
for key in record:
    print key, record[key]
    # fill in the details

rnk 5
lag 66
lat 42.821259
cid col:64144
oli Maastrichtian
oid occ:1220179
idr 3
tdf nomen dubium
eag 83.6
idn Tyrannosaurus n. sp. turpanensis
iid txn:68319
oei Campanian
tna Tyrannosaurus
eid rei:30214
tid txn:38613
rid ref:52140
lng 89.85601
gnl Tyrannosaurus


Now do the same for the first record

In [29]:
record = data[0]
for key in record:
    print key, record[key]
    # copy details from last code cell

rnk 3
lag 66
cid col:11917
oid occ:139292
eag 72.1
lat 51.906399
oei Maastrichtian
tna Tyrannosaurus rex
eid rei:22878
tid txn:54833
rid ref:4218
lng -113.0289
gnl Tyrannosaurus


Some keys are absent here, but `lat` and `lng` are present in both.  
In fact, they are present in all records.  
They stand for latitude and longitude of the site where this Tyrannosaurus was found.

Print a line that contains the latitude and longitude of the current record.

In [30]:
# fill in the dots
print 'latitude = %8.3f   longitude = %8.3f' % (record['lat'], record['lng'])

latitude =   51.906   longitude = -113.029


Using the structure of `data` that you investigated, now find all pairs of latitude and longitude contained in the JSON structure.  
So, make a loop over all records again, and for each print the corresponding latitude and longitude

In [33]:
# fill in this cell
for i in range(len(data)):
    records = data[i]
    print 'latitude = %8.3f   longitude = %8.3f' % (records['lat'], records['lng'])

latitude =   51.906   longitude = -113.029
latitude =   51.933   longitude = -113.233
latitude =   50.729   longitude = -111.526
latitude =   50.727   longitude = -111.525
latitude =   43.050   longitude = -104.483
latitude =   40.386   longitude = -104.492
latitude =   48.626   longitude =   44.059
latitude =   47.536   longitude = -107.083
latitude =   45.949   longitude = -103.962
latitude =   45.949   longitude = -103.962
latitude =   46.100   longitude = -103.300
latitude =   45.978   longitude = -103.754
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.137   longitude = -103.795
latitude =   46.406   longitude = -103.940
latitude =   46.382   longitude = -103.964
latitude =   46.450   longitude = -103.023
latitude =   46.464   longitude = -104.023
latitude =   46.334   longitude = -103.898
latitude = 

Finally, put all essential parts together into one script that you can run outside this notebook as well.  
Skip all cells that were used for experimenting; keep only those pieces of code that are necessary for obtaining the list of latitues and longitudes.

In [34]:
# fill in this cell
#file Trex_reduced.json should be in the same folder as this script
INFILENAME = 'Trex_reduced.json'

#inport module json
import json

#get the json file into memory
infile = open(INFILENAME, 'r')
#read the entire file into one string
trex_json = infile.read()
infile.close()

#use module json do the parsing
#it produces a python object that contains all data as found in the string from of the Json file
trex = json.loads(trex_json)
del trex_json

#extract the real data
data = trex['records']

for i in range(len(data)):
    record = data[i]
    print 'latitude = %8.3f   longitude = %8.3f' % (record['lat'], record['lng'])

latitude =   51.906   longitude = -113.029
latitude =   51.933   longitude = -113.233
latitude =   50.729   longitude = -111.526
latitude =   50.727   longitude = -111.525
latitude =   43.050   longitude = -104.483
latitude =   40.386   longitude = -104.492
latitude =   48.626   longitude =   44.059
latitude =   47.536   longitude = -107.083
latitude =   45.949   longitude = -103.962
latitude =   45.949   longitude = -103.962
latitude =   46.100   longitude = -103.300
latitude =   45.978   longitude = -103.754
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.026   longitude = -103.767
latitude =   46.137   longitude = -103.795
latitude =   46.406   longitude = -103.940
latitude =   46.382   longitude = -103.964
latitude =   46.450   longitude = -103.023
latitude =   46.464   longitude = -104.023
latitude =   46.334   longitude = -103.898
latitude = 