# Week 03: Structured Data

Java Script Object Notation (JSON): another file format to store, transfer and organize data.

## JSON

- [ansur](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/ansur.json) (6k)
- [wines](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/wines.json) (8k)
- [cities](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/cities.json) (9k)
- [pm2.5](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/pm2_5.json) (41k)

### Other Sources
- [Kaggle JSON](https://www.kaggle.com/datasets/?sort=downloadCount&fileType=json)
- [Hugging Face JSON](https://huggingface.co/datasets?format=format:json&sort=downloads)

## Setup

Let's run the following $2$ cells to download our datasets and load some helper libraries.

In [None]:
!wget -q https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/wines.json -P ./data
!wget -q https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/ansur.json -P ./data

In [None]:
import json

## Loading a JSON file

Is similar to loading a `csv` file (and perhaps even easier).

We open the file using the `Python` function `open()`, and then use a library to read the contents of the file, knowing that it's a `json` file.

For `csv` files it was something like this:

```py
csvfile = open("./data/wines.csv")
csvreader = csv.reader(csvfile)
csvdata = list(csvreader)
```

For `json`, it's like this:

In [None]:
jsonfile = open("./data/wines.json")
data = json.load(jsonfile)

The `json.load()` function takes care of reading the contents of the file and turning it into `Python` values. We don't need to create a reader and then convert the reader to a `list` like we did with `csv`. Similar, but different.

And once we have the contents in a variable, we can look at the data using `display()`:

In [None]:
display(data)

# ðŸ¤”

This sort of, kind of, looks like a list of lists, but we have the header repeated a bunch of times.

These are dictionaries.

## Dictionaries

Together with lists, dictionaries are the other construct we use to structure data within our programs.

They're used to organize data, but in a way that makes it easy to keep data consistent.

Simplest dictionaries are sequences of `key`, `value` pairs. Just like with a physical dictionary, where we can use a word to look up its definition or translation, in `Python` dictionaries we use a `key` to fetch some extra piece of data (`value`) associated with it.

Another way to think about dictionaries is that they create a separate space for us to create variables within variables.

Let's take a look.

In [None]:
# TODO: create empty dictionary
me = {}

# TODO: add items with keys
me["name"] = "Thiago"
me["zip"] = 12221
me["grade"] = 87

# TODO: create and initialize
you = {
  "name": "Lex",
  "zip": 12345,
  "grade": 99,
}

display(me)
display(you)

# TODO: read a value from a key
me["zip"]
you["zip"]

# TODO: useful for comparing things that should be alike and have similar properties
me["grade"] < you["grade"]

# TODO: can modify value associated with any given key

me["grade"] = 100

me["grrrade"] = 100

# TODO: or remove

del me["grrrade"]

me

## Objects

This is another name for dictionaries. People also use _hash map_, _hash table_ and _associative array_ to refer to objects/dictionaries. They all mean the same thing.

One reason why people use some of these other terms is that the term _dictionary_ might imply, or be closely associated with, a simple `key` $\to$ `value` type of organization, like in physical dictionaries. But, `Python` dictionaries are more sophisticated because they allow a `key` to be associated with a more complex data type like a `list` or even a `dictionary`.

So, yeah, nested dictionaries are a thing... so are lists of dictionaries and dictionaries with lists.

In [None]:
# TODO: add list values
me = {
  "name": "Thiago",
  "grades": [100,83,75,87],
}

you = {
  "name": "Lex",
  "grades": [80,73,95,98],
}

# TODO: add average to dictionary
# TODO: first, create list of dicts
people = [me, you]

# TODO: process list and add average
for p in people:
  my_grades = p["grades"]
  p["average"] = sum(my_grades) / len(my_grades)

display(me, you)

## Datasets

We've now seen $2$ ways of storing datasets in text files. Some of the details between them are a little different, but they have a similar overall structure for organizing data in $2$ dimensions: rows and columns.

<img src="./imgs/datasets00.jpg" width="700px">

In `csv` files we have an outer list of rows that have inner lists of the column values, and in `json` files we have an outer list of rows that have inner objects with the column values.

<img src="./imgs/datasets01.jpg" width="700px">

But what we can do with one, we can do with the other.

Let's repeat some of the processing we did with the wines from last week, but this time using `json` data (a list of objects).

### Wines

Let's find the most expensive wine on the dataset and the average price of wines from California.

In [None]:
# iterate over wines and find most expensive wine

# assume it's the first wine
expensive_wine = data[0]

# for every wine on the list, compare its price property 
#   to the price of the most expensive wine we have seen so far.
# if we see a more expensive wine, update the expensive_wine result variable

for w in data:
  if w["price"] > expensive_wine["price"]:
    expensive_wine = w

expensive_wine

In [None]:
# iterate over wines and calculate average price for California wines

# we need the sum of their prices and the number of wines from California
# we can first filter the wines from California, and then sum the prices, or do both at once

# the result variables
sum_ca_prices = 0
cnt_ca_wines = 0

for w in data:
  if w["province"] == "California":
    sum_ca_prices = sum_ca_prices + w["price"]
    cnt_ca_wines = cnt_ca_wines + 1

print("Sum of Prices:", sum_ca_prices)
print("Count:", cnt_ca_wines)
print("Average:", sum_ca_prices / cnt_ca_wines)

### Exercise

Find the cheapest wine from the US and the cheapest wine not from the US.

Just like we can check for equality using `==`, we can check for inequality using `!=`.

So, wines not from the US are the ones that have `w["country"] != "US"`

In [None]:
# TODO: cheap wines

# TODO: set up some variables to hold the result we want
# TODO: iterate and use if statements to determine when/if we should update the results

# TODO: can do it using 2 separate loops or combine the logic into one loop

cheap_usa = data[0]
cheap_other = data[1]

for wine in data:
  if wine["country"] == "US" and wine["price"] < cheap_usa["price"]:
    cheap_usa = wine
  if wine["country"] != "US" and wine["price"] < cheap_other["price"]:
    cheap_other = wine

display(cheap_usa, cheap_other)

## Dictionary in Dictionary

Let's open up another file and take a look at its structure.

In [None]:
ansurfile = open("./data/ansur.json")
ansurdata = json.load(ansurfile)

ansurdata[0]

In [None]:
# Look at first and last
ansurdata[0], ansurdata[-1]

In [None]:
# Lots of indexing to get to ear length
ansurdata[0]["ear"]["length"]

# ðŸ¤”

The first $6$ or so properties look normal, but the last $4$ are actually dictionaries.

These are nested dictionaries ! We can't do this in `csv` files.

We can still get to the values, we just have to do some double indexing.

In order to get foot length we first get the `foot` object, and then its `length` value.

In [None]:
# get foot length data for person at index 10

ansurdata[10]["foot"]["length"]

### Filtering and Finding

Finding specific people and filtering the dataset is still similar, the biggest change being how we index our objects to get the values we want to compare.

In [None]:
# Get person with largest foot

big_foot = ansurdata[0]

for p in ansurdata:
  if p["foot"]["length"] > big_foot["foot"]["length"]:
    big_foot = p

big_foot

### Exercise

Repeat and find person with longest ears.

In [None]:
# TODO: find person with longest ears

long_ear = ansurdata[0]

for p in ansurdata:
  if p["ear"]["length"] > long_ear["ear"]["length"]:
    long_ear = p

long_ear

In [None]:
# We'll only see "key functions" in the end of the notebook, but let's use one to check the answer above

# A key function receives a single object and returns its one value we're interested in using for comparisons
def ear_length_key(P):
  return P["ear"]["length"]

# Now we can use it in min/max/sorted functions for lists of objects

max(ansurdata, key=ear_length_key)

## Functions

Ways of defining set of sub-commands that we want to run multiple times, possibly on different data.

We've seen the `sorted()` function. We give it a list and it applies some pre-defined commands on the list before it returns a version of the list with items arranged by value. We can give the function any list and will perform the routine that has pre-defined in order to return a sorted list.

We can define our own functions with the `def` keyword, followed by our function name and a list of parameter variables:

```python
def add_and_square(param0, param1):
  # do some stuff
  x = param0 + param1
  y = x ** 2
  return y
```

In [None]:
# TODO: figure out who weighs more, soldiers who are 21 and younger or over 21
# TODO: figure out who has longer ears

# TODO: first: go through people and separate into 2 lists using age > 20
# TODO: then:  extract list of values we're interested in and calculate average

# TODO: Ohhohhh: repetitive task ! Use function.
# TODO: there can be a function that calculates the average of the values in a list
# TODO: and also functions that given a key or keyS, extract the values from the key into a list
# TODO: like extract(people_list, "height") would return a list of all height values

def average(mlist):
  return sum(mlist) / len(mlist)

def get_values(mlist, key):
  vals = []
  for item in mlist:
    vals.append(item[key])
  return vals

def get_get_values(mlist, key0, key1):
  vals = []
  for item in mlist:
    vals.append(item[key0][key1])
  return vals

under21 = []
older20 = []

for p in ansurdata:
  if p["age"] > 20:
    older20.append(p)
  else:
    under21.append(p)

under21_weights = get_values(under21, "weight")
older20_weights = get_values(older20, "weight")

print("Young average weight:", average(under21_weights))
print("Older average weight:", average(older20_weights))

under21_ear_lengths = get_get_values(under21, "ear", "length")
older20_ear_lengths = get_get_values(older20, "ear", "length")

print()
print("Young ear length:", average(under21_ear_lengths))
print("Older ear length:", average(older20_ear_lengths))

## Sorting (list of) objects

We can sort lists of numbers or words like this:

In [None]:
x = [0,5,1,12,8,4,6]
y = ["elephant", "cow", "bull", "owl", "zebra", "chicken"]

print(sorted(x))
print(sorted(y))

Would be nice to do the same for lists of objects:

In [None]:
people = [
  { "name": "thiago", "grades": [89, 87, 94], "id": 1543 },
  { "name": "lex", "grades": [98, 78, 92], "id": 1311 },
]

print(sorted(people))

But this is ambiguous: sort how ? By what parameter ?

Maybe once by name, then later by grade, or id ... 

We can sort lists of arbitrary "things" as long as we tell `Python` what to look at when comparing items. 

We do this by defining a function that extracts the item we want `Python` to use when sorting.

In [None]:
# define a function that will extract the "thing" we want to compare (foot length for now)

# given a person, return their foot length
def get_foot_length(P):
  return P["foot"]["length"]

# now we can pass this to the sort() function to use as the sorting "key"
display(sorted(ansurdata, key=get_foot_length))

We can reverse the order of the sort, and get the list from largest value to smallest value:

In [None]:
display(sorted(ansurdata, key=get_foot_length, reverse=True))

We can also use a `key` on `min()` and `max()` functions, to get the object with the largest or smallest values for that key.

In [None]:
display(min(ansurdata, key=get_foot_length))
display(max(ansurdata, key=get_foot_length))

## Questions

- How many people in dataset ?
- How tall are the tallest and shortest persons ?
- What are the largest and smallest values for weight ?
- How long is the longest hand ? Who has it ?
- Do older people weigh more ?
- Are older people taller ?
- Do taller people have bigger heads ?

In [None]:
# The number of people in the dataset is the number of "rows", or number of items on our list

print("Number of people", len(ansurdata))

# For tallest / shortest, we don't care about any other property, so we can separate the heights and then use min/max
height_vals = []
for p in ansurdata:
  height_vals.append(p["height"])

print()
print("tallest:", max(height_vals))
print("shortes:", min(height_vals))

# Weight is the same logic, and we can even use the get_values function we created above
weight_vals = get_values(ansurdata, "weight")

print()
print("max weight:", max(weight_vals))
print("min weight:", min(weight_vals))

# min weight looks like an error, but the error is in the dataset

In [None]:
# Longest hand with other properties. We should use a key function.
def hand_length_key(P):
  return P["hand"]["length"]

display(max(ansurdata, key=hand_length_key))
# check
print("max hand value", max(get_get_values(ansurdata, "hand", "length")))

In [None]:
# Do older people weight more ?
# We need some other tools in order to answer this question properly and definitively, but...
# we kind of saw how to get average weights of people above/below 20, maybe we can extend this idea

# let's create a list of 10 empty lists.
# these are our age bins
weights_by_age = [[], [], [], [], [], [], [], [], [], []]


# as we go through the list, we're going to append the person's weight to one of the 10 bins.
# which bin will be determined by their age:
#   if they're younger than 10 years old it's bin 0
#   10 to 19: bin 1
#   20 to 29: bin 2, ... 
#   90 to 99: bin 9
for p in ansurdata:
  idx = int(p["age"] // 10)
  weights_by_age[idx].append(p["weight"])

# have bins of weights by age. can average
for idx,weights in enumerate(weights_by_age):
  if len(weights) > 0:
    print("average weight", idx*10, "to", idx*10+9, average(weights))

In [None]:
# Are older people taller ?

heights_by_age = [[], [], [], [], [], [], [], [], [], []]

for p in ansurdata:
  idx = int(p["age"] // 10)
  heights_by_age[idx].append(p["height"])

for idx,heights in enumerate(heights_by_age):
  if len(heights) > 0:
    print("average height", idx*10, "to", idx*10+9, average(heights))

In [None]:
# Do older people have longer ears ?

ear_length_by_age = [[], [], [], [], [], [], [], [], [], []]

for p in ansurdata:
  idx = int(p["age"] // 10)
  ear_length_by_age[idx].append(p["ear"]["length"])

for idx,ear_lengths in enumerate(ear_length_by_age):
  if len(ear_lengths) > 0:
    print("average ear lengths", idx*10, "to", idx*10+9, average(ear_lengths))

In [None]:
# Do taller people have bigger heads ?

# Similar, but now our bins are by height.
# Since height is in inches, using 10 bins that skip by 10 inches is probably ok

head_circumference_by_height = [[], [], [], [], [], [], [], [], [], []]

for p in ansurdata:
  idx = int(p["height"] // 10)
  head_circumference_by_height[idx].append(p["head"]["circumference"])

for idx,head_circumferences in enumerate(head_circumference_by_height):
  if len(head_circumferences) > 0:
    print("average head circumference for heights", idx*10, "in to", idx*10+9, "in:", average(head_circumferences))