# Week 03: Structured Data

Java Script Object Notation (JSON): another file format to store, transfer and organize data.

## JSON

- [ansur](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/ansur.json) (6k)
- [wines](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/wines.json) (8k)
- [cities](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/cities.json) (9k)
- [pm2.5](https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/pm2_5.json) (41k)

### Other Sources
- [Kaggle JSON](https://www.kaggle.com/datasets/?sort=downloadCount&fileType=json)
- [Hugging Face JSON](https://huggingface.co/datasets?format=format:json&sort=downloads)

## Setup

Let's run the following $2$ cells to download our datasets and load some helper libraries.

In [1]:
!wget -q https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/wines.json -P ./data
!wget -q https://raw.githubusercontent.com/PSAM-5005-2026S-A/5005-utils/refs/heads/main/datasets/json/ansur.json -P ./data

In [2]:
import json

## Loading a JSON file

Is similar to loading a `csv` file (and perhaps even easier).

We open the file using the `Python` function `open()`, and then use a library to read the contents of the file, knowing that it's a `json` file.

For `csv` files it was something like this:

```py
csvfile = open("./data/wines.csv")
csvreader = csv.reader(csvfile)
csvdata = list(csvreader)
```

For `json`, it's like this:

In [3]:
jsonfile = open("./data/wines.json")
data = json.load(jsonfile)

The `json.load()` function takes care of reading the contents of the file and turning it into `Python` values. We don't need to create a reader and then convert the reader to a `list` like we did with `csv`. Similar, but different.

And once we have the contents in a variable, we can look at the data using `display()`:

In [4]:
display(data)

[{'ID': 0,
  'country': 'US',
  'designation': "Martha's Vineyard",
  'points': 96,
  'price': 235.0,
  'province': 'California',
  'region': 'Napa Valley',
  'variety': 'Cabernet Sauvignon',
  'winery': 'Heitz'},
 {'ID': 1,
  'country': 'Spain',
  'designation': 'Carodorum Selección Especial Reserva',
  'points': 96,
  'price': 110.0,
  'province': 'Northern Spain',
  'region': 'Toro',
  'variety': 'Tinta de Toro',
  'winery': 'Bodega Carmen Rodríguez'},
 {'ID': 2,
  'country': 'US',
  'designation': 'Special Selected Late Harvest',
  'points': 96,
  'price': 90.0,
  'province': 'California',
  'region': 'Knights Valley',
  'variety': 'Sauvignon Blanc',
  'winery': 'Macauley'},
 {'ID': 3,
  'country': 'US',
  'designation': 'Reserve',
  'points': 96,
  'price': 65.0,
  'province': 'Oregon',
  'region': 'Willamette Valley',
  'variety': 'Pinot Noir',
  'winery': 'Ponzi'},
 {'ID': 4,
  'country': 'France',
  'designation': 'La Brûlade',
  'points': 95,
  'price': 66.0,
  'province': 'Pr

# 🤔

This sort of, kind of, looks like a list of lists, but we have the header repeated a bunch of times.

These are dictionaries.

## Dictionaries

Together with lists, dictionaries are the other construct we use to structure data within our programs.

They're used to organize data, but in a way that makes it easy to keep data consistent.

Simplest dictionaries are sequences of `key`, `value` pairs. Just like with a physical dictionary, where we can use a word to look up its definition or translation, in `Python` dictionaries we use a `key` to fetch some extra piece of data (`value`) associated with it.

Another way to think about dictionaries is that they create a separate space for us to create variables within variables.

Let's take a look.

In [24]:
# TODO: create empty dictionary
me = []

# TODO: add items with keys
me["name"] = "Kavya"
me["zip"] = 11233
me["grade"] = 87

display(me)
# TODO: create and initialize

# TODO: read a value from a key

# TODO: useful for comparing things that should be alike and have similar properties

# TODO: can modify value associated with any given key

# TODO: or remove

TypeError: list indices must be integers or slices, not str

## Objects

This is another name for dictionaries. People also use _hash map_, _hash table_ and _associative array_ to refer to objects/dictionaries. They all mean the same thing.

One reason why people use some of these other terms is that the term _dictionary_ might imply, or be closely associated with, a simple `key` $\to$ `value` type of organization, like in physical dictionaries. But, `Python` dictionaries are more sophisticated because they allow a `key` to be associated with a more complex data type like a `list` or even a `dictionary`.

So, yeah, nested dictionaries are a thing... so are lists of dictionaries and dictionaries with lists.

In [None]:
# TODO: add list values
me = {
  "name" : "Kavya",
  "grades" : [100,89,78,87]
}

you = {
  "name" : "Raj",
  "grades" : [90,72,68,74]
}

people = [me,you]

for p in people:
  my_grades = p["grades"]
  p["Average"] = sum(my_grades) / len(my_grades)
  
  # TODO: add average to dictionary

# TODO: first, create list of dicts

# TODO: process list and add average

## Datasets

We've now seen $2$ ways of storing datasets in text files. Some of the details between them are a little different, but they have a similar overall structure for organizing data in $2$ dimensions: rows and columns.

<img src="./imgs/datasets00.jpg" width="700px">

In `csv` files we have an outer list of rows that have inner lists of the column values, and in `json` files we have an outer list of rows that have inner objects with the column values.

<img src="./imgs/datasets01.jpg" width="700px">

But what we can do with one, we can do with the other.

Let's repeat some of the processing we did with the wines from last week, but this time using `json` data (a list of objects).

### Wines

Let's find the most expensive wine on the dataset and the average price of wines from California.

In [7]:
# iterate over wines and find most expensive wine

# assume it's the first wine
expensive_wine = data[0]

# for every wine on the list, compare its price property 
#   to the price of the most expensive wine we have seen so far.
# if we see a more expensive wine, update the expensive_wine result variable

for w in data:
  if w["price"] > expensive_wine["price"]:
    expensive_wine = w

expensive_wine

{'ID': 13318,
 'country': 'US',
 'designation': 'Roger Rose Vineyard',
 'points': 91,
 'price': 2013.0,
 'province': 'California',
 'region': 'Arroyo Seco',
 'variety': 'Chardonnay',
 'winery': 'Blair'}

In [8]:
# iterate over wines and calculate average price for California wines

# we need the sum of their prices and the number of wines from California
# we can first filter the wines from California, and then sum the prices, or do both at once

# the result variables
sum_ca_prices = 0
cnt_ca_wines = 0

for w in data:
  if w["province"] == "California":
    sum_ca_prices = sum_ca_prices + w["price"]
    cnt_ca_wines = cnt_ca_wines + 1

print(sum_ca_prices, cnt_ca_wines, sum_ca_prices / cnt_ca_wines)

115510.0 2616 44.15519877675841


### Exercise

Find the cheapest wine from the US and the cheapest wine not from the US.

Just like we can check for equality using `==`, we can check for inequality using `!=`.

So, wines not from the US are the ones that have `w["country"] != "US"`

In [29]:
# TODO: cheap wines
cheap_wine = data[0]

# TODO: set up some variables to hold the result we want
for w in data:
  if w["country"] == "US" and w["price"] < cheap_wine["price"]:
    
    cheap_wine = w

cheap_wine
    
# TODO: iterate and use if statements to determine when/if we should update the results

# TODO: can do it using 2 separate loops or combine the logic into one loop

{'ID': 1858,
 'country': 'US',
 'designation': 'Unoaked',
 'points': 83,
 'price': 4.0,
 'province': 'California',
 'region': 'California',
 'variety': 'Chardonnay',
 'winery': "Pam's Cuties"}

In [30]:
cheap_wine = data[0]

# TODO: set up some variables to hold the result we want
for w in data:
  if w["country"] != "US" and w["price"] < cheap_wine["price"]:
    
    cheap_wine = w

cheap_wine

{'ID': 5609,
 'country': 'Spain',
 'designation': 'Crianza',
 'points': 81,
 'price': 5.0,
 'province': 'Levante',
 'region': 'Utiel-Requena',
 'variety': 'Tempranillo',
 'winery': 'Viña Decana'}

## Dictionary in Dictionary

Let's open up another file and take a look at its structure.

In [10]:
ansurfile = open("./data/ansur.json")
ansurdata = json.load(ansurfile)

ansurdata[0]

{'age': 42,
 'gender': 'F',
 'height': 67,
 'weight': 158,
 'span': 1739,
 'stature': 1671,
 'ear': {'breadth': 32, 'length': 57, 'protrusion': 17},
 'foot': {'breadth': 92, 'length': 238},
 'hand': {'breadth': 75, 'length': 181, 'palm': 112},
 'head': {'breadth': 143, 'length': 197, 'height': 232, 'circumference': 553}}

# 🤔

The first $6$ or so properties look normal, but the last $4$ are actually dictionaries.

These are nested dictionaries ! We can't do this in `csv` files.

We can still get to the values, we just have to do some double indexing.

In order to get foot length we first get the `foot` object, and then its `length` value.

In [11]:
# get foot length data for person at index 10

ansurdata[10]["foot"]["length"]

277

### Filtering and Finding

Finding specific people and filtering the dataset is still similar, the biggest change being how we index our objects to get the values we want to compare.

In [12]:
# Get person with largest foot

big_foot = ansurdata[0]

for p in ansurdata:
  if p["foot"]["length"] > big_foot["foot"]["length"]:
    big_foot = p

big_foot

{'age': 21,
 'gender': 'M',
 'height': 77,
 'weight': 240,
 'span': 2121,
 'stature': 1944,
 'ear': {'breadth': 37, 'length': 63, 'protrusion': 24},
 'foot': {'breadth': 115, 'length': 323},
 'hand': {'breadth': 101, 'length': 239, 'palm': 137},
 'head': {'breadth': 169, 'length': 202, 'height': 275, 'circumference': 583}}

### Exercise

Repeat and find person with longest ears.

In [13]:
# TODO: find person with longest ears

## Functions

Ways of defining set of sub-commands that we want to run multiple times, possibly on different data.

We've seen the `sorted()` function. We give it a list and it applies some pre-defined commands on the list before it returns a version of the list with items arranged by value. We can give the function any list and will perform the routine that has pre-defined in order to return a sorted list.

We can define our own functions with the `def` keyword, followed by our function name and a list of parameter variables:

```python
def add_and_square(param0, param1):
  # do some stuff
  x = param0 + param1
  y = x ** 2
  return y
```

In [14]:
# TODO: figure out who weighs more, soldiers who are 21 and younger or over 21
# TODO: figure out who has longer ears

# TODO: first: go through people and separate into 2 lists using age > 20
# TODO: then:  extract list of values we're interested in and calculate average

# TODO: Ohhohhh: repetitive task ! Use function.
# TODO: there can be a function that calculates the average of the values in a list
# TODO: and also functions that given key value or valueS, extract those into a list
# TODO: like extract(people_list, "height") would return a list of all height values

## Sorting (list of) objects

We can sort lists of numbers or words like this:

In [15]:
x = [0,5,1,12,8,4,6]
y = ["elephant", "cow", "bull", "owl", "zebra", "chicken"]

print(sorted(x))
print(sorted(y))

[0, 1, 4, 5, 6, 8, 12]
['bull', 'chicken', 'cow', 'elephant', 'owl', 'zebra']


Would be nice to do the same for lists of objects:

In [16]:
people = [
  { "name": "thiago", "grades": [89, 87, 94], "id": 1543 },
  { "name": "lex", "grades": [98, 78, 92], "id": 1311 },
]

print(sorted(people))

TypeError: '<' not supported between instances of 'dict' and 'dict'

But this is ambiguous: sort how ? By what parameter ?

Maybe once by name, then later by grade, or id ... 

We can sort lists of arbitrary "things" as long as we tell `Python` what to look at when comparing items. 

We do this by defining a function that extracts the item we want `Python` to use when sorting.

In [None]:
# define a function that will extract the "thing" we want to compare (foot length for now)

# given a person, return their foot length
def get_foot_length(P):
  return P["foot"]["length"]

# now we can pass this to the sort() function to use as the sorting "key"
display(sorted(ansurdata, key=get_foot_length))

We can reverse the order of the sort, and get the list from largest value to smallest value:

In [None]:
display(sorted(ansurdata, key=get_foot_length, reverse=True))

We can also use a `key` on `min()` and `max()` functions, to get the object with the largest or smallest values for that key.

In [None]:
display(min(ansurdata, key=get_foot_length))
display(max(ansurdata, key=get_foot_length))

## Questions

- How many people in dataset ?
- How tall are the tallest and shortest persons ?
- What are the largest and smallest values for weight ?
- How long is the longest hand ? Who has it ?
- Do older people weigh more ?
- Are older people taller ?
- Do taller people have bigger heads ?