# From Week 02

## Lists

### Ranges

<img src="./imgs/range.jpg" width="500px" />

### List functions

Members of each `list` object.

<img src="./imgs/lists00.jpg" width="500px" />

### Functions on lists

These are functions that Python gives us to work on lists.

There are functions for sorting, reversing and getting the length of a `list`:

<img src="./imgs/lists01.jpg" width="600px" />

### Slicing

Python has a built-in mechanism for getting sub-sections of a list called *slicing*.

Instead of a single index, we specify two values in the square bracket, separated by a `:`, to specify where our slice starts and ends:

<img src="./imgs/slicing.jpg" width="700px" />

One **VERY** important thing to remember is that the second index in the bracket is **NOT** included in the slice.

# Week 03

## TODO: add cell highlight

## Setup

Let's import some helper functions and libraries

In [None]:
import random

## Objects

### Iterating over keys, values and items

[Documentation](https://docs.python.org/3/tutorial/datastructures.html#looping-techniques)

<img src="./imgs/objects.jpg" width="500px" />

In [None]:
# TODO use my_info.keys(), .values() and .items() to print contents of object

## List of objects

### Create a list of 10 objects with random heights, brooklyn zip codes and a random id between 100 and 999.

```python
my_data = [
  {"height": [60, 70], "zip": [11200, 11250], "id": [100, 999]},
  {"height": [60, 70], "zip": [11200, 11250], "id": [100, 999]},
  {"height": [60, 70], "zip": [11200, 11250], "id": [100, 999]},
  ...
]
```

To do this, we can use a call to `range()` to create a counter, and then for each of the $10$ iterations we'll `append()` an object with three items, a `height` value between $60$ and $70$, a `zip` between $11200$ and $11250$ and an `id` between $100$ and $999$.

In [None]:
# TODO: create list of random objects
my_data = []

my_data

### Let's create a list of 3 random grades for each member of the list and another item with their computed average

We have to iterate over the list of objects, and for each object create a list with $3$ random grades between $70$ and $100$, and add this list to the object with the key `grades`.

In [None]:
# TODO: first, append grade list to objects

### Average

<img src="./imgs/average00.jpg" width="500px" />
<br>
<img src="./imgs/average01.jpg" width="500px" />

Now we'll iterate over the list of objects/students, and for each student calculate the average value of their grades and store that with a new key `average`.

In [None]:
# TODO: compute and store average of grades

### Get highest and lowest average grades

First, get all average grades on a separate list, then use `min()`/`max()`

In [None]:
grades = []
for obj in my_data:
  grades.append(obj["avg"])

min(grades), max(grades)

### Sort objects by average grades

We could first get all the average grades and then sort the new list:

In [None]:
grades = []
for obj in my_data:
  grades.append(obj["avg"])

by_grade = sorted(grades)

print("original:\n", grades)
print("sorted:\n", by_grade)

### But now we don't have the other associated information with each grade.

We want to sort the list while keeping the objects together.

Would be nice to be able to do something like this, just like with a `list`:

In [None]:
by_grade = sorted(my_data)
print(by_grade)

but we can't

### Sorting Objects

For lists of objects we have to tell python which values to compare to determine their order.

We do this by defining a key function.

Key functions receive one argument, that can be an object, a list, a class member, anything... and they return one numerical value.

<img src="./imgs/list-of-objects.jpg" width="620px" />

In [None]:
# this key function receives a student-info object with {height, grade, zip, etc}
# and should return just the average grade value
def gradeKey(person):
  return person["avg"]

# then we can just use it when we call sorted()
by_grade = sorted(my_data, key=gradeKey)

by_grade

In [None]:
# TODO: sort by first assignment grade

### `min()`/`max()` functions also work with a `key` argument:

In [None]:
# student with highest average grade
max_by_grade = max(my_data, key=gradeKey)

# student with lowest score on first assignment
min_by_hw01 = min(my_data, key=gradeKey)

print(max_by_grade)
print(min_by_hw01)

## Bigger Lists

## Setup

Include some helper functions and libraries

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import matplotlib.pyplot as plt

from data_utils import object_from_json_url

### Load ANSUR 2 Dataset

The `JSON` file has a subset of the measurements found [here](https://www.openlab.psu.edu/ansur2/).

In [None]:
ANSUR_JSON_URL = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/ansur.json"
ansur = object_from_json_url(ANSUR_JSON_URL)

# TODO: look at the data

# Answer:
#   - how many rows/records/items ?
#   - tallest height ?
#   - longest ear ?
#   - average ear length ?

### Let's look at a simpler version:

In [None]:
AHW_LIST_URL = "https://raw.githubusercontent.com/PSAM-5020-2025F-A/5020-utils/main/datasets/json/ansur_age_height_weight.json"
ahws = object_from_json_url(AHW_LIST_URL)

# TODO: look at data
# How is it organized ?

# Answer the following:
#   - how many items ?
#   - how do we access the height of a person ?

## List of Lists

Just like we can put lists inside objects, and objects inside lists, we can also put lists inside lists.

If we want to get to a particular value we have to use $2$ indices instead of using just one:
`list[i][j]`

The first index tells Python which of the sub-lists we want, and the second specifies the item on that list.

<img src="./imgs/list-of-lists00.jpg" width="700px" />

<img src="./imgs/list-of-lists01.jpg" width="700px" />

Sometimes we'll refer to the first index as the row index and the second index as the column index.

That's because if we imagine our list of lists as a 2-dimensional matrix of numbers, the first index tells Python which row we want to access and the second tells which column:

<img src="./imgs/list-of-lists02.jpg" width="700px" />

<img src="./imgs/list-of-lists03.jpg" width="700px" />

### Datasets

We'll see this kind of structure a lot.

It's very common for datasets to be organized by rows/columns, where each column specifies a different *property* (or *feature*) and each row is a different *measurement* (or *record*) of those features.

In our example above, our dataset had $3$ *features* (age, height, weight), and one *record* per person.

<img src="./imgs/datasets00.jpg" width="700px" />

### JSON

It's also common to find datasets specified in the JSON format.

Instead of just being a list of lists with values, each *record* is an object that specifies the names and values of its *features*:

<img src="./imgs/datasets01.jpg" width="700px" />

There are advantages and disadvantages to each. We'll soon look at another way to organize datasets that will make it easier to go from one type to the other if we have to.

## Plots

We can use the [matplot](https://matplotlib.org/stable/api/pyplot_summary.html) library to visualize our data.

In [None]:
# TODO: get heights
heights = []

plt.plot(heights, 'bo', markersize=2)
plt.show()

In [None]:
# TODO: get weights
weights = []

plt.plot(weights, 'ro', markersize=2)
plt.show()

In [None]:
# TODO: plot ages in green
ages = []

### Sorting data can give a different perspective

In [None]:
sorted_heights = sorted(heights)
plt.plot(sorted_heights, 'bo', markersize=2)
plt.show()

### Histograms

In [None]:
min_height = min(heights)
max_height = max(heights)
plt.hist(heights, bins=range(min_height, max_height + 1))
plt.grid()
plt.show()

## Correlation

Measurement of how $2$ independent variables (features) are related to each other.

<img src="./imgs/correlation.jpg" width="800px" />

They can have *positive* or *direct* correlation, if an increase in one of the variables comes with an increase in the other.

They can have *negative* or *inverse* correlation if an increase in one of the variables is accompanied by a decrease in the other.

Or, there can be *weak* or *NO* correlation, if a change in one variable doesn't seem to be accompanied by a change in the other.

In [None]:
# use "column" lists from above to plot scatter plot
plt.scatter(ages, heights, marker='o', alpha=0.2)
plt.xlabel("age")
plt.ylabel("height")
plt.show()

In [None]:
# TODO plot other combinations of variables
# TODO: any correlation ?

## Extra Practice

### Traversing a list of objects/dictionaries

This next cell creates a list of $1000$ objects with the following keys/parameters:

```py
{
  "id": "abc1234",
  "zip": 10001,
  "grades": [70, 81, 92, 84, 89],
  "attendance": [True, False, True, ....]
}
```

The `id` field is a string made up of $3$ letters and $4$ numbers; `zip` is a NYC area zip code; `grades` is a list of $5$ grades between $0$ and $100$; and `attendance` is a list of `15` boolean values.

In [None]:
import random
import string

from matplotlib import pyplot as plt

In [None]:
data = []

for cnt in range(1000):
  id_let = random.choices(string.ascii_lowercase, k=3)
  id_num = random.choices(string.digits, k=4)
  id_str = "".join(id_let + id_num)
  data.append({
    "id": id_str,
    "zip": random.randint(10001, 11250),
    "grades": [random.gauss(80, 5) for g in range(5)],
    "attendance": random.choices([True, True, True, True, True, False], k=15)
  })

We can check the length of the list, the contents of its first item, and its keys, with:

In [None]:
display(len(data))
display(data[0])
display(data[0].keys())

### Plot: Grade vs Attendance

Let's see if someone's attendance has an effect on their grade.

We'll want to eventually plot average grade vs average attendance, but let's start simple.

Let's plot all of the students's first assignment grade, versus their first attendance.

We have to go through the list of objects/dicts and extract the first assignment grade and first attendance into separate lists of grades and attendances.

In [None]:
grades_0 = []
attendances_0 = []

for d in data:
  # TODO: append first grade to list
  # TODO: append first attendance to list

plt.plot(attendances_0, grades_0, 'o')
plt.show()

### Plot: Grade vs Attendance

Now let's plot average grade versus average attendance.

The logic is the same, but instead of just appending the first grade/attendance, we'll push average grade and attendance values.

In [None]:
grades_avg = []
attendances_avg = []

for d in data:
  # TODO: append average grade to list
  # TODO: append average attendance to list

plt.plot(attendances_avg, grades_avg, 'o')
plt.show()

plt.hist(attendances_avg, bins=8)
plt.show()