# Homework02

Some exercises to get started with Python and lists, dicts and data in Python.

## Goals

- Gain experience with a popular scripting language used for ML/AI projects and research
- Get familiar with Python's notation for lists and objects
- Experiment with Python's unique functionalities for processing lists and objects
- Learn to load and process datasets using Python

### Setup

Run the following 2 cells to import all necessary libraries and helpers for Homework 02

In [26]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py

In [28]:
import matplotlib.pyplot as plt

from Homework02_utils import Tests
from data_utils import object_from_json_url

### Exercise 01:

Finding the sum of integer sequences.

Create a function ```sum_of_ints(i0, i1)``` that returns the sum of all integers between two integers.

For example, ```sum_of_ints(4, 32)``` should return $522$.


In [29]:
## Work on exercise 01 here

def sum_of_ints(i0, i1):
  return i0 + i1


### Exercise 01 testing:

Running the following cell will test the ```sum_of_ints()``` function on a small set of input pairs and report any combination that isn't computing correctly, if any.

In [30]:
# Test 01
Tests.test("01", sum_of_ints)

ERROR Check sum of sum_of_ints(3, 83) should be 3483 not 86
ERROR Check sum of sum_of_ints(47, 77) should be 1922 not 124
ERROR Check sum of sum_of_ints(38, 85) should be 2952 not 123
ERROR Check sum of sum_of_ints(-68, -36) should be -1716 not -104
ERROR Check sum of sum_of_ints(31, 66) should be 1746 not 97
ERROR Check sum of sum_of_ints(4, 30) should be 459 not 34
ERROR Check sum of sum_of_ints(68, 78) should be 803 not 146
ERROR Check sum of sum_of_ints(-22, -7) should be -232 not -29
ERROR Check sum of sum_of_ints(1, 13) should be 91 not 14
ERROR Check sum of sum_of_ints(-35, 14) should be -525 not -21
ERROR Check sum of sum_of_ints(-29, -53) should be -1025 not -82
ERROR Check sum of sum_of_ints(-29, -14) should be -344 not -43
ERROR Check sum of sum_of_ints(19, 69) should be 2244 not 88
ERROR Check sum of sum_of_ints(41, 80) should be 2420 not 121


### Exercise 02:

Working with lists and dictionaries/objects.

Write a function `sum_objects(in_list)` that accept a list of objects and returns a sum according to the following specifications:

Each object in the list has 2 fields, `type` and `amount`, and so will look something like this:

```python
test_list = [
  { "type": "cost", "amount": 10.00 },
  { "type": "cost", "amount": 15.99 },
  { "type": "income", "amount": 150.25 },
  { "type": "income", "amount": 243.52 },
]
```

The `sum_objects(in_list)` function should iterate through all of the items in `in_list` and sum their `amount`s using positive values when `type` is `"income"`, and negative values when `type` is `"cost"`.

For example, passing the above list to the function should return $367.78$.

In [31]:
# Implement 02 here

def sum_objects(in_list):
  sum_obj = 0  # Initialize sum_obj to 0
  for item in in_list:
      if "amount" in item and isinstance(item["amount"], (int, float)):  # Check for 'amount' key and valid type
          sum_obj += item["amount"]  # Add the amount to sum_obj
  return sum_obj


In [32]:
# Run this cell to test if function returns 367.78

test_list = [
  { "type": "cost", "amount": 10.00 },
  { "type": "cost", "amount": 15.99 },
  { "type": "income", "amount": 150.25 },
  { "type": "income", "amount": 243.52 },
]

sum_objects(test_list)

419.76

#### Exercise 02 test:

Run the following cell to test the `sum_objects()` function.

In [33]:
# Test 02
Tests.test("02", sum_objects)

ERROR Check sum of [{'type': 'income', 'amount': 11.17}, {'type': 'income', 'amount': 19.96}, {'type': 'income', 'amount': 64.89}, {'type': 'income', 'amount': 38.64}, {'type': 'income', 'amount': 45.06}, {'type': 'income', 'amount': 97.34}, {'type': 'cost', 'amount': 2.79}, {'type': 'cost', 'amount': 10.77}, {'type': 'income', 'amount': 68.57}, {'type': 'income', 'amount': 46.93}, {'type': 'cost', 'amount': 75.22}, {'type': 'cost', 'amount': 19.96}, {'type': 'cost', 'amount': 16.67}, {'type': 'cost', 'amount': 46.44}] should be 220.7100000000001 not 564.4100000000001
ERROR Check sum of [{'type': 'cost', 'amount': 22.21}, {'type': 'income', 'amount': 95.16}, {'type': 'cost', 'amount': 17.33}, {'type': 'cost', 'amount': 1.01}, {'type': 'cost', 'amount': 15.18}, {'type': 'cost', 'amount': 32.61}, {'type': 'cost', 'amount': 43.29}, {'type': 'income', 'amount': 3.15}, {'type': 'income', 'amount': 98.21}, {'type': 'cost', 'amount': 48.56}, {'type': 'cost', 'amount': 48.38}, {'type': 'cost',

### Exercise 03:

Working with data files.

Find the name and population of the 3 cities that are geographically closest to the world's most populous city.

# 🤔😱


#### Load Data:

Let's break this down into a few sub-problems.

First, let's load a JSON file that has information about large cities in the world.

The file at this [URL](https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/cities50k.json) has a list of cities formatted like this:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391
}
```

This is just like how we loaded ANSUR data files in class:

In [34]:
# Define the location of the json file here
CITIES_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/cities50k.json"

# Use the object_from_json_url() function to load contents from 
# the json file into a Python object called "info_cities"

info_cities = object_from_json_url(CITIES_FILE)

#### Exercise 03A:

Ok. We should now have a list of objects with information about cities.

Explore the data and answer the following questions:
- How many cities are in this list?
- What's the name of the first city on the list?
- What are the latitude and longitude of the last city on the list?
- What are the populations for the largest and smallest cities?
- What's the name of the city with the largest population?


In [37]:
# Work on 03A here

# How many cities are in the list?

num_cities = len(info_cities)
print(num_cities)


# What's the name of the first city on the list?

first_city = info_cities[0]["name"]
print(first_city)


# What are the latitude and longitude of the last city on the list?

last_latitude = info_cities[-1]["lat"]
last_longitude = info_cities[-1]["lon"]
print(last_latitude, last_longitude)


# What are the populations for the largest and smallest cities?

largest_population = max(info_cities, key=lambda city: city["pop"])["pop"]
smallest_population = min(info_cities, key=lambda city: city["pop"])["pop"]

print(largest_population, smallest_population)



# What's the name of the city with the largest population?

largest_city_name = largest_city_name = max(info_cities, key=lambda city: city["pop"])["name"]

print(largest_city_name)


8670
Abu Dhabi
-17.88333 30.7
22315474 50011
Shanghai


#### Test 03A

In [38]:
# Test 03A
answers = [num_cities, first_city, last_latitude, last_longitude, largest_population, smallest_population, largest_city_name]

Tests.test("03A", answers)

03A: All tests passed 🎉🎉🎉


#### Exercise 03B:

We have the largest city's name and population, but we need its position.

We can recycle some of the logic from above to get the whole object that contains information for the largest city.

In [52]:
# Work on 03B here

largest_city = max(info_cities, key=lambda city: city["pop"])

largest_city_lat = max(info_cities, key=lambda city: city["pop"])["lat"]
largest_city_lon = max(info_cities, key=lambda city: city["pop"])["lon"]

largest_city_pos = [largest_city_lat, largest_city_lon]
print(largest_city_pos)




[31.22222, 121.45806]


In [53]:
# Test 03B
Tests.test("03B", largest_city)

03B: All tests passed 🎉🎉🎉


#### Exercise 03C:

We should have all info about the largest city here.

Now, we'll iterate through the list and use each city's latitude and longitude to calculate its distance from the largest city.

Althought not $100\%$ correct, it's ok to use the [2D Euclidean distances](https://en.wikipedia.org/wiki/Euclidean_distance#Two_dimensions) for this.

Could be useful to define a function `distance(cityA, cityB)` that returns the distance between two cities.

In [55]:
# Work on 03C here

# Implement the helper function for calculating distances between 2 cities

def distance(cityA, cityB):
  cityAlat = cityA["lat"]
  cityAlon = cityA["lon"]
  cityBlat = cityB["lat"]
  cityBlon = cityB["lon"]
  distanceAB = ((cityAlat-cityBlat)**2 + (cityAlon-cityBlon)**2)**0.5
  return distanceAB



In [56]:
# Test 03C
Tests.test("03C", distance)

03C: All tests passed 🎉🎉🎉


#### Exercise 03D:

Ok. We implemented a function to calculate the distance between 2 cities, let's use it now.

Iterate through the list of cities again, calculate the distance from each city to the largest city, and add that as a new feature/key to each city's entry:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391,
  "distance": 1222.32
}
```

Just make sure the key that holds the distance value is called `distance`.

In [61]:
# Work on 03D here

# Now calculate every city's distance from the largest city and
# add that info to each city's entry or save that on a new list
# with their name and pop


city_distances = []

for city in info_cities: 
    dist = distance(largest_city, city)  # Compute distance
    city_distances.append({
        "name": city["name"],
        "pop": city["pop"],
        "distance_from_largest": dist
    })
print(city_distances)


[{'name': 'Abu Dhabi', 'pop': 603492, 'distance_from_largest': 67.40206314269321}, {'name': 'Ajman City', 'pop': 490035, 'distance_from_largest': 66.235511831048}, {'name': 'Al Ain City', 'pop': 55091, 'distance_from_largest': 66.0726126284749}, {'name': 'Al Fujairah City', 'pop': 86512, 'distance_from_largest': 65.40228606844411}, {'name': 'Al Shamkhah City', 'pop': 61710, 'distance_from_largest': 67.09874187855165}, {'name': 'Bani Yas City', 'pop': 80498, 'distance_from_largest': 67.18516412019845}, {'name': 'Dubai', 'pop': 2956587, 'distance_from_largest': 66.43359899000656}, {'name': 'Khalifah A City', 'pop': 85374, 'distance_from_largest': 67.19763291038755}, {'name': 'Musaffah', 'pop': 243341, 'distance_from_largest': 67.32612877090291}, {'name': 'Ras Al Khaimah City', 'pop': 351943, 'distance_from_largest': 65.73972164114859}, {'name': 'Reef Al Fujairah City', 'pop': 82310, 'distance_from_largest': 65.49300750447561}, {'name': 'Sharjah', 'pop': 1324473, 'distance_from_largest': 

In [45]:
# Test 03D
Tests.test("03D", info_cities)

ERROR City object {'name': 'Liverpool', 'country': 'GB', 'admin1': 'England', 'lat': 53.41058, 'lon': -2.97794, 'pop': 864122} is missing the 'distance' field
ERROR City object {'name': 'Westonaria', 'country': 'ZA', 'admin1': 'Gauteng', 'lat': -26.31905, 'lon': 27.6486, 'pop': 156831} is missing the 'distance' field
ERROR City object {'name': 'Hepo', 'country': 'CN', 'admin1': 'Guangdong Sheng', 'lat': 23.43077, 'lon': 115.82991, 'pop': 131238} is missing the 'distance' field
ERROR City object {'name': 'Neuss', 'country': 'DE', 'admin1': 'Nordrhein-Westfalen', 'lat': 51.19807, 'lon': 6.68504, 'pop': 152457} is missing the 'distance' field
ERROR City object {'name': 'Osmanabad', 'country': 'IN', 'admin1': 'State of Mahārāshtra', 'lat': 18.18158, 'lon': 76.03889, 'pop': 85521} is missing the 'distance' field
ERROR City object {'name': 'Isparta', 'country': 'TR', 'admin1': 'Isparta', 'lat': 37.76444, 'lon': 30.55222, 'pop': 172334} is missing the 'distance' field
ERROR City object {'name

#### Exercise 03E:

Now, sort the array from the previous step by distance and get the name and population of the $3$ cities closest to the largest city, but not including the largest city. In other words, if you sort the list from the exercise above by ascending `distance`, the $3$ cities closest to the largest city will be in the slice `[1:4]`. The city at index $0$ is the city with the largest population, and should have a distance of $0$ from itself.

The answer should be an object where its keys are city names and values are populations.

Something like:

```python
closest_3 = {
  "pittsburgh": 23412,
  "liverpool": 172821,
  "oakland": 182726
}
```

We saw how to sort lists of objects in lecture.

In [65]:
# Work on 03E here

# Sort the array and get the name and population of the 3 cities closest to the largest city

cities_by_distance = []
# Sort city_distances by the "distance_from_largest" key
cities_by_distance = sorted(city_distances, key=lambda x: x["distance_from_largest"])

closest_3 = {
    "city_1": cities_by_distance[1],
    "city_2": cities_by_distance[2],
    "city_3": cities_by_distance[3]
}
print(closest_3)

{'city_1': {'name': 'Songjiang', 'pop': 130218, 'distance_from_largest': 0.3065440987851497}, 'city_2': {'name': 'Zhujiajiao', 'pop': 60000, 'distance_from_largest': 0.4171640354824465}, 'city_3': {'name': 'Kunshan', 'pop': 1600000, 'distance_from_largest': 0.5271747551808571}}


In [66]:
# Test 03E
Tests.test("03E", closest_3)

ERROR Check city names. Couldn't find all of the correct cities in the answer.


### Exercise 04:

Visualizing data files.


#### Loading The Data:

Let's load a JSON file that has information about houses in the Los Angeles metropolitan region of California.

The file at this [url](https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/LA_housing.json) has a list of objects formatted like this:

```python
{
  "longitude": -114.310,
  "latitude": 34.190,
  "age": 15,
  "rooms": 12.234,
  "bedrooms": 3.514,
  "value": 669000
}
```

The number of rooms and bedrooms are not integers because some addresses have multiple units/apartments with different floorplans that get averaged.

In [67]:
# Define the location of the json file
HOUSES_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/LA_housing.json"

# Use the object_from_json_url() function to load
# the json file into a Python object called "info_houses"

info_houses = object_from_json_url(HOUSES_FILE)

#### Exercise 04A:

Explore the data and answer the following questions:
- How many instances are there in our dataset?
- What's the value of the most expensive house?
- What's the max number of bedrooms in a house?
- What's the number of bedrooms in the house with the most rooms?
- What's the number of rooms in the house with the most bedrooms?

In [None]:
# Work on 04A here

# How many instances are there in our dataset?
# This is the same as asking "how many rows" or, in this case, "how many houses"

num_houses = 0


# What's the value of the most expensive house?

max_value = 0


# What's the number of bedrooms in the house with the most bedrooms?

most_bedrooms = 0


# What's the number of bedrooms in the house with the most rooms?

bedrooms_in_most_rooms = 0


# What's the number of rooms in the house with the most bedrooms?

rooms_in_most_bedrooms = 0


In [None]:
# Test 04A
answers = [num_houses, max_value, most_bedrooms, bedrooms_in_most_rooms, rooms_in_most_bedrooms]

Tests.test("04A", answers)

#### Exercise 04B:

Which of the features (`longitude`, `latitude`, `age`, `rooms` or `bedrooms`) is a better indicator for value of a house?

We're going to use XY scatter plots to visualize house value as a function of each of these features, and see if any of them show strong correlation.

Documentation for the plotting library is here:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

One thing to note is that the functions for plotting like to get lists of values and not lists of objects.

Before we plot anything, let's define a function `list_from_key(objs, key)` that returns a list with all values of feature `key` for all of the objects in `objs`.

In [None]:
# Work on 04B here

# helper function to get lists of values from specific key
def list_from_key(objs, key):
  # TODO: implement the list_from_key functionality
  return []


In [None]:
# Test 04B
Tests.test("04B", list_from_key)

#### Exercise 04C:

Now we can actually plot some values and start looking for correlations.

Pick a feature and make a graph that shows house prices as a function of that feature.

You can also write a for loop to plot graphs for all features.

In [None]:
# Work on 04C here

# TODO: get a list with all of the price values
prices = []

# TODO: get a list with all of the house ages (for example)
house_ages = []


# this is the command to plot a XY scatter plot from 2 lists
# see documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

plt.scatter(house_ages, prices)
plt.xlabel("age")
plt.ylabel("value")
plt.show()

#### What are some features that correlate with price ?

### <span style="color:hotpink">Interpretation</span>

<span style="color:hotpink">Answer 04C here. Double-click to edit Markdown.</span>

#### Foreshadowing:

What if we use two features at a time?

Is there a pair of features (`longitude`, `latitude`, `age`, `rooms` or `bedrooms`) that correlates to house value?

We could look at the relationship between `price`, `age` and `rooms`:

In [None]:
# get a list with all of the price values
prices = list_from_key(info_houses, "value")

# get a list with all of the values of one feature
feature_0_values = list_from_key(info_houses, "age")

# get a list with all of the values of another feature
feature_1_values = list_from_key(info_houses, "rooms")

# this is how we plot an XY scatter plot using 3 lists
# see documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html

plt.scatter(feature_0_values, feature_1_values, c=prices, alpha=0.3)
plt.xlabel("age")
plt.ylabel("rooms")
plt.show()

Or, we could write a little for loop to plot all possible pairs of features:

In [None]:
# to plot all feature pairs
# get list of all features
features = info_houses[0].keys()
prices = list_from_key(info_houses, "value")

# get all pairs of features
for idx_0, feature_0 in enumerate(features):
  x = list_from_key(info_houses, feature_0)
  for idx_1, feature_1 in enumerate(features):
    y = list_from_key(info_houses, feature_1)
    # skip repeated features
    if feature_0 != "value" and feature_1 != "value" and idx_1 > idx_0:
      plt.scatter(x, y, c=prices, alpha=0.3)
      plt.xlabel(feature_0)
      plt.ylabel(feature_1)
      plt.show()