# HW03B

Some exercises to get started with lists, dicts and data in Python

## Goals

- Gain experience with a popular scripting language used for ML/AI projects and research
- Get familiar with Python's notation for lists and objects
- Experiment with Python's unique functionalities for processing lists and objects
- Process a dataset using Python

All of these skills are invaluable when working in any data-driven project, or any project that requires any kind of data processing or the use of ML models.

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework

In [2]:
from HW03_utils import object_from_json_path, Tests

### Working with data files

Find the name of the 3 cities that are geographically closest to the world's most populous city.

# 🤔😱


#### Load Data:

Let's break this down into a few sub-problems.

First, let's load a JSON file that has information about large cities in the world.

The file at this [URL](https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/cities50k.json) has a list of cities formatted like this:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391
}
```

This is just like how we loaded ANSUR data files in class:

In [3]:
# Define the location of the json file here
CITIES_FILE = "./data/cities50k.json"

# Use the object_from_json_url() function to load contents from 
# the json file into a Python object called "info_cities"

info_cities = object_from_json_path(CITIES_FILE)

#### Exercise A:

Ok. We should now have a list of objects with information about cities.

Explore the data and answer the following questions:
- How many cities are in this list?
- What's the name of the first city on the list?
- What are the latitude and longitude of the last city on the list?
- What are the populations for the largest and smallest cities?
- What's the name of the city with the largest population?


In [18]:
# Work on A here

# print(info_cities, "\n")

# How many cities are in the list?

num_cities = len(info_cities)
print("The number of cities in the list are:", num_cities, "\n")

# What's the name of the first city on the list?

first_city = info_cities[0]["name"]
print("The name of the first city in the list is", first_city, "\n")

# What are the latitude and longitude of the last city on the list?

last_latitude = info_cities[-1]["lat"]
last_longitude = info_cities[-1]["lon"]
print("The latitude of the last city is:", last_latitude)
print("The longitude of the last city is:", last_longitude, "\n")

# What are the populations for the largest and smallest cities?

def sortPopulation(info_cities):
    return info_cities["pop"]

populations = sorted(info_cities, key=sortPopulation)

smallest_population = populations[0]["pop"]
largest_population = populations[-1]["pop"]

print("The smallest population available:", smallest_population)
print("The largest population available:", largest_population, "\n")

# What's the name of the city with the largest population?

largest_city_name = populations[-1]["name"]
print("The name of the city with the largest population is:", largest_city_name, "\n")

The number of cities in the list are: 8670 

The name of the first city in the list is Abu Dhabi 

The latitude of the last city is: -17.88333
The longitude of the last city is: 30.7 

The smallest population available: 50011
The largest population available: 22315474 

The name of the city with the largest population is: Shanghai 



#### Test A

In [5]:
# Test A
answers = [num_cities, first_city, last_latitude, last_longitude, largest_population, smallest_population, largest_city_name]

Tests.test("A", answers)

A: All tests passed 🎉🎉🎉


#### Exercise B:

We have the largest city's name and population, but we need its location.

We can recycle some of the logic from above to get the whole object that contains information for the largest city.

In [6]:
# Work on B here

largest_city = populations[-1]


In [7]:
# Test B
Tests.test("B", largest_city)

B: All tests passed 🎉🎉🎉


#### Exercise C:

We should have all info about the largest city here.

Now, we'll iterate through the list and use each city's latitude and longitude to calculate its distance from the largest city.

Althought not $100\%$ correct, it's ok to use the [2D Euclidean distances](https://en.wikipedia.org/wiki/Euclidean_distance#Two_dimensions) for this.

Could be useful to define a function `distance(cityA, cityB)` that returns the distance between two cities.

In [8]:
# Work on C here

# Implement the helper function for calculating distances between 2 cities

import math

def distance(cityA, cityB):
  latA = cityA["lat"]
  lonA = cityA["lon"]
  latB = cityB["lat"]
  lonB = cityB["lon"]

  # dist = math.sqrt(((latA-latB)**2) + ((lonA-lonB)**2))
  # Easier implementation that I found which returns the root of the sum of squares of the two parameters it takes.
  dist = math.hypot(latA - latB, lonA - lonB)
  return dist


In [9]:
# Test C
Tests.test("C", distance)

C: All tests passed 🎉🎉🎉


#### Exercise D:

Ok. We implemented a function to calculate the distance between 2 cities, let's use it now.

Iterate through the list of cities again, calculate the distance from each city to the largest city, and add that as a new feature/key to each city's entry:

```py
{
  "name": "Pittsburgh",
  "country": "US",
  "admin1": "Pennsylvania",
  "lat": 40.4406200,
  "lon": -79.9958900,
  "pop": 304391,
  "distance_to_largest": 1222.32
}
```

Just make sure the `key` that holds the distance value is called `distance_to_largest`.

In [10]:
# Work on D here

# TODO: calculate every city's distance from the largest city 
# and add that info to each city's entry in info_cities
for item in info_cities:
    dist = distance(item, largest_city)
    item["distance_to_largest"] = dist

info_cities[0:2]
         


[{'name': 'Abu Dhabi',
  'country': 'AE',
  'admin1': 'Abū Z̧aby',
  'lat': 24.45118,
  'lon': 54.39696,
  'pop': 603492,
  'distance_to_largest': 67.40206314269321},
 {'name': 'Ajman City',
  'country': 'AE',
  'admin1': '‘Ajmān',
  'lat': 25.40177,
  'lon': 55.47878,
  'pop': 490035,
  'distance_to_largest': 66.235511831048}]

In [11]:
# Test D
Tests.test("D", info_cities)

D: All tests passed 🎉🎉🎉


#### Exercise E:

Now, sort the array from the previous step by distance and get the name of the $3$ cities closest to the largest city, but not including the largest city. In other words, if you sort the list from the exercise above by ascending `distance_to_largest`, the $3$ cities closest to the largest city will be in the slice `[1:4]`. The city at index $0$ is the city with the largest population, and should have a distance of $0$ from itself.

The answer should be a list with city names.

Something like:

```python
closest_3 = [ "pittsburgh", "liverpool", "oakland" ]
```

We saw how to sort lists of objects in lecture.

In [19]:
# Work on E here

# Sort the array and get the name and population of the 3 cities closest to the largest city

def sortLargestDistance(info_cities):
    return info_cities["distance_to_largest"]

cities_by_distance = sorted(info_cities, key=sortLargestDistance)

# print("cities by distance", cities_by_distance)



closest_3 = [ item["name"] for item in cities_by_distance[1:4] ]

print(closest_3)


['Songjiang', 'Zhujiajiao', 'Kunshan']


In [15]:
# Test E
Tests.test("E", closest_3)

E: All tests passed 🎉🎉🎉
