<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/exercise-visualization-with-osm-shirleyhan1996-hash/blob/main/oct21_in_class_parsing_json_data_ii.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parsing JSON Data - An Example with OpenStreetMap (OSM)



In this notebook, we will create a dataset on Cafés in Toronto from OSM


## Use OpenStreetMap APIs to Create a Dataset

An application programming interface (API) is a set of rules and protocols that allows different software programs to exchange data with each other.

* e.g., **Our Program** $\Longleftrightarrow$ **OSM**

The following OSM API will be used:

- **Overpass API**  
  - Retrieves detailed information about the search results within the specified area (based on the type and ID)  

Overpass is an API designed to query OpenStreetMap data.

We will Overpass QL for the queries.

Main idea:
* Restrict the search to an area
* Ask for OSM elements within that area
  * e.g., businesses, parks, roads, schools



Let us define the query for searching all cafés in Toronto.

In [19]:
#Node is the point on map
#Way is the connection
#Relation is the relationship between the 2 factors - looking for amenity = cafe

query = """
[out:json][timeout:25];
area(id:3600324211)->.a;
 (
    node["amenity"="cafe"](area.a);
    way["amenity"="cafe"](area.a);
    relation["amenity"="cafe"](area.a);
  );
out center tags;
"""

print(query)


[out:json][timeout:25];
area(id:3600324211)->.a;
 (
    node["amenity"="cafe"](area.a);
    way["amenity"="cafe"](area.a);
    relation["amenity"="cafe"](area.a);
  );
out center tags;



In the above query, we specified that:
* The output data format is json
* Search is within city of Toronto, which has an area id of 3600324211

* The searched tag is **amenity**, which is for places like cafe, school, parking, toilets
  * use the tag **leisure** for parks, green spaces, playgrounds, sports fields ...
  * use the tag **landuse** for wooded areas (e.g., forests)

**Questions**

* How can we find the area id of a place to use in OVERPASS?

* How can we find the right tag to use for a type pf places (e.g., "dance studio")?

To perform the query, we will use the `requests` library, which request/send data to the server.

* Use `requests.get` to request data
* Use `requests.post` to send data

Note that this applies to APIs (as web services), web scraping, and even web browsing

We will use `requests.post` to send the query to the server, which needs:

* The server URL
* The data to be sent; the query in this case
* Time out period (in seconds)

In [20]:
# Import the library

import requests

# The URL

OVERPASS = "https://overpass-api.de/api/interpreter"

# Sending the data, and assign the response to a variable r , request server, and ask the server to return the data query with time out limit =30s

r = requests.post(OVERPASS, data={'data': query}, timeout=30)

An http request may go through or fail. The result is encoded in `r.status_code`, where `r` is the result from the request.


* 200 → OK (success)

* 404 → Not Found

* 403 → Forbidden (e.g., bad headers or blocked)

* 500 → Internal server error

In [21]:
r
#200 is success

<Response [200]>

What's contained in the result `r`?
* `.headers` - who sent the response?
* `.text` -  the content of the response
* `.json()` return the json-format data, if one requested json

In [22]:
if r.status_code == 200:
    data = r.json()
    # You can process the data here
    print("Data received successfully.")
else:
    print(f"Error: Request failed with status code {r.status_code}")
    print(r.text) # Print the response text to see what was returned

Data received successfully.



As we can see, the "elements" field contains all elements that matches the search.

In [23]:
# We can extract this list of elements by `data['elements']` or `data.get('elements')`.

places = data.get('elements') #will not return error, just returning missing value
#print(data.get('location')) #eg: this will return None

#data['elements'] - returns the value with the key called elements, if there's no key in that name, returnning the error


#Assign the list of elements to the variable named "places"



In [24]:
places[-1]

{'type': 'way',
 'id': 1229832500,
 'center': {'lat': 43.7773776, 'lon': -79.3445933},
 'tags': {'addr:city': 'North York',
  'addr:housenumber': '1800',
  'addr:postcode': 'M2J 5A7',
  'addr:province': 'ON',
  'addr:street': 'Sheppard Avenue East',
  'addr:unit': '2021',
  'amenity': 'cafe',
  'brand': 'machi machi',
  'brand:en': 'machi machi',
  'brand:wikidata': 'Q114828290',
  'brand:zh': '麥吉',
  'cuisine': 'bubble_tea',
  'indoor': 'room',
  'level': '1',
  'name': 'Machi machi',
  'opening_hours': 'Mo-Sa 09:30-21:30; Su 10:00-20:00',
  'ref': '2021',
  'shop': 'tea',
  'takeaway': 'yes'}}

Let's sample the first and last places in the list.

Side by side comparison of the two sample elements


<img src="https://github.com/zhouy185/BUS_O712/blob/main/images/place0.png?raw=true"
width="300" border="1">
<img src="https://github.com/zhouy185/BUS_O712/blob/main/images/place-1.png?raw=true"
width="400" border="1">




We note that there are three element types:
1. **Node**: A single point (latitude/longitude)
    * Example: A tree 🌳 at a park, a business marked as a single point.

2. **Way**: An ordered list of nodes (a line or polygon)
    * Example: A road, the outline of a building

3. **Relation**: A group of nodes/ways/other relations with extra rules
    * Example: A bus route (collection of roads + stops in order), a city boundary (multiple polygon pieces stitched together).


A café can be a "node" or a "way"!

We may construct a DataFrame from `places`.

But,
* The elements of the type "node" directly have 'lat' and 'lon' as keys
* The elements of the type "node" do not; instead:
  * they have a key named 'center', whose value contains 'lat' and 'lon'

We can use `pd.DataFrame(list_of_dict)` to directly convert a list of `dict`s to a DataFrame

**Observation**
* For a "node" type, **lat** and **lon** are directly available fields in the dict

* For a "way" type, **lat** and **lon** are **not** directly available but are nested under the **center** field -- they are the coordinates of the centroid of the polygon shape!

Let's address this issue by first creating an empty DataFrame with the columns of interest
* we will include `['type','id','name','lat','lon','cuisine','brand']` as the columns

An easy way is to let Gemini perform the task.

> Prompt: Create a DataFrame df_cafes based on the variable places. The columns are 'type','id','name','lat','lon','cuisine','brand'.

In [25]:
import pandas as pd

cafe_list=[]
for place in places:
  place_type = place['type']
  place_id = place['id']

  tags= place.get('tags')
  name= tags.get('name')
  cuisine= tags.get('cuisine')
  brand= tags.get('brand')

  if place_type == 'node':
    lat = place['lat']
    lon = place['lon']
  elif place_type == 'way':
    center= place.get('center')
    lat= center['lat']
    lon= center['lon']
  else:
      # Handle other types if necessary, though based on the query,
      # we expect only 'node' and 'way'
      lat = None
      lon = None


  cafe_list.append({
      'type':place_type,
      'id':place_id,
      'name':name,
      'lat':lat,
      'lon':lon,
      'cuisine':cuisine,
      'brand':brand
  })

df_cafes = pd.DataFrame(cafe_list)

In [26]:
df_cafes.head()
df_cafes.to_csv('cafes.csv', index=False)