<div align="center"><img src="../images/LKYCIC_Header.jpg"></div>

**Table of contents**<a id='toc0_'></a>    
- [3-01: Geocoding](#toc1_)    
  - [Geocoding](#toc1_1_)    
    - [Reverse geocoding](#toc1_1_1_)    
    - [Use Cases of Geocoding](#toc1_1_2_)    
  - [Data](#toc1_2_)    
  - [Nominatim](#toc1_3_)    
    - [Additional Information on Package Use in Python](#toc1_3_1_)    
      - [Returned data from API call](#toc1_3_1_1_)    
  - [Batch Geocoding (For Loop)](#toc1_4_)    
    - [Rounds of Modification](#toc1_4_1_)    
    - [Utilising Regular Expression to rearrange within string column](#toc1_4_2_)    
      - [Example string](#toc1_4_2_1_)    
  - [Optional steps: Caching the queried output](#toc1_5_)    
  - [Reference](#toc1_6_)    
  - [Next Section](#toc1_7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[3-01: Geocoding](#toc0_)

## <a id='toc1_1_'></a>[Geocoding](#toc0_)

Geocoding is the process where it converts address into spatial data and associates the exact geographical coordinates for that address.

| ![geocoding_001](../images/geocoding_001.jpg)                  |
| ------------------------------------------------------------ |
| [Geocoding Service - Google for Developers](https://developers.google.com/maps/documentation/javascript/examples/geocoding-simple) |

### <a id='toc1_1_1_'></a>[Reverse geocoding](#toc0_)

Occasionally, we need to transform coordinates to addresses, namely reverse geocoding.

| ![geocoding_002](../images/geocoding_002.png)                  |
| ------------------------------------------------------------ |
| [Reverse Geocoding - Google for Developers](https://developers.google.com/maps/documentation/javascript/examples/geocoding-reverse) |

### <a id='toc1_1_2_'></a>[Use Cases of Geocoding](#toc0_)

Geocoding is commonly used to translate vague address information into accurate geospatial data.  

- For instance, if **postcodes** are collected from a survey and you need to analyse the survey responses, geocoding services become invaluable. 

- Or behind the map services like Google Maps, it uses geocoding to convert **user-inputted addresses** into latitude and longitude coordinates.

In general, geocoding serves as a bridge between **text-based spatial information and numerical coordinate pairs**.  

**Common Geocoding tools:**

- [Google Maps Platform](https://developers.google.com/maps): Provides APIs for map, including for geocoding and reverse geocoding
- [Nominatum](https://nominatim.openstreetmap.org): An open source geocoding services
- [geopy](https://github.com/geopy/geopy): A Python library to use geocoding services
- [Mapbox](https://mapbox.com): An alternative to Google Maps
- [Esri ArcGIS Platform or ArcGIS pro](https://esri.com): Design for Esri User, but with limitation of quota

**For Geocoding in Singapore:**

| ![onemap api](../images/onemap_services.png)                  |
| ------------------------------------------------------------ |
| OneMap Available Services |

Processes to use OneMap Geocoding Services:

1. Register on the [OneMap API website](https://www.onemap.gov.sg/apidocs/)

| ![onemap api](../images/OneMap_apikey.png)                  |
| ------------------------------------------------------------ |
| The generated API key after registeration |

In [None]:
import requests
from pprint import pprint

In [None]:
# Read the stored api key from the file other_files/donttrack_mapapi second line

with open("../other_files/donttrack_mapapi") as f:
    api_key = f.readlines()[1].strip()

# print(api_key)

In [None]:
searchVal="8 Somapah Rd, Singapore 487372"
      
url = f"https://www.onemap.gov.sg/api/common/elastic/search?searchVal={searchVal}&returnGeom=Y&getAddrDetails=Y&pageNum=1"
      
headers = {"Authorization": api_key}
      
response = requests.get(url, headers=headers)

response_dict = response.json()
      
pprint(response_dict)

{
  "found": 6,
  "totalNumPages": 1,
  "pageNum": 1,
  "results": [
    {
      "SEARCHVAL": "SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN",
      "BLK_NO": "8",
      "ROAD_NAME": "SOMAPAH ROAD",
      "BUILDING": "SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN",
      "ADDRESS": "8 SOMAPAH ROAD SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN SINGAPORE 487372",
      "POSTAL": "487372",
      "X": "42416.5920993574",
      "Y": "35815.2679505069",
      "LATITUDE": "1.3401716369901",
      "LONGITUDE": "103.962860116421"
    },
    {
      "SEARCHVAL": "SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN (BUILDING 1)",
      "BLK_NO": "8",
      "ROAD_NAME": "SOMAPAH ROAD",
      "BUILDING": "SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN (BUILDING 1)",
      "ADDRESS": "8 SOMAPAH ROAD SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN (BUILDING 1) SINGAPORE 487372",
      "POSTAL": "487372",
      "X": "42372.2915454404",
      "Y": "35819.9252548926",
      "LATITUDE": "1.34021377704081",
      

In [None]:
import pandas as pd

## <a id='toc1_2_'></a>[Data](#toc0_)

Dengue clusters were collected from the NEA website:  

[https://www.nea.gov.sg/dengue-zika/dengue/dengue-clusters](https://www.nea.gov.sg/dengue-zika/dengue/dengue-clusters)  

Instructions on how to collect tabular data from the NEA webpage can be found in the Jupyter Notebook file: [NEA Dengue Scraper](./extra_practices/X-02_nea_dengue_scraper.ipynb).  

While web scraping is not covered in this bootcamp, you may find this resource useful.  

The path to the file is <u>../data/raw/part_iii/dengue_clusters_with_subtables_17_Jan_2025.csv</u>

In [None]:
project_path = '../data/raw/part_iii/'
df = pd.read_csv(project_path + 'dengue_clusters_with_subtables_17_Jan_2025.csv')

In [None]:
df.head()

## <a id='toc1_3_'></a>[Nominatim](#toc0_)

`Nominatim` is a search engine for OpenStreetMap (OSM) data.

- **Geocoding**: Converts addresses or place names into geographic coordinates (latitude and longitude).

- **Reverse Geocoding**: Converts geographic coordinates into human-readable addresses.

To use the Nominatim service, we are using a Python library, `geopy`. 

It enables easy integration of geocoding functionality, allowing us to convert addresses into coordinates and vice versa with simple Python code.

### <a id='toc1_3_1_'></a>[Additional Information on Package Use in Python](#toc0_)

**As of 6 May 2024, more than 530,000 Python packages are available.**  
(Source: [Wikipedia](https://en.wikipedia.org/wiki/Python_Package_Index#:~:text=As%20of%206%20May%202024,modules%20from%20a%20compiled%20language.))  

When including scattered Python code hosted on platforms like GitHub or GitLab, it is likely that 90% of the functionality you need has already been written by someone.  

<div align="center">  
    <img src="../images/programmer_tailor_pixabay.png" width="500px">  
    <br>  
    <p><b>Programmer? No, Tailor!</b> Source: Pixabay</p>  
</div>  

---

There is probably no single package available for every need, and some packages may overlap in functionality. Selecting a well-maintained package can reduce potential errors.  

Most Python package source code is hosted on GitHub. [GitHub Search](https://github.com/search?q=bibliographic+analysis&type=repositories) is a useful tool to identify existing code or packages you can reuse.  

For example, if a package has not been updated for years, it is better to choose one with more stars and regular maintenance:  

<div align="center">  
    <img src="../images/github_search.png" width="500px">  
    <br>  
    <p>You can search through a sea of published code using <b><a href="https://github.com/search?q=bibliographic+analysis&type=repositories">GitHub Search</a>.</b> </p>  
</div>

In [None]:
# Uncomment to install geopy
# %pip install geopy

In [None]:
from geopy.geocoders import Nominatim
from pprint import pprint

1. Keyboard input:

Let's put `SUTD` to test:

In [None]:
# Instantiates a new Nominatim client with a user agent string. The user agent string is required by Nominatim to identify your application.
app = Nominatim(user_agent="tutorial")

# Prompts the user to enter a location (address or place name)
your_loc = input("Enter your location: ")

# Geocodes the entered location and retrieves the raw location 
location = app.geocode(your_loc).raw 

# Pretty-prints the raw location data for better readability.
pprint(location)

```json
{'addresstype': 'peak',
 'boundingbox': ['32.2566178', '32.2567178', '77.7040811', '77.7041811'],
 'class': 'natural',
 'display_name': 'SUTD, Lahul, Lahaul and Spiti, Himachal Pradesh, India',
 'importance': 0.16000999999999999,
 'lat': '32.2566678',
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. '
            'http://osm.org/copyright',
 'lon': '77.7041311',
 'name': 'SUTD',
 'osm_id': 8601419457,
 'osm_type': 'node',
 'place_id': 197204667,
 'place_rank': 18,
 'type': 'peak'}
 ```

You may notice that the returned location for SUTD is a place in India.  

This is because there is a high likelihood of **duplicate place names** when using **inaccurate or incomplete information** for geocoding.  

Let us provide more accurate supplemental details, including the district and country information this time: 

In [None]:
# Prompts the user to enter a location (address or place name)
your_loc = "SUTD, Tampines, Singapore"

# Geocodes the entered location and retrieves the raw location 
location = app.geocode(your_loc).raw 

# Pretty-prints the raw location data for better readability.
pprint(location)

#### <a id='toc1_3_1_1_'></a>[Returned data from API call](#toc0_)

Normally, API returns data in `JSON format`. 

JSON has a very similar structure with Dictionary. Therefore, in Python, the acquired JSON is always read as `dictionary`. 

In [None]:
type(location)

Therefore, we can access the infomrtaion through the following syntax:

In [None]:
location['osm_id']

In [None]:
lat, lon = float(location['lat']), float(location['lon'])

## <a id='toc1_4_'></a>[Batch Geocoding (For Loop)](#toc0_)

Because there could be streets with same name in other countries. 

To acquire the coordinates as accurate as possible, we need to append "Singapore" to <u>each address as detailed as possible</u>.

In [None]:
for index, row in df.iterrows():
    address = row['Location'] + ', Singapore'
    print(address)
    try:
        location = app.geocode(address).raw
        df.at[index, 'Latitude'] = location['lat']
        df.at[index, 'Longitude'] = location['lon']
    except:
        print('!!!Error!!!: ', address)
    print("---------------------------------------")

In [None]:
print(f'There are {df['Latitude'].notnull().sum()} geo-coded addresses')
print(f'There are {df['Latitude'].isnull().sum()} ungeo-coded addresses')

In [None]:
df[df['Latitude'].isnull()]

### <a id='toc1_4_1_'></a>[Data Cleaning: Rounds of Modification](#toc0_)

When we observe the pattern of ungeocoded address, we can find they are most likely in the format of `<Road Name>(<Blk + Building Number>)`

For example, when we search the address in Google Map, you can find the common written way of the address is `<Building Number> <Road Name>`

<div align="center">  
    <img src="../images/geocoded_failure.png" width="500px">  
    <br>  
    <p>Failed geocoded outputs in Google Map</p>  
</div>

Therefore, we need to do some data reorganisation on the address column in a desirable format:

### <a id='toc1_4_2_'></a>[Utilising Regular Expression to rearrange within string column](#toc0_)

A **regular expression** (often abbreviated as **regex** or **regexp**) is a sequence of characters that defines a search pattern. 

1. Pattern Matching

    Identify text patterns in strings (e.g., finding all email addresses in a document).

2. Validation

    Check if a string matches a specific format (e.g., validating phone numbers or postal codes).

3. Substitution

    Replace parts of a string based on a pattern.

4. Extraction

    Extract specific parts of a string, like dates or identifiers.

| **Pattern**        | **Description**                             | **Example**                     |
|--------------------|---------------------------------------------|---------------------------------|
| `.`                | Matches any single character except a newline | `a.c` matches `abc`, `a1c`     |
| `\d`               | Matches a digit (0–9)                       | `\d+` matches `123`, `42`      |
| `\w`               | Matches a word character (letters, digits, and underscores) | `\w+` matches `hello`, `_abc` |
| `\s`               | Matches a whitespace character              | `a\s+b` matches `a b`, `a   b` |
| `^`                | Matches the start of a string               | `^Hello` matches `Hello world` if it starts with `Hello` |
| `$`                | Matches the end of a string                 | `world$` matches `Hello world` if it ends with `world` |
| `[abc]`            | Matches any character in the brackets        | `[aeiou]` matches vowels       |
| `[^abc]`           | Matches any character not in the brackets    | `[^aeiou]` matches consonants  |
| `(a|b)`            | Matches either `a` or `b`                   | `(cat|dog)` matches `cat` or `dog` |
| `*`                | Matches 0 or more of the preceding character | `ab*` matches `a`, `ab`, `abb` |
| `+`                | Matches 1 or more of the preceding character | `ab+` matches `ab`, `abb`      |
| `?`                | Matches 0 or 1 of the preceding character    | `ab?` matches `a`, `ab`        |
| `{n,m}`            | Matches between `n` and `m` repetitions      | `a{2,4}` matches `aa`, `aaa`, `aaaa` |

*Note:* You don't have to remember the symbols and patterns. You can ask AI bots to write the regular expression for you.

In [None]:
import re

#### <a id='toc1_4_2_1_'></a>[Example string](#toc0_)

In [None]:
text = "Contact us at support@example.com or call 123-456-7890."

# Find an email address
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
email = re.findall(email_pattern, text)
print("Email found:", email)

# Find a phone number
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phone = re.findall(phone_pattern, text)
print("Phone number found:", phone)

In this case, we want to extract the building number contains in `(Blk <Building Number>)`.

For example, from `Jurong West Avenue 1(Blk 537A)` to `537A Jurong West Avenue 1`

We can prompt the task to ChatGPT (or other chat bots), and ask for code instruction:

<div align="center">  
    <img src="../images/ai_prompted_codesolution.png" width="500px">  
    <br>  
    <p>Ask ChatGPT to write the cleaning code.</p>  
</div>

In [None]:
example = "Jurong West Avenue 1(Blk 537A)"

# Use regular expression to extract the block number
match = re.search(r'\(Blk (\d+\w*)\)', example)
if match:
    block_number = match.group(1)  # Extracted block number
    # Remove the "(Blk 537)" part and prepend the block number
    transformed = f"{block_number} {example.split('(')[0].strip()}"
    print(transformed)

To be able to apply it on the whole column, we need to <u>define it as a function</u>:

In [None]:
def transform_address(address):
    # Updated regex to capture numbers and optional letters
    match = re.search(r'\(Blk (\d+\w*)\)', address)
    if match:
        block_number = match.group(1)
        return f"{block_number} {address.split('(')[0].strip()}"
    return address  # Return unchanged if no match is found

In [None]:
transform_address(example)

In [None]:
for index, row in df.iterrows():
    try:
        df.at[index, 'address_changed'] = transform_address(row['Location'])
    except:
        df.at[index, 'address_changed'] = row['Location']

In [None]:
for index, row in df.iterrows():
    # Skip if already geocoded
    if pd.isnull(row['Latitude']):
        address = row['address_changed'] + ', Singapore'
        print(address)
        try:
            location = app.geocode(address).raw
            df.at[index, 'Latitude'] = location['lat']
            df.at[index, 'Longitude'] = location['lon']
        except:
            print('!!!Error!!!: ', address)
        print("---------------------------------------")

In [None]:
print(f'There are {df['Latitude'].notnull().sum()} geo-coded addresses')
print(f'There are {df['Latitude'].isnull().sum()} ungeo-coded addresses')

#### Google API for Challenging Cases

For the next step, you can manually query the [Google Geocoding Service](https://developers.google.com/maps/documentation/javascript/examples/geocoding-simple) or use another geocoding API, such as the [Google Maps API](https://geopy.readthedocs.io/en/stable/index.html?highlight=google#googlev3).

*Note: The Google Maps API incurs usage fees.*

<div align="center">  
    <img src="../images/google_geocoding_api.png" width="700px">  
    <br>  
    <p>Geocoding pricing for the Google Maps API.</p>  
</div>

The overall strategy is:

**Maintaining data quality while minimising the usage of costly APIs wherever possible.**

Save your Google Map API as plain file, then run the following two cells:

In [None]:
# read api key from file donttrack_googlemapapi
with open('../other_files/donttrack_googlemapapi') as f:
    Gmapapi_key = f.readline()

In [None]:
from geopy import geocoders
g = geocoders.GoogleV3(api_key=Gmapapi_key)


for index, row in df.iterrows():
    # Skip if already geocoded
    if pd.isnull(row['Latitude']):
        address = row['Location'] + ', Singapore'
        print(address)
        try:
            location = g.geocode(address, timeout=10)
            pprint(location.raw)
            df.at[index, 'Latitude'] = location.latitude
            df.at[index, 'Longitude'] = location.longitude
        except:
            print('Error: ', address)

In [None]:
print(f'There are {df['Latitude'].notnull().sum()} geo-coded addresses')
print(f'There are {df['Latitude'].isnull().sum()} ungeo-coded addresses')

In [None]:
df[df.duplicated(subset=['Latitude', 'Longitude'], keep=False)]

In [None]:
# find the exact same Latitude and Longitude
duplicate_index = df[df.duplicated(subset=['Latitude', 'Longitude'], keep=False)].index
df.loc[duplicate_index, ['Latitude', 'Longitude']] = None

In [None]:
print(f'There are {df['Latitude'].notnull().sum()} geo-coded addresses')
print(f'There are {df['Latitude'].isnull().sum()} ungeo-coded addresses')

In [None]:
for index, row in df.iterrows():
    # Skip if already geocoded
    if pd.isnull(row['Latitude']):
        address = row['Location'] + ', Singapore'
        print(address)
        try:
            location = g.geocode(address, timeout=10)
            pprint(location.raw)
            df.at[index, 'Latitude'] = location.latitude
            df.at[index, 'Longitude'] = location.longitude
        except:
            print('Error: ', address)

In [None]:
print(f'There are {df['Latitude'].notnull().sum()} geo-coded addresses')
print(f'There are {df['Latitude'].isnull().sum()} ungeo-coded addresses')

In [None]:
df[df.duplicated(subset=['Latitude', 'Longitude'], keep=False)]

In [None]:
df

`Task:` Transform it to GeoDataFrame

In [None]:
import geopandas as gpd

In [None]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))

In [None]:
gdf.plot()

In [None]:
project_path

In [None]:
gdf.to_file('../data/processed/part_iii/dengue_clusters_with_subtables_17_Jan_2025.geojson', driver='GeoJSON')

*Note:*

Data cleaning is a very exhausting step in data analysis.

It is crucial to **constantly check** your intermediate data and optimise its accuracy.

Once you have acquired clean data, you can import the GeoJSON file into other GUI-based GIS software.

<div align="center">
    <img src="../images/dengue_vis.png" width = 500px>
</div>

## <a id='toc1_5_'></a>[Optional steps: Caching the queried output](#toc0_)

Saving queried/calculated outputs is common for saving resources for repeating requests.

This is not mandatory knowledge in this bootcamp, so feel free to read the following part if you are interested.

In [None]:
import pickle
import os

In [None]:
df.head()

In [None]:
# Path to the pickle file
pickle_file_path = '../data/processed/part_iii/geocode_cache.pkl'

In [None]:
# Save a single row to the cache
def save_row_to_cache(row, file_path):
    # Load existing cache if available
    if os.path.exists(file_path):
        with open(file_path, 'rb') as file:
            cache = pickle.load(file)
    else:
        cache = {}
    
    # Add or update the cache with the new row
    cache[row['Location']] = {'lat': row['Latitude'], 'lon': row['Longitude']}
    
    # Save the updated cache
    with open(file_path, 'wb') as file:
        pickle.dump(cache, file)
    print(f"Saved row for address: {row['Location']}")

# Example usage: Save each row to the cache
for _, row in df.iterrows():
    save_row_to_cache(row, pickle_file_path)


The following code shows how can you extract data from previous cache

In [None]:
# Load the cache
def load_cache(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'rb') as file:
            cache = pickle.load(file)
        print(f"Cache loaded from {file_path}")
        return cache
    else:
        print(f"No cache file found at {file_path}. Returning an empty dictionary.")
        return {}

# Load the cache and print it
cache = load_cache(pickle_file_path)
print("\nCached Data:")
for address, coords in cache.items():
    print(f"Address: {address}: Lat: {coords['lat']}, Lon: {coords['lon']}")

Difference between pickle cache and saving as csv:

| **Aspect**         | **Pickle Cache**                      | **CSV File**                      |
|---------------------|---------------------------------------|------------------------------------|
| **Format**          | Binary, Python-specific              | Text, widely compatible           |
| **Readability**     | Not human-readable                   | Human-readable                    |
| **Interoperability**| Limited to Python                    | Supported across platforms/tools  |
| **Performance**     | Fast, ideal for Python objects       | Slower, ideal for tabular data    |
| **File Size**       | Can be larger due to metadata        | Compact for simple tabular data   |
| **Use Case**        | Temporary, quick storage of objects  | Sharing or long-term storage      |

Caching the calculated outputs is quite common process in Big Data Analysis:

For example,

The computational cost for calculating shortest path from Point A to Point B is quite high.

You do not want to calculate every time you want to know the shortest path from Point A to Point B.

So you can cache it for the first time, and reuse it in the future. And it is fast to read than tabular data.

## <a id='toc1_6_'></a>[Reference](#toc0_)

1. [What is geocoding and how can it help sell products (geospatialworld.net)](https://www.geospatialworld.net/blogs/what-is-geocoding-and-how-can-it-help-sell-products/)
2. [Geocoding and postal codes, points to consider (opencagedata.com)](https://opencagedata.com/guides/how-to-think-about-postcodes-and-geocoding)

## <a id='toc1_7_'></a>[Next Section](#toc0_)

Go to [3-02: Data Visualization](./3-02_datavis.ipynb)