# Batch Geocoding with the U.S. Census Geocoder (Python)

In this notebook, we geocode a **large list of addresses** by sending repeated
requests to the U.S. Census Geocoder API.

We will learn to:
- Geocode many addresses reliably (batch workflow)
- Respect rate limits (polite usage)
- Track failed matches
- Save results to a CSV for later mapping/analysis

**Note:** This notebook uses U.S. addresses (Census Geocoder).


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# importing necessary modules
import geopandas as gpd # used for handling geospatial data
import pandas as pd # used for creating and manipulating data in tabular format
import folium # used to create interactive maps
import requests # used to send an API call, i.e., U.S Census Geocoder
import numpy as np
from geopy.geocoders import Nominatim
from time import sleep
from datetime import datetime

## Batch Geocoding Workflow

Batch geocoding means converting *many* addresses into coordinates.
Because APIs can fail or rate-limit you, a robust batch workflow should include:

1. Clean inputs
2. Rate limiting (pause between requests)
3. Error handling and retries
4. Logging match success/failure
5. Saving outputs frequently


In [3]:
from geopy.extra.rate_limiter import RateLimiter
import time

In [4]:
# Rate-limited geocode function: one request per second by default
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError

geolocator = Nominatim(user_agent="my-geocoder-app")
geocode_rate_limited = RateLimiter(geolocator.geocode, min_delay_seconds=1, error_wait_seconds=2.0)

# Simple in-memory cache and optional disk save/load
import os, pickle
CACHE_FILE = "geocode_cache.pkl"

try:
    with open(CACHE_FILE, "rb") as f:
        geocode_cache = pickle.load(f)
    print("Loaded geocode cache with", len(geocode_cache), "entries")
except FileNotFoundError:
    geocode_cache = {}
    print("Starting with empty geocode cache")

def lookup_place(place_str, country_bias=None, limit=1, exactly_one=True):
    """
    Lookup place_str with caching. Optionally bias queries by country (ISO code or name).
    Returns geopy.Location or None.
    """
    key = (place_str, country_bias)
    if key in geocode_cache:
        return geocode_cache[key]
    try:
        query = place_str
        if country_bias:
            # Nominatim accepts 'countrycodes' param, but geopy's simple geocode may not expose all options.
            # For simple biasing we append the country name‚Äîthis is heuristic.
            query = f"{place_str}, {country_bias}"
        location = geocode_rate_limited(query, exactly_one=exactly_one, timeout=10)
        geocode_cache[key] = location
        return location
    except (GeocoderTimedOut, GeocoderServiceError) as e:
        print("Geocoding error:", e)
        return None

def save_cache():
    with open(CACHE_FILE, "wb") as f:
        pickle.dump(geocode_cache, f)
    print("Cache saved:", CACHE_FILE)

Starting with empty geocode cache


In [5]:
# Example batch: input dataframe of place strings and optional context (country/state)
places_df = pd.DataFrame({
    "id": [1,2,3,4],
    "place": ["Paris", "Paris", "Springfield", "South Bend"],
    "context": [None, "TX", "IL", "IN"]  # context biases to e.g., Paris, TX
})

# We'll iterate and populate lat/lon, using cache
out_rows = []
for _, row in places_df.iterrows():
    place = row["place"]
    ctx = row["context"]
    loc = lookup_place(place, country_bias=ctx, exactly_one=False)
    if loc is None:
        out_rows.append({"id": row["id"], "place": place, "context": ctx, "found": False})
        continue
    locs = loc if isinstance(loc, list) else [loc]
    # choose top 1 for simplicity
    chosen = locs[0]
    out_rows.append({
        "id": row["id"],
        "place": place,
        "context": ctx,
        "found": True,
        "address": chosen.address,
        "lat": chosen.latitude,
        "lon": chosen.longitude
    })
    # small sleep to be polite (RateLimiter usually handles this)
    time.sleep(0.5)

batch_out = pd.DataFrame(out_rows)
print(batch_out)
#save_cache()

   id        place context  found  \
0   1        Paris    None   True   
1   2        Paris      TX   True   
2   3  Springfield      IL   True   
3   4   South Bend      IN   True   

                                             address        lat        lon  
0  Paris, √éle-de-France, France m√©tropolitaine, F...  48.853495   2.348391  
1   Paris, Lamar County, Texas, 75460, United States  33.661796 -95.555513  
2  Springfield, Sangamon County, Illinois, United...  39.799017 -89.643957  
3  South Bend, Saint Joseph County, Indiana, Unit...  41.683381 -86.250007  


# Handling Ambiguous Place Names in Geocoding

Not all locations are unique. Place names like **Springfield** or **Paris** exist
in many states and countries. This notebook introduces ambiguity in geocoding and
shows simple strategies to reduce errors.

We will learn:
- Why ambiguity happens
- How to use context (state/country) to improve matches
- How to compare multiple candidate matches
- A simple workflow for choosing the ‚Äúbest‚Äù location


## Why Ambiguity Happens ?

Geocoders try to match a text query to a real location. But many place names are
shared across regions.

Examples:
- Springfield (many U.S. states)
- Washington (state vs. DC vs. towns)
- Paris (France vs. U.S. cities)

To reduce ambiguity, always include context:
- city + state
- city + country
- postal code (best)

In [6]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0


In [7]:
from fuzzywuzzy import process as fuzzy_proces

In [8]:
# Example: disambiguate by simple string matching and fuzzy score
def disambiguate_candidates(candidate, n_results=5, country_bias=None):
    # Use geocode with exactly_one=False to get multiple results (the RateLimiter wrapper may return list)
    raw = lookup_place(candidate, country_bias=country_bias, exactly_one=False)
    if raw is None:
        return []
    results = raw if isinstance(raw, list) else [raw]
    choices = []
    for r in results[:n_results]:
        # create a simple score based on string similarity between candidate and the display name (address)
        score = fuzzy_proces.extractOne(candidate, [r.address or ""], score_cutoff=0)[1] if r.address else 0
        choices.append({
            "address": r.address,
            "lat": r.latitude,
            "lon": r.longitude,
            "raw": r.raw,
            "score": score
        })
    # sort by score desc
    return sorted(choices, key=lambda x: x["score"], reverse=True)

In [9]:
# Try disambiguating "Springfield" with US bias by appending "USA"
cands = disambiguate_candidates("Paris", n_results=6, country_bias="USA")
for i, c in enumerate(cands, 1):
    print(i, c["address"], c["lat"], c["lon"], "score:", c["score"])

1 Paris, Lamar County, Texas, 75460, United States 33.6617962 -95.555513 score: 60
2 Paris, Bourbon County, Kentucky, 40361, United States 38.2132087 -84.2492072 score: 60
3 Paris, Edgar County, Illinois, 61944, United States 39.611146 -87.6961374 score: 60
4 Paris, Henry County, West Tennessee, Tennessee, United States 36.3019461 -88.3258578 score: 60
5 Paris, Oxford County, Maine, 04281, United States 44.2614578 -70.5009798 score: 60
6 Paris, Logan County, Arkansas, 72855, United States 35.2924747 -93.7294452 score: 60


In [10]:
def disambiguate_candidates(place_name, context_text="", country_bias=None):
    """
    Try to disambiguate a place using context text.
    Returns a list of (location, score).
    """
    # Make a combined query
    query = f"{place_name}, {context_text}" if context_text else place_name
    print(f"üîé Querying geocoder with: {query}")

    locs = lookup_place(query, country_bias=country_bias, exactly_one=False)

    if not locs:
        print("‚ö†Ô∏è No candidates found.")
        return []

    results = []
    for idx, loc in enumerate(locs):
        score = 0

        # Boost if the place_name appears in the address
        if place_name.lower() in loc.address.lower():
            score += 50

        # Boost if the context string appears in the address
        if context_text and context_text.lower() in loc.address.lower():
            score += 40

        # Small bonus for being ranked higher by Nominatim
        score += max(0, 30 - idx)

        results.append((loc, score))

    # Sort highest score first
    results.sort(key=lambda x: x[1], reverse=True)
    return results


In [11]:
# Ambiguous Springfield
candidates = disambiguate_candidates("Springfield","County", country_bias="USA")
for loc, score in candidates:
    print(score, loc.address)

# Ambiguous Paris
candidates = disambiguate_candidates("Paris", "Eiffel, Bordeaux", country_bias="FR")
for loc, score in candidates:
    print(score, loc.address)

üîé Querying geocoder with: Springfield, County
120 Springfield, Sangamon County, Illinois, United States
119 Springfield, Hampden County, Massachusetts, United States
118 Springfield, Greene County, Missouri, United States
117 Springfield, Clark County, Ohio, United States
116 Springfield, Lane County, Oregon, United States
115 Springfield, Windsor County, Vermont, United States
114 Springfield, Effingham County, Georgia, United States
113 Springfield, Washington County, Kentucky, United States
112 Springfield, Brown County, Minnesota, 56087, United States
111 Springfield, Fairfax County, Virginia, 22150, United States
üîé Querying geocoder with: Paris, Eiffel, Bordeaux
80 Monnaie de Paris, Avenue Gustave Eiffel, Saige, Pessac, Bordeaux, Gironde, Nouvelle-Aquitaine, France m√©tropolitaine, 33600, France
