# Bus Need Classifier: Frontend_Data_Creation_V1

This notebook experiments with APIs to figure out how to gather data to populate all columns of an entry based on the home address and school address.

First, let's try and get the coordinates of a building based on its address. We're using the Census Geocoder API (US Addresses only). An HTTP request is sent to the API using the link below, and a list of address matches is returned. We will extract the first (best matched) address and get its coordinates.

In [23]:
import requests

def geocode_address_census(address):
    url = "https://geocoding.geo.census.gov/geocoder/locations/onelineaddress"

    params = {
        "address": address,
        "benchmark": "Public_AR_Current",
        "format": "json"
    }

    response = requests.get(url, params=params).json()
    matches = response["result"]["addressMatches"]

    if len(matches) == 0:
        raise ValueError("Address not found.")

    coords = matches[0]["coordinates"]
    return coords["y"], coords["x"]   # (lat, lon)

# Example:
lat, lon = geocode_address_census("2571 Durant Ave, Berkeley, CA")


In [24]:
print("lat: ", lat, "long: ", lon)

lat:  37.868088845999 long:  -121.941776296797


Now let's try to extract information like Census Division, MSA/CMSA Status, and Urban/Rural classification from coordinates. Let's start with Census Division (CDIVMSAR). We need to use the FCC API to first convert the coordinates into the standardized FIPS code for the state.

In [25]:
def latlon_to_state_fips(lat, lon):
    url = f"https://geo.fcc.gov/api/census/area?lat={lat}&lon={lon}&format=json"
    data = requests.get(url).json()
    return data["results"][0]["state_fips"]


Next, we need to hard code a dictionary that translates a FIPS code into the Census Division (this is static, so hard coding is fine.)

In [26]:
STATE_TO_DIVISION = {
    # New England
    "09":"New England","23":"New England","25":"New England","33":"New England","44":"New England","50":"New England",

    # Middle Atlantic
    "34":"Middle Atlantic","36":"Middle Atlantic","42":"Middle Atlantic",

    # East North Central
    "17":"East North Central","18":"East North Central","26":"East North Central","39":"East North Central","55":"East North Central",

    # West North Central
    "19":"West North Central","20":"West North Central","27":"West North Central","29":"West North Central",
    "31":"West North Central","38":"West North Central","46":"West North Central",

    # South Atlantic
    "10":"South Atlantic","11":"South Atlantic","12":"South Atlantic","13":"South Atlantic",
    "24":"South Atlantic","37":"South Atlantic","45":"South Atlantic","51":"South Atlantic","54":"South Atlantic",

    # East South Central
    "01":"East South Central","21":"East South Central","28":"East South Central","47":"East South Central",

    # West South Central
    "05":"West South Central","22":"West South Central","40":"West South Central","48":"West South Central",

    # Mountain
    "04":"Mountain","08":"Mountain","16":"Mountain","30":"Mountain",
    "32":"Mountain","35":"Mountain","49":"Mountain","56":"Mountain",

    # Pacific
    "02":"Pacific","06":"Pacific","15":"Pacific","41":"Pacific","53":"Pacific"
}


Let's test this out with an address.

In [27]:

print(STATE_TO_DIVISION[(latlon_to_state_fips(lat, lon))])

Pacific


For the other datapoints, we need more specific information than the Division. Let's get some more FIPS codes.

In [28]:
def get_fips_from_coords(lat, lon):
    """
    Given lat/lon, return:
    - state FIPS
    - county FIPS
    - tract code
    - block group code
    """
    url = f"https://geo.fcc.gov/api/census/block/find?latitude={lat}&longitude={lon}&format=json"
    response = requests.get(url).json()

    block_fips = response["Block"]["FIPS"]  # 15-digit block code

    state_fips = block_fips[:2]       # 2 digits
    county_fips = block_fips[2:5]     # 3 digits
    tract = block_fips[5:11]          # 6 digits
    block = block_fips[11:]           # 4 digits
    block_group = block[0]            # 1 digit

    return state_fips, county_fips, tract, block_group


P.S. we can use either function to get the Division, now.

In [32]:
print(get_fips_from_coords(lat, lon))

('06', '001', '422800', '1')


Use the Census API along with data table B19013 from the ACS 5 Year Dataset to get median household income estimates by block group that an address is in.

In [33]:
import requests

def get_blockgroup_median_income(state_fips, county_fips, tract, block_group, api_key=None):
    """
    Get median household income for a block group from ACS5 (2022).

    Parameters:
        state_fips (str): 2-digit state FIPS
        county_fips (str): 3-digit county FIPS
        tract (str): 6-digit census tract
        block_group (str): 1-digit block group number
        api_key (str, optional): Your Census API key. Default None.

    Returns:
        int: median household income estimate
    """
    base_url = "https://api.census.gov/data/2022/acs/acs5"
    var = "B19013_001E"  # median household income estimate
    url = f"{base_url}?get={var}&for=block group:{block_group}&in=state:{state_fips}+county:{county_fips}+tract:{tract}"

    if api_key:
        url += f"&key={api_key}"

    response = requests.get(url)
    response.raise_for_status()  # raise exception if bad response

    data = response.json()
    # data[0] is header, data[1] is the actual value
    median_income = int(data[1][0])

    if median_income in [None, "", "null"]:
      median_income = "No data"  # or some default value

    return median_income

# Example usage:
state_fips, county_fips, tract, block_group = get_fips_from_coords(lat, lon)
#api_key = "YOUR_KEY_HERE"

median_income = get_blockgroup_median_income(state_fips, county_fips, tract, block_group)
print("Block group median household income:", median_income)


Block group median household income: 22857


Now let's move onto a race estimate by block group using a very similar procedure, but with the B02001 (race) and B03003 (hispanic status) tables.

In [35]:


def get_race_hisp_counts(state_fips, county_fips, tract, block_group, api_key=None):
    """
    Get ACS5 race and Hispanic counts for a block group.
    """
    # ACS5 table variables
    race_vars = [
        "B02001_002E",  # White
        "B02001_003E",  # Black or African American
        "B02001_004E",  # Asian
        "B02001_005E",  # American Indian or Alaska Native
        "B02001_006E",   # Native Hawaiian or other Pacific Islander
        "B02001_007E",   # One other race
        "B02001_008E",   # Two+ races

    ]
    hisp_vars = [
        "B03003_002E",  # Not Hispanic
        "B03003_003E"   # Hispanic
    ]

    all_vars = race_vars + hisp_vars
    var_str = ",".join(all_vars)

    url = f"https://api.census.gov/data/2022/acs/acs5"
    params = {
        "get": var_str,
        "for": f"block group:{block_group}",
        "in": f"state:{state_fips}+county:{county_fips}+tract:{tract}",
        "key": api_key
    }

    response = requests.get(url, params=params)
    data = response.json()

    # First row is headers, second row is values
    counts = dict(zip(data[0], data[1]))
    # Convert string counts to int
    counts = {k: int(v) for k, v in counts.items()}

    return counts

# Example usage:

#api_key = "YOUR_CENSUS_API_KEY"
state_fips, county_fips, tract, block_group = get_fips_from_coords(lat, lon)
counts = get_race_hisp_counts(state_fips, county_fips, tract, block_group)
print(counts)


{'B02001_002E': 1383, 'B02001_003E': 251, 'B02001_004E': 16, 'B02001_005E': 1565, 'B02001_006E': 20, 'B02001_007E': 367, 'B02001_008E': 490, 'B03003_002E': 3123, 'B03003_003E': 969, 'state': 6, 'county': 1, 'tract': 422800, 'block group': 1}


Now let's look at MSA (Metropolitan Statistical Area) size. We have to map the County FIPS code to the MSA region, for which we need to access the mapping file from the Census website.

In [48]:
import pandas as pd

url = "https://www2.census.gov/programs-surveys/metro-micro/geographies/reference-files/2023/delineation-files/list1_2023.xlsx"
cbsa_crosswalk = pd.read_excel(url, dtype=str,header=2)


# Inspect columns

cbsa_crosswalk.head(20)


Unnamed: 0,CBSA Code,Metropolitan Division Code,CSA Code,CBSA Title,Metropolitan/Micropolitan Statistical Area,Metropolitan Division Title,CSA Title,County/County Equivalent,State Name,FIPS State Code,FIPS County Code,Central/Outlying County
0,10100,,,"Aberdeen, SD",Micropolitan Statistical Area,,,Brown County,South Dakota,46,13,Central
1,10100,,,"Aberdeen, SD",Micropolitan Statistical Area,,,Edmunds County,South Dakota,46,45,Outlying
2,10140,,,"Aberdeen, WA",Micropolitan Statistical Area,,,Grays Harbor County,Washington,53,27,Central
3,10180,,101.0,"Abilene, TX",Metropolitan Statistical Area,,"Abilene-Sweetwater, TX",Callahan County,Texas,48,59,Outlying
4,10180,,101.0,"Abilene, TX",Metropolitan Statistical Area,,"Abilene-Sweetwater, TX",Jones County,Texas,48,253,Outlying
5,10180,,101.0,"Abilene, TX",Metropolitan Statistical Area,,"Abilene-Sweetwater, TX",Taylor County,Texas,48,441,Central
6,10220,,,"Ada, OK",Micropolitan Statistical Area,,,Pontotoc County,Oklahoma,40,123,Central
7,10300,,220.0,"Adrian, MI",Micropolitan Statistical Area,,"Detroit-Warren-Ann Arbor, MI",Lenawee County,Michigan,26,91,Central
8,10380,,364.0,"Aguadilla, PR",Metropolitan Statistical Area,,"Mayagüez-Aguadilla, PR",Aguada Municipio,Puerto Rico,72,3,Central
9,10380,,364.0,"Aguadilla, PR",Metropolitan Statistical Area,,"Mayagüez-Aguadilla, PR",Aguadilla Municipio,Puerto Rico,72,5,Central


We can use the State + County FIPS codes to discern whether the block is a part of an MSA (micropolitan = no) and what the CBSA code is (useful for population lookup later).

In [56]:
def get_MSA_status(state_fips, county_fips):
  return cbsa_crosswalk[(cbsa_crosswalk["FIPS State Code"] == state_fips) & (cbsa_crosswalk["FIPS County Code"] == county_fips)]

row = get_MSA_status("46", "013")
print(row["CBSA Code"].tolist()[0])
print(row["Metropolitan/Micropolitan Statistical Area"].tolist()[0])
#This is the only two pieces of info we need.

10100
Micropolitan Statistical Area


We can use this info to get the population from B01003 (only if the area is in an MSA, though, since that's how our RF is coded).

In [59]:
def get_population_by_cbsa(cbsa_code):
    """
    Given a CBSA code, fetch total population (B01003) from ACS5 API.
    Returns the population as an integer.
    """
    base_url = "https://api.census.gov/data/2022/acs/acs5"
    params = {
        "get": "B01003_001E",  # Total population estimate
        "for": f"metropolitan statistical area/micropolitan statistical area:{cbsa_code}",
        #"key": CENSUS_API_KEY
    }

    response = requests.get(base_url, params=params)
    response.raise_for_status()  # Raise an error if the request failed

    data = response.json()
    # The first row is column names, second row is the values
    pop = int(data[1][0])
    return pop

# Example usage
cbsa_code = "10420"  # Replace with your CBSA code
population = get_population_by_cbsa(cbsa_code)
print(f"Total population for CBSA {cbsa_code}: {population}")


Total population for CBSA 10420: 700578


Only two variables left!!! Let's look at Urban classification next. We need to use the TIGER/Line shapefiles to get urban classifications (at https://www2.census.gov/geo/tiger/TIGER2020/UAC/tl_2020_us_uac20.zip)..For now, we have to classify into just urban/rural, but a stretch goal would be to extend to urban, urban within a cluster, rural, rural within a cluster.

In [79]:
import geopandas as gpd
import requests
import zipfile
import io
from shapely.geometry import Point

# ----------------------------------------------------
# Configuration: URL of shapefile ZIP
UAC20_URL = "https://www2.census.gov/geo/tiger/TIGER2020/UAC/tl_2020_us_uac20.zip"

# ----------------------------------------------------
# Helper to load shapefile into GeoDataFrame
def load_urban_areas_gdf(url=UAC20_URL):
    # Download zip into bytes
    r = requests.get(url)
    r.raise_for_status()
    z = zipfile.ZipFile(io.BytesIO(r.content))

    # Find the .shp file name inside the zip
    shapefile_name = [f for f in z.namelist() if f.endswith(".shp")][0]

    # Extract all files into memory buffer
    z.extractall("/tmp/tl_uac20")

    # Load with GeoPandas
    gdf = gpd.read_file(f"/tmp/tl_uac20/{shapefile_name}")
    # Ensure it's in WGS84 lat/lon
    gdf = gdf.to_crs(epsg=4326)
    return gdf

# Load once
urban_gdf = load_urban_areas_gdf()

# ----------------------------------------------------
def classify_urban(lat, lon, gdf=urban_gdf):
    """
    Returns:
      - 'Urban' if the point is inside any urban polygon
      - 'Rural' otherwise
      - urban area name if inside urban, else None
    """
    pt = Point(lon, lat)
    match = gdf[gdf.contains(pt)]
    if not match.empty:
        # Inside some urban polygon
        name = match.iloc[0]["NAME20"]
        return "Urban", name
    else:
        return "Rural", None  # Not in urban area


# ----------------------------------------------------
# Example usage:
lat_example = 37.8715   # Berkeley, CA
lon_example = -122.2730
status, area_name = classify_urban(lat_example, lon_example)
print(status, area_name)

# Example for rural coordinate
lat_rural = 43.9836    # Rural Oregon
lon_rural = -120.7022
status_rural, area_name_rural = classify_urban(lat_rural, lon_rural)
print(status_rural, area_name_rural)



Urban San Francisco--Oakland, CA
Rural None


Only two to go. We can figure out distance to a school by calling the Open Route Service API.

In [85]:

# Get a free key from https://openrouteservice.org/sign-up/
API_KEY = "eyJvcmciOiI1YjNjZTM1OTc4NTExMTAwMDFjZjYyNDgiLCJpZCI6ImFiMTU3YmFjMzYxNzQ3MGRhZGY5ZWQ4MTFmOTE0ZGZiIiwiaCI6Im11cm11cjY0In0="

def get_driving_distance_ors(address1, address2):
    """
    Returns driving distance in kilometers and duration in minutes using OpenRouteService.
    """
    # First, geocode addresses using ORS
    def geocode(address):
        url = "https://api.openrouteservice.org/geocode/search"
        params = {"api_key": API_KEY, "text": address, "size": 1}
        resp = requests.get(url, params=params).json()
        if len(resp["features"]) == 0:
            raise ValueError(f"Address not found: {address}")
        coords = resp["features"][0]["geometry"]["coordinates"]  # [lon, lat]
        return coords

    start_coords = geocode(address1)
    end_coords = geocode(address2)

    # Call directions endpoint
    url = "https://api.openrouteservice.org/v2/directions/driving-car"
    headers = {"Authorization": API_KEY, "Content-Type": "application/json"}
    body = {
        "coordinates": [start_coords, end_coords]
    }
    resp = requests.post(url, json=body, headers=headers).json()

    print(resp)
    route = resp["routes"][0]["summary"]
    distance_mi = route["distance"] / 1000 * 0.621371
    duration_min = route["duration"] / 60

    return distance_mi, duration_min

# Example usage:
address1 = "7535 Northland Ave, San Ramon, CA"
address2 = "2571 Durant Ave, Berkeley, CA"

dist, dur = get_driving_distance_ors(address1, address2)
print(f"Distance: {dist:.2f} mi, Duration: {dur:.1f} min")


{'bbox': [-122.268957, 37.690686, -121.924111, 37.8667], 'routes': [{'summary': {'distance': 48797.4, 'duration': 2483.9}, 'segments': [{'distance': 48797.4, 'duration': 2483.9, 'steps': [{'distance': 1.0, 'duration': 0.2, 'type': 11, 'instruction': 'Head southeast on Northland Avenue', 'name': 'Northland Avenue', 'way_points': [0, 1]}, {'distance': 296.2, 'duration': 71.1, 'type': 1, 'instruction': 'Turn right onto May Way', 'name': 'May Way', 'way_points': [1, 12]}, {'distance': 932.9, 'duration': 67.2, 'type': 0, 'instruction': 'Turn left onto Davona Drive', 'name': 'Davona Drive', 'way_points': [12, 33]}, {'distance': 502.3, 'duration': 69.8, 'type': 1, 'instruction': 'Turn right onto Alcosta Boulevard', 'name': 'Alcosta Boulevard', 'way_points': [33, 41]}, {'distance': 133.5, 'duration': 24.9, 'type': 6, 'instruction': 'Continue straight onto Alcosta Boulevard', 'name': 'Alcosta Boulevard', 'way_points': [41, 47]}, {'distance': 2285.1, 'duration': 106.9, 'type': 1, 'instruction': 

Our final challenge: get the local gas price in cents. For now, we are fading this, because all the APIs for this are paid.

That's all for this notebook.