## Census Batch Geocoding

See [https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.pdf](https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.pdf) for more information.

The Census batch geocoder is a great solution when you need to geocode high quantities of addresses located in the United States. We can submit a file containing up to 10,000 addresses at a time. The file needs to be formatted correctly with the following columns: Unique ID, Street address, City, State, ZIP. We do this preprocessing below, and subsequently write the CSV to file. We then send a GET request for the file using the `curl` utility to get the locations of the addresses. Change the variable names to fit your file format.

Addresses often do not follow the format necessary for use within geocoders, and should be standardized prior to processing. We perform one such preprocessing step below, turning the numeral "One" to a numeric value.

In [1]:
import pandas as pd
from tqdm import tqdm

addfile='RI_Independent_Schools.csv' #Input file with addresses
df = pd.read_csv(addfile)

ID = 'org_ID'
NAME = 'name'
ADDRESS = 'location_address1'
STATE = 'location_state'
CITY = 'location_city'
ZIP = 'location_zip'
MAX_LOCATIONS = 5

def split_dataframe(df, chunk_size=10000):
    return [df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]

def preprocess(df: pd.DataFrame) -> list[pd.DataFrame]:
    df[ADDRESS] = df[ADDRESS].str.replace('One', '1')
    if len(df) > 10000:
        df = split_dataframe(df)
        return df
    else:
        return [df]

addfile = addfile.split('.')
df_list = preprocess(df)

batch_names = []
for i,df in enumerate(df_list):
    batch_compatible_df = df[[ID, ADDRESS, CITY, STATE, ZIP]]
    batch_df_name = f'Batch_Compatible_{addfile[0]}_{i}.{addfile[-1]}'
    batch_names.append(batch_df_name)
    batch_compatible_df.to_csv(batch_df_name, index=False)

We specify a benchmark parameter, which is a numerical ID or name which references the version of the locator to be
used. This generally corresponds to address locators based on MAF/TIGER benchmarks, otherwise known as TIGER/Line geography. At the time of writing, there were three benchmarks available:
- 4: Public Address Ranges - Current Benchmark
- 8: Public Address Ranges - ACS2024 Benchmark
- 2020: Public Address Ranges - Census 2020 Benchmark

The differences are minimal, so we will stick to the default here.

Here is a `curl` request that gets addresses for all of the schools listed above.

In [2]:
!curl --form addressFile=@Batch_Compatible_RI_Independent_Schools_0.csv --form benchmark=4 \
https://geocoding.geo.census.gov/geocoder/locations/addressbatch --output Batch_Compatible_RI_Independent_Schools_Matched.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12798  100  9447  100  3351   8950   3249  0:00:01  0:00:01 --:--:--  3331174  0:00:01  0:00:01 --:--:-- 12413


And here is an example of using the `requests` API to get addresses for all of the schools listed above.

In [3]:
import requests

# The URL for the address batch geocoding API
url = 'https://geocoding.geo.census.gov/geocoder/locations/addressbatch'

# Open the file to send in the form
with open('Batch_Compatible_RI_Independent_Schools_0.csv', 'rb') as file:
    files = {'addressFile': file}
    data = {'benchmark': '4'}
    response = requests.post(url, files=files, data=data)
    with open('Batch_Compatible_RI_Independent_Schools_Matched.csv', 'wb') as output_file:
        output_file.write(response.content)

if response.status_code == 200:
    print('File uploaded and result saved successfully.')
    
else:
    print(f'Error: {response.status_code}')

File uploaded and result saved successfully.


## Census Individual Geocoding

The Census Batch geocoder defined above should be your go-to in most cases: it runs much faster and provides easy access to results. However, in some cases the format specified above may not be a good fit for the data attributes you have. You may also want more control over the process, or to look at the matches for an individual address. In this case, individual geocoding may be a better choice. Here, we will use the Census API's One Line Address feature to lookup addresses built from individual rows. One Line Address lookup can match based on just two fields, the ADDRESS and ZIP fields. 

First, we define a function which will query the API for the Street, City, Address, and Zip provided in each row. We default to the current benchmark.

In [4]:
#Accepted formats are at minimum (street, zip) or (street, city, state), .
import requests, urllib, traceback
lineaddr_url = f'https://geocoding.geo.census.gov/geocoder/locations/onelineaddress?'

addfile='RI_Independent_Schools.csv' #Input file with addresses
df = pd.read_csv(addfile)

def get_request_census(address: str, url: str) -> list[dict]:
    address = urllib.parse.quote_plus(address)
    add_url = f'address={address}&benchmark=4&format=json'
    data_url = f'{lineaddr_url}{add_url}'
    #print(data_url)
    response=requests.get(data_url)
    return response.json()['result']['addressMatches']

Here, we construct the address string from the ADDRESS, CITY, STATE, and ZIP columns defined above. We send a GET request for each row of the dataframe. We retrieve best matching address, x, and y coordinates unless there are no matches. We append four columns to the dataframe specifying the new information. If there were no matches, we set them to None instead.

In [5]:
for idx, row in tqdm(df.iterrows(), total=len(df)):
    try:
        address = f'{row[ADDRESS]},{row[CITY]},{row[STATE]},{row[ZIP]}'
        request_data = get_request_census(address, lineaddr_url)
        if len(request_data) > 0:
            matches = []
            for m in request_data:
                matches.append([row[ID], row[NAME], 
                                    m['matchedAddress'], 
                                    m['coordinates']['x'],
                                    m['coordinates']['y']])
            matches.sort(key=lambda x: x[2], reverse=True)
            best_match = matches[0]
            match_ct = len(request_data)
            match_address,x,y = best_match[2:5]
        else:
            score,match_address,x,y = None,None,None, None
        df.loc[idx, 'matches'] = match_ct
        df.loc[idx, 'match_address'] = match_address
        df.loc[idx, 'x'] = x
        df.loc[idx,'y'] = y

    except Exception as e:
        traceback.print_exc() 
df

100%|██████████| 67/67 [00:17<00:00,  3.93it/s]


Unnamed: 0,org_ID,parent_ID,code,finance_code,name,name_short_30,name_short_15,org_type_ID,org_type,active,...,listing_order,sch_sub_type_ID,sch_sub_type_name,role_sort_order,source,OverRideSortOrder,matches,match_address,x,y
0,2829,,07353,,A Childs University - Cranston,A Childs University,A Childs Univer,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,9999,1.0,"695 PARK AVE, CRANSTON, RI, 02910",-71.428536,41.777548
1,2928,,17304,,A Childs University - Smithfield,A Childs University,AChilds Univers,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,9999,1.0,"370 WASHINGTON HWY, SMITHFIELD, RI, 02917",-71.509318,41.923054
2,2607,,32340,,Middlebridge School,Middlebridge School,Middlebridge,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,950,1.0,"333 OCEAN RD, NARRAGANSETT, RI, 02882",-71.455340,41.414584
3,3232,,27306,,Sea Rose Montessori Co-op,Sea Rose Montessori Co-op,Sea Rose Montes,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",,RIDE_Feb 16 2023 10:41AM,9999,1.0,"324 W MAIN RD, PORTSMOUTH, RI, 02871",-71.259192,41.606879
4,3363,42.0,709A1,,Seekonk Christian Academy,Seekonk Christian Academy,Seekonk Christi,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",,RIDE_Feb 16 2023 10:41AM,9999,1.0,"95 SAGAMORE RD, SEEKONK, MA, 02771",-71.312616,41.805387
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62,1365,,35371,,The Stork's Nest Child Academy III,The Stork's Nest,Stork's Nest,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,9999,1.0,"1100 TOLL GATE RD, WARWICK, RI, 02886",-71.498012,41.711140
63,1438,,38306,,Islamic School of RI,Islamic School of RI,Islamic School,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,9999,1.0,"840 PROVIDENCE ST, WEST WARWICK, RI, 02893",-71.491486,41.724252
64,1437,,38305,,The Tides School - West Warwick,The Tides School - WW,The Tides,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",12.0,RIDE_Feb 16 2023 10:41AM,9999,1.0,"222 WASHINGTON ST, WEST WARWICK, RI, 02893",-71.527903,41.701037
65,1497,,39332,,Hillside Alternative Program,Hillside Alternative Program,Hillside,2,School,Y,...,,7,"Independent (PK, Elem/Sec)",5.0,RIDE_Feb 16 2023 10:41AM,1300,1.0,"141 MAIN ST, WOONSOCKET, RI, 02895",-71.513855,42.002683


We can save our geolocated file to CSV, with the match and coordinate data appended.

In [6]:
from datetime import date

today =str(date.today())
outfile = addfile.split('.')[0]+'_MATCHED_'+today+'.csv'
df.to_csv(outfile, index=False)