# Geocoding in geopandas

Geopandas supports geocoding via a library called
[geopy](http://geopy.readthedocs.io/), which needs to be installed to use
[geopandas’ `geopandas.tools.geocode()`
function](https://geopandas.org/en/stable/docs/reference/api/geopandas.tools.geocode.html).
`geocode()` expects a `list` or `pandas.Series` of addresses (strings) and
returns a `GeoDataFrame` with resolved addresses and point geometries.

Let’s try this out.

We will geocode addresses stored in a semicolon-separated text file called
`addresses.txt`. These addresses are located in the Helsinki Region in Southern
Finland.

In [None]:
import pathlib
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

In [None]:
import pandas
addresses = pandas.read_csv(
    DATA_DIRECTORY / "helsinki_addresses" / "addresses.txt",
    sep=";"
)

addresses.head()

We have an `id` for each row and an address in the `addr` column.


## Geocode addresses using *Nominatim*

In our example, we will use *Nominatim* as a *geocoding provider*. [*Nominatim*](https://nominatim.org/) is a library and service using OpenStreetMap data, and run by the OpenStreetMap Foundation. Geopandas’
[`geocode()`
function](hhttps://geopandas.org/en/stable/docs/reference/api/geopandas.tools.geocode.html) supports it natively.

<div style="border: 1px solid #cce5ff; background-color: #e9f7fd; padding: 15px; border-radius: 5px;">

**Fair-use**

[Nominatim’s terms of use](https://operations.osmfoundation.org/policies/nominatim/) require that users of the service ensure they don’t send more frequent requests than one per second and that a custom **user-agent** string is attached to each query.

Geopandas’ implementation allows us to specify a `user_agent`, and the library also takes care of respecting Nominatim's rate limit.

Looking up an address is a quite expensive database operation. This is why the public and free-to-use Nominatim server sometimes takes slightly longer to respond. In this example, we add a parameter `timeout=10` to wait up to 10 seconds for a response.

</div>



In [None]:
import geopandas

geocoded_addresses = geopandas.tools.geocode(
    addresses["addr"],
    provider="nominatim",
    user_agent="autogis2023",
    timeout=10
)
geocoded_addresses.head()

Et voilà! As a result we received a `GeoDataFrame` that contains a parsed
version of our original addresses and a `geometry` column of
`shapely.geometry.Point`s that we can use, for instance, to export the data to
a geospatial data format.

However, the `id` column was discarded in the process. To combine the input
data set with our result set, we can use pandas’ [*join*
operations](https://pandas.pydata.org/docs/user_guide/merging.html).


## Join data frames

> **Note: Joining data sets using pandas**

> For a comprehensive overview of different ways of combining DataFrames and Series based on set theory, see the pandas documentation on [merge, join, and concatenate](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).



Joining data from two or more data frames or tables is a common task in many
(spatial) data analysis workflows. As you might remember from our earlier
lessons, combining data from different tables based on common **key** attribute
can be done easily in pandas/geopandas using the [`merge()`
function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html).
We used this approach in [exercise 6 of the Geo-Python
course](https://geo-python-site.readthedocs.io/en/latest/lessons/L6/exercise-6.html#joining-data-from-one-dataframe-to-another).

However, sometimes it is useful to join two data frames together based on their
**index**. The data frames have to have the **same number of records** and
**share the same index** (simply put, they should have the same order of rows).

We can use this approach, here, to join information from the original data
frame `addresses` to the geocoded addresses `geocoded_addresses`, row by row.
The `join()` function, by default, joins two data frames based on their index.
This works correctly for our example, as the order of the two data frames is
identical.

In [None]:
geocoded_addresses_with_id = geocoded_addresses.join(addresses)
geocoded_addresses_with_id

The output of `join()` is a new `geopandas.GeoDataFrame`:

In [None]:
type(geocoded_addresses_with_id)

The new data frame has all original columns plus new columns for the `geometry`
and for a parsed `address` that can be used to spot-check the results.

> **Note**

> If you perform the join the other way around, i.e., `addresses.join(geocoded_addresses)`, the output would be a `pandas.DataFrame`, not a `geopandas.GeoDataFrame`.



---


It’s now easy to save the new data set as a geospatial file, for instance, in
*GeoPackage* format:

In [None]:
# delete a possibly existing file, as it creates
# troubles in case sphinx is run repeatedly
try:
    (DATA_DIRECTORY / "addresses.gpkg").unlink()
except FileNotFoundError:
    pass

In [None]:
geocoded_addresses.to_file(DATA_DIRECTORY / "addresses.gpkg")

<div style="border: 1px solid #ffeeba; background-color: #fff3cd; padding: 15px; border-radius: 5px;">

**Attention: Understanding the difference between `join` and `merge` in GeoPandas**

GeoPandas provides both `join` and `merge` functions, and while they may seem similar, they are used differently depending on the context.

1. **`join`**: 
   - This is primarily used for joining GeoDataFrames with a shared index. It works similarly to a SQL join based on the index of the two tables.
   - It is ideal for adding columns from one GeoDataFrame to another based on the index or a pre-aligned structure.
   
2. **`merge`**:
   - `merge` allows more flexibility by enabling joins based on specific columns, not just the index. It works similarly to `pd.merge` in pandas.
   - It is useful for spatial joins when you want to match features based on attribute values in specific columns rather than just the index.
   
### Example

```python
import geopandas as gpd

# Sample GeoDataFrames
gdf1 = gpd.GeoDataFrame({
    'ID': [1, 2, 3],
    'Name': ['Park', 'Lake', 'Forest'],
    'geometry': gpd.points_from_xy([10, 20, 30], [10, 20, 30])
})

gdf2 = gpd.GeoDataFrame({
    'ID': [1, 2, 3],
    'Area_km2': [1.5, 2.1, 3.3]
})

# Using `join` - joins based on index
joined = gdf1.set_index('ID').join(gdf2.set_index('ID'))
print("Using `join`:\n", joined)

# Using `merge` - joins based on a column
merged = gdf1.merge(gdf2, on='ID')
print("Using `merge`:\n", merged)
