# Aggregating data

Data aggregation refers to a process where we combine data into groups. When
doing spatial data aggregation, we merge the geometries together into coarser
units (based on some attribute), and can also calculate summary statistics for
these combined geometries from the original, more detailed values. For example,
suppose that we are interested in studying continents, but we only have
country-level data like the country dataset. If we aggregate the data by
continent, we would convert the country-level data into a continent-level
dataset.

In this tutorial, we will aggregate our travel time data by car travel times
(column `car_r_t`), i.e. the grid cells that have the same travel time to
Railway Station will be merged together.

Let’s start with loading `intersection.gpkg`, the output file of the
[previous section](overlay-analysis):

In [None]:
import pathlib 
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

In [None]:
import geopandas
intersection = geopandas.read_file(DATA_DIRECTORY / "intersection.gpkg")

For doing the aggregation we will use a method called `dissolve()` that takes
as input the column that will be used for conducting the aggregation:

Let’s compare the number of cells in the layers before and after the
aggregation:

In [None]:
print(f"Rows in original intersection GeoDataFrame: {len(intersection)}")
print(f"Rows in dissolved layer: {len(dissolved)}")

Indeed the number of rows in our data has decreased and the Polygons were
merged together.

What actually happened here? Let's take a closer look. 

Let's see what columns we have now in our GeoDataFrame:

As we can see, the column that we used for conducting the aggregation
(`car_r_t`) can not be found from the columns list anymore. What happened to
it?

Let’s take a look at the indices of our GeoDataFrame:

Aha! Well now we understand where our column went. It is now used as index in
our `dissolved` GeoDataFrame. 

Now, we can for example select only such geometries from the layer that are for
example exactly 15 minutes away from the Helsinki Railway Station:

As we can see, as a result, we have now a Pandas `Series` object containing
basically one row from our original aggregated GeoDataFrame.

Let’s also visualize those 15 minute grid cells.

First, we need to convert the selected row back to a GeoDataFrame:

Plot the selection on top of the entire grid:

Another way to visualize the travel times in the entire GeoDataFrame is to plot using one specific column. In order to use our `car_r_t` column, which is now the index of the GeoDataFrame, we need to reset the index:

As we can see, we now have our `car_r_t` as a column again, and can then plot the GeoDataFrame passing this column using the `column` parameter:

> **How Are Other Columns Aggregated During `dissolve`?**  
> 
> When using the `dissolve` method in GeoPandas (e.g., `dissolved = intersection.dissolve(by="car_r_t")`), here's how other columns are aggregated:
> 
> ### Default Behavior:
> - **Default Aggregation Function:** `aggfunc='first'`
>   - Keeps the **first value** from each group for columns that are not involved in the aggregation (i.e., not the `by` column).
>   - For multiple rows grouped together, only the first row's values are retained for other columns.
> 
> ### Custom Aggregation:
> You can control how other columns are aggregated using the `aggfunc` parameter:
> ```python
> dissolved = intersection.dissolve(by="car_r_t", aggfunc="sum")
> ```
> Supported aggregation functions include:
> - `"sum"`: Sum of the values in the group.
> - `"mean"`: Average of the values in the group.
> - `"min"`: Minimum value in the group.
> - `"max"`: Maximum value in the group.
> - `"first"`: First value in the group (default).
> - `"last"`: Last value in the group.
> - Custom aggregation using a `lambda` function.
> 
> ### Using Multiple Aggregations:
> To apply different aggregations to different columns, you can do further aggregation manually:
> ```python
> dissolved = intersection.dissolve(by="car_r_t")
> dissolved = dissolved.groupby("car_r_t").agg({
>     "column1": "sum",
>     "column2": "mean",
>     "column3": "max"
> })
> ```
> 
> ### Geometry Aggregation:
> - The geometries in the grouped rows are **merged (unioned)** into a single geometry for each group.