---
title: Spatiotemporal join with Geopandas
subtitle: Filter geospatial events with polygons + time intervals
date: 2025-07-16
categories: [geospatial, H3, KeplerGl, pings binning, Geolife, tutorial]
image: images/cover.png
toc: true
draft: false
colab: <a href="https://colab.research.google.com/github/SebastianoF/GeoDsBlog/blob/master/posts/gds-2025-07-16-spatiotemporal-join/index.ipynb" target="_blank"><img src="images/colab.svg"></a>
github: <a href="https://github.com/SebastianoF/GeoDsBlog/blob/master/posts/gds-2025-07-16-spatiotemporal-join/index.ipynb" target="_blank">  <img src="images/github.svg"> </a>
twitter-card:
  image: images/cover.png
---

Spatiotemporal join intro - in progress



### Outline

Goal: combine the spatial and the temporal join with geopandas.


- Datasets
  - Pings (id, lat, lon, timestamp)  (map)
  - Events (id, polygon, start_time, end_time)  (map)
  - Temporised polygons  (map)
- Joins
  - Pings in polygon  (illustration and map)
  - Events in polygons  (illustration and map)
  - Pings in events  (illustration and map)
- References and acknowledgements


THE END. 

References:
# https://stackoverflow.com/questions/63369715/filter-a-geopandas-dataframe-within-a-polygon-and-remove-from-the-dataframe-the
https://www.youtube.com/watch?v=y85IKthrV-s&t=3s&ab_channel=JonathanSoma


- @sec-python-env Python environment setup
- @sec-download-the-dataset Download geolife dataset
- @sec-load-dataset Load the dataset
- @sec-eda Exploratory data analysis (EDA)
- @sec-space-time-binning Space-time binning with H3
- @sec-dataset-profile-visualisation Dataset profile visualisation



::: {.callout-tip}
To create the `.gif` animation on mac, take a scree recording with a software like QuickTime player `video.mov`, then following [this guide](https://gist.github.com/SheldonWangRJT/8d3f44a35c8d1386a396b9b49b43c385), run the command:

```bash
ffmpeg -i video.mov -pix_fmt rgb8 -r 10 output.gif && gifsicle -O3 output.gif -o output.gif
```

Making sure that you have `ffmpeg` and `gifsicle` installed beforehand:

```bash
brew install ffmpeg
brew install gifsicle
```

:::

## Python environment setup {#sec-python-env}

Create a virtualenvironment and install the required libraries.

Suggested lightweight method:

```bash
virtualenv venv -p python3.11
source venv/bin/activate
pip install -r https://raw.githubusercontent.com/SebastianoF/GeoDsBlog/master/posts/gds-2024-06-20-dataset-profiling/requirements.txt
```

Where the requirement file is sourced directly from the repository, and contains the following libraries and their pinned dependencies:

```text
altair==5.3.0
geopandas==1.0.1
gitpython==3.1.43
h3==3.7.7
keplergl==0.3.2
matplotlib==3.9.0
pyarrow==16.1.0
tqdm==4.66.4
```

You can look under `requirements.txt` [in the repository](https://github.com/SebastianoF/GeoDsBlog/blob/master/posts/gds-2024-06-20-dataset-profiling/requirements.txt) for the pinned dependency tree.


In [None]:
import datetime as dt
import io
import zipfile
from copy import deepcopy
from pathlib import Path

import altair as alt
import geopandas as gpd
import git
import h3
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from IPython.display import Image
from shapely import Polygon
from tqdm import tqdm

plt.style.use("dark_background")
alt.renderers.set_embed_options(theme="dark")

KEPLER_OUTPUT = False  # for blog visualisation: set to true if running on jupyter notebook

## Download the dataset {#sec-download-the-dataset}

The dataset can be downloaded manually, or running the following commands:

In [None]:
url_download_geolife_dataset = "https://download.microsoft.com/download/F/4/8/F4894AA5-FDBC-481E-9285-D5F8C4C4F039/Geolife%20Trajectories%201.3.zip"

try:
    path_root = Path(git.Repo(Path().cwd(), search_parent_directories=True).git.rev_parse("--show-toplevel"))
except (git.exc.InvalidGitRepositoryError, ModuleNotFoundError):
    path_root = Path().cwd()

path_data_folder = path_root / "z_data"
path_data_folder.mkdir(parents=True, exist_ok=True)

path_unzipped_dataset = path_data_folder / "Geolife Trajectories 1.3"

if not path_unzipped_dataset.exists():
    r = requests.get(url_download_geolife_dataset)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(path_data_folder)

print("Dataset ready")

Dataset ready


## Load the dataset {#sec-load-dataset}

The folder structure of the Geolife dataset is not as linear as it could be.

- The trajectories are divided into 182 folders numbered from `000` to `181`. 
- Each folder contains a subfolder called `Trajectory` containing a series of `.plt` files.
- Some folders also contain a file called `labels.txt`, containing time intervals and transportation modes.
- There are two different datetime formats.
- Within the zip folder there is also a pdf file with the dataset description.

To load the files I implemented a class to load the files and enrich them with the labels when they are available. The time formats are converted to pandas datetimes[^1].

The main parser method has also a boolean flag `parse_only_if_labels`. `False` by default, when set to `True` only the files with the labels are parsed.

[^1]: For a refresher about datetime manipulation, [here](https://sebastianof.github.io/GeoDsBlog/posts/gds-2024-01-24-timestamp/) is an detailed tutorial on this topic.


In [None]:
# Columns names

COLS_TRAJECTORY = [
    "latitude",
    "longitude",
    "none",
    "altitude",
    "date_elapsed",
    "date",
    "time",
]

COLS_TRAJECTORY_TO_LOAD = [
    "latitude",
    "longitude",
    "altitude",
    "date",
    "time",
]

COLS_LABELS = [
    "start_time",
    "end_time",
    "transportation_mode",
]

COLS_RESULTS = [
    "entity_id",
    "latitude",  # degrees
    "longitude",  # degrees
    "altitude",  # meters (int)
    "timestamp",  # UNIX (int)
    "transport",  # only if the labels.txt file is there
]

# Format codes

LABELS_FC = "%Y/%m/%d %H:%M:%S"
TRAJECTORY_FC = "%Y-%m-%d %H:%M:%S"


class GeoLifeDataLoader:
    def __init__(self, path_to_geolife_folder: str | Path) -> None:
        self.pfo_geolife = path_to_geolife_folder
        pfo_data = Path(self.pfo_geolife) / "Data"
        self.dir_per_subject: dict[int, Path] = {int(f.name): f for f in pfo_data.iterdir() if f.is_dir()}

    def to_pandas_per_device(
        self,
        device_number: int,
        leave_progressbar=True,
    ) -> tuple[pd.DataFrame]:
        path_to_device_folder = self.dir_per_subject[device_number]
        pfo_trajectory = path_to_device_folder / "Trajectory"
        pfi_labels = path_to_device_folder / "labels.txt"
        df_labels = None
        df_trajectory = None

        list_trajectories = [plt_file for plt_file in pfo_trajectory.iterdir() if str(plt_file).endswith(".plt")]
        list_dfs = []
        for traj in tqdm(list_trajectories, leave=leave_progressbar):
            df_sourced = pd.read_csv(traj, skiprows=6, names=COLS_TRAJECTORY, usecols=COLS_TRAJECTORY_TO_LOAD)
            df_sourced["altitude"] = df_sourced["altitude"].apply(lambda x: x * 0.3048)  # feets to meters
            df_sourced["timestamp"] = df_sourced.apply(
                lambda x: dt.datetime.strptime(x["date"] + " " + x["time"], TRAJECTORY_FC),
                axis=1,
            )
            df_sourced = df_sourced.assign(entity_id=f"device_{device_number}", transport=None)
            df_sourced = df_sourced[COLS_RESULTS]
            list_dfs.append(df_sourced)
        df_trajectory = pd.concat(list_dfs)

        if pfi_labels.exists():
            df_labels = pd.read_csv(pfi_labels, sep="\t")
            df_labels.columns = COLS_LABELS
            df_labels["start_time"] = df_labels["start_time"].apply(lambda x: dt.datetime.strptime(x, LABELS_FC))
            df_labels["end_time"] = df_labels["end_time"].apply(lambda x: dt.datetime.strptime(x, LABELS_FC))
        if df_labels is not None:
            for _, row in tqdm(df_labels.iterrows(), leave=leave_progressbar):
                mask = (df_trajectory.timestamp > row.start_time) & (df_trajectory.timestamp <= row.end_time)
                df_trajectory.loc[mask, "transport"] = row["transportation_mode"]

        return df_trajectory, df_labels

    def to_pandas(self, parse_only_if_labels: bool = False) -> pd.DataFrame:
        list_dfs_final = []
        for idx in tqdm(self.dir_per_subject.keys()):
            df_trajectory, df_labels = self.to_pandas_per_device(idx, leave_progressbar=False)
            if parse_only_if_labels:
                if df_labels is not None:
                    list_dfs_final.append(df_trajectory)
                else:
                    pass
            else:
                list_dfs_final.append(df_trajectory)

        return pd.concat(list_dfs_final)


### Load and save to parquet 

To speed up next loading phase, I save to parquet after loading from the `.plt` files the first time.

In this way it will take less time to reload the dataset to continue the analysis.

In [4]:
path_complete_parquet = path_data_folder / "GeolifeTrajectories.parquet"

if not path_complete_parquet.exists():
    # this takes about 5 minutes
    gdl = GeoLifeDataLoader(path_unzipped_dataset)
    df_geolife = gdl.to_pandas(parse_only_if_labels=True)
    df_geolife = df_geolife.reset_index(drop=True)
    df_geolife.to_parquet(path_complete_parquet)

In [5]:
# this takes 2 seconds (MAC book air, 8GB RAM)
df_geolife = pd.read_parquet(path_complete_parquet)
print(df_geolife.shape)
df_geolife.head()

(24876977, 6)


Unnamed: 0,entity_id,latitude,longitude,altitude,timestamp,transport
0,device_135,39.974294,116.399741,149.9616,2009-01-03 01:21:34,
1,device_135,39.974292,116.399592,149.9616,2009-01-03 01:21:35,
2,device_135,39.974309,116.399523,149.9616,2009-01-03 01:21:36,
3,device_135,39.97432,116.399588,149.9616,2009-01-03 01:21:38,
4,device_135,39.974365,116.39973,149.6568,2009-01-03 01:21:39,
