# YouthMappers HOT Tasking Manager project analysis
In this notebook we demonstrate how to combine two datasets: HOT Tasking Manager project data and OSM contributions.

We want to investigate all contributions that have been made in OSM via HOT's Tasking Manager. We will furthermore filter these contributions by using the **OSM Changeset** information.

These are the steps you see further down:
* Set the connection parameters.
* Prepare your input parameters, e.g. define HOT Tasking Manager project ID.
* Download data using DuckDB. This time we also download HOT project data.
* Filter OSM contributions using changeset attributes.
* Display both datasets on a map.

## Getting started
Set connection params.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
import os

s3_user = os.environ["S3_ACCESS_KEY_ID"]  # add your user here
s3_password = os.environ["S3_SECRET_ACCESS_KEY"]  # add your password here

Configure DuckDB.

In [None]:
!pip install duckdb==1.0.0

In [3]:
import duckdb

con = duckdb.connect(
    config={
        'threads': 32,
        'max_memory': '50GB',
        # 'enable_object_cache': True
    }
)
con.install_extension("spatial")
con.load_extension("spatial")

Set the connection params to Iceberg Rest Catalog. (We need this for the OSM data.)

In [4]:
from pyiceberg.catalog.rest import RestCatalog

catalog = RestCatalog(
    name="default",
    **{
        "uri": "https://sotm2024.iceberg.ohsome.org",
        "s3.endpoint": "https://sotm2024.minio.heigit.org",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": s3_user,
        "s3.secret-access-key": s3_password,
        "s3.region": "eu-central-1"
    }
)

Set connection to MinIO object storage. (We need this for HOT Tasking Manager data)

In [5]:
query = f"""
DROP SECRET IF EXISTS "__default_s3";
CREATE SECRET (
      TYPE S3,
      KEY_ID '{s3_user}',
      SECRET '{s3_password}',
      REGION 'eu-central-1',
      endpoint 'sotm2024.minio.heigit.org',
      use_ssl true,
      url_style 'path'
  );
"""
con.sql(query).show()

┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘



## Prepare the input parameters for your analysis

In [6]:

# Set iceberg table
namespace = 'geo_sort'
tablename = 'contributions'
icebergtable = catalog.load_table((namespace, tablename))

hot_tm_project_ids = [
    13582,
    16369,
    16512,
    13046,
    12975,
    12761,
    12760,
    12707,
    12661,
    15996
]

teach_osm_project_ids = [
    1450,
    1457,
    1247,
    1525,
    1502,
    1470,
    1595,
    1488
]

## Get Information about HOT Tasking Manager project

In [7]:
import geopandas as gpd

def get_hot_tm_project_info(project_id):
    hot_tm_parquet_data_path = "s3a://heigit-ohsome-sotm24/data/hot_tasking_manager/**"
    
    query = f"""
    SELECT *
    FROM read_parquet('{hot_tm_parquet_data_path}') a
    WHERE project_id = {project_id};
    """
    df = con.sql(query).df()
    
    bbox = df["bbox"].values[0]
    
    # convert the data to geodata
    gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.GeoSeries.from_wkt(df['geometry'])
    ).set_crs('epsg:4326')

    return gdf, bbox

Display where the project is located.

## Get Teach OSM project info

In [25]:
def get_teach_osm_tm_project_info(project_id):

    url = f"https://tasking-manager-tm4-teachosm-api.hotosm.org/api/v2/projects/{project_id}/queries/aoi/"
    gdf = gpd.read_file(url)
    bbox = {
        "xmin": gdf.bounds["minx"][0],
        "xmax": gdf.bounds["maxx"][0],
        "ymin": gdf.bounds["miny"][0],
        "ymax": gdf.bounds["maxy"][0] 
    }
    return gdf, bbox

## Download data OSM Data

In [9]:
import time


def download_osm_data(bbox):
    start_time = time.time()
    osm_data = icebergtable.scan(
        row_filter=(
            f"( status = 'latest' or status = 'history' )"
            f"and (xmax >= {bbox['xmin']} and xmin <= {bbox['xmax']}) "
            f"and (ymax >= {bbox['ymin']} and ymin <= {bbox['ymax']}) "
        ),
        selected_fields=(
            "user_id",
            "osm_id",
            "osm_version",
            "valid_from",
            "valid_to",
            "tags",
            "tags_before",
            "changeset",
            "geometry",
        ),
    )
    
    download_time = round(time.time() - start_time, 3)
    print(f"iceberg scan took {download_time} sec.")

    return osm_data

## Save as parquet files

In [10]:
def save_as_parquet(osm_data, outfile_name):
    start_time = time.time()
    
    osm_data.to_duckdb('raw_osm_data',connection=con)

    query = f"""
    COPY 
    (
    SELECT * FROM raw_osm_data
    ) TO '{outfile_name}' WITH (
        FORMAT PARQUET,
        COMPRESSION ZSTD
    )
    """

    con.sql(query)
    download_time = round(time.time() - start_time, 3)
    print(f"download took {download_time} sec.")

## Loop throug all projects

In [22]:
for project_id in hot_tm_project_ids:
    print(f"start for hot tm project: {project_id}")
    gdf, bbox = get_hot_tm_project_info(project_id)
    osm_data = download_osm_data(bbox)

    outfile = f"s3a://heigit-ohsome-sotm24/data/youthmappers/hot_tm_project_contributions_{project_id}.parquet"
    save_as_parquet(osm_data, outfile)
    
    print(f"finish for hot tm project: {project_id}")

start for hot tm project: 13582
iceberg scan took 0.108 sec.
download took 38.379 sec.
finish for hot tm project: 13582
start for hot tm project: 16369
iceberg scan took 0.007 sec.
download took 31.071 sec.
finish for hot tm project: 16369
start for hot tm project: 16512
iceberg scan took 0.007 sec.
download took 37.277 sec.
finish for hot tm project: 16512
start for hot tm project: 13046


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

iceberg scan took 0.007 sec.
download took 43.53 sec.
finish for hot tm project: 13046
start for hot tm project: 12975
iceberg scan took 0.095 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 36.747 sec.
finish for hot tm project: 12975
start for hot tm project: 12761
iceberg scan took 0.007 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 39.712 sec.
finish for hot tm project: 12761
start for hot tm project: 12760
iceberg scan took 0.008 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 41.552 sec.
finish for hot tm project: 12760
start for hot tm project: 12707
iceberg scan took 0.01 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 39.172 sec.
finish for hot tm project: 12707
start for hot tm project: 12661
iceberg scan took 0.007 sec.
download took 39.669 sec.
finish for hot tm project: 12661
start for hot tm project: 15996
iceberg scan took 0.093 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 39.217 sec.
finish for hot tm project: 15996


## Teach OSM projects

In [27]:
for project_id in teach_osm_project_ids:
    print(f"start for hot tm project: {project_id}")
    gdf, bbox = get_teach_osm_tm_project_info(project_id)
    osm_data = download_osm_data(bbox)
    
    outfile = f"s3a://heigit-ohsome-sotm24/data/youthmappers/teach_osm_tm_project_contributions_{project_id}.parquet"
    save_as_parquet(osm_data, outfile)
    
    print(f"finish for teach osm tm project: {project_id}")

start for hot tm project: 1450
iceberg scan took 0.142 sec.
download took 26.949 sec.
finish for teach osm tm project: 1450
start for hot tm project: 1457
iceberg scan took 0.007 sec.
download took 24.674 sec.
finish for teach osm tm project: 1457
start for hot tm project: 1247
iceberg scan took 0.016 sec.
download took 39.18 sec.
finish for teach osm tm project: 1247
start for hot tm project: 1525
iceberg scan took 0.017 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 38.336 sec.
finish for teach osm tm project: 1525
start for hot tm project: 1502
iceberg scan took 0.016 sec.
download took 41.367 sec.
finish for teach osm tm project: 1502
start for hot tm project: 1470
iceberg scan took 0.007 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 38.616 sec.
finish for teach osm tm project: 1470
start for hot tm project: 1595
iceberg scan took 0.007 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 40.785 sec.
finish for teach osm tm project: 1595
start for hot tm project: 1488
iceberg scan took 0.012 sec.


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

download took 42.5 sec.
finish for teach osm tm project: 1488
