Geohack: NDGI Project Geoharvester

Challenge 1 - Metadata quality control
Challenge 2 - Extend scraper with explorative search
Project stack
Setup / Deployment
- Frontend
- Backend
  - Troubleshooting
    - Cannot start the application
    - VSCode complains about missing imports
API Documentation
- SwaggerUI

Introduction

The Geoharvester is project in developement for Swisstopo. At its core a scraper (developed by David Oesch, Swisstopo) takes a list of geoservice (WM(T)S, WFS) urls as input and writes the getCapability response into a csv file, netting about 23k datasets. FHNW/IGEOs part is to provide a user interface and algorithms to make the data accessible. Think Google for a second - to find data that is relevant to you (aka matches your search query) we need to structure, group, index, match and rank datasets and send results in a sensible time. For the Geohack we use a modified version of the project, which sacrifices performance in favor of a simplified setup (see stack description below). We came up with two challenges that will help us improve scraping:

Challenge 1 - Metadata quality control

Context

Besides our main goal of making data available and accessible through the geoharvester we focus on promoting quality standards and good metadata documentation. To reach this objective, we would like to target the data providers with a simple effort-reward strategy: data which is complete and well documented gets found more easily and/or higher ranked in the search results than incomplete, outdated or faulty data. Ideally the data providers will thus strife to optimise the metadata of their services (and datasets).

Based on the current source definition, the scraper currently sources about 25000 datasets. (Note that for this challenge, we use a smaller csv file which only contains WFS service addresses - still about 6500 datasets). As you can see from the output, the quality of the metadata varies greatly. However such differences are difficult to grasp in detail from a table alone.

Task

To get a better overview of the (meta)data quality we need additional aids - and this is where you come in:

How can the services/datasets be assessed on their metadata completeness and quality? (main task)
Which meaningful indicators and criteria can be used to judge and/or rank the data? (OGC compliance might be a good starting point)
How can the data be sorted/grouped/aggregated based on such criteria and analysed statistically and/or visually?

Suggestions

While you could work with the csv file alone, we recommend for a more technical approach to hook into the pandas dataframe and to transform / extend it to your needs. The dataframe gets populated once during startup (main.py / @app.on_event("startup"), see https://fastapi.tiangolo.com/advanced/events/) but you could also move the code to an endpoint to trigger it from the outside, e.g. the frontend.

Goal

Recommendations, criteria and/or strategies on how to measure metadata quality and completeness. Such principles applied to the data in an assesment of how well it aligns.

Challenge 2 - Extend scraper with explorative search:

Context

While the Geoharvester provides a search interface in addition to a server for quick data retrieval, the datasets stored in the database/dataframe come from a separate scraper . Run daily by Github actions, the script takes a list of service urls, retrieves all datasets listed by the getCapabilities definition and saves the output into a csv file, which is then ingested by the Geoharvester.

Task

We would like to expand the scrapers ability to source geodata, in addition to the existing list of sources. For that we could imagine an additonal script, that

searches for additional, publically available WM(T)S / WFS services, retrieves their getCapabilies url and saves it to a file (main task)
checks with the sources.csv if this url is already registered to avoid duplication
filters datasets on their relevance, e.g. by comparing the bounding box (BBOX property) with the dimensions of Swizerland. Only datasets with a sufficient overlap should pass the filter
any additonal features you may see fit

Suggestions

While the service url (and its getCapabilities response) is the main goal, you could also analyse the datasets (e.g. on their quality or completeness, see challenge 1) or visualise the results. You do not need to hook into the scraper code itself (as it might be a heavy process if run on your local machine) but you could add your script to the API as a separate endpoint or into the startup routine (main.py / @app.on_event("startup"), see https://fastapi.tiangolo.com/advanced/events/), save the output to a separate dataframe and display results in the frontend.

Goal

Have a script/algorithm that searches for (relevant = Swizerland-focussed) WM(T)S / WFS services and extends the functionality of the scraper.

Project stack

Stack diagram of the main project:

Simplified stack for Geohack:

The Geohack version of Geoharvester differs from this diagram:

The backend is not containerized, Docker is not needed.
Pandas dataframe instead of Redis database.

To compensate for the lower performance of pandas compared to reddit, a row limit (see main.py) is set.

Setup / Deployment

Frontend

Requirements

Your favorite terminal (Recommendation for Windows user )
Have node and npm installed (https://docs.npmjs.com/downloading-and-installing-node-js-and-npm)

Run

cd into frontend folder ("client")
run npm i to install dependencies (from package.json)
run npm start to start the fronted on localhost (npm start is defined in package.json). Process restarts automatically on code changes.

Backend

Requirements

Your favorite terminal (Recommendation for Windows user ) - use a second window / tab / tile for this process.
Have a venv running and dependencies installed - Cd into server/app then run python -m venv env && source ./env/bin/activate && pip install -r requirements.txt

Run

In terminal cd (back) into server folder
Run uvicorn app.main:app --reload to start the API. ("--reload" means hot-reloading, no restarting after code changes needed)
Check localhost:8000/docsin your browser to verify that backend is running

Troubleshooting

Cannot start the application

Check that you are starting the backend from the server folder (not server/apps).
Is the virtual environment up and running?

VSCode complains about missing imports

Point the VSCode Python compiler to your venv, so that it can pick up the dependencies from the virtual environment. (See/Click bottom right corner in VSCode )

API Documentation

SwaggerUI

Fast API comes with Swagger UI preinstalled. If you have the backend running (see steps above), Swagger UI is available on http://localhost:8000/docs.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
client		client
server		server
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geohack: NDGI Project Geoharvester

Introduction

Challenge 1 - Metadata quality control

Context

Task

Suggestions

Goal

Challenge 2 - Extend scraper with explorative search:

Context

Task

Suggestions

Goal

Project stack

Simplified stack for Geohack:

Setup / Deployment

Frontend

Requirements

Run

Backend

Requirements

Run

Troubleshooting

Cannot start the application

VSCode complains about missing imports

API Documentation

SwaggerUI

About

Releases

Packages

Languages

FHNW-IVGI/Geohack-Geoharvester

Folders and files

Latest commit

History

Repository files navigation

Geohack: NDGI Project Geoharvester

Introduction

Challenge 1 - Metadata quality control

Context

Task

Suggestions

Goal

Challenge 2 - Extend scraper with explorative search:

Context

Task

Suggestions

Goal

Project stack

Simplified stack for Geohack:

Setup / Deployment

Frontend

Requirements

Run

Backend

Requirements

Run

Troubleshooting

Cannot start the application

VSCode complains about missing imports

API Documentation

SwaggerUI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages