OpenStreetMap Data Quality based on the Contributions History
Working with community-built data as OpenStreetMap forces to take care of data quality. We have to be confident with the data we work with. Is this road geometry accurate enough? Is this street name missing?
Our first idea was to answer to this question: can we assess the quality of OpenStreetMap data? (and how?).
This project is dedicated to explore and analyze the OpenStreetMap data history in order to classify the contributors.
There are a serie of articles on
the Oslandia's blog site which deal
with this topic. Theses articles are also in the
Works with Python 3
There is a
requirements.txt file. Thus, do
pip install -r requirements.txt
from a virtual environment.
How does it work?
There are several Python files to extract and analyze the OSM history data. Two machine learning models are used to classify the changesets and the OSM contributors.
The purpose of the PCA is not to reduce the dimension (you have less than 100 features). It's to analyze the different features and understand the most important ones.
Get some history data
You can get some history data for a specific world region
on Geofabrik. You have to download a
*.osh.pbf file. For instance, on
the Greater London page,
you can download the
Warning: Since GDPR, Geofabrik has modified its API. You have to be logged
in to the website with your OSM contributor account to download
osh.pbf files, as OSM history files contain some private informations about OSM contributors.
Organize your output data directories
data directory and some subdirs elsewhere. The data processing should
be launched from the folder where you have your
data folder (or alternatively, where a symbolic link points out to it).
mkdir -p data/output-extracts
Then, copy your fresh downloaded
*.osh.pbf file into the
Note: if you want another name for your data directory, you'll be able to
specify the name thanks to the
--datarep luigi option.
The limits of the data pipeline
The data pipeline processing is handled by Luigi, which can build a direct acyclic dependency graph of your different processing tasks and launch them in parallel when it's possible.
These tasks yield output files (CSV, JSON, hdf5, png). Some files such as
all-editors-by-user.csv needed for some tasks
was built outside of this pipeline. Actually, these files come from the big
changesets-latest.osm XML file which is difficult to include in the pipeline
- the processing can be a quite long
- you should have a large amount of RAM
Thus, you can get these two CSV files in the
osm-user-data and copy them into
See also the I want to parse the changesets.osm file section.
Run your first analyze
You should have the following files:
data data/raw data/raw/region.osh.pbf data/output-extracts data/output-extracts/all-changesets-by-user.csv data/output-extracts/all-editors-by-user.csv
luigi --local-scheduler --module analysis_tasks AutoKMeans --dsname region
python3 -m luigi --local-scheduler --module analysis_tasks AutoKMeans --dsname region
dsname mean "dataset name". It must have the same name as your
Note: The default value of this parameter is
bordeaux-metropole. If you do not set another value and if you do not have such
.osh.pbf file onto your file system, the program will crash.
Most of the time (if you have an Python import error), you have to prepend the
luigi command by the
PYTHONPATH environment variable to the
osm-data-quality/src directory. Such as:
PYTHONPATH=/path/to/osm-data-quality/src luigi --local-scheduler ...
MasterTask chooses the number of PCA components and the number of KMeans
clusters in an automatic way. If you want to set the number of clusters for
instance, you can pass the following options to the luigi command:
--module analysis_tasks KMeansFromPCA --dsname region --n-components 6 --nb-clusters 5
In this case, the PCA will be carried out with 6 components. The clustering will use the PCA results to carry out the KMeans with 5 clusters.
See also the different luigi options in the official luigi documentation.
You should have a
data/output-extracts/<region> directory with several
CSV, JSON and h5 files.
- Several intermediate CSV files;
- JSON KMeans report to see the "ideal" number of clusters (the key
- PCA hdf5 files with
- KMeans hdf5 files with
- A few PNG images.
Open the results analysis notebook to have an insight about how to exploit the results.
I want to parse the changesets.osm file
- Convert the file into a huge CSV file
- Group each user by editors and changesets thanks with dask
TODO : write the "how to"