Simplest and stupid algoritm (v0.0.2) and some evolution to convert big OSM files into simple CSV file for Wikidata-tag analysis.
This project have two distinct parts:
-
XML parser: converts a big OSM XML file into a lean tabular format, the intermediary CSV "raw" format. There are other tolls (eg. OPL), this one is suitable for OSM beginners, for Unix agile pipeline lovers... And for those who do not trust anyone: to fact checking.
The pipeline ofstep1-osmWd2csv_pre.sh
andstep2-osmWd2csv.php
. -
Final parser and SQL database. It is not so simple, have some mix of libraries and medium-complexity SQL processing. Produce all final products, including the exportation to the proposed data interchange format,
.wdDump.csv
.
The target of the "OSM Wikidata-elements CSV file" is to feed PostgreSQL (or SQLite) with complete and reliable data. See eg. Semantic-bridge OSM-Wikidata project.
This project also define two simple data-interchange formats for tests and benchmark of OSM-Wikidata tools.
.wdDump.csv
format: is the best way to dump, analyse or interchange OSM-Wikidata "bigdata". Is a standard CSV file with columns
<osm_type,osm_id,wd_ids,wd_member_ids>
.
The first field, osm_type
, is the OSM element type, abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a qid without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, wd_member_ids
, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The .wdDump.csv
is the final format of the parsing process described also in this repo.
Consumes ~0.01% of the XML (.osm
) format, of the wikidata-filtered file, and its zipped file ~0.4% of the .osm.pbf
format — see the summary of file sizes. In CPU time to process or analyse is also a big gain... And for data-analists is a standard source of the truth at any SQL tool. Example:
osm_type | osm_id | wd_id | geohash | wd_member_ids |
---|---|---|---|---|
n | 32011242 | Q49660 | u0qu3jyt |
|
w | 28712148 | Q1792561 | u0qu0tmz |
|
w | 610491098 | Q18482699 | 6vjwdr |
Q18482699:2 |
r | 1988261 | u1j1 |
Q315548:49 | |
r | 51701 | Q39 | u0m |
Q11925:1 Q12746:1 |
r | 3366718 | Q386331 | 6ur |
Q386331:15 |
The same table in SQL can be converted in JSON or JSONb with the following structure:
TABLE wdOsm.raw (
osm_type char NOT NULL, -- reduced to n/w/r
osm_id bigint NOT NULL,
wd_id bigint, -- "Q" removed
geohash bigint, -- must convert to base32
member_wd_ids JSONb, -- e.g. {"315548":49}
-- option bigint[[key,value]] = array[array[315548,49],array[392600,2]]
-- and bigint2d_find(a,needle)
UNIQUE(osm_type,osm_id)
);
.wdDump.raw.csv
format: a standard CSV file with columns <osm_type,osm_id,otherIDs>
.
Is a intermediary format, with a lot of redundant lines and IDs. Is easy to interchange or feed SQL for final parsing.
The first field, osm_type
, is the OSM element type, the second its ID, and otherIDs
a set of space-separed IDs (node-IDs, way-IDs or Wikidata-IDs).
Consumes ~10% of the XML (.osm
) format.
So, the following algoritms are also reference specifications of these file formats:
-
XML pre-parser, osmWd2csv_pre.sh: a simple Unix pipeline where you can easily change or debug the intermediary XML format.
-
XML parser, osmWd2csv.php: the simplest and stupid algoritm to convert pre-processed OSM into
*_wdDump.raw.csv
format. Good for fact checking (auting process) or statrting point for adaptations.
The ideal algoritm will be C or C++ doing the job of both, pre-parser and parser. Will be easy also to include in the parse the "SQL processing", because it only checks IDs and refs. -
SQL processing osmWd_raw-transform.sql: transforms
*_wdDump.raw.csv
into*_wdDump.csv
. Is not necessary to be a SQL algoritm, but SQL is the simplest and most popular standard way to specify procedures and test.
There are a lot of tools, ranging from the simplest grep
to counting tags (see also example), to complicated (but very useful) ones:
-
Direct tools:
- OPL File Format conversors;
- Osmconvert and its Writing CSV tools.
-
Indirect tools:
- OSM to GeoJSON complex convertions used by taginfo and others;
osm2pgsql
to feed PostgreSQL;- Osmosis also to feed PostgreSQL.
But the problem to be solved here was to track back only one tag (in this case the wikidata=*
tag)...
And by a "dummy user", that never used OPL, Osmconvert, etc. but knows something about XML and CSV formats.
The only external OSM-tool here will be Osmium, to convert PBF format.
The first step in the OSM's Bigdata analysis is to get the file in .osm
(that is a XML) from "the planet"... For beginners, using Linux:
-
Install.
apt install osmium-tool; osmium --version
, will show "osmium version 1.3.0" for UBUNTU16 and "osmium version 1.7.1 / libosmium version 2.13.1" for UBUNTU18. -
Select here the region, eg.
wget https://download.geofabrik.de/europe/liechtenstein-latest.osm.pbf
, showing its with MD5 sum, so compare it withmd5sum liechtenstein-latest.osm.pbf
.
... Or big-data as Europewget http://download.geofabrik.de/europe-latest.osm.pbf
-
See tests and instructions at Osmium-tool/manual. The main is
osmium fileinfo -e liechtenstein-latest.osm.pbf
to see all details.
We need a filter (only elements with Wikidata tag) and a convertion (PBF to OSM).
osmium tags-filter liechtenstein-latest.osm.pbf nwr/wikidata -o wikidata-liechtenstein.osm.pbf
osmium cat -f osm wikidata-liechtenstein.osm.pbf -o wikidata-liechtenstein.osm
So the last output is our XML working-file, in this case wikidata-liechtenstein.osm
.
To run samples in your database, the simplest is to sh src/make.sh
.
There are some examples at example.md, starting with the .osm.pbf
files.
-
cat wikidata-liechtenstein.osm | php src/distincTags.php
is a fact-check forosmium fileinfo -e
and to confirm the DTD of the OSM file, as well its IDs. -
sh src/osmWd2csv_pre.sh < wikidata-liechtenstein.osm | more
will show the pre-parsed XML. -
sh src/osmWd2csv_pre.sh < wikidata-liechtenstein.osm | php src/osmWd2csv.php > LI_wdDump.csv
the useful thing! -
... for the complete process, with SQL outputing
LI_wdDump.csv
, see "example 3 - parsing".
Contents, data and software of this project are dedicated to