mimir data import
Clone or download
Latest commit ba3fe69 Sep 24, 2018

README.md

Mímirsbrunn

Mimirsbrunn is a geocoding service build upon Elasticsearch.

It is an independent service, but Navitia uses it as it's global geocoding service.

Mimirsbrunn is composed of several parts, some managing the data import in Elasticsearch, and a web service wrapping Elasticsearch responses to return formated responses (we use geocodejson as the responses format)

Use

Data

Mimirsbrunn relies on geographical datasets to find what users are looking for. These locations belong to different data types and come from various sources.

data type data sources and components
Addresses OpenAddresses (openaddresses2mimir) or BANO (bano2mimir)
Streets OpenStreetMap (osm2mimir)
POIs OpenStreetMap (osm2mimir)
Public Transport Stops Navitia.io data platform (ntfs2mimir) or any GTFS data repository (stops2mimir)
Administrative regions OSM (osm2mimir) or Cosmogony (cosmogony2mimir)

Check out the doc of each component to know more about how to use.

If you need to use another datasource, you can also write your own data importer. See for instance Fafnir, an external component to import POIs from another database.

Install

manually

To build, you must first install rust:

curl https://sh.rustup.rs -sSf | sh

To use the Mimirsbrunn components you will need an elasticsearch database.

The elasticsearch version needs to be 2.x

and then build Mimirsbrunn:

cargo build --release

Architecture

Indexes architecture

Data are imported in multiple indexes with this structure:

munin -> munin_addr -> munin_addr_dataset1 -> munin_addr_dataset1_20160101T123200
                   |-> munin_addr_dataset2 -> munin_addr_dataset2_20160101T123200
     |-> munin_admin -> munin_admin_dataset1 -> munin_admin_dataset1_20160101T123200
     |-> munin_street -> munin_street_dataset1 -> munin_street_dataset1_20160101T123200

Munin is the root index, it's an alias used by the frontend (bragi), it pointing to an index for each dataset/document type. So if we have address data for France and Belgium we will have two indexes: "addr_fr" and "addr_be". These are also aliases, they point to a dated index, this way we can import data in another index without impacting anyone, then switch the alias to point to the new data.

This will give us the ability to only a part of the world without any downtime.

During an update the indexes will be (for the previous example say we update addr_dataset1):

During the data update:

munin -> munin_addr -> munin_addr_dataset1 -> munin_addr_dataset1_20160101T123200
                   |-> munin_addr_dataset2 -> munin_addr_dataset2_20160101T123200
     |-> munin_admin -> munin_admin_dataset1 -> munin_admin_dataset1_20160101T123200
     |-> munin_street -> munin_street_dataset1 -> munin_street_dataset1_20160101T123200
     |-> munin_stop -> munin_stop_dataset1 -> munin_stop_dataset1_20160101T123200

munin_addr_dataset1_20160201T123200

and when the loading is finished

munin -> munin_addr -> munin_addr_dataset1
                                          |-> munin_addr_dataset1_20160201T123200
                   |-> munin_addr_dataset2 -> munin_addr_dataset2_20160101T123200
     |-> munin_admin -> munin_admin_dataset1 -> munin_admin_dataset1_20160101T123200
     |-> munin_street -> munin_street_dataset1 -> munin_street_dataset1_20160101T123200
     |-> munin_stop -> munin_stop_dataset1 -> munin_stop_dataset1_20160101T123200

There is one major drawback: dataset aren't hermetic since we import multiple OSM files, the area near the border will be in multiple dataset, for now we accept these duplicate. We will be able to filter with shape at import time and/or remove them in bragi.

components

All Mimirsbrunn's components implement the --help (or -h) argument to explain it's use

There are several components in Mimirsbrunn:

osm2mimir

This component imports openstreetmap data into Mimir.

You can get openstreetmap data from http://download.geofabrik.de/

eg:

curl -O http://download.geofabrik.de/europe/france-latest.osm.pbf

To import all those data into Mimir, you only have to do:

./target/release/osm2mimir --input=france-latest.osm.pbf --level=8 --level=9 --import-way --import-admin --import-poi --dataset=france --connection-string=http://localhost:9200

level: administrative levels in openstreetmap

bano2mimir

This component imports bano's data into Mimir. It is recommanded to run bano integration after osm integration so that addresses are attached to admins.

You can get bano's data from http://bano.openstreetmap.fr/data/

eg:

curl -O http://bano.openstreetmap.fr/data/full.csv.gz
gunzip full.csv.gz

To import all those data into Mimir, you only have to do:

./target/release/bano2mimir -i full.csv --dataset=france --connection-string=http://localhost:9200/

The --connection-string argument refers to the ElasticSearch url

ntfs2mimir

This component imports data from the ntfs files into Mimir. It is recommended to run ntfs integration after osm integration so that stops are attached to admins.

To import all those data into Mimir, you only have to do:

./target/release/ntfs2mimir -i <path_to_folder_with_ntfs_file> --dataset=idf --connection-string=http://localhost:9200/

The --connection-string argument refers to the ElasticSearch url

The ntfs input file needs to match the NTFS specification (https://github.com/CanalTP/navitia/blob/dev/documentation/ntfs/ntfs_0.6.md)

Note: previously, another component was used: stops2mimir. Though it is still available, it is now deprecated because ntfs2mimir imports stops and every other files present in the ntfs.

Bragi

Bragi is the webservice built around ElasticSearch. It has been done to hide the ElasticSearch complexity and to return consistent formated response.

Its responses format follow the geocodejson-spec. It's a format used by other geocoding API (https://github.com/addok/addok or https://github.com/komoot/photon).

To run Bragi:

./target/release/bragi --connection-string=http://localhost:9200/munin

then, you can call the API (the default Bragi's listening port is 4000):

curl "http://localhost:4000/autocomplete?q=rue+hector+malot"

Contribute

Integration tests

To test, you need to manually build mimir and then simply launch:

cargo test

Integration tests are spawning one ElasticSearch docker, so you'll need a recent docker version. Only one docker is spawn, so ES base has to be cleaned before each test.

To write a new test:

  • write your test in a separate file in tests/
  • add a call to your test in tests/tests.rs::test_all()
  • pass a new ElasticSearchWrapper to your test method to get the right connection string for ES base
  • the creation of this ElasticSearchWrapper automatically cleans ES base (you can also refresh ES base, clean up during tests, etc.)

Geocoding tests

We use geocoder-tester to run real search queries and check the output against expected to prevent regressions.

Feel free to add some tests cases here.

When a new Pull Request is submitted, it will be manually tested using this repo, that loads a bunch of data into the geocoder, runs geocoder-tester and then add the results as a comment in the PR.