Automatic QA for Maps search results

Maps currently lacks efficient tools to analyze how the quality of results evolves across version, this page will describe what we could focus on to build such a tool.

What endpoints need to be monitored?

Currently, all search requests are processed by Idunn which exposes a bunch of critical endpoints:

v1/autocomplete: provides 7 results to be displayed in the autosuggest of qwant.com/maps while the user is typing v1/search: provides a single result to be displayed on qwant.com/maps when the user pressed enter or when clicking the maps tab from qwant.com (eg. search "tour eiffel") v1/instant_answer: similarly to search, this endpoints gives a single result for a given query but this result is intended to be displayed as an instant answer in qwant.com (eg. searching "tour eiffel"). This endpoint may return several places, but this actually means an intention has been detected and the results of this intentions are displayed. v1/get_places_bbox: return a list of places that match a given query or category in a restrained area. In our use-case the provided area is typically small (at most city-wide) which makes the ranking less critical than it is with other endpoints. For some categories and when searching in France, the results will be returned from the API of pagesjaunes instead of our own geocoding.

What is the input/output format of the endpoints?

Parameters

The actual technical documentation of the endpoints can be found here. Let's highlight a few parameters:

q: the query typed by the user lon/lat/zoom: if this is provided, the search will boost results close to the given location (the strength depends of the zoom level) lang: user language limit: must be set to 7 to be consistent with real-world behavior nlu: if set to true, an intention may be returned before other regular place results (an intention being a category-like result eventually together with a place, eg. when searching "restaurant à paris"

The endpoint v1/instant_answer is a bit simpler as it only takes 3 parameters: q, lang and user_country (which will be used to prioritize results in the given country, or closer to it).

Result Intentions (for v1/autocomplete and v1/search)

When an intention is detected it will either a category label (such as restaurant, food_chinese, ...) or a full-text query. The intended behavior is that these full-text queries are returned when a brand is detected (eg. "ikea à paris" will give you "ikea").

A second optional information might be given, which is a place around which the search should be performed (eg. "ikea à paris" will give you a description of Paris).

Places

They contain a bunch of metadata, but in the context of checking the quality of results, the only important thing would be to identify them, either through their name or coordinates.

What does already exist?

Most of the existing tests come from geocoder-tester, the repository contains a big list of tests, containing fulltext queries, a search location for some of them. The test success can be constrained with the name, location and index of the result. We often use these tests to spot for global regressions in the endpoint autocomplete but they have a few issues:

Specific to the places results of the autocomplete endpoint, noticeably we have no unified mechanism to test for detection of intentions. Hard to investigate on changes : the failing tests have generally never been reviewed and there is some noise introduced by the dispatch to elasticsearch nodes that makes them non-deterministic (two instances running the same version of the stack may highlight a few thousand differences on the success/failing tests).

What kind of metrics could we be based on?

Comparison with competitors

Competitors like Google or Bing also provide auto-completions services that could be used to compare to our service. There is a difficulty in some cases (especially in Google) where the auto-completion service does not have the same role as ours and won't raise POIs until the user press "enter" to trigger actual search.

Weighted tests

As mentioned before, current test cases have some noise in their results which makes it difficult to identify specific regressions. A clue to mitigate this issue could be to add a classification of how critical a test is. For instance if some day the search "tour eiffel" in Romania doesn't return the french "Tour Eiffel" there may be a very embarrassing regression, however if the search for some address now hits the neighbor city it may not be as alarming.

First proof-of-concept

Use /v1/autocomplete only. The content of "intentions" results can be ignored for now. However rank of places "features" should be shifted when an intention is returned (i.e the 1st result should be considered as 2nd, etc.)
Build a subset of the current "geocoder-tester" dataset. ~ 1000 tests. Ideally this subset would be more representative than the original dataset (lower proportion of addresses, etc.)
Add extra parameters for tests where they are not defined explicitly: lang=fr limit=7 nlu=true
Provide to XXXXX: a sample of tests, a sample of expected output, details of implementation about result validation (distance from expected location, etc.)

Provide feedback

Saved searches