Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support of search_languages so that we can limit the list of lan… #615

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dixstonz3
Copy link

We have a perfomance issue when running photon with multiple languages like so:

-languages ru,en,de,uk,es,pt,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi

we don't know exactly what the client langauge will be. It might be any language.

In this case when a new request comes in, photon builds a query that tries to search across all supported langues but this is too slow and unnecessary. In our case, we know exactly that for a client, say, from Brazil we only need to search across pt, es, en languages, but we do not have this possibility.

This commit adds a new "search_languages" parameter that can be utilized in API to specify a comma-separated list of languages that ES should search through. If the parameter is not specified photon will use the list of supported languages from the -languages param (the old logic applies).

@dixstonz3
Copy link
Author

Here is the usage example of how you can use the "search_languages" parameter to limit the list of languages ES will search through:

/api/?q=gdansk&limit=7&lang=en&search_languages=en,it,de

In this case, even if photon is running with the extended -languages list (ru,en,de,uk,es,pt,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi), the ES query will search only through English, Italian, German and local languages.

@lonvia
Copy link
Collaborator

lonvia commented Nov 24, 2021

I very much agree that this is a much-needed feature.

However, I am wondering: do we really need an extra parameter for it? Or would it not be sufficient to extend the lang parameter to accept a list of languages. The huge advantage with that would be that we get the lang parameter in line with the Accept-Languages HTTP header. I was roughly thinking along those lines:

  • get language list:
    • if lang parameter is set, parse comma-separated list of languages from that
    • else: get language list from Accept-Languages
    • else: use [any] (for output in the local language and searching through all)
  • use all languages in list for searching, use first language for output language
  • special values:
    • default = for output in local language (ignored for the list of languages to search in)
    • any = use all supported languages to search in (use local language when first in list)

The disadvantage is that such an approach would change the meaning of lang. In particular: when a single language is used then we really only search in that language and the local language. Would that be an issue?

@lonvia
Copy link
Collaborator

lonvia commented Nov 24, 2021

On the technical side: the CI error has nothing to do with your PR. I've fixed it on master. Can you please rebase at some point.

Also, the new feature will need a test eventually but lets settle on the API first.

@dixstonz3
Copy link
Author

dixstonz3 commented Nov 25, 2021

The disadvantage is that such an approach would change the meaning of lang. In particular: when a single language is used then we really only search in that language and the local language. Would that be an issue?

Yes, I think that would be an issue.

  1. It does change the meaning of the lang parameter which should be used as output language. For me personally, it's not obvious and misleading.

  2. It breaks backward compatibility for clients that runs photon with multiple languages like so:

-languages es,pt,en

and use the lang parameter to specify the output language. Their search will be restricted to only one (output) language.

Also, there will be an issue for some countries that have multiple official or very widespread languages (pt/es, ru/ua and so on).

The output language is set in our application across the company in which a user works. For example, a company might decide that its official language is Ukranian. At the same time, the company might have Russian-speaking employees (whose second language is Ukranian) who will prefer to search adressees in their first language (which is Russian) even though the results will still be in Ukranian.

As far as I understand the same issue might apply to es/pt speaking countries and fr/de/it countries. I am not sure about the Asia region.

@dixstonz3
Copy link
Author

On the technical side: the CI error has nothing to do with your PR. I've fixed it on master. Can you please rebase at some point.

Done with the rebase

@dixstonz3
Copy link
Author

Also, the new feature will need a test eventually but lets settle on the API first.

I have added some tests for the new search_languages parameter.

@lonvia
Copy link
Collaborator

lonvia commented Dec 8, 2021

The lang parameter is not defined as only changing the output language. It has always also had an influence on the ranking of search results. If that is a good or bad thing, is open for discussion.

I see two fundamental problems with the PR in its current state:

  1. It introduces yet another variant of specifying languages next to the lang parameter and the ACCEPT_LANGUAGE header. Given that the existing parameters already cause confusion, we need to consider carefully the meaning of the existing ones before adding another.
  2. The PR isn't really solving the actual problem you are having, namely that the query becomes too large when too many languages are configured on the server side. It is an attempt to work around the problem by off-loading the task of making the query smaller (by restricting the languages) to the user. That usually is not a good idea. The server should always work efficiently for the majority of use cases without the user being obliged to intervene.

Sorting out the language parameters is tricky. It is not always obvious, which language people are going to use for toponyms, especially in areas foreign to them. Similarly, when dealing with OSM input data it is not always clear what the language-specific tags contain. The translation is sometimes less known than the local language term. Nominatim has always dealt with this by searching through all data and that seems to work reasonably well. Photon used to search only in the local and chosen language and that has frequently caused problems. It has only very recently switched to an algorithm where it searches through all languages. That's what is causing the performance problems now.

So what does that mean?

Taking user language preferences into account when ranking results is a good thing. That's what I referred to in my first comment. I'm just not convinced that not considering foreign-language terms in the first place works out well. Unfortunately that doesn't get us very far in terms of the performance issue because 2 of the 3 places where the supported languages come into play are filter terms in the query. Only the part that prefers full word terms relates to ranking exclusively.

In the end, the performance issue needs to be tackled separately by changing our index structure in a way that filter queries over many languages become a lot cheaper. That should be possible. The current multi-language support tried to build on the existing indexes to remain backwards-compatible with the database. But going forward we can alter the database structure.

NB: using the latest released version, you can run different instances of Photon with different language settings, to restrict the set of languages used. The instances can be run against the same database. The set of supported languages is a run-time setting. So you can have a database that indexes all languages and then one instance for the es/pt/en users, one instance for the ru/uk/en users etc.

@dixstonz3
Copy link
Author

The PR isn't really solving the actual problem you are having, namely that the query becomes too large when too many languages are configured on the server side.

The server should always work efficiently for the majority of use cases without the user being obliged to intervene.

In the end, the performance issue needs to be tackled separately by changing our index structure in a way that filter queries over many languages become a lot cheaper. That should be possible. The current multi-language support tried to build on the existing indexes to remain backwards-compatible with the database. But going forward we can alter the database structure.

I completely agree with the above. I will be waiting for the performance issue fix.

It is an attempt to work around the problem by off-loading the task of making the query smaller (by restricting the languages) to the user.

No, it's not. In our case, we know exaclty what languages are expected for search because our software has already some built-in language settings that can be utilized as search languages. That's why for us the solution is going to work well until the perfomance fix is released.

using the latest released version, you can run different instances of Photon with different language settings, to restrict the set of languages used. The instances can be run against the same database.

I doubt that it is a good idea to have 17 instances of photon running for each of the supported language:

en,de,uk,es,pt,ru,fr,zh,pl,it,tr,ja,vi,id,ms,th,hi

At least, it seems to be a very RAM consuming solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants