New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and fix the Sphinx indexes SQL query. #486

Merged
merged 1 commit into from Nov 14, 2014

Conversation

Projects
None yet
3 participants
@jiru
Member

jiru commented Oct 29, 2014

By taking advantage of cross-linking, we can get both direct and indirect
translations using a single query. We start by querying indirects ones,
and if an indirect one equals the original one, it means we got back to
the starting point, and in this case we keep the direct translation instead.

The only downside of this approach is that it doesn’t work when two
sentences aren’t cross-linked. There are currenly only five such broken
links on the production server, visible by running the follwing query:

select s.sentence_id, s.translation_id
from sentences_translations as s
left join sentences_translations s2
on s.translation_id = s2.sentence_id
and s.sentence_id = s2.translation_id
where s2.sentence_id is NULL;

Some quick comparison of the data returned by this new query shows that
it’s is more accurate. The previous one happened to return the original
sentence language as a translation language from time to time, whereas this
query doesn’t. This problem was visible when running a search with the same
source language as the target one, and a keyword of that language.

http://tatoeba.org/sentences/search?query=fish&from=eng&to=eng

Such a query returned many English sentences without any translation,
whereas it should only return English sentences translated into English.
So this commit fixes that too.

Some benchmarks.

Running the query to generate indexes for Turkish sentences (about 200k)
on the production server. That’s about 60% faster and uses less memory.

           before      after

          0:54.25    0:33.74
time to   0:54.85    0:40.27
run the   0:59.39    0:33.71
query     0:56.75    0:31.87
          0:54.51    0:35.62

Maximum
resident   213 ko     182 ko
set size
Optimize and fix the Sphinx indexes SQL query.
By taking advantage of cross-linking, we can get both direct and indirect
translations using a single query. We start by querying indirects ones,
and if an indirect one equals the original one, it means we got back to
the starting point, and in this case we keep the direct translation instead.

The only downside of this approach is that it doesn’t work when two
sentences aren’t cross-linked. There are currenly only five such broken
links on the production server, visible by running the follwing query:

select s.sentence_id, s.translation_id
from sentences_translations as s
left join sentences_translations s2
on s.translation_id = s2.sentence_id
and s.sentence_id = s2.translation_id
where s2.sentence_id is NULL;

Some quick comparison of the data returned by this new query shows that
it’s is more accurate. The previous one happened to return the original
sentence language as a translation language from time to time, whereas this
query doesn’t. This problem was visible when running a search with the same
source language as the target one, and a keyword of that language.

  http://tatoeba.org/sentences/search?query=fish&from=eng&to=eng

Such a query returned many English sentences without any translation,
whereas it should only return English sentences translated into English.
So this commit fixes that too.

Some benchmarks.

Running the query to generate indexes for Turkish sentences (about 200k)
on the production server. That’s about 60% faster and uses less memory.

           before      after

          0:54.25    0:33.74
time to   0:54.85    0:40.27
run the   0:59.39    0:33.71
query     0:56.75    0:31.87
          0:54.51    0:35.62

Maximum
resident   213 ko     182 ko
set size
@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Nov 2, 2014

Contributor

By taking advantage of cross-linking, we can get both direct and indirect translations using a single query.

Are you talking about a query whose results are visible to the end user, or one that will be processed further? Currently, whatever we display to the user makes a firm distinction between direct and indirect translations. If I were to do a query, there are situations (especially as an advanced contributor looking for sentences to link) where I might want to know indirect translations, but most of the time, I would want to know only direct translations. I would rather have to do a separate query for each one.

We should fix the five sentences with broken links. I'll write that up as a separate issue.

Contributor

alanfgh commented Nov 2, 2014

By taking advantage of cross-linking, we can get both direct and indirect translations using a single query.

Are you talking about a query whose results are visible to the end user, or one that will be processed further? Currently, whatever we display to the user makes a firm distinction between direct and indirect translations. If I were to do a query, there are situations (especially as an advanced contributor looking for sentences to link) where I might want to know indirect translations, but most of the time, I would want to know only direct translations. I would rather have to do a separate query for each one.

We should fix the five sentences with broken links. I'll write that up as a separate issue.

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Nov 3, 2014

Member

I’m talking about the query used to tell Sphinx what sentences are translated into what languages, so that one can perform a search based on that critera. We don’t make a distinction between “indirectly translated into language X” and “directly translated into language X” while performing a search. Sphinx just returns “every sentence in language X directly or indirectly translated into language Y“, and then, for each or these sentences, we look up translations and display them. So it’s a two-step process and I was only talking about the first step. Technically, we were using a single query before too, but it was two queries joined by an SQL union, which performed slower.

Member

jiru commented Nov 3, 2014

I’m talking about the query used to tell Sphinx what sentences are translated into what languages, so that one can perform a search based on that critera. We don’t make a distinction between “indirectly translated into language X” and “directly translated into language X” while performing a search. Sphinx just returns “every sentence in language X directly or indirectly translated into language Y“, and then, for each or these sentences, we look up translations and display them. So it’s a two-step process and I was only talking about the first step. Technically, we were using a single query before too, but it was two queries joined by an SQL union, which performed slower.

@trang trang added this to the 2014-11-16 milestone Nov 13, 2014

trang added a commit that referenced this pull request Nov 14, 2014

Merge pull request #486 from Tatoeba/sql-optimiz
Optimize and fix the Sphinx indexes SQL query.

@trang trang merged commit 1b7ca59 into master Nov 14, 2014

@trang trang deleted the sql-optimiz branch Nov 20, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment