Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal error every time the home page is opened #1767

Closed
Guybrush88 opened this issue Feb 1, 2019 · 10 comments

Comments

Projects
None yet
3 participants
@Guybrush88
Copy link

commented Feb 1, 2019

Since today, every time I open the home page, I get the internal error message, so it is unusable. Every other page works properly, though.

@trang

This comment has been minimized.

Copy link
Member

commented Feb 2, 2019

Something wrong with the search engine, not entirely sure what. I've restarted it, ran the merge_indexes.sh script. Seems to be fine for now.

Probably we should make it a prio to work on #1518 and along the way optimize what we can. Until then I think we are kind of doomed to endure a search crash every now and then.

@jiru jiru closed this Feb 18, 2019

@Guybrush88

This comment has been minimized.

Copy link
Author

commented Mar 13, 2019

This still happens now

@jiru

This comment has been minimized.

Copy link
Member

commented Mar 13, 2019

Thank for reporting us Guybrush. I took care of it and I opened a new issue: #1817.

@jiru

This comment has been minimized.

Copy link
Member

commented Mar 30, 2019

I think the problem was caused by a general high load. d7ba14d already decreased load a lot, but there might be some other optimizations we can do.

@jiru jiru reopened this Apr 2, 2019

@jiru jiru self-assigned this Apr 2, 2019

@jiru

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

It happened again today so I’m gathering my findings here.

The slow request used to get random sentences suddenly jumped from 3-6s to 12-17s at some point during the 9:15am UTC indexation of bre, fry, fao, afr, yue, sqi, isl, hun, bel, arz, wuu, lat, swh, pol, kat, est, zsm, nob, swe, tur, por, ind, heb, nld. It further increased to 20-30s after finishing that reindexation and stayed like that for six hours. Then I started an index merge. Even though we run the merge with --rotate, for some reason the old delta index are still being used until sphinx is restarted. Only after restarting Sphinx, the new delta index were used and the slow request time went back to 3-6s.

So the problem is definitely related to delta/merge, but not due to the growth of delta indexes.

While the slow request is being executed, reindexation is blocked and connecting to SphinxQL blocks too.

Disk I/O is very calm.

@jiru

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

It happened again today. The query time gradually decreased after merging each delta index, from biggest (500 documents) to smallest. Merging delta indexes containing less than 100 documents did not affect the query time.

I also discovered that the eng, epo, fra, fin and heb main indexes were corrupted, so I rebuilt them. When trying to merge them with their delta, the command failed:

Sphinx 2.2.11-id64-release (95ae9a6)
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/etc/sphinxsearch/sphinx.conf'...
merging index 'fin_delta_index' into index 'fin_main_index'...
read 44.7 of 44.7 MB, 100.0% done
FATAL: failed to merge index 'fin_delta_index' into index 'fin_main_index': out of pool memory on loading persistent MVA values

However, increasing mem_limit and mva_updates_pool up to 128MB didn’t help. Running indexer --check on them produced a segfault.

Since I was not able to reproduce the problem on dev by creating much much bigger delta indexes, I suspect the problem comes from running UpdateAttributes() on the indexes.

jiru added a commit that referenced this issue Apr 4, 2019

Add -w option to wait for existing process
in order to make sure that index merge happens consistently,
despite a running delta index merge.

Mitigates #1767.
@jiru

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

I installed Manticore. At first, the query time of random sentence was very very long (more than a minute), resulting in "Internal errors" again. But I think it was because the search daemon was warming up. After warming up, the query time is about one second. I’ll check if it doesn’t get any higher in the next 24 hours and if not, I’ll close this issue.

A simple way to avoid doing this expensive query is to first select a language randomly. We give more chances of being randomly selected to languages with a higher number of sentences (weighted selection). Then, we ask Manticore to select a random sentence for that language.

The downside is that it’s not truly random because we filter out unadopted and red sentences. As a result, if, say, a language accounts for 50% of the corpus and includes 90% of orphan sentences, the language should have 5% of chances to show up in the random sentence selection. But using this not truly random algorithm, it would have 50% of chances because orphan sentences are not taken into account. It’s probably negligible though.

jiru added a commit that referenced this issue Apr 6, 2019

Avoid updating empty delta indexes
This will speedup delta reindexation.

Refs #1767.

jiru added a commit that referenced this issue Apr 7, 2019

Get more random ids at once
in order to work around bug #1767.

This is not ideal because one visitor (or more, in race condition)
out of 500 will have a slow page response.
@jiru

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

I confirm switching to Manticore partly solved the problem. But’s it’s still there. Whenever the query time jumps, it goes back to normal after a few minutes. So we now only have some quick performance drops, instead of a continuous failure. In addition, it looks like the search daemon does not block any more when this happens, so the page will just be slow or failing for a few visitors. It looks like changing the value of max_children made these drops less frequent, but they are still there.

About picking a random language before running the query, I figured out it’s more complicated than I though so I did not implement it. There is an edge case. If the randomly selected language has no valid sentences (all red or orphans), it will fail to find a sentence. And we do have such languages.

@jiru jiru closed this Apr 7, 2019

jiru added a commit that referenced this issue Apr 7, 2019

Use faster random id selection method
when no particular language is specified.

Solves #1767.
@jiru

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

Okay the problem is definitely solved now I implemented a faster algorithm for the random sentence that does not rely on the search engine. It means the random sentence will always show up regardless of the search engine.

@jiru jiru added the optimization label Apr 7, 2019

@trang

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

@jiru Thanks for the upgrade to Manticore and the optimization of the random sentence selection! There's just one thing regarding reloading a random sentence (cf. my comment on the commit). But in any case, with the upgrade to Manticore I do feel Tatoeba is a bit faster now, and hopefully even faster when the optimization for the random sentence selection is deployed :)

@trang trang added this to the 2019-04-07 milestone Apr 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.