Faster search index refresh using delta index #442

Merged
merged 11 commits into from Oct 4, 2014

Conversation

Projects
None yet
2 participants
@jiru
Member

jiru commented Sep 26, 2014

This allows newly inserted sentences to be quickly found, using the delta index technique described in the Sphinx documentation.

As mentioned in #233, it isn’t as complete as it could, but it can already be used. The limitation is that we must recreate the main indexes once in a while, instead of being able to merge them with the delta ones little by little.

We need to setup a frequent delta index refresh (like an hour, possibly less) and a less frequent main index refresh (in a way that the delta doesn’t get too big). To run the refreshes, use the appropriate CakePHP shells:

./cake/console/cake -app app/ sphinx_indexes update main
./cake/console/cake -app app/ sphinx_indexes update delta

These should be run using the sphinxsearch user, so use the sphinxsearch user crontab or sudo -u sphinxsearch from root.

@trang Before pushing on production, you need to increase the maximum opened file limit for the Sphinx daemon. sudo sed -i '/start-stop-daemon --start/i ulimit -n 2048' /etc/init.d/sphinxsearch should do the trick. Also, don’t forget to run the database update script.

jiru added some commits Sep 24, 2014

Added a shell to ease index managment.
Now we have these delta/main all around, `indexer --all` is
not enough to run batch commands on the indexes, so I created
a script for the usual index merge and index update tasks.
Fixed bogus model relationship.
Sentences and Tags are already linked with a HABTM relation, so no
need to put it in $belongsTo. Also, CakePHP use the automatic name
'TagsSentence' for the HABTM model, which misses our 's'.

The bug appeared while loading the Sencence model *only* from the
sphinx_indexes.php shell and performing a find().
Add 'last modified' into the Sphinx indexes. #233
So that we can get that value in order to know the newest sentence
of a given index.
Convert datetimes to make Sphinx happy.
See http://sphinxsearch.com/docs/archives/manual-2.0.4.html#conf-sql-attr-timestamp
Now we should be able to use the 'created' and 'moified' Sphinx attributes.
Fix delta discriminant values. #233
Update the delta discriminant value using data from the index itself
instead of our database. This way, values in the sphinx_delta keep
consistant with the actual indexes.
Handle sentences with null last modification time.
Include sentences with null last modification time in the main index,
and set the delta discriminant to zero if all the sentences have a null
last modification time.

Otherwise unmodified sentences included from the Takana Corpus are known
to have a null last modification. In addition, when importing sentences
from CSV files on a VM, all the sentences have a null last modification
time.

@trang trang added this to the 2014-10-04 milestone Sep 28, 2014

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Sep 29, 2014

Member

Note to self: remove text that says the search is being unfrequently indexed.

Member

jiru commented Sep 29, 2014

Note to self: remove text that says the search is being unfrequently indexed.

@trang

This comment has been minimized.

Show comment
Hide comment
@trang

trang Oct 3, 2014

Member

@jiru, I let you do the merge when you feel ready.

It works, but I haven't tested thouroughly. Maybe I can do a few more tests tonight.

Anyway I have been able to test only on my Windows version so I cannot say if there are environment config issues, besides the ulimit -n 2048, that you may encounter on the prod.

I did try to test this on my VM (provided by lool0), but I ran into memory issues when starting searchd. I can't remember if I ever managed to run Sphinx on that VM though. I didn't investigate the problem further, I'm thinking that the indexer didn't work properly.

Member

trang commented Oct 3, 2014

@jiru, I let you do the merge when you feel ready.

It works, but I haven't tested thouroughly. Maybe I can do a few more tests tonight.

Anyway I have been able to test only on my Windows version so I cannot say if there are environment config issues, besides the ulimit -n 2048, that you may encounter on the prod.

I did try to test this on my VM (provided by lool0), but I ran into memory issues when starting searchd. I can't remember if I ever managed to run Sphinx on that VM though. I didn't investigate the problem further, I'm thinking that the indexer didn't work properly.

jiru added a commit that referenced this pull request Oct 4, 2014

Merge pull request #442 from Tatoeba/delta-index
Faster search index refresh using delta index

@jiru jiru merged commit 205a248 into master Oct 4, 2014

@jiru jiru referenced this pull request in Tatoeba/imouto Oct 4, 2014

Closed

Increase the number of allowed open files for Sphinx #21

@trang trang deleted the delta-index branch Oct 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment