Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance boost for bulk delete, and additional abstraction in utils #99

Merged

Conversation

afrancis13
Copy link

I added bulk delete functionality around a couple weeks ago, but I noticed that this could be done with significantly improved efficiency. Instead of serializing the entire document over batches, you can just specify the primary key for elasticsearch (often integers) along with the operation type ('delete') and delete in that fashion (see streaming_bulk in https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/helpers/__init__.py)
Since things were getting a little cluttered inside update, I pulled some sub-functionality out of there, put them in helper functions, and added documentation. I'll comment on one other thing below.

def filter_model_items(index_instance, model_items, model_name, start_date, end_date):
''' Filters the model items queryset based on start and end date.'''
if index_instance.updated_field is None:
logging.warning("No updated date field found for {} - not restricting with start and end date".format(model_name))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a logger warning in Haystack instead of a hard error, and I think we should follow suit. The problem with the ValueError from before was that it was difficult to apply the update functionality across varied types of model indices. For example, if you want to apply update with start and end dates over all your model indices, you need to split the ones with update_field from the ones without in some way. This is possible for users like us, but seems a bit more cumbersome than necessary. Let me know what you think about this!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a valid point.

@@ -279,7 +279,10 @@ def test_time_indexing(self):
except Exception as e:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on this test, can we remove the exception handling here and let the testing framework catch it. Instead of an assertion fail, it's probably better for the test to raise the exception. Hence, this function would become:

def test_time_indexing(self):
    update_index(Article.objects.all(), 'Article', start_date=datetime.strftime(datetime.now(), '%Y-%m-%d %H:%M'))
    update_index(NoUpdatedField.objects.all(), 'NoUpdatedField', end_date=datetime.strftime(datetime.now(), '%Y-%m-%d'))

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this would be a better way of writing this, now that you mention it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll fix that up soon then.

ChristopherRabotin added a commit that referenced this pull request Aug 6, 2015
Performance boost for bulk delete, and additional abstraction in utils
@ChristopherRabotin ChristopherRabotin merged commit 4df82e7 into ChristopherRabotin:master Aug 6, 2015
afrancis13 pushed a commit to afrancis13/bungiesearch that referenced this pull request Aug 6, 2015
Performance boost for bulk delete, and additional abstraction in utils
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants