New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance boost for bulk delete, and additional abstraction in utils #99
Performance boost for bulk delete, and additional abstraction in utils #99
Conversation
def filter_model_items(index_instance, model_items, model_name, start_date, end_date): | ||
''' Filters the model items queryset based on start and end date.''' | ||
if index_instance.updated_field is None: | ||
logging.warning("No updated date field found for {} - not restricting with start and end date".format(model_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a logger warning in Haystack instead of a hard error, and I think we should follow suit. The problem with the ValueError
from before was that it was difficult to apply the update functionality across varied types of model indices. For example, if you want to apply update with start and end dates over all your model indices, you need to split the ones with update_field
from the ones without in some way. This is possible for users like us, but seems a bit more cumbersome than necessary. Let me know what you think about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a valid point.
@@ -279,7 +279,10 @@ def test_time_indexing(self): | |||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on this test, can we remove the exception handling here and let the testing framework catch it. Instead of an assertion fail, it's probably better for the test to raise the exception. Hence, this function would become:
def test_time_indexing(self):
update_index(Article.objects.all(), 'Article', start_date=datetime.strftime(datetime.now(), '%Y-%m-%d %H:%M'))
update_index(NoUpdatedField.objects.all(), 'NoUpdatedField', end_date=datetime.strftime(datetime.now(), '%Y-%m-%d'))
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this would be a better way of writing this, now that you mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll fix that up soon then.
Performance boost for bulk delete, and additional abstraction in utils
Performance boost for bulk delete, and additional abstraction in utils
I added bulk delete functionality around a couple weeks ago, but I noticed that this could be done with significantly improved efficiency. Instead of serializing the entire document over batches, you can just specify the primary key for elasticsearch (often integers) along with the operation type ('delete') and delete in that fashion (see
streaming_bulk
in https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/helpers/__init__.py)Since things were getting a little cluttered inside update, I pulled some sub-functionality out of there, put them in helper functions, and added documentation. I'll comment on one other thing below.