New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove djorm-exp-pgfulltext #1652

Merged
merged 8 commits into from Mar 22, 2018

Conversation

Projects
None yet
2 participants
@toolness
Contributor

toolness commented Mar 22, 2018

This fixes #1645.

Implementing the fix involved some reading of djorm_pgfulltext's source code, as well as of django.contrib.postgres.search. A few things djorm-ext-pgfulltext made easy aren't supported out-of-the-box by Django's full-text-search, so I had to implement a solution.

Notes

  • manage.py update_search_field no longer delegates to a djorm_pgfulltext command. It was easier to just hard-code the auto-updating of the Contracts model, so I removed the <app-name> <model-name> arguments to the command.

  • Doing this work required examining the underlying SQL Django was sending to the database. Hacking this into settings.py was cumbersome, so I added a boolean environment variable called DEBUG_LOG_SQL that can be used to toggle it, which should make future debugging easier too.

  • Django's SearchQuery full-text search queries only translate to SQL as plainto_tsquery, whereas our search code has always been using to_tsquery. As far as I can tell, plainto_tsquery doesn't let us specify to search for words that are prefixes of labor category names. For instance, searching for eng via a plainto_tsquery will only find results like "senior eng", but not "senior engineer". So I had to implement a custom Value subclass that used to_tsquery instead.

  • There's no easy built-in way to tell Django to auto-update SearchVectorFields on model save (in contrast, djorm-ext-pgfulltext did this kind of thing automatically for us). In fact, Django's documentation tells us to just install a postgres trigger, which is kind of annoying, since we're not really creating new models very often for that to be a necessity. Anyways, I dug into how djorm-ext-pgfulltext did things, which was via a post-save signal, and replicated the strategy in our own code.

  • I made modifications to the initial migration of the Contracts model instead of adding a custom migration; this is because the initial migration included an import of djorm_pgfulltext, which I had to remove in order to decouple us from that package. And it's OK that I did this, because the underlying postgres database field remains unchanged: it's still just a tsvector under-the-hood.

@@ -236,7 +236,7 @@
),
}
LOGGING = {
LOGGING: Dict[str, Any] = {

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Had to appease mypy here: it was inferring LOGGING to be a dictionary mapping strings to objects, which are non-indexable.

Also, code climate seems to think this type annotation is the same thing as putting two statements on one line, which is super weird.

return self.search(MultiPhraseSearchQuery(queries))
def search(self, query):
return self.filter(search_index=query)

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

search() was a method on the manager provided by djorm_pgfulltext, and it was easiest to just tack it on to our manager. I guess we could just remove it and inline this function's body if we think that makes things more readable, though.

name='search_index',
field=djorm_pgfulltext.fields.VectorField(),
preserve_default=True,
),

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

I believe this AlterField was solely due to an upgrade of djorm_pgfulltext that made a bunch of formerly-explicit constructor arguments implicit. In other words, this AlterField doesn't actually represent any kind of change to the underlying database schema--similar to how merely changing a field's help_text often triggers a migration, even though help_text doesn't have anything to do with the underlying database field.

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Note that actually tracing the cause for these lines is non-trivial; they were first added in #170, which doesn't make any changes to the search_index field at all. Furthermore, from that time to the beginning of the project, djorm_pgfulltext's entry in requirements.txt didn't include a version number, so it was hard to tell what version was being used at what time.

The only other evidence we have is that the change to VectorField's constructor defaults was made in linuxlewis/djorm-ext-pgfulltext@db8c220, which was in the 0.9.3 release. which came out on March 31, 2015 (a few months before this migration was made).

Furthermore, the migration immediately preceding this one was made in #45 on January 9, 2015. So given that 0.9.3 shipped between the preceding migration and this one, it makes sense that this one would reflect the changes to VectorField's constructor.

Phew! So in other words, I think my hunch that this AlterField doesn't actually represent a change in the underlying db schema is correct.

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Wait, there's something I'm really confused about here... null defaulted to True prior to djorm-ext-pgfulltext 0.9.3, but it seems like that default changed to False in 0.9.3 (see the previously linked-to linuxlewis/djorm-ext-pgfulltext@db8c220 for evidence of this). So maybe the underlying schema did change in this migration... Arrgh so confusing.

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Yup, this is super duper weird. If I port the test_schema_is_what_we_expect test I added in 9c9d3d8 over to develop, it fails, because the underlying is_nullable field in postgres is NO in develop, but YES in this PR.

Yet if I change null=False in this PR's initial migration, a bunch of tests start failing with database constraint violations, claiming that the search index can't be set to NULL... ohh man this is weird.

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Oops nevermind. When I was changing null=False to this PR's initial migration, I wasn't making a corresponding null=False modification to the Contract model. Once I did that in 6be8f40 everything seems to have parity with develop, I think.

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

So technically, if we really want to reflect past history with 100% fidelity, we could change the initial migration's null to True, and then re-add this AlterField in this migration, setting null to False, as that would properly reflect the (utterly non-obvious and probably unintentional) changes in this migration. However, since it's not like we actually have any instances from 2015 that will actually need to run this migration to properly have their is_nullable flipped, I don't think it's a big deal.

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

whoa, that is some deep spelunking!

@receiver(post_save, sender=Contract)
def on_contract_save(sender, instance=None, **kwargs):
if instance:
Contract.objects.filter(pk=instance.id).update_search_index()

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

According to Django's documentation on updating multiple objects at once, update() (which is called by update_search_index()) "doesn’t run any save() methods on your models, or emit the pre_save or post_save signals", so this won't erroneously cause an infinite loop.

@jseppi

jseppi approved these changes Mar 22, 2018

This looks great, and I'm very happy to get rid of djorm-exp-pgfulltext!

A few comments in my review, but none of them blocking merging this really. My biggest thing is seeing if we can remove our custom tsquery code in favor of something from the framework.

from contracts.models import Contract
class Command(UpdateSearchFieldCommand):
class Command(BaseCommand):

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

should we maybe name this something liked update_contracts_search_index since it's now hardcoded to the Contract model?

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Hmm, I was thinking we could keep it the same, just in case we do start adding more indexed fields, we could bring back the previous behavior of the old command?

@@ -304,6 +304,19 @@
},
}
DEBUG_LOG_SQL = 'DEBUG_LOG_SQL' in os.environ

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

👍

name='search_index',
field=djorm_pgfulltext.fields.VectorField(),
preserve_default=True,
),

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

whoa, that is some deep spelunking!

verbose_name = 'Contracts'
def ready(self):
from . import signals # noqa

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

Why does this need to happen in ready?

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

Oh, the docs say that's where it should go: https://docs.djangoproject.com/en/dev/topics/signals/

if isinstance(queries, str):
queries = [queries]
queries = [
Contract.normalize_labor_category(q)
for q in queries
]
return self.search(convert_to_tsquery_union(queries), raw=True)
tsquery = convert_to_tsquery_union(queries)

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

I wonder if this manual construction of tsqueries is necessary or if there is some better way to work with django.contrib.postgres.search so that it isn't required.

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

ah-ha! maybe we can use SearchQuery

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

I'm having a bit of a brain-block on how to combine the terms best, but this is along the lines of what I'm thinking:

def multi_phrase_search(self, queries):
    if isinstance(queries, str):
        queries = [queries]
    queries = [
        SearchQuery(Contract.normalize_labor_category(q))
        for q in queries
    ]

   # There has to be a better/more-pythonic way to do this combination
    combined_search_query = queries[0]
    for q in queries[1:]:
        combined_search_query |= q

    return self.search(combined_search_query)

then we could get rid of all our custom tsquery code.
Maybe best to handle in another PR though!

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

:doh: This was your third bullet point in the PR write-up. I got too much into reading the code and totally missed that you already explored this option.

This comment has been minimized.

@jseppi

jseppi Mar 22, 2018

Contributor

I think that is what this Django feature request is about: https://code.djangoproject.com/ticket/28041

This comment has been minimized.

@toolness

toolness Mar 22, 2018

Contributor

Oh cool, didn't realize there was a feature request. That could be a fun way to contribute upstream to Django... anyways, yeah, I originally tried exactly what you suggested but ALAS.

@toolness

This comment has been minimized.

Contributor

toolness commented Mar 22, 2018

Wooo thanks for reviewing this!

@toolness toolness merged commit 89d838f into develop Mar 22, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
codeclimate 1 fixed issue
Details
codeclimate/total-coverage 96% (0.0% change)
Details

@toolness toolness deleted the remove-pgfulltext branch Mar 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment