Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search enhancements #865

Closed
pkarman opened this issue Dec 9, 2015 · 7 comments

Comments

@pkarman
Copy link
Contributor

commented Dec 9, 2015

Background

We have several stories open related to improving the experience of proposals search:

These stories boil down to: make search faster, give users more fields to search within and advanced options, support pagination (and presumably, though not yet mentioned, sorting by any field).

There are a few aspects of our data model and code architecture that make improving search a non-trivial problem:

  • data is heavily normalized in the db, across proposals, client_data, comments, versions, and attachments.
  • authorization is similarly normalized across users and roles.
  • each new client use cases means new custom column names.

The current implementation uses a big block of hard-coded SQL plus some ActiveRecord associations in order to create a very large, multiple-JOIN query, which includes manual calls to Postgres full-text search functions for every column against which the user wants to search. The current implementation does zero pagination or user-controlled sorting, and does not distinguish between client slugs (all proposals are searched, regardless of the user's client_slug value).

Current search usage indicates that 80-90% of searches are simply for the public_id of a known proposal. That use case could be solved far more simply and with better performance than the current implementation supports. However, it may be that the high percentage of public_id searches is because the existing user base has figured out that the current search feature doesn't really support much more than that. Only a small minority of client_data fields are searched, comments are ignored, multiple search query terms do not behave as expected, and performance is slow. I.e., our data might indicate a narrow need because users have self-selected a cow path that fits the existing feature constraints, not their actual business needs.

Options

I see three distinct ways forward.

Continue with the current, hand-rolled SQL

  • Refactoring code to limit to the current user client_slug and interrogating the relevant client_data model class to autogenerate the Postgres full-text search syntax. This allows us to add new client models and write very little custom search SQL.
  • Column weighting for ranking could be defined as a class-based hash lookup.
  • Mangle the incoming search query terms in order to correctly implement AND/OR/NOT boolean logic.
  • Postgres term highlighting is very slow, so use it sparingly if ever.

Pros

  • Gives us total control over the features
  • No added dependencies

Cons

  • Re-inventing wheels already solved by other solutions (see other 2 potential paths)
  • Performance will degrade (even further) as we add columns to the search
  • Code maintenance
  • Hard to debug errors since data remains normalized

pg_search

Rewrite the search features to use the industry standard pg_search gem. The multisearch feature should support our normalized data.

NOTE I looked at textacular as an alternative gem but it does not have a multisearch-like feature and so our normalized data structure would prove a challenge to its simpler feature set.

Pros

  • Uses a standard gem
  • Search stays within Postgres (no external system dependencies)
  • Compared to option (1) above, our code should be configuration and declaration, instead of generating custom SQL

Cons

  • Hard to debug errors since data remains normalized
  • Performance will degrade (even further) as we add columns to the search

Elasticsearch

Offload search entirely to an external service. Query parsing, ranking, sorting, pagination, highlighting, would all be performed by ES.

Pros

  • Industry standard
  • Best performance available
  • Standard service offering from cloud.gov
  • Minimal code to write, mostly around serializing/de-normalizing at index time, and query munging at search time
  • Data is de-normalized, which makes debugging easier

Cons

  • External service increases app complexity
  • Small latency in search availability while data is indexed (usually measured in seconds)

Recommendation

Elasticsearch. Best performance, least amount of code.

@jessieay

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2015

I've had luck with textacular in the past. Not sure what the "multisearch-like feature" limitation you are referring to is. More details?

I still have the textacular search code I wrote if you're interested in taking a look. Was for searching through a postgres table with a few hundred thousand records, matching against a few columns (which we weighted).

If you suggest Elasticsearch, then we should do that. I am not sure how difficult it will be to procure elasticsearch for our needs. Have other 18F projects used it in the past?

@jessieay

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2015

Actually, my memory deceived me. I ended up using pg_search in that project with trigram, which I am guessing is what you referring to when you said "multi-search" https://github.com/Casecommons/pg_search#configuring-multi-search

@pkarman

This comment has been minimized.

Copy link
Contributor Author

commented Dec 9, 2015

@jessieay By "multisearch" I meant (quotes from pg_search README):

records of many different Active Record classes can be mixed together into one global search index
across your entire application

which is I think what we would need to do in order to de-normalize our many tables into one. Otherwise, we have to use pg_search_scope, but since we need to search across tables:

It is possible to search columns on associated models. Note that if you do this, it will be impossible to
speed up searches with database indexes.
However, it is supported as a quick way to try out cross-model searching.

I.e. either we "flatten" our many tables into one (multisearch), or we suffer the same performance hit we have now.

Here's a good write-up of cross-model search. Note that the author wants to search one model, and then he adds on a couple of columns from a couple of associations. Our use case is more complex, ergo the performance hit (far more associations to join). http://codeloveandboards.com/blog/2014/06/20/easy-multi-table-full-text-search-whith-rails/

Since ES is essentially flattening our many tables into one external store, it's philosophically the same thing as pg_search multisearch.

@afeld

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2015

@pkarman Great write-up. Worth noting that, with the current solution:

  • speed could be dramatically improved by creating search indexes
  • pagination should be very straightforward
  • sorting by arbitrary columns should also be relatively easy, at least as far as the query is concerned

You can see more background in #355. That being said, I don't feel super strongly about one solution over the other.

@pkarman

This comment has been minimized.

Copy link
Contributor Author

commented Dec 9, 2015

@afeld all those points are true if we go with Option 1. My worry is mostly maintenance. That much custom search code frightens me.

Our problem space is somewhat different than most of the use cases I see for using postgres full-text search (a couple of which are referenced as comments in the current code), which is when you have lots of data in a few tables, and only want to search a few, mostly text, columns. I call that a "deep" data scenario.

Ours is "wide" data: modest amounts of data (rows), normalized across many tables, with the added constraint of authorization, where we want to be able to search just about every column (could be dozens depending on the client).

@afeld

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2015

👍

@pkarman pkarman referenced this issue Dec 15, 2015
@pkarman

This comment has been minimized.

Copy link
Contributor Author

commented Feb 25, 2016

@pkarman pkarman closed this Feb 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.