Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing on web_search? #140

Open
tomcardoso opened this issue Dec 16, 2020 · 2 comments
Open

Indexing on web_search? #140

tomcardoso opened this issue Dec 16, 2020 · 2 comments

Comments

@tomcardoso
Copy link

I'm a bit confused as to how I can set up indexing on web search, especially if I want to use the approach we talked about in #137. To recap, what I'm doing is:

order, conditions = cases.web_search_similarities_and_conditions({
  case_number: keywords,
  short_title: keywords,
  case_events: {
    party_name: keywords,
    short_title: keywords
  }}, false)

results = Case
  .joins(:case_events)
  .where(conditions.join(' OR ')) # depends on exclusive, could be AND as well
  .group(:id) # group by ID to deduplicate
  .order(Arel.sql("max(#{order.join(' + ')})")) # use aggregation function to pick highest rank for duplicates

Which searches a number of fields for a certain keyword and returns all results that match in any of those fields.

But I'm finding the search to be very slow for my database (~1.8m case event entries). I went in and added simple keyword indexes of this sort:

class AddKeywordIndexToCaseEvents < ActiveRecord::Migration[6.0]
  disable_ddl_transaction!

  def change
    add_index :case_events, :party_name, unique: false, algorithm: :concurrently
    add_index :case_events, :short_title, unique: false, algorithm: :concurrently
  end
end

But I have a feeling I may need to do something more. Do you have any advice on what I should do? I know Textacular recommends using the :sql schema file format, but I'd prefer to keep it as a schema.rb for compatibility reasons with my production database.

@tomcardoso
Copy link
Author

tomcardoso commented Dec 18, 2020

That didn't work, so I've tried doing something like this instead:

add_index :cases, "to_tsvector('english', case_number || ' ' || short_title)", using: :gin, name: 'cases_keywords_idx', algorithm: :concurrently

But the search is still very, very slow (taking 10 seconds or longer). I've done some profiling and it's all down to the bottleneck created by the special two-table, four-column query I mentioned in my previous message. Here's an example query that lasts ~10 seconds:

SELECT "cases".* FROM "cases" INNER JOIN "case_events" ON "case_events"."case_id" = "cases"."id" WHERE (to_tsvector('english', "cases"."case_number"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "cases"."short_title"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "case_events"."party_name"::text) @@ websearch_to_tsquery('english', 'police'::text) OR to_tsvector('english', "case_events"."short_title"::text) @@ websearch_to_tsquery('english', 'police'::text)) GROUP BY "cases"."id" ORDER BY max(COALESCE(ts_rank(to_tsvector('english', "cases"."case_number"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "cases"."short_title"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "case_events"."party_name"::text), websearch_to_tsquery('english', 'police'::text)), 0) + COALESCE(ts_rank(to_tsvector('english', "case_events"."short_title"::text), websearch_to_tsquery('english', 'police'::text)), 0)) LIMIT $1 OFFSET $2;

I've tested both with and without that indexing, and it doesn't seem to really make a difference, which makes me wonder whether I need to index things some other way, or if it's not the ts_rank operations that are really slowing me down. Some testing suggested the search is fairly snappy until I get to .order(Arel.sql("max(#{order.join(' + ')})")), at which point the query grinds to a halt, so that must be an expensive operation. Any suggestions here would be much appreciated.

@simi
Copy link
Contributor

simi commented Nov 14, 2021

@tomcardoso for some reason if you join and lookup into two tables GIN indexes are not used. I have managed it is possible to check for matching JOINed table in CTE and just pass ids to final query. You need to use ANY(ARRAY(... method to keep indexes being used. Here is example fast query.

WITH related_labels AS (
  SELECT id FROM labels WHERE account_id = 80 AND to_tsvector('english_nostop', "labels"."name"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
)
SELECT
	"products" .*,
	COALESCE(ts_rank(to_tsvector('english_nostop', "products"."title"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."catalog_number"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."version"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."artist"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."barcode"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "products"."company_id"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) + COALESCE(ts_rank(to_tsvector('english_nostop', "labels"."name"::TEXT), websearch_to_tsquery('english_nostop', 'Hello'::TEXT)), 0) AS "rank20534463617833501"
FROM
	"products"
LEFT OUTER JOIN "labels" ON
	"labels"."id" = "products"."label_id"
WHERE
	"products"."account_id" = 80
	AND (to_tsvector('english_nostop', "products"."title"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
	OR to_tsvector('english_nostop', "products"."version"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
	OR to_tsvector('english_nostop', "products"."artist"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
	OR to_tsvector('english_nostop', "products"."barcode"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
	OR to_tsvector('english_nostop', "products"."company_id"::TEXT) @@ websearch_to_tsquery('english_nostop', 'Hello'::TEXT)
        OR label_id = ANY(ARRAY(SELECT * FROM related_labels)))
ORDER BY
	"rank20534463617833501" DESC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants