add trigram index to search #3719

phiresky · 2023-07-24T23:37:17Z

Search currently does ILIKE '%x%y%' queries on person, post, comment tables. These cause seq scans because they can't be indexed with a normal BTREE index.

This PR adds trigram indexes to all three tables. This should speed up most of those searches to < 100ms. The cost is slightly higher cost for updating/creating rows in those tables.

One limitation is that this index is suboptimal when other filters are present and important, for example searching in one community.

This migration should really run in the background with CREATE INDEX CONCURRENTLY since it will take a while but I don't think there's a way to do that from within diesel.

dessalines

One you forgot here: is the community search, which like person searches the name or title:

ancientmarinerdev · 2023-07-26T21:52:09Z

If you have added more of a performance cost on to insert at the cost of search, is that definitely going to have an overall decrease in cost? Is search done more than posting?

Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache. The more elaborate the search, the less likely there is to be an hit, but splitting queries out or any technology that can help support that might just cut a significant load from the DB which will help massively with scalability. Someone will inevitably try expensive queries to kill servers. Being defensive here could be important.

There is probably some investigation that could go into this such as types of queries, replication of data etc. Community names could probably be quite an easy cache as it's a commonly undertaken activity.

phiresky · 2023-07-26T22:06:44Z

If you have added more of a performance cost on to insert at the cost of search, is that definitely going to have an overall decrease in cost? Is search done more than posting?

I didn't benchmark it much, but I don't think it increases the cost for insert/update by more than like 1ms. Creating the index for a comments table with 700k comments takes 40s on a shitty server, so a single insert shouldn't take longer than a millisecond. So it should be basically irrelevant in the face of all the other stuff lemmy does on post insert that take up time. I also can't think of a way this could be exploited to cause any excess load.

Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence

I've heard similar things a few times now and I feel like there must be some common misconception. Every single site admin so far runs lemmy on a single server, including the database. PostgreSQL scales perfectly well to fill a single server of any size. We're far from its limits. Rust also scales perfectly well to fill a server of any size. Moving some stuff to a different application doesn't improve the situation any more than improving the performance with the tools we already have. Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

One you forgot here: is the community search

Added, though with <10k communities it's probably irrelevant.

dessalines · 2023-07-26T22:20:08Z

Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache.

I've been extremely reticent to add additional data stores, because of the complication they add, in trying to sync all of them up. Redis, melisearch, elasticsearch, all of these would add a ton of complication.

Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

Spot on. People sometimes think they can just throw new technologies at a problem to solve it. There's almost nothing that would make lemmy less maintainable than adding additional data stores / sources of truth.

ancientmarinerdev · 2023-07-26T23:33:14Z

I've been extremely reticent to add additional data stores, because of the complication they add, in trying to sync all of them up. Redis, melisearch, elasticsearch, all of these would add a ton of complication.

Spot on. People sometimes think they can just throw new technologies at a problem to solve it. There's almost nothing that would make lemmy less maintainable than adding additional data stores / sources of truth.

It would add complexity, that is something I am not going to disagree, but I don't think they add unjustifiable complexity. Most websites running at country scale will use a Lucene, Solr, or more recently Elasticsearch for covering search. Taking load off the database is critical, because all extra concurrent load makes all transactions slower until it gets to the point it buckles. Even if it doesn't buckle, the delayed reaction times impact on users and their experience and eventually those doubts start to build up about whether this is good or reliable.

I suggested the following because from what I have seen, at scale these technologies are favoured. I don't know any large website that allows database searching without any form of caching. Care to give an example? I seriously fear you're letting your own view cloud what happens in practice and you are starting to believe that what you are doing is more correct that all alternatives that have made a different conclusion. Maybe you're right and industry standards are wrong, but that's a mighty bold assertion and a big leap of faith.

I've heard similar things a few times now and I feel like there must be some common misconception. Every single site admin so far runs lemmy on a single server, including the database. PostgreSQL scales perfectly well to fill a single server of any size. We're far from its limits. Rust also scales perfectly well to fill a server of any size. Moving some stuff to a different application doesn't improve the situation any more than improving the performance with the tools we already have. Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

I'm getting some titanic "this ship can't sink" vibes here. Most of my experience is that most shiny tech that promises a lot, generally has a moment where it doesn't scale as advertised. Eventually projects are spun off late to try to deal with this. I raised the point in the hope of starting the discussion early, but it seems like the sites have to buckle and there to be no option until there is going to be a consideration of something like this.

It isn't some common misconception though, most top websites when accessing immutable content, will try to cache first to cut load. If a query is run more than 96 times a day, a 15 min cache is going to provide a reduction in the amount of work, assuming they are evenly divided. They are returning a result rather than doing the same computations again. Yes, the data can maybe be stale, but who needs the data to be that real time for search. Even an hour cache is hardly an issue from a search perspective.

In tech, it's important to use the best tool for the job, it isn't always advisable to stick with simple stacks when the demands are greater. The last few weekends, there has been bad actors really testing the limits of Lemmy, and they seem quite motivated. By allowing search straight onto the DB, you're putting the DB in the hands of bad actors which is a very risky move. So far, it's not going smoothly. They're going to keep probing and poking where their is weaknesses.

I didn't even want thumbs up of this idea, I just wanted it to be considered. I'm quite disappointed (and a little insulted) that it was fobbed off with some weird insinuation that I was naively assuming throwing technology at something would magically fix it. I don't know Rust, or Diesel, so unfortunately couldn't contribute from a code perspective. I was hoping I could share some of my experience, but I'm getting the impression that isn't welcomed. Good luck.

dessalines · 2023-07-27T00:00:11Z

I don't know Rust, or Diesel, so unfortunately couldn't contribute from a code perspective. I was hoping I could share some of my experience, but I'm getting the impression that isn't welcomed. Good luck.

This is a PR for a rust code addition. If you don't know rust, then don't add to the noise here, it wastes our time. Create an issue, or comment on it elsewhere.

phiresky · 2023-07-27T11:22:21Z

I didn't even want thumbs up of this idea, I just wanted it to be considered

Adding a dedicated tool for search is definitely something to consider in the future. I'm not denying that those tools exist for a reason, it's just that the suggestion is pretty generic, would need a ton more research and development would be > 100x the effort of these changes, so it doesn't really affect whether this PR is a good idea or not at all. Right now isn't really the time since there's so many "easy" improvements that can be done by not increasing the comlexity of lemmy at all or even reducing it.

Yes, lemmy has been having tons of issues with scalability in the past weeks, but none of that was caused by the core technology or tools used. The only thing this PR does is add an index to speed up a query that's already being run on production and that's causing issues. It's easy to rip out again and e.g. replace it with PG to_tsvector() FTS search or tantivy or something else later.

There's an open issue about it here #3713

* add trigram index to search * add community index

ancientmarinerdev · 2023-07-27T13:52:45Z

I didn't even want thumbs up of this idea, I just wanted it to be considered

Adding a dedicated tool for search is definitely something to consider in the future. I'm not denying that those tools exist for a reason, it's just that the suggestion is pretty generic, would need a ton more research and development would be > 100x the effort of these changes, so it doesn't really affect whether this PR is a good idea or not at all. Right now isn't really the time since there's so many "easy" improvements that can be done by not increasing the comlexity of lemmy at all or even reducing it.

Yes, lemmy has been having tons of issues with scalability in the past weeks, but none of that was caused by the core technology or tools used. The only thing this PR does is add an index to speed up a query that's already being run on production and that's causing issues. It's easy to rip out again and e.g. replace it with PG to_tsvector() FTS search or tantivy or something else later.

There's an open issue about it here #3713

I completely agree with that assessment, and it is a good pragmatic change, I appreciate the focus right now needs to be on the bottlenecks, and the known inefficiencies. I had raised it as the topic was on text search and it was more as a consideration of what can come next. I didn't intend for my comment to be a blocker. Thanks for the link to the issue, and I wanted to say thanks for your work done so far. Very impressed with your analysis so far on performance and security. You're a big asset to the project.

add trigram index to search

c7d9387

phiresky requested review from Nutomic and dessalines as code owners July 24, 2023 23:37

phiresky mentioned this pull request Jul 24, 2023

[Bug]: Timeout occurred while waiting for a slot to become available #3661

Closed

4 tasks

dessalines reviewed Jul 25, 2023

View reviewed changes

add community index

b4ceffc

dessalines approved these changes Jul 26, 2023

View reviewed changes

Nutomic merged commit 9bfa86d into LemmyNet:main Jul 27, 2023
1 check passed

Nutomic pushed a commit that referenced this pull request Jul 27, 2023

add trigram index to search (#3719)

6dee7d1

* add trigram index to search * add community index

ancientmarinerdev mentioned this pull request Jul 27, 2023

Investigate dedicated full-text search with e.g. Meilisearch or Elasticsearch #3713

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add trigram index to search #3719

add trigram index to search #3719

phiresky commented Jul 24, 2023

dessalines left a comment

ancientmarinerdev commented Jul 26, 2023

phiresky commented Jul 26, 2023 •

edited

dessalines commented Jul 26, 2023

ancientmarinerdev commented Jul 26, 2023

dessalines commented Jul 27, 2023

phiresky commented Jul 27, 2023 •

edited

ancientmarinerdev commented Jul 27, 2023

add trigram index to search #3719

add trigram index to search #3719

Conversation

phiresky commented Jul 24, 2023

dessalines left a comment

Choose a reason for hiding this comment

ancientmarinerdev commented Jul 26, 2023

phiresky commented Jul 26, 2023 • edited

dessalines commented Jul 26, 2023

ancientmarinerdev commented Jul 26, 2023

dessalines commented Jul 27, 2023

phiresky commented Jul 27, 2023 • edited

ancientmarinerdev commented Jul 27, 2023

phiresky commented Jul 26, 2023 •

edited

phiresky commented Jul 27, 2023 •

edited