Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add trigram index to search #3719

Merged
merged 2 commits into from Jul 27, 2023
Merged

add trigram index to search #3719

merged 2 commits into from Jul 27, 2023

Conversation

phiresky
Copy link
Collaborator

Search currently does ILIKE '%x%y%' queries on person, post, comment tables. These cause seq scans because they can't be indexed with a normal BTREE index.

This PR adds trigram indexes to all three tables. This should speed up most of those searches to < 100ms. The cost is slightly higher cost for updating/creating rows in those tables.

One limitation is that this index is suboptimal when other filters are present and important, for example searching in one community.

This migration should really run in the background with CREATE INDEX CONCURRENTLY since it will take a while but I don't think there's a way to do that from within diesel.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One you forgot here: is the community search, which like person searches the name or title:

Screenshot_2023-07-25-12-44-34-030_com termux

@ancientmarinerdev
Copy link

If you have added more of a performance cost on to insert at the cost of search, is that definitely going to have an overall decrease in cost? Is search done more than posting?

Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache. The more elaborate the search, the less likely there is to be an hit, but splitting queries out or any technology that can help support that might just cut a significant load from the DB which will help massively with scalability. Someone will inevitably try expensive queries to kill servers. Being defensive here could be important.

There is probably some investigation that could go into this such as types of queries, replication of data etc. Community names could probably be quite an easy cache as it's a commonly undertaken activity.

@phiresky
Copy link
Collaborator Author

phiresky commented Jul 26, 2023

If you have added more of a performance cost on to insert at the cost of search, is that definitely going to have an overall decrease in cost? Is search done more than posting?

I didn't benchmark it much, but I don't think it increases the cost for insert/update by more than like 1ms. Creating the index for a comments table with 700k comments takes 40s on a shitty server, so a single insert shouldn't take longer than a millisecond. So it should be basically irrelevant in the face of all the other stuff lemmy does on post insert that take up time. I also can't think of a way this could be exploited to cause any excess load.

Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence

I've heard similar things a few times now and I feel like there must be some common misconception. Every single site admin so far runs lemmy on a single server, including the database. PostgreSQL scales perfectly well to fill a single server of any size. We're far from its limits. Rust also scales perfectly well to fill a server of any size. Moving some stuff to a different application doesn't improve the situation any more than improving the performance with the tools we already have. Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

One you forgot here: is the community search

Added, though with <10k communities it's probably irrelevant.

@dessalines
Copy link
Member

Should search directly search DB? Shouldn't something like elasticsearch be considered so it can sit in front of/alongside the persistence? It might open more options for caching also. If some common queries can be cached, it could be reduced, even if it is a very short lived cache.

I've been extremely reticent to add additional data stores, because of the complication they add, in trying to sync all of them up. Redis, melisearch, elasticsearch, all of these would add a ton of complication.

Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

Spot on. People sometimes think they can just throw new technologies at a problem to solve it. There's almost nothing that would make lemmy less maintainable than adding additional data stores / sources of truth.

@ancientmarinerdev
Copy link

I've been extremely reticent to add additional data stores, because of the complication they add, in trying to sync all of them up. Redis, melisearch, elasticsearch, all of these would add a ton of complication.

Spot on. People sometimes think they can just throw new technologies at a problem to solve it. There's almost nothing that would make lemmy less maintainable than adding additional data stores / sources of truth.

It would add complexity, that is something I am not going to disagree, but I don't think they add unjustifiable complexity. Most websites running at country scale will use a Lucene, Solr, or more recently Elasticsearch for covering search. Taking load off the database is critical, because all extra concurrent load makes all transactions slower until it gets to the point it buckles. Even if it doesn't buckle, the delayed reaction times impact on users and their experience and eventually those doubts start to build up about whether this is good or reliable.

I suggested the following because from what I have seen, at scale these technologies are favoured. I don't know any large website that allows database searching without any form of caching. Care to give an example? I seriously fear you're letting your own view cloud what happens in practice and you are starting to believe that what you are doing is more correct that all alternatives that have made a different conclusion. Maybe you're right and industry standards are wrong, but that's a mighty bold assertion and a big leap of faith.

I've heard similar things a few times now and I feel like there must be some common misconception. Every single site admin so far runs lemmy on a single server, including the database. PostgreSQL scales perfectly well to fill a single server of any size. We're far from its limits. Rust also scales perfectly well to fill a server of any size. Moving some stuff to a different application doesn't improve the situation any more than improving the performance with the tools we already have. Adding more random tools / different applications will take up the same memory and the same CPU time lemmy and PG is using. It would just add more overhead, more complexity, and more multiple-sources-of-truth issues.

I'm getting some titanic "this ship can't sink" vibes here. Most of my experience is that most shiny tech that promises a lot, generally has a moment where it doesn't scale as advertised. Eventually projects are spun off late to try to deal with this. I raised the point in the hope of starting the discussion early, but it seems like the sites have to buckle and there to be no option until there is going to be a consideration of something like this.

It isn't some common misconception though, most top websites when accessing immutable content, will try to cache first to cut load. If a query is run more than 96 times a day, a 15 min cache is going to provide a reduction in the amount of work, assuming they are evenly divided. They are returning a result rather than doing the same computations again. Yes, the data can maybe be stale, but who needs the data to be that real time for search. Even an hour cache is hardly an issue from a search perspective.

In tech, it's important to use the best tool for the job, it isn't always advisable to stick with simple stacks when the demands are greater. The last few weekends, there has been bad actors really testing the limits of Lemmy, and they seem quite motivated. By allowing search straight onto the DB, you're putting the DB in the hands of bad actors which is a very risky move. So far, it's not going smoothly. They're going to keep probing and poking where their is weaknesses.

I didn't even want thumbs up of this idea, I just wanted it to be considered. I'm quite disappointed (and a little insulted) that it was fobbed off with some weird insinuation that I was naively assuming throwing technology at something would magically fix it. I don't know Rust, or Diesel, so unfortunately couldn't contribute from a code perspective. I was hoping I could share some of my experience, but I'm getting the impression that isn't welcomed. Good luck.

@dessalines
Copy link
Member

I don't know Rust, or Diesel, so unfortunately couldn't contribute from a code perspective. I was hoping I could share some of my experience, but I'm getting the impression that isn't welcomed. Good luck.

This is a PR for a rust code addition. If you don't know rust, then don't add to the noise here, it wastes our time. Create an issue, or comment on it elsewhere.

@Nutomic Nutomic merged commit 9bfa86d into LemmyNet:main Jul 27, 2023
1 check passed
@phiresky
Copy link
Collaborator Author

phiresky commented Jul 27, 2023

I didn't even want thumbs up of this idea, I just wanted it to be considered

Adding a dedicated tool for search is definitely something to consider in the future. I'm not denying that those tools exist for a reason, it's just that the suggestion is pretty generic, would need a ton more research and development would be > 100x the effort of these changes, so it doesn't really affect whether this PR is a good idea or not at all. Right now isn't really the time since there's so many "easy" improvements that can be done by not increasing the comlexity of lemmy at all or even reducing it.

Yes, lemmy has been having tons of issues with scalability in the past weeks, but none of that was caused by the core technology or tools used. The only thing this PR does is add an index to speed up a query that's already being run on production and that's causing issues. It's easy to rip out again and e.g. replace it with PG to_tsvector() FTS search or tantivy or something else later.

There's an open issue about it here #3713

Nutomic pushed a commit that referenced this pull request Jul 27, 2023
* add trigram index to search

* add community index
@ancientmarinerdev
Copy link

I didn't even want thumbs up of this idea, I just wanted it to be considered

Adding a dedicated tool for search is definitely something to consider in the future. I'm not denying that those tools exist for a reason, it's just that the suggestion is pretty generic, would need a ton more research and development would be > 100x the effort of these changes, so it doesn't really affect whether this PR is a good idea or not at all. Right now isn't really the time since there's so many "easy" improvements that can be done by not increasing the comlexity of lemmy at all or even reducing it.

Yes, lemmy has been having tons of issues with scalability in the past weeks, but none of that was caused by the core technology or tools used. The only thing this PR does is add an index to speed up a query that's already being run on production and that's causing issues. It's easy to rip out again and e.g. replace it with PG to_tsvector() FTS search or tantivy or something else later.

There's an open issue about it here #3713

I completely agree with that assessment, and it is a good pragmatic change, I appreciate the focus right now needs to be on the bottlenecks, and the known inefficiencies. I had raised it as the topic was on text search and it was more as a consideration of what can come next. I didn't intend for my comment to be a blocker. Thanks for the link to the issue, and I wanted to say thanks for your work done so far. Very impressed with your analysis so far on performance and security. You're a big asset to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants