Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate post thumbnail/metadata in background (ref #4529) #4564

Merged
merged 5 commits into from Mar 27, 2024

Conversation

Nutomic
Copy link
Member

@Nutomic Nutomic commented Mar 25, 2024

@ticoombs mentioned that its sometimes extremely slow to generate the post thumbnail and metadata because sites are slow to respond. So Im moving this to a background task which also gets rid of some duplicate code.

I also noticed that the custom post thumbnail added in #4425 isnt working properly because its not federated. So other instances will generate a thumbnail as usual. Ive removed the custom thumbnail code for now.

@dessalines
Copy link
Member

I also noticed that the custom post thumbnail added in #4425 isnt working properly because its not federated. So other instances will generate a thumbnail as usual. Ive removed the custom thumbnail code for now.

Why is that? That field is only added via the API, the thumbnail_url field should still be used for federation.

Copy link
Member

@dessalines dessalines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree with backgrounding this. This means that front ends might think that there's a bug with missing picture / metadata after a post has been created, when in reality its just waiting for it, and they'll need to keep clicking refresh to make sure it fetched everything.

Also, not sure how this might affect federation of pics and metadata.

@Nutomic
Copy link
Member Author

Nutomic commented Mar 26, 2024

Why is that? That field is only added via the API, the thumbnail_url field should still be used for federation.

Sorry youre right, Ive restored it again.

Not sure I agree with backgrounding this. This means that front ends might think that there's a bug with missing picture / metadata after a post has been created, when in reality its just waiting for it, and they'll need to keep clicking refresh to make sure it fetched everything.

The problem is mentioned here. Basically to keep up with a major instance like lemmy.world, an instance has to process around 3 activities per second. But if its blocked fetching link metadata for 2 seconds, it will fall behind and spend the next seconds catching up. Thats a problem when an instance already has trouble keeping up.

Also, not sure how this might affect federation of pics and metadata.

Right that was a problem, but Ive moved it so its done after metadata generation now.

@Nutomic
Copy link
Member Author

Nutomic commented Mar 26, 2024

Theres a weird test failure where alt_text isnt getting federated. Its only failing sometimes, and despite me not changing anything related to alt text.

..Default::default()
};
let updated_post = Post::update(&mut context.pool(), post.id, &form).await?;
if let Some(send_activity) = send_activity(updated_post) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// so wait for it explicitly. For removed posts we cant refetch anything.
postOne = await waitForPost(beta, postOne.post, res => {
return res == null || res?.post.embed_title != null;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alt_text issue might be a logical problem here. Its returning even if the res is null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing test is "No image proxying if setting is disabled" which doesnt use this function. The null check is necessary because theres a removed post which cant be fetched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm... gotta be somewhere else then. Otherwise all the other PRs would have alt_text issues too, because that's already in main.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to happen specifically with this PR, so be careful with merging even if CI randomly passes.

post.post_view.post,
res => res?.post.alt_text != null,
);
expect(betaPost.post).toBeDefined();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the problem, gamma was fetching the post here before metadata was generated. This means the post url was federated as Link which doesnt include alt text. Have to subscribe to the community and wait for it to send the post out after metadata is generated. Also had to change from delta to beta because follow didnt go through from delta for some reason.

let betaPost = await waitForPost(
beta,
post.post_view.post,
res => res?.post.alt_text != null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
res => res?.post.alt_text != null,
res => res?.post.alt_text !== null,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails with !== but passes with !=. Not sure why but Ive changed it back.

api_tests/src/post.spec.ts Outdated Show resolved Hide resolved
api_tests/src/post.spec.ts Outdated Show resolved Hide resolved
crates/api_common/src/request.rs Outdated Show resolved Hide resolved
Co-authored-by: SleeplessOne1917 <28871516+SleeplessOne1917@users.noreply.github.com>
@dessalines dessalines enabled auto-merge (squash) March 27, 2024 14:47
@dessalines dessalines merged commit a4b79ca into main Mar 27, 2024
2 checks passed
@kroese
Copy link
Contributor

kroese commented Apr 5, 2024

@Nutomic @dessalines Excuse me if Im overlooking something, but why is it necessary to fetch metadata for federated posts at all?

I always assumed the metadata fetching was only done by the instance that generated the post. And since those fields (like thumbnail_url) are already federated, what is the purpose to let every receiving instance repeat that process? If the number of instances grows, that would trigger a DoS attack at the external URL by all those instances fetching the metadata at roughly the same time.

If the metadata fetching is necessary because some field is not federated, wouldnt it make much more sense to include that field in the post info instead of letting every server retrieve it from the external URL?

@Nutomic
Copy link
Member Author

Nutomic commented Apr 5, 2024

The metadata is not federated, and Im not entirely sure how that would be done over Activitypub. Looking at Mastodon they also dont federate it so it cant be that bad. Plus a malicious instance could send wrong metadata.

@kroese
Copy link
Contributor

kroese commented Apr 5, 2024

I am really surprised! I always thought my Lemmy instance was only making connections to other instances, but this means its constantly making GET requests to all kinds of non-Lemmy webservers for scraping.

I am not sure if Im too happy about that, for example Reddit may rate-limit my IP, other websites might block me for constant scraping. And it might connect to all kinds of shady or illegal websites.

Can you give one example why this metadata fetching is necessary? The thumbnail URL is already provided (which I hotlink). The only other thing seems to be that short text summary, which would be nice to have, but in no way critical.

Could there at least be an option to disable metadata scraping for federated posts? I dont mind my server connecting to other Lemmy servers, I just want it to trust the received metadata instead of crawling the whole web. That it does this for locally created posts is totally fine (and unavoidable), but for federated posts is overkill.

@Nutomic
Copy link
Member Author

Nutomic commented Apr 5, 2024

Sure you can open an issue for an option to disabling post metadata fetching.

@kroese
Copy link
Contributor

kroese commented Apr 5, 2024

Done, see #4598

@kroese
Copy link
Contributor

kroese commented Apr 23, 2024

@Nutomic I am looking into this commit, as I expected the background fetch would only apply to incoming (federated) posts. But if Im reading the code correctly, it seems it is also applied to outgoing (local) posts..

Is there any need to do that? Because this can potentially cause loads of issues (for example posts that already get federated to other instances before they were "finished", users that see their post incomplete before its finished, etc.).

It would make more sense to only apply the background fetching to incoming (federated) posts, where its effects will be much more harmless.

@dessalines
Copy link
Member

I did test it locally, and it seemed to work fine. As you can see in the code here, after its done fetching, it sends the update over federation.

@kroese
Copy link
Contributor

kroese commented Apr 24, 2024

@dessalines Okay, good to know!

But that still means that every post (except the few ones without metadata) will be federated twice. Once during creation, and once for the update. Thus essentially doubling the load on the network, with no real benefit.

@SleeplessOne1917 SleeplessOne1917 deleted the post-thumbnail-background branch April 24, 2024 10:42
@dessalines
Copy link
Member

True, its less than ideal, but we don't know another way to prevent blocking federation worker threads.

@kroese
Copy link
Contributor

kroese commented Apr 24, 2024

@dessalines But the blocking of federation worker threads is only applicable to incoming posts, for those this change was very useful, and I dont question that decision at all.

My complaint was that this change is also applied to outgoing posts. Since the outgoing posts are not in a queue at all, processing them asynchronously provides no benefit, only downsides (like doubling the network load). So this makes the federation queue even fuller.

To say it in other words: I understand why crates/apub/src/objects/post.rs was modified, and agree about that.

I just dont think it was smart that the same was done to crates/api_crud/src/post/create.rs and crates/api_crud/src/post/update.rs

@Nutomic
Copy link
Member Author

Nutomic commented Apr 24, 2024

Local posts are still only federated once, as you can see that ActivityChannel::submit_activity was moved into generate_post_link_metadata and is still only called once per API call. Doing this in the background does make sense for local posts, as it avoids long delay or timeout on the api call when metadata fetching is slow. The only benefit is that metadata and thumbnails may be missing for the first few seconds, but thats a very minor complaint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants