Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use random ID for activitypub #1101

Open
Nutomic opened this issue Aug 25, 2020 · 11 comments
Open

Use random ID for activitypub #1101

Nutomic opened this issue Aug 25, 2020 · 11 comments
Labels
area: federation support federation via activitypub enhancement New feature or request

Comments

@Nutomic
Copy link
Member

Nutomic commented Aug 25, 2020

At the moment we use the database ID as the activitypub ID, which means we have to first insert the new user/community/post/comment into the database, wait for that to finish and then update it with the activitypub ID. This also means that the ap_id has to be nullable.

If we used a random ID instead (eg UUID), we could create a new object in one step, without any need for a separate update step or a nullable ID.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@Nutomic Nutomic added enhancement New feature or request area: federation support federation via activitypub labels Aug 25, 2020
@Nutomic
Copy link
Member Author

Nutomic commented Feb 11, 2021

If we use random IDs for all users, communities, posts and comments we might also be able to migrate actors with all data from one Lemmy instance to another. But that would certainly take a lot of work.

@chaorace
Copy link

chaorace commented Jun 14, 2023

So, if I understand this correctly, the idea is two-fold:

A) Switching to a UUID will help decouple network performance from DB performance
B) Switching to a UUID will help decouple object identity from instance ownership

The first point is fairly self-explanatory. If that's all you need to achieve, the IDs don't even need to change that much:
https://lemmy.world/post/1 => https://lemmy.world/post/196544016-dead-beef-cafe-7471e767e36c

The second point is where I start to get a little confused. Even if you switch to UUIDs, it wouldn't be technically sound to naively discount the possibility of collisions from bad actors, bad entropy, or outright bad luck. If you can't actually guarantee a UUID is unique for the context in which it exists (i.e.: the Fediverse), then what's the technical advantage here?

You're still stuck disambiguating the scope of the ID by attributing it to a specific instance host, right? An ID format change like this seems off the table as far as I can imagine:
https://lemmy.world/post/1 => post/196544016-dead-beef-cafe-7471e767e36c

@ivanjermakov
Copy link

ivanjermakov commented Jun 14, 2023

What if object id includes original instance within it? For example, post created on lemmy.world will get ID 12345@lemmy.world. It can be either a composite key or a single varchar key.

This way:
a) Every object is unique across instances (as long as instance domain is unique)
b) Instance itself is responsible for creating unique object IDs (some collision protection might still be implemented)

It might be a pain to migrate existing data, though.

@chaorace
Copy link

chaorace commented Jun 14, 2023

What if object id includes original instance within it?

That's technically what the current Fediverse ID format is already doing: https://lemmy.world/post/1

True, unifying the ID formats to be the same for Fediverse object IDs & backend IDs would allow for constructing usable URLs from the Fediverse object ID verbatim, but it's otherwise basically identical to the current format. I question if changing the backend ID syntax for that reason alone is more practical than simply having different representations for the ID in-URL vs. Fediverse objects and letting the backend figure the rest out at query-time (In-URL: https://lemmy.ml/u/chaorace@lemmy.world <=> Fediverse ID: https://lemmy.world/u/chaorace)

In any case, it's an entirely separate issue from whether or not posts/comments are given sequential/random IDs. Both methodologies would be compatible with either syntax:

  • https://lemmy.world/post/1@lemmy.ml <=> https://lemmy.ml/post/1
    • https://lemmy.world/post/1@lemmy.ml <=> post/1@lemmy.ml
  • https://lemmy.world/post/196544016-dead-beef-cafe-7471e767e36c@lemmy.ml <=> https://lemmy.ml/post/196544016-dead-beef-cafe-7471e767e36c
    • https://lemmy.world/post/196544016-dead-beef-cafe-7471e767e36c@lemmy.ml <=> post/196544016-dead-beef-cafe-7471e767e36c@lemmy.ml

@chaorace
Copy link

chaorace commented Jun 14, 2023

Upon further consideration, I guess a more succinct way to put it is that switching away from sequential IDs is definitely a good idea because it will decouple network performance from DB performance. However, switching to a UUID alone will not be good enough to function as unique identifier in inter-instance communication.

As I say above, you can separately deal with the inter-instance communication issue by translating the DB ID into more disambiguated formats, but this creates a situation where you'll be translating between 3 different syntaxes:

  • DB: {uuid}
  • Fediverse: https://{external-instance}/{type}/{uuid}
  • URL: https://{local-instance}/{type}/{uuid}@{external-instance}

In the case of a local instance handling an external originating resource, this unfortunately creates a performance issue when DB querying because the UUID itself cannot be treated as a unique index. You have to account for the instance as well and therefore must query against multiple non-unique fields.


In light of this stumbling block, I think that the scope of this issue should be expanded to simply be "Refactor Lemmy IDs". That way we can kill multiple ID-related birds with one stone by engineering a more purpose-suited format.

The most obvious thing is to combine the originally suggested UUID approach with the tweak suggested by @ivanjermakov. Doing this will lead to a globally unique ID that can be used as the canonical DB index (e.g.: {uuid}@{external-instance}). As a bonus, I think this even allows you to simplify construction of the other two syntaxes:

  • DB: {uuid}@{external-instance}
  • Fediverse: https://{external-instance}/{type}/{db-id}
  • URL: https://{local-instance}/{type}/{db-id}

I think this would simplify a lot of code-paths and also eliminate complexity stemming from there currently being two different URL patterns used in accessing local/remote resources, since the fully qualified syntax would become canonical.


That alone is already a big potential improvement, but why not go all the way and further unify these three formats as completely as we can? Apparently the Fediverse ID needs to be a resolveable URL, so we can't quite unify the DB & Fediverse IDs... though we can get pretty darn close:

  • DB: {type}/{uuid}@{external-instance}
  • Fediverse: https://{external-instance}/{db-id}
  • URL: https://{local-instance}/{db-id}

@ivanjermakov
Copy link

I think the last step of unifying every object's id is overkill.

  1. It creates redunant data within the DB table
  2. It makes it not clear what object type is being refferred to in the URL which might be confusing to the end user

@chaorace
Copy link

chaorace commented Jun 14, 2023

It probably is overkill. It would have been a different story if we could use an arbitrary value for the Fediverse ID and fully unify it with the DB ID, but the ActivityPub spec dashed my hopes with that one!

FWIW: The constructed URL would actually be identical for both implementations. After all, if the type is included in the DB ID and the DB ID is included in the URL, then type must also included in the URL by virtue of nesting. The difference is strictly on the construction side of things!

To illustrate how it affects the URL construction recipe:

  • If DB ID is {uuid}@{external-instance}: Instance URL + Resource Type + DB ID
  • If DB ID is {type}/{uuid}@{external-instance}: Instance URL + DB ID

Both recipes yield the same URL. It's really just an implementation convenience since the DB ID itself is technically useless if you don't also specify which resource type to be checking.

@mermit
Copy link

mermit commented Jun 14, 2023

I'm not completely sure about the current db schema, but wouldn't it be easier to migrate the existing DB IDs when the type isn't integrated into the ID? I tried to look for it in the source code, and I'm not super familiar with it (and rust) yet. But there probably are different tables for the different types?

@chaorace
Copy link

I'm not completely sure about the current db schema [...] I tried to look for it in the source code, and I'm not super familiar with it

FYI: Here's the DB Schema, though, yeah, I'm also unfamiliar with the stack being used here.

wouldn't it be easier to migrate the existing DB IDs when the type isn't integrated into the ID?

I expect that the difficulty will come from the process of orderly transitioning the schema, rather than the actual code change. Just my two cents, though.

But there probably are different tables for the different types?

Yes. It's not a shared table. The motivation behind the embedded type idea was more-or-less aesthetic -- making the index feel more like a resource locator. Regardless... not a hill worth fighting on. I'll cop to that part being a sketchy suggestion.

Fortunately, that bit wasn't really the important part of the post. What matters most is merely having a DB ID which is globally unique and able to be effortlessly used verbatim in Federated IDs/URLs (i.e.: {uuid}@{external-instance})

@mermit
Copy link

mermit commented Jun 14, 2023

FYI: Here's the DB Schema, though, yeah, I'm also unfamiliar with the stack being used here.

Ah, thank you, I missed that there was another Source folder.

Fortunately, that bit wasn't really the important part of the post. What matters most is merely having a DB ID which is globally unique and able to be effortlessly used verbatim in Federated IDs/URLs (i.e.: {uuid}@{external-instance})

Yeah exactly. As it might also allow for something like an universal url for sharing, making interoperability much simpler

@mermit
Copy link

mermit commented Jun 14, 2023

I expect that the difficulty will come from the process of orderly transitioning the schema, rather than the actual code change. Just my two cents, though.

Since we have the Fediverse ID it could be as simple as updating the db with the new system when "legacy" communities, posts, users, etc are interacted with. Although it might cause problems as comments depend on posts, and other similar relationships. Perhaps an "Old ID" can be saved and checked against for those situations?
And next to that a slower background process can also work on migrating the remaining part of the DB while the instance is active.

Edit: Scrap that, if a system is added that allows looking up old IDs, there is no point in also doing it based on interaction. While it might be nice to prioritise active parts first, I don't think it is worth the extra hassle and performance hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants