Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PkgServer synchronization (Pkg Server version of General is delayed relative to Git clone of General) #16777

Closed
johnnychen94 opened this issue Jun 23, 2020 · 61 comments
Assignees

Comments

@johnnychen94
Copy link
Contributor

johnnychen94 commented Jun 23, 2020

With pkg server enabled by default since Julia 1.5, there's an issue that a release PR is merged in General while the new version is still unavailable in the storage server in a short period. Because users don't know whether the pkg server has synced the commit, this would frequently break the CI test.

One such example is: once ImageMorphology v0.2.6 is added to General, I immediately retrigger the CI in JuliaImages/Images.jl#895 and then CI on nightly fails because it couldn't find ImageMorphology v0.2.6. In this case, the PR merged notification is a lie to developers 😂

cc: @staticfloat @StefanKarpinski

@StefanKarpinski
Copy link
Contributor

The Pkg client should fall back to fetching the package version directly from GitHub, no?

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 23, 2020

The issue here is that pkg client would directly download the out-of-date "latest" registry from pkg server, and then it finds no v0.2.6 when doing version resolving.

julia --color=yes --check-bounds=yes --inline=yes --project -e using Pkg; Pkg.test(coverage=true)
 Installing known registries into `~/.julia`
#=#=#                                                                         

######################################################################## 100.0%
      Added registry `General` to `~/.julia/registries/General`
   Updating registry at `~/.julia/registries/General`
ERROR: Unsatisfiable requirements detected for package ImageMorphology [787d08f9]:
 ImageMorphology [787d08f9] log:
 ├─possible versions are: [0.1.0-0.1.1, 0.2.0-0.2.5] or uninstalled
 └─restricted to versions 0.2.6-0.2 by Images [916415d5] — no versions left
   └─Images [916415d5] log:
     ├─possible versions are: 0.22.2 or uninstalled
     └─Images [916415d5] is fixed to version 0.22.2

I'm not sure how frequently storage server updates, but there's still a time gap here.

@DilumAluthge
Copy link
Member

@staticfloat How frequently does the Pkg server update its copy of the General registry?

@StefanKarpinski
Copy link
Contributor

It runs in a loop continuously pulling the registry and updating things. So it’s generally mostly up-to-date but not instantaneous.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 23, 2020

Depending on a point version that you just published in CI seems like kind of a corner case. You have to wait for registration to go through as well, how is waiting for it to get into storage servers any different?

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 23, 2020

That I received a merged notification/email from General, I retriggered the CI, and then ‼️ "no this version is not available yet."

Yes, this is an edge case for CI only, and it could be totally fine to do the rest of the work later. It's just not making the pipeline as smooth as it was; I usually retrigger the relevant CI when I saw the notification.

Or should we unset PkgServer in CI?

@DilumAluthge
Copy link
Member

It would be nice to get some concrete measurements on this.

If the delay between "PR merged in General" to "new version is available from the Pkg server" is 5 minutes, then I think that's no big deal.

But if the delay is e.g. 30 minutes, I think that would be annoying and should be fixed.

@johnnychen94 How long was the delay for you?

Or should we unset PkgServer in CI?

That would deprive the community of some useful telemetry statistics. I'd rather be able to keep using Pkg Server in CI.

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 23, 2020

I checked it again after 15mins and it failed, then I went away to do other of my work, now it works. I'll report if I get more data.

It usually takes about 8-15mins for an incremental update in my storage server in LAN. But I've also observed 40mins in the BFSU mirror, though.

The current loop in gen_static.jl iterates on all packages and all versions, and most of the time is wasted on untaring existing versions only to get Artifacts.toml, but we could actually only iterate on the registry diffs.

@DilumAluthge
Copy link
Member

It runs in a loop continuously pulling the registry and updating things. So it’s generally mostly up-to-date but not instantaneous.

I checked it again after 15mins and it failed, then I went away to do other of my work, now it works. I'll report if I get more data.

Personally, I think that 15 minutes is too slow. It would be good to get this delay down to 5 minutes or less, in my opinion.

It seems like the only issue here is updating the registry, right? If you have an updated registry, but the tarball of the code is not in the Pkg Server, then you'll just fall back to downloading the tarball from Git. The issue here is that the registry itself is not up to date.

Could we have two loops in the Pkg Server, running in parallel side by side? One loop does just the registry. It repeatedly updates the registry. The other loop does everything else.

@DilumAluthge
Copy link
Member

Personally, I think that 15 minutes is too slow. It would be good to get this delay down to 5 minutes or less, in my opinion.

Just to elaborate, in the pre-PkgServer days, the delay was essentially zero. As soon as a pull request was merged, you only had to wait a few seconds before updating your registry, since the registry was a Git repo, and GitHub updates Git repos within a matter of seconds.

So in my opinion, going from a delay of less than one minute to a delay of greater than or equal to fifteen minutes is a clear regression.

@fredrikekre
Copy link
Member

I just tested and for me it was less than 1 minute. I don't think we can do better than that, we have to let stuff propagate through the system.

@DilumAluthge
Copy link
Member

I just tested and for me it was less than 1 minute. I don't think we can do better than that, we have to let stuff propagate through the system.

1 minute is definitely fine. I would say anything less than or equal to 5 minutes is fine.

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 23, 2020

Could we have two loops in the Pkg Server, running in parallel side by side? One loop does just the registry. It repeatedly updates the registry. The other loop does everything else.

Personally, I think this can be a good idea in practice. But just to be clear, IIUC it's deliberately designed to update /registries only when the whole update succeed so that the storage server "strictly" follows the completeness requirement: any data declared by the server is available in the server. If we continuously update /registries, it slightly breaks such commitment. IMO, this commitment is conceptually great, but in practice, it would be over-strict since there's always a fallback solution for pkg client.

@DilumAluthge
Copy link
Member

It's curious that for Fredrik the update took 1 minute, but for Johnny it took more than 15 minutes.

It might be helpful to collect more samples. Perhaps someone could write a script that routinely pulls the registry from the Pkg server, and pulls a list of recently merged PRs from General, and loops through the recently merged PRs (starting with the most recently merged PR) and goes backwards in time until it finds the most recent PR that is included in the registry provided by Pkg server. If we automate that process, we can collect a lot of samples and figure out how common it is for the Pkg server registry to be more than 5 minutes delayed.

@DilumAluthge
Copy link
Member

Could we have two loops in the Pkg Server, running in parallel side by side? One loop does just the registry. It repeatedly updates the registry. The other loop does everything else.

Personally, I think this can be a good idea in practice. But just to be clear, IIUC it's deliberately designed to update /registries only when the whole update succeed so that the storage server "strictly" follows the completeness requirement: any data declared by the server is available in the server. If we continuously update /registries, it slightly breaks such commitment.

Yeah I realize now my "two loops" suggestion breaks that promise. Maybe best to look for other solutions.

@DilumAluthge
Copy link
Member

It just seems like a bad user experience that if a bug fix is registered and merged in General, now you have to wait an unknown period of time before the bug fix is accessible to you?

At the very least, it would be useful to be able to figure out how recent your registry is. Does the Pkg server expose any endpoint that would let me get e.g. the UTC timestamp corresponding to the last time when the Pkg server cloned the registry? So at least I can get a sense of how recent my Pkg server registry is.

@StefanKarpinski
Copy link
Contributor

@johnnychen94, are you running a modified version of the gen_static.jl script? In the default configuration, it should only be downloading artifacts that appear in newly registered package versions, which should be fast, much closer to the 1 minute for an incremental update that @fredrikekre is seeing than the 15 minutes you're seeing.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 23, 2020

E.g. do you have get_old_package_artifacts set to true or something?

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 23, 2020

The 15 minutes I observed is an approximate time in CI and not using my local storage server.

do you have get_old_package_artifacts set to true or something?

Oops, looks like I didn't set this correctly. My local storage server and BFSU mirror, however, do run a refactored-version of gen_static.jl, i.e., StorageServer.jl. That's probably the reason why it takes >8mins for each incremental update.


I'll make a script as @DilumAluthge suggested and give a further report on this issue next weekend. (I was too busy with my school things to do so 😢 )

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jul 16, 2020

The following is what I've collected since last 12 hours on us-east.storage.julialang.org (mean 32.25 mins):

registry_hash, available_time, wait_time(min)
8c9af6ba0ba9dc8605ed2334ca3ef58c06438bc7, 2020-07-16T16:57:37.829, 24.447133333333333
89daa2df5a8d0623dc3dc01f2456d29bd68cc255, 2020-07-16T18:25:25.7, 40.07833333333333
e405b2ca00df3fd3de42e1598d124487065abada, 2020-07-16T18:46:30.462, 23.57435
356e277b1e3392f7eea6097dbfab68057d17b95e, 2020-07-16T19:50:36.881, 35.81466666666667
d555acd84f406dbd22565988269ee83d67fdeb7f, 2020-07-16T21:15:58.31, 30.588483333333333
f89846fab0011a3abc1d40a00fb7e9764c60597c, 2020-07-16T21:38:32.576, 23.02625
0f9b1ff36d1c1f57144b7a469b4a2f467893a6be, 2020-07-16T22:20:50.83, 23.14715
2d91520d48691a98203fa5fdce5c974affe3705c, 2020-07-16T23:04:23.309, 40.788466666666665
ac1d300e4a9e02097170df19dc2b49d3c4caa5bc, 2020-07-16T23:26:00.964, 28.81605
fae1e99903414859c680f62eeb1d85ddcd568256, 2020-07-16T23:46:10.641, 23.010666666666665
fde3f8682be0da1cafe9a43c209859b20d2a4b94, 2020-07-17T00:30:23.78, 33.57965
342b73e1e663d9b9d96714f9cf3daff5cadad2c0, 2020-07-17T00:51:15.738, 33.9456
f76f5bf5167e444bbb451e74c3c5a9c8ceb90088, 2020-07-17T01:14:09.782, 41.27968333333333
faf9767c320fae08b7f304acea08beee0ada0731, 2020-07-17T01:35:35.17, 38.48616666666667
1e4af49160e6b2a24089d8b364a137d7a04d8ced, 2020-07-17T01:56:48.785, 40.82975
a4d4bf9c53ee8b479e20bb7e26e62eda7df621a7, 2020-07-17T02:25:55.667, 40.64445
d2c616f9bcb54db4af425872f4d0ab7366541268, 2020-07-17T02:46:42.314, 23.105216666666667
a6d1bf1d78fc91c9ce6115679f3448bf381a90ef, 2020-07-17T03:51:35.29, 36.53815
c0a56ff9bf3ec1c09f5bc7f1be69b681384b1fe6, 2020-07-17T04:55:50.671, 22.161183333333334

us-east.pkg.julialang.org is basically the same as us-east.storage.julialang.org (within 1 minute).

cn-east.pkg.julialang.org is more "severe" (mean 40.14 mins):

registry_hash, available_time, wait_time(min)
8c9af6ba0ba9dc8605ed2334ca3ef58c06438bc7, 2020-07-16T17:01:09.939, 27.982283333333335
89daa2df5a8d0623dc3dc01f2456d29bd68cc255, 2020-07-16T18:28:52.01, 43.516816666666664
e405b2ca00df3fd3de42e1598d124487065abada, 2020-07-16T18:49:55.062, 26.98435
356e277b1e3392f7eea6097dbfab68057d17b95e, 2020-07-16T19:55:16.954, 40.48255
d555acd84f406dbd22565988269ee83d67fdeb7f, 2020-07-16T21:22:07.153, 36.735866666666666
2d91520d48691a98203fa5fdce5c974affe3705c, 2020-07-16T23:29:39.576, 66.05958333333334
ac1d300e4a9e02097170df19dc2b49d3c4caa5bc, 2020-07-16T23:43:31.738, 46.32895
fae1e99903414859c680f62eeb1d85ddcd568256, 2020-07-16T23:54:47.673, 31.627866666666666
fde3f8682be0da1cafe9a43c209859b20d2a4b94, 2020-07-17T00:36:27.765, 39.64606666666667
342b73e1e663d9b9d96714f9cf3daff5cadad2c0, 2020-07-17T00:57:07.443, 39.80733333333333
f76f5bf5167e444bbb451e74c3c5a9c8ceb90088, 2020-07-17T01:18:55.514, 46.04188333333333
faf9767c320fae08b7f304acea08beee0ada0731, 2020-07-17T01:39:02.858, 41.94761666666667
1e4af49160e6b2a24089d8b364a137d7a04d8ced, 2020-07-17T02:00:13.763, 44.24603333333334
a4d4bf9c53ee8b479e20bb7e26e62eda7df621a7, 2020-07-17T02:28:09.664, 42.877716666666664
d2c616f9bcb54db4af425872f4d0ab7366541268, 2020-07-17T02:47:47.307, 24.188433333333332
a6d1bf1d78fc91c9ce6115679f3448bf381a90ef, 2020-07-17T03:51:31.154, 36.46921666666667
c0a56ff9bf3ec1c09f5bc7f1be69b681384b1fe6, 2020-07-17T04:55:46.833, 22.097216666666668

available time: the local time when I get a new hash from pkg/storage server via /registries
wait time: available_time - git commit time

There're 26 new "discontinuous" commits recorded, while only 19 of them achieved by the storage server.

script: https://gist.github.com/johnnychen94/98fde55fc341d0c967f8f5ef2a48956a

@staticfloat
Copy link
Member

That's great Johnny, we should track this over time somehow.

@staticfloat
Copy link
Member

staticfloat commented Jul 19, 2020

I've just fixed some issues with the Korean storage server; please keep track of the latency of Registry -> https://kr.storage.julialang.org over the next couple of days. I think it shuold be much better than in the past.

One remaining design reason why the registry updates may be slow sometimes is that the storage server does not advertise the new registry hash until it has downloaded and stored all new resources; e.g. it doesn't advertise a registry until it can serve everything referenced by that registry. It's possible we may want to change that in order to expedite registry service, but I'm not 100% sure. In any case, let's see what the user experience is like with the current design, but with less bugs. :)

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jul 20, 2020

Tracking kr.storage.julialang.org looks great now!

registry_hash, available_time, wait_time(min)
a0e8965316b0cabd279a427796174f394ec8fe51, 2020-07-19T01:57:07.609, 59.66011666666667
69bde3f1b510c45ad3f6cc78c927f59f9cd0a312, 2020-07-19T02:49:48.119, 76.96865
784c3cdc16a64f605f87a63903c387647d6dbebc, 2020-07-19T15:44:43.319, 59.8053
c1b1462946f5e3eb365b498bfab10a613c8ea02c, 2020-07-19T16:34:40.143, 1.5023666666666666
6b2351fad68d1461ae86280226d92f3c2f071d2d, 2020-07-19T17:57:50.054, 0.6508833333333334
3daa85af49a9c315ed784bc84a626c5ee04d6393, 2020-07-19T18:24:17.493, 1.3748666666666667
2226550d169a9aa18cc5797655ba4890a3f67c72, 2020-07-19T18:59:04.296, 1.88825
49592182216d517575d2651aa9d9c2df9a4d2b88, 2020-07-19T20:20:04.313, 0.5052
04a9ee5adbdb1a3d40a101ec73e3a22f41da96ae, 2020-07-19T20:57:56.493, 0.8748666666666667
4bbbc9a9f3c468fb00f95ce9055640d66d31f1b8, 2020-07-20T01:34:59.707, 1.81175
9f72ff8688a1d04fa5667fe36a00910fdab79a4a, 2020-07-20T01:46:55.071, 1.5511666666666666
8d79f17d93a439ecc7a4ae9af57c3a8ea45941a7, 2020-07-20T02:58:15.375, 1.3062333333333334
b93141ba4c6bf0c2870b9a18845297a085df8ef8, 2020-07-20T04:18:33.838, 1.0305333333333333
8c56073f6bff521db16bdb7b64afd5722cb2fbe7, 2020-07-20T06:34:38.472, 1.4077
443ade04a5d52b01e2d615d2e3186f754527fa8b, 2020-07-20T08:43:06.404, 0.89005

The first three records seems like a warm-up.

Feel free to close this issue when you think it is stable.

Just curious, is there any public access to the build script?

@StefanKarpinski
Copy link
Contributor

The storage server code is not public and is substantially more complex than the simple static server script. The premise as outlined in the original Pkg design issue, is that different entities provide independent storage services, which are treated uniformly by the pkg servers. @staticfloat and I have talked about exposing new /packages and /artifacts endpoints that list all package and artifact resources like /registries does for the current registries, which would make it easy to mirror everything that a storage server knows about.

@StefanKarpinski
Copy link
Contributor

Also, great to see those fast update times! That's what I had always imagined this should be like. It should continue to be like that going forward.

@DilumAluthge
Copy link
Member

The storage server code is not public and is substantially more complex than the simple static server script.

Would it be possible to eventually open source the storage server code?

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jul 21, 2020

That is not something we're planning. The storage servers are built and maintained by Julia Computing, offered to as a free service to the community. A large part of their functionality is interacting with proprietary systems like GitHub and GitLab to get resources and AWS/S3 for persistence and those capabilities are also key features of JC's JuliaTeam product offering. If anyone else wants to build and maintain a storage service, they should absolutely do so—the protocol is very simple. I do think that we should have an open source script that mirrors the Julia Computing storage servers and serves them statically. That will act as a backup in case the JC storage servers go down.

I'm realizing now that since the storage servers are JC proprietary they probably should not be called {us-east,kr}.storage.julialang.org but should instead be named {us-east,kr}.storage.juliahub.com. @staticfloat, how hard would it be to change their host names?

@staticfloat
Copy link
Member

staticfloat commented Jul 21, 2020 via email

@DilumAluthge
Copy link
Member

That is not something we're planning. The storage servers are built and maintained by Julia Computing, offered to as a free service to the community. A large part of their functionality is interacting with proprietary systems like GitHub and GitLab to get resources and AWS/S3 for persistence and those capabilities are also key features of JC's JuliaTeam product offering.

That makes sense to me!

@dehann
Copy link

dehann commented May 24, 2021

Based on the fix @JeffFessler mentioned (thanks!), I was able to get this to work by just adding the environment variable JULIA_PKG_SERVER = "" to the CI scripts, e.g. see here:
https://github.com/JuliaRobotics/ApproxManifoldProducts.jl/blob/4a8c3acadac765191c70659e6d312d8bd38aa19c/.travis.yml#L9-L10


PS, without this fix I'm still getting the sync issue more than 36 hours later.

@briochemc
Copy link
Contributor

Just experienced this too with a 5hr+ delay today.

@StefanKarpinski
Copy link
Contributor

StefanKarpinski commented Jun 9, 2021

Yeah, it's a public resource and sometimes very large artifacts get submitted which take a long time to get processed, preventing updates for a while. If you want to see things immediately, you can do export JULIA_PKG_SERVER="" and skip getting things from the package server altogether.

@MilesCranmer
Copy link

MilesCranmer commented Jun 10, 2021

The delays in the registry updates have caused ~ten users of PySR (has SymbolicRegression.jl as backend) to raise GitHub issues or email me, despite me pinning an issue with the JULIA_PKG_SERVER="" fix and also mentioning it on my README. Can a fix be implemented if the server can't find a package's version, it automatically switches to the git-based package server? Or can the git-based server be used as the new default, so users who are new to Julia don't have to debug this?

What happens is: I usually update the Julia backend, wait for it to merge with the registry, and then update the PyPI package (Python package server). PyPI is instantly updated, and my tests pass fine because the Julia GitHub action uses the git-based registry, but the Julia default registry can take more than a day sometimes, so this will cause any user (who doesn't use Julia regularly) who updates PySR to see an issue about Julia not being able to find the updated backend.

@johnnychen94
Copy link
Contributor Author

johnnychen94 commented Jun 10, 2021

Can a fix be implemented if the server can't find a package's version, it automatically switches to the git-based package server?

The original design of Pkg/Storage server is that they only provide the registry versions of which it holds the complete package and artifact data, so a fallback like this should not be implemented on the PkgServer side.

Currently, the pkg client talks to the pkg server to update its registry. Now I think this issue can be perfectly fixed by adding a registry server which only serves the General registry, so that

  1. pkg client talks to the registry server to update its registry; the registry server only builds a small amount of data so it's almost an up-to-date version.
  2. pkg client tries to download package/artifact content from the Pkg server. It would almost be a miss-hit when the new package version is registered because pkg server needs to build the data; still, the fallback solution to download from GitHub is better than "failed to find a new version"

An officially hosted registry server also solves the trust issue about 3rd-party pkg servers; where pkg client queries the SHA and URL from the official registry server, download from a 3rd party pkg server, and verify the downloaded data.

This idea is shared by @GunnarFarneback 10 days ago on slack #pkg-trust

Related to JuliaLang/Pkg.jl#1876 (comment) I would be very careful about using a third party mirror of a package server. If it's not trustworthy, it could easily feed you a fake registry with updates to malevolent versions of any packages, which would be similar to a dependency confusion attack but much, much worse. Note, it couldn't feed you bad packages when instantiating a manifest or if you get the registry from a trusted source (unless it has the computational resources to create sha1 collisions at will). So, well, third party package servers are fine if you know you can trust them and should be safe if you get your registry from git, but I wouldn't promote their use in any official documentation.
[...]
Yeah, I understand that bandwidth and service reliability are issues. One potential future solution could be an option to only obtain the General registry hash from an official server and get the contents from a third party server.

@PetrKryslUCSD
Copy link

I am seeing a problem that looks like it is related, with a six hour delay between the registration of the package and now when adding the package fails: https://discourse.julialang.org/t/registered-package-invisible/67533/2

@DilumAluthge
Copy link
Member

DilumAluthge commented Oct 10, 2021

I made a Discourse post that summarizes this issue and provides the workaround for users that need immediate access to new packages and new versions.

https://discourse.julialang.org/t/general-registry-delays-and-a-workaround/67537?u=dilumaluthge

@staticfloat
Copy link
Member

With some recent upgrades to the StorageServer, this should now be pretty much fixed. Please shout out if you experience PkgServer registry delays, as they should be eliminated now. We have aded some client-side configuration that can be used to communicate to the PkgServer if you would like a more bleeding-edge or conservative registry, see this issue for more detail: JuliaPackaging/PkgServer.jl#144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests