New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opportunistic import of old content #62
Comments
So far having looked at about 15 thousand ids, we're seeing: about 60% found, 50% error or 404. |
So, given that at 20% in, we're still around ~50% not found, is it even worth continuing? That'll leave us with ~600k images to import, and probably 600,000,000 blobs. (our first 4800 images averaged 1k blobs per). Can we think of any other way to prioritize? |
Thanks for looking into it, Ken! I can't think of any other way other. In that case, I'd be fine w/ not worrying about images that no longer exist, and simply re-generating ones that do when they're requested. (Like a true cache.) |
Re-opening this: now that we've shipped, we can see what content's getting requested, and manually copy over the tiles for those things. @kenperkins's script already takes as input a list of IDs (e.g. top analytics IDs); we could easily write a script to parse the logs for more IDs. We have a small window of time where the tiles on Azure haven't been deleted yet, so maybe we should prioritize this while we still can. |
I'll have to figure out how to refactor it slightly, the tooling started from a cloud stored copy of total.txt. |
@kenperkins: would you mind giving this a stab at your convenience? |
So anytime we get a content-by-url or content-by-id where |
Oh man, I didn't even think of that, but that's interesting. No, I was just thinking of going through our logs one-off and importing tiles for content that's been requested so far. Here's a little snippet I came up with on the subway, but this is only for view pages:
This only gets view page IDs (not embed or API requests, and not by URL), but maybe that's good enough for now. |
Hi All, Please forgive me if I'm posting in the wrong place (github n00b), but I've been poking around for a while trying to get more info about 'queued' images, and if there's a way for me to get my image back online myself. Obvs it isn't in the top 5K but nonetheless would love to see it live on! The URL in question is http://zoom.it/JzGm. Could one of you kindly point me in any directions (or tell me to hang tight if it's just that the migration will take time), etc? Thanks! And thank you so much for continuing this! Really really love the service, and as soon as the donate link is up I'll drop some $ in the bucket. :) |
@manbartlett this is a fine place to mention this! The main tracking issue is #89, which links to here. |
@iangilman awesome, thank you! |
The old tile server is indeed no longer up, so we can no longer simply copy tiles. This also means images that no longer exist at the source (e.g. 404s now) can no longer be served from zoom.it. So I'll go ahead close this now in favor of #89, but we can continue the conversation there. |
@aseemk and I were debating whether we should attempt to ingest any additional content from our top ~5000 items.
The prevailing idea was that we could create a
HEAD
request for each of our 1.2m items, and if we got a 404, we could opportunistically import that content, as it's no longer available on the web.First step would be to see how many images we're talking.
The text was updated successfully, but these errors were encountered: