Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opportunistic import of old content #62

Closed
kenperkins opened this issue Oct 11, 2014 · 12 comments
Closed

Opportunistic import of old content #62

kenperkins opened this issue Oct 11, 2014 · 12 comments
Assignees
Labels
Milestone

Comments

@kenperkins
Copy link
Member

@aseemk and I were debating whether we should attempt to ingest any additional content from our top ~5000 items.

The prevailing idea was that we could create a HEAD request for each of our 1.2m items, and if we got a 404, we could opportunistically import that content, as it's no longer available on the web.

First step would be to see how many images we're talking.

@kenperkins
Copy link
Member Author

So far having looked at about 15 thousand ids, we're seeing:

about 60% found, 50% error or 404.

@kenperkins
Copy link
Member Author

So, given that at 20% in, we're still around ~50% not found, is it even worth continuing? That'll leave us with ~600k images to import, and probably 600,000,000 blobs. (our first 4800 images averaged 1k blobs per).

Can we think of any other way to prioritize?

@aseemk
Copy link
Member

aseemk commented Oct 13, 2014

Thanks for looking into it, Ken!

I can't think of any other way other. In that case, I'd be fine w/ not worrying about images that no longer exist, and simply re-generating ones that do when they're requested. (Like a true cache.)

@aseemk
Copy link
Member

aseemk commented Oct 16, 2014

Re-opening this: now that we've shipped, we can see what content's getting requested, and manually copy over the tiles for those things.

@kenperkins's script already takes as input a list of IDs (e.g. top analytics IDs); we could easily write a script to parse the logs for more IDs.

We have a small window of time where the tiles on Azure haven't been deleted yet, so maybe we should prioritize this while we still can.

@aseemk aseemk reopened this Oct 16, 2014
@aseemk aseemk added this to the Zoom.it milestone Oct 16, 2014
@kenperkins
Copy link
Member Author

I'll have to figure out how to refactor it slightly, the tooling started from a cloud stored copy of total.txt.

@aseemk
Copy link
Member

aseemk commented Oct 17, 2014

@kenperkins: would you mind giving this a stab at your convenience?

@kenperkins
Copy link
Member Author

So anytime we get a content-by-url or content-by-id where ready: false, we import tiles?

@aseemk
Copy link
Member

aseemk commented Oct 17, 2014

Oh man, I didn't even think of that, but that's interesting. No, I was just thinking of going through our logs one-off and importing tiles for content that's been requested so far.

Here's a little snippet I came up with on the subway, but this is only for view pages:

# on the server, as root, at /var/log/zoomhub
$ egrep 'GET /\w+ ' zoomhub.snapshot02.log | cut -d' ' -f11 | cut -c2- | sort | uniq -d > logs-view-ids.txt

This only gets view page IDs (not embed or API requests, and not by URL), but maybe that's good enough for now.

@manbartlett
Copy link

Hi All,

Please forgive me if I'm posting in the wrong place (github n00b), but I've been poking around for a while trying to get more info about 'queued' images, and if there's a way for me to get my image back online myself. Obvs it isn't in the top 5K but nonetheless would love to see it live on! The URL in question is http://zoom.it/JzGm. Could one of you kindly point me in any directions (or tell me to hang tight if it's just that the migration will take time), etc? Thanks!

And thank you so much for continuing this! Really really love the service, and as soon as the donate link is up I'll drop some $ in the bucket. :)

@iangilman
Copy link
Member

@manbartlett this is a fine place to mention this! The main tracking issue is #89, which links to here.

@manbartlett
Copy link

@iangilman awesome, thank you!

@aseemk
Copy link
Member

aseemk commented Nov 2, 2014

The old tile server is indeed no longer up, so we can no longer simply copy tiles. This also means images that no longer exist at the source (e.g. 404s now) can no longer be served from zoom.it.

So I'll go ahead close this now in favor of #89, but we can continue the conversation there.

@aseemk aseemk closed this as completed Nov 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants