Opportunistic import of old content #62

kenperkins · 2014-10-11T21:03:49Z

@aseemk and I were debating whether we should attempt to ingest any additional content from our top ~5000 items.

The prevailing idea was that we could create a HEAD request for each of our 1.2m items, and if we got a 404, we could opportunistically import that content, as it's no longer available on the web.

First step would be to see how many images we're talking.

The text was updated successfully, but these errors were encountered:

kenperkins · 2014-10-12T16:52:02Z

So far having looked at about 15 thousand ids, we're seeing:

about 60% found, 50% error or 404.

kenperkins · 2014-10-12T20:50:46Z

So, given that at 20% in, we're still around ~50% not found, is it even worth continuing? That'll leave us with ~600k images to import, and probably 600,000,000 blobs. (our first 4800 images averaged 1k blobs per).

Can we think of any other way to prioritize?

aseemk · 2014-10-13T01:15:02Z

Thanks for looking into it, Ken!

I can't think of any other way other. In that case, I'd be fine w/ not worrying about images that no longer exist, and simply re-generating ones that do when they're requested. (Like a true cache.)

aseemk · 2014-10-16T14:54:02Z

Re-opening this: now that we've shipped, we can see what content's getting requested, and manually copy over the tiles for those things.

@kenperkins's script already takes as input a list of IDs (e.g. top analytics IDs); we could easily write a script to parse the logs for more IDs.

We have a small window of time where the tiles on Azure haven't been deleted yet, so maybe we should prioritize this while we still can.

kenperkins · 2014-10-16T14:58:36Z

I'll have to figure out how to refactor it slightly, the tooling started from a cloud stored copy of total.txt.

aseemk · 2014-10-17T14:32:33Z

@kenperkins: would you mind giving this a stab at your convenience?

kenperkins · 2014-10-17T15:28:55Z

So anytime we get a content-by-url or content-by-id where ready: false, we import tiles?

aseemk · 2014-10-17T15:44:01Z

Oh man, I didn't even think of that, but that's interesting. No, I was just thinking of going through our logs one-off and importing tiles for content that's been requested so far.

Here's a little snippet I came up with on the subway, but this is only for view pages:

# on the server, as root, at /var/log/zoomhub
$ egrep 'GET /\w+ ' zoomhub.snapshot02.log | cut -d' ' -f11 | cut -c2- | sort | uniq -d > logs-view-ids.txt

This only gets view page IDs (not embed or API requests, and not by URL), but maybe that's good enough for now.

manbartlett · 2014-10-27T17:31:09Z

Hi All,

Please forgive me if I'm posting in the wrong place (github n00b), but I've been poking around for a while trying to get more info about 'queued' images, and if there's a way for me to get my image back online myself. Obvs it isn't in the top 5K but nonetheless would love to see it live on! The URL in question is http://zoom.it/JzGm. Could one of you kindly point me in any directions (or tell me to hang tight if it's just that the migration will take time), etc? Thanks!

And thank you so much for continuing this! Really really love the service, and as soon as the donate link is up I'll drop some $ in the bucket. :)

iangilman · 2014-10-28T16:15:19Z

@manbartlett this is a fine place to mention this! The main tracking issue is #89, which links to here.

manbartlett · 2014-10-28T17:03:15Z

@iangilman awesome, thank you!

aseemk · 2014-11-02T16:57:26Z

The old tile server is indeed no longer up, so we can no longer simply copy tiles. This also means images that no longer exist at the source (e.g. 404s now) can no longer be served from zoom.it.

So I'll go ahead close this now in favor of #89, but we can continue the conversation there.

kenperkins added the question label Oct 11, 2014

kenperkins closed this as completed Oct 13, 2014

aseemk reopened this Oct 16, 2014

aseemk added this to the Zoom.it milestone Oct 16, 2014

aseemk assigned kenperkins Oct 17, 2014

iangilman mentioned this issue Oct 20, 2014

Import/generate additional images from old zoom.it #89

Closed

aseemk closed this as completed Nov 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opportunistic import of old content #62

Opportunistic import of old content #62

kenperkins commented Oct 11, 2014

kenperkins commented Oct 12, 2014

kenperkins commented Oct 12, 2014

aseemk commented Oct 13, 2014

aseemk commented Oct 16, 2014

kenperkins commented Oct 16, 2014

aseemk commented Oct 17, 2014

kenperkins commented Oct 17, 2014

aseemk commented Oct 17, 2014

manbartlett commented Oct 27, 2014

iangilman commented Oct 28, 2014

manbartlett commented Oct 28, 2014

aseemk commented Nov 2, 2014

Opportunistic import of old content #62

Opportunistic import of old content #62

Comments

kenperkins commented Oct 11, 2014

kenperkins commented Oct 12, 2014

kenperkins commented Oct 12, 2014

aseemk commented Oct 13, 2014

aseemk commented Oct 16, 2014

kenperkins commented Oct 16, 2014

aseemk commented Oct 17, 2014

kenperkins commented Oct 17, 2014

aseemk commented Oct 17, 2014

manbartlett commented Oct 27, 2014

iangilman commented Oct 28, 2014

manbartlett commented Oct 28, 2014

aseemk commented Nov 2, 2014