Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop redundantly downloading media when re-exporting #395

Merged
merged 11 commits into from
Oct 23, 2020
Merged

Stop redundantly downloading media when re-exporting #395

merged 11 commits into from
Oct 23, 2020

Conversation

andrewkolos
Copy link
Contributor

Closes #382

Heya, this PR is mostly meant to facilitate discussion about #382. Maybe I should keep discussion there instead, but I don't know. Anyway, this PR demonstrates a minimal (though arguably not robust) way to avoid redownloading media when re-performing an export.

Implementation. When parsing a URL, MediaUrlParser also extracts the parent directory of the file being downloaded. MediaDownloader adds this directory in parenthesis to the filename when saving the files to disk. Example: if I embed bagel.png in a message and then exported the channel, it might now appear locally as bagel(5QO5WVLYSEI).png. In this case, 5QO5WVLYSEI is the message ID. The idea is that the filename + message ID should uniquely identify any media, so if someone else also uploads a bagel.png after that message, then that file would still get downloaded as well, it would just have a different message ID in the filename.

Concerns.

  • Technical--fragility. This implementation creates a dependency on the URL that a piece of media is found at: any media must have a parent directory (e.g. server/guild ID or message ID), which I imagine is pretty safe to depend on. It also depends on the combination of this parent directory and the filename as a reliable means to uniquely identify any media. If this is ever not true, then we may incorrectly avoid downloading a piece of media thinking that we already downloaded it. A more stable way to implement this change might be to let the model class that represents a piece of media decide what information uniquely identifies a piece of a media of that type and also the filename it should be given when saving it locally. This is obviously more complicated than just passing a URL to media downloader, so I went for the simple approach I described, and I can only hope it doesn't horribly fail in some situation I did not consider.
  • Technical--inefficacy/accessibility of this feature. This caching mechanism only works when redoing an export that's already been done with the same settings (i.e. one that would cause a "File already exists--do you wish to overwrite?" filesystem dialog). For example, if I export a channel with an After Date of 2 days ago and then do the same export with an After Date of 3 days ago, this would create two separate media directories (no change here), so we would still end up redundantly downloading the media that both exports download. I am not sure of all the settings that change the filename/media directory, but the After Date is an example. I hope I was able to explain this decently enough. I imagine that this functionality would be used mostly when re-exporting the entirety of a channel on some periodic schedule.
  • User experience / potential controversy. Adding this information to the filenames makes them more noisy. For users browsing the downloaded media after they've exported a channel, this may be annoying. bagel.png is nicer to read than bagel(5QO5WVLYSEI).png (or some variant of it). In my personal experience, most uploaded media isn't named, so I imagine the filenames of most downloaded media were already made mostly of gibberish beforehand--so this change probably wouldn't make the usual situation any worse.

I am interested in your thoughts. Let me know if you would like some clarification.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 8, 2020

I think ultimately we won't be able to avoid collisions. For example:

  • a single message can have multiple attachments (according to Discord API) and thus can have two files with the same name
  • a user can edit their older message by uploading a new attachment with the same filename (what should even happen in this case?)
  • a completely random file may end up in the media directory for whatever reason with a filename that matches an exported media
  • parent directory is technically arbitrary and depends on the export file name, which is completely up to the user

This is a problem because the media files are detached from the actual chat log. I was thinking of somehow putting the media in the same file as the export (using formats like MHTML, PDF or encoding via base64) but that seems to lead to more complexity and issues than it's worth.

TBH, downloading media is already pretty fragile because it can skip files if the server responds with an error. I feel like making it more fragile is just going to increase number of issues where some random thing happens simply because this whole process is non-deterministic. For example, if the user tries to use this with file skipping, they might end up with a chat log that has incorrect images and it will be really hard to determine why. I was really against adding this functionality in the first place because of this, but there was a large demand for it.

Maybe it's possible to redesign this feature to make it more predictable, but I have no idea how. Until then, I don't think expanding on it really makes sense, unfortunately.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 8, 2020

Like I would much rather prefer the export to nuke the media directory if it already exists and write new files instead. This way it's much more predictable as you're not dealing with mutable state (if you disregard that someone can write in that directory at the same time). But the original issue is asking for the exact opposite of that.

@andrewkolos
Copy link
Contributor Author

andrewkolos commented Oct 8, 2020

a single message can have multiple attachments (according to Discord API) and thus can have two files with the same name

I was wondering about this and didn't know that this was already true. Do you have a reference? I am curious. Edit: found it. Anyway. in this case, the message ID + filename isn't enough to uniquely identify an attachment. Nonetheless, each media still has a unique ID, but getting it would require more changes/work.

a user can edit their older message by uploading a new attachment with the same filename (what should even happen in this case?)

I don't think it is currently possible to change an attachment on existing message. See API. I do agree, however, that if Discord ever changed this--it could potentially result in trouble. Again, my original schema of message ID + filename falls short here, but also I imagine the newer attachment would still have a different UID/snowflake. If we got this snowflake and put it in the filename stored locally, I imagine we could still avoid collisions. If we did do a re-export in the scenario you described, the old attachment would still be sitting in the media directory wasting space, but the newly created HTML would reference the newer attachment and still work properly.

a completely random file may end up in the media directory for whatever reason with a filename that matches an exported media

Theoretically possible--but how likely is this? Why/how would I move a file named bagel(15466E41B04).png into the media directory? I imagine that the ID being in the filename should prevent this situation from happening.

parent directory is technically arbitrary and depends on the export file name, which is completely up to the user

I don't think I understand this. I do agree that the parent directory is arbitrary, but how is it dependent on the filename? Regardless, if we could switch to a UID/snowflake naming scheme instead of a parent directory scheme, this would be addressed.

Like I would much rather prefer the export to nuke the media directory if it already exists and write new files instead. This way it's much more predictable as you're not dealing with mutable state (if you disregard that someone can write in that directory at the same time). But the original issue is asking for the exact opposite of that.

I agree that the mutability/lack of complete control on the media directory is unfavorable from a purist standpoint, but I think that with a naming scheme of filename + attachment UID we would avoid most if not all problems.

I agree with the sentiment that there is a lot of potential for fragility here, but it may be possible to get this working without introducing additional fragility (unlike my original idea/PR which totally does).

Maybe it's possible to redesign this feature to make it more predictable, but I have no idea how. Until then, I don't think expanding on it really makes sense, unfortunately.

I think it is. As I've already alluded to, what we could do is use the attachment ID/snowflake instead of the parent directory of the attachment URL (e.g. bagel(<attachment ID>).png instead of bagel(<parent directory from URL>).png. I only chose the parent-directory scheme because it was very simple to implement.

I'd love to hear some more thoughts from you. There would be some more decisions to be made if we wanted to still implement this feature using the attachment ID to uniquely identify attachments locally, but I think the idea is sound.

Of course, if you think this feature isn't worth the trouble, I understand.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 9, 2020

Fair points. I guess bagel-{attachment-id}.png makes sense. In that case the path map here isn't necessary since we can just make lookups against the file system:

private readonly Dictionary<string, string> _pathMap = new Dictionary<string, string>();

However, how are going to accommodate media that is not attachments? For example embedded images.

I guess one way is to parse the URL but it could be a bit fragile as well. Another option is to extend ResolveMediaUrlAsync with a hintFileName parameter which the caller can optionally set, and if it's an attachment it could form the file name based on its id and filename.

Also, maybe it's possible to rework the media exporting functionality somehow so that it's not explicit (i.e. the writer calls ResolveMediaUrlAsync) but rather happens as part of writing the output (i.e. maybe somewhere inside MessageExporter it checks the output and, if it's a URL, rewrites it to something else). It might be a stupid idea though, just throwing it out there.

@andrewkolos
Copy link
Contributor Author

However, how are going to accommodate media that is not attachments? For example embedded images.

That is something I failed to consider. All we have is the URL in this case, which doesn't tell us much as the image at the URL can be changed without the URL changing.
The simplest fix/compromise is to not have this new "caching" functionality not affect embeds. For example, we can just store all embeds in their own directory, and every time we do a new export, we nuke the directory if it already exists and just redownload everything as we always have. Besides--and correct me if I'm wrong--for most Discord servers, embeds are 1) not as frequent as attached media and more importantly 2) are usually pretty small in terms of file sizes (usually just preview images; I don't think they can contain video/audio). Thus I think it is acceptable to always redownload embeds--as the redownload/rewrite times of normal attachments most likely comprise the vast majority of performance waste we are looking to address.

However, we could get more complicated if we still wanted to try to speed up the embed downloading process. Quite frankly, if you are happy with the above simple compromise and are short on time, I would consider skipping forward to the next quote. Anyway, I'm thinking that we can't avoid having to download every embed file from the server, but it is still possible skip rewriting to the disk if we can verify that we already have the same exact embed (same source URL and same content) already downloaded. I imagine the writing to disk is the most expensive part of the process in terms of time, so if we can prevent these redundant writes, we can still get a nice speed up for re-exporting ops involving embeds with images. However, it is very possible that I'm wrong here (or at least less correct) because drive-related details like buffer caching, caching strategy (e.g. write-back), embed file size relative to cache/block size, and the hardware implementation of the disk itself really muddies things.

If we go ahead and assume writing embeds to disk is significantly slower than reading them. Here's an example of how we can do this without significantly slowing down the common use case (i.e. one-time exports):

When writing embed media to disk, mark it as an embed somehow. This could be another naming scheme or just having the embeds sit in their own directory within the media folder--it doesn't matter. In addition, incorporate the source URL into the filenames (similarly to how we added the ID to filenames for normal attachments). When we start the exporting process, we scan the media directory for already-downloaded embeds and keep these filenames in memory (this should take less than a millisecond on an SSD I would imagine). Whenever we reach an embed during the exporting process, we lookup the already existing files for the source URL. If we get a match, there is a pretty decent chance that the file is the same as the one we downloaded, but we still need to verify this--we can perform a byte-by-byte comparison of the two files. If the comparison comes back positive, we move on; if negative, we rewrite the file.

This read/compare op would be a performance hit. However, if the files came from the same URL, I imagine that there is a very high chance that these files are the same. Thus there is a very high chance that the comparison will come back positive and we can then avoid the write. P.S. we could short-circuit the file compare op by checking file size first to see if going ahead with a byte-by-byte comparison makes sense, but this is just another optimization detail that isn't crucial.

Also, maybe it's possible to rework the media exporting functionality somehow so that it's not explicit (i.e. the writer calls ResolveMediaUrlAsync) but rather happens as part of writing the output (i.e. maybe somewhere inside MessageExporter it checks the output and, if it's a URL, rewrites it to something else). It might be a stupid idea though, just throwing it out there.

So instead of including a media downloader in ExportContext, writers just write some special syntax (or even just a url). After the output is completed. The MessageExporter revises the output and deals with media then? I am thinking that each output format renders media differently (e.g. HTML actually builds elements with the local file URL as the src, whereas JSON merely includes the local file URL as text). Thus, it makes sense for the context to simply provide the local file URL and then let the writer decide how it wants to render the media given where the media is now stored locally. This is how things work now if I'm not mistaken. I feel like I might be missing your point and I would appreciate some more detail.

Regardless, back to the main point. In summary, we could stick to my plan (or something similar) when it comes to regular attachments, because these have IDs. When it comes to embeds that only have URLs, we can simply throw our hands up and just redownload these. We would still enjoy a performance gain when it comes to the more common, likely larger standard attachments. If we really wanted to get complicated, there are hoops we could jump through to try to speed up embed downloading for re-exports, but I doubt it's worth the complexity it would introduce to the code.

Thanks for continuing the discussion.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 12, 2020

...for most Discord servers, embeds are 1) not as frequent as attached media and more importantly 2) are usually pretty small in terms of file sizes (usually just preview images; I don't think they can contain video/audio).

  1. it really depends on the server and the channel
  2. these are always images, I believe, yes, however they don't need to be small (for example the most common case is an open graph image and those can be rather large too)

At the end of the day it's not really the size of the image that matters, but the server that provides it, as it may be slow or unstable, making the download slower even due to insignificant files.

I imagine the writing to disk is the most expensive part of the process in terms of time

I'm pretty sure networking is much slower than write operations, even on slower HDDs. Plus we can't really load the file in memory before saving it to file system, as the files may be rather large. So if we already had the file on hands, it's cheaper and easier to just write it anyway. Assuming I understood you correctly.

So instead of including a media downloader in ExportContext, writers just write some special syntax (or even just a url). After the output is completed. The MessageExporter revises the output and deals with media then? I am thinking that each output format renders media differently (e.g. HTML actually builds elements with the local file URL as the src, whereas JSON merely includes the local file URL as text). Thus, it makes sense for the context to simply provide the local file URL and then let the writer decide how it wants to render the media given where the media is now stored locally. This is how things work now if I'm not mistaken. I feel like I might be missing your point and I would appreciate some more detail.

Yeah, that was pretty much it. I guess there are indeed too many difficulties associated with this approach.

Regarding the main topic: we can take a hash of the (canonical?) url and append it to the filename. So instead of attachment ID, we would have some random gibberish attached at the end. It will look messy, but this will mean two things:

  • Single approach for all media types
  • Hash will remain the same for the same URL across sessions

@andrewkolos
Copy link
Contributor Author

Do we want to depend on the URL (or a hash of it)? The content behind the URL could have been changed between exports (I believe we discussed this on the oneclick backup thread). This probably isn't likely but it could happen. Thus if we just use the URL we might fail to update media, because we think we already have it when what we have locally is an outdated version. This problem is the reason I went with using attachment ID over the URL.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 13, 2020

If it's a Discord URL, I believe it's immutable because it includes both the entity ID and its parent ID. With URLs from other sources (e.g. embed images), it's not guaranteed but then it's on the user's consciousness. :)
I think this would work as an option in settings, instead of a per-export or default setting.

@andrewkolos
Copy link
Contributor Author

andrewkolos commented Oct 13, 2020

OK, so we switch to hashing the URL and having that uniquely identify a piece of media.

GUI:
Then, we have a toggle-option in the settings (off-by-default). The button is enabled only if the media toggle is checked. If disabled, it will also be set to unchecked. Here is the UI text I quickly came up with, though I must admit that my UX text skill has historically left a lot to be desired.
Label: "Reuse previously downloaded media"
Tooltip: "If the media folder already exists, reuse media inside it to skip downloads."

Maybe when it gets toggled, a warning modal pops up that says something like

The source URL is used to determine where already downloaded media came from. If a piece of media came from a non-Discord URL, it may have been edited/changed since it was downloaded. If so, the downloaded copy would be outdated. In the event that this happens, the outdated version will still be used.

CLI:
How would you like the CLI to work? Right now we have --media. We could

  1. add --mediaWithCache,
  2. add -m as short for --media and then add -mc to use the media folder as a cache,
  3. add --reuseMedia that will get used along with --media,
  4. add -m, and then add -M which is the same thing but skips already downloaded stuff, or
  5. use some variant of one of the above.

I'm partial towards 2 and 4, but I don't have a lotta experience with CLIs so I am not sure what the norm is for modified versions of existing options. What do you think?

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 14, 2020

Since the option will be in settings, where it's separate from the export dialog, it won't need to be enabled/disabled based on the media switch. The suggested label and tooltip look great. The warning could be included in the tooltip or otherwise just not included at all I would say.

For CLI I would just add another option called --reuse-media. This should be the simplest.

Also, the hash could be pretty long and we may increase the risk hitting filename limits with it. We may consider truncating it to X characters, probably would need to determine what's the lowest we can keep it at to avoid collisions. Given that attachment name is also included, I think the chances are miserably low, so maybe something like 5 chars should be even enough.

@andrewkolos
Copy link
Contributor Author

Since the option will be in settings, where it's separate from the export dialog, it won't need to be enabled/disabled based on the media switch.

Oh yeah I forgot that was in export dialog and not the settings one.

Also, the hash could be pretty long and we may increase the risk hitting filename limits with it. We may consider truncating it to X characters, probably would need to determine what's the lowest we can keep it at to avoid collisions. Given that attachment name is also included, I think the chances are miserably low, so maybe something like 5 chars should be even enough.

Good thinking about the filename length limit. I'll probably do something like you suggested. Despite keeping the hash short, it is possible that we hit the filename limit anyway, but this could be solved with additional truncation.

@andrewkolos
Copy link
Contributor Author

Rereading through this conversation, I don't think I ever asked--should this feature be on or off by default? Having it on by default might mean more users take advantage of it, but this would result in a change in behavior. I'm guessing keep it off--but wanted your opinion regardless.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 15, 2020

Rereading through this conversation, I don't think I ever asked--should this feature be on or off by default? Having it on by default might mean more users take advantage of it, but this would result in a change in behavior. I'm guessing keep it off--but wanted your opinion regardless.

I would say off by default. For a user, it would result in fewer surprises, IMO.

@andrewkolos
Copy link
Contributor Author

Observation: If you export a channel multiple times with reuse=off and then turn reusing on for future exports, the program will use the original files. For example, it will choose bagel-(49593).png over bagel-(49593) (1).png even though the latter is the most up-to-date version of that image.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 19, 2020

Observation: If you export a channel multiple times with reuse=off and then turn reusing on for future exports, the program will use the original files. For example, it will choose bagel-(49593).png over bagel-(49593) (1).png even though the latter is the most up-to-date version of that image.

Since we're using hashes and should never have duplicates, maybe we can remove the (1) stuff? It was previously used because we expected to have files with identical name, but with hashes we can guarantee that's not going to happen. Overwriting existing files is fine in this case I suppose.

Comment on lines 63 to 74
private static string HashUrl(string url)
{
// Knuth hash
UInt64 hashedValue = 3074457345618258791ul;
for (int i = 0; i < url.Length; i++)
{
hashedValue += url[i];
hashedValue *= 3074457345618258799ul;
}
return hashedValue.ToString();
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you chose this hash implementation? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question for which I do not have a good answer. I just Google'd something like "fast string hashing C#" and found this on some SO post. I completely forgot that String.GetHashCode is a thing in C#. I imagine that would suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind--I just remembered that String.GetHashCode is meant for hashtables and isn't necessarily ideal for avoiding collisions. Then again, the strings would be pretty long and there wouldn't be that many of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, then again, I also just remembered that we are also incorporating the original filenames into the filename we save, so even if a collision happened--it probably wouldn't matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am not super sure what to do. The chance of the media filename and the url hash colliding seems absurdly low, but it's possible. My instinct is to play it safe with a cryptographic hash like SHA-256. We are hashing hundreds-of-thousands of strings, so I highly doubt it would result in any real performance impact.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.GetHashCode also returns different values for the same string across sessions/runtimes, so it's not useful to us. There are other hashes like md5/sha1 which could also be useful. Pretty sure they are CPU-optimized anyway so I doubt performance is a concern. I guess the important thing here is making sure we can actually cut off everything but the first 5 characters and still retain good entropy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.GetHashCode also returns different values for the same string across sessions/runtimes, so it's not useful to us. Oh, that is good to know!

From a couple of different SO posts like this one, truncating md5(or sha1) seems good enough.

prevent accidentally forgetting to set it
Didn't really have a good reason to use the prior hash method other than being a placeholder method that I found on the internet first. MD5 is at least pretty uniformly distributed in terms of output, so it seems suitable for the purpose of hashing media URLs.
@andrewkolos
Copy link
Contributor Author

Ready for review again.

@Tyrrrz
Copy link
Owner

Tyrrrz commented Oct 23, 2020

Looks good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip already downloaded files
2 participants