-
-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop redundantly downloading media when re-exporting #395
Conversation
I think ultimately we won't be able to avoid collisions. For example:
This is a problem because the media files are detached from the actual chat log. I was thinking of somehow putting the media in the same file as the export (using formats like MHTML, PDF or encoding via base64) but that seems to lead to more complexity and issues than it's worth. TBH, downloading media is already pretty fragile because it can skip files if the server responds with an error. I feel like making it more fragile is just going to increase number of issues where some random thing happens simply because this whole process is non-deterministic. For example, if the user tries to use this with file skipping, they might end up with a chat log that has incorrect images and it will be really hard to determine why. I was really against adding this functionality in the first place because of this, but there was a large demand for it. Maybe it's possible to redesign this feature to make it more predictable, but I have no idea how. Until then, I don't think expanding on it really makes sense, unfortunately. |
Like I would much rather prefer the export to nuke the media directory if it already exists and write new files instead. This way it's much more predictable as you're not dealing with mutable state (if you disregard that someone can write in that directory at the same time). But the original issue is asking for the exact opposite of that. |
I was wondering about this and didn't know that this was already true. Do you have a reference? I am curious. Edit: found it. Anyway. in this case, the message ID + filename isn't enough to uniquely identify an attachment. Nonetheless, each media still has a unique ID, but getting it would require more changes/work.
I don't think it is currently possible to change an attachment on existing message. See API. I do agree, however, that if Discord ever changed this--it could potentially result in trouble. Again, my original schema of message ID + filename falls short here, but also I imagine the newer attachment would still have a different UID/snowflake. If we got this snowflake and put it in the filename stored locally, I imagine we could still avoid collisions. If we did do a re-export in the scenario you described, the old attachment would still be sitting in the media directory wasting space, but the newly created HTML would reference the newer attachment and still work properly.
Theoretically possible--but how likely is this? Why/how would I move a file named
I don't think I understand this. I do agree that the parent directory is arbitrary, but how is it dependent on the filename? Regardless, if we could switch to a UID/snowflake naming scheme instead of a parent directory scheme, this would be addressed.
I agree that the mutability/lack of complete control on the media directory is unfavorable from a purist standpoint, but I think that with a naming scheme of filename + attachment UID we would avoid most if not all problems. I agree with the sentiment that there is a lot of potential for fragility here, but it may be possible to get this working without introducing additional fragility (unlike my original idea/PR which totally does).
I think it is. As I've already alluded to, what we could do is use the attachment ID/snowflake instead of the parent directory of the attachment URL (e.g. I'd love to hear some more thoughts from you. There would be some more decisions to be made if we wanted to still implement this feature using the attachment ID to uniquely identify attachments locally, but I think the idea is sound. Of course, if you think this feature isn't worth the trouble, I understand. |
Fair points. I guess
However, how are going to accommodate media that is not attachments? For example embedded images. I guess one way is to parse the URL but it could be a bit fragile as well. Another option is to extend Also, maybe it's possible to rework the media exporting functionality somehow so that it's not explicit (i.e. the writer calls |
That is something I failed to consider. All we have is the URL in this case, which doesn't tell us much as the image at the URL can be changed without the URL changing. However, we could get more complicated if we still wanted to try to speed up the embed downloading process. Quite frankly, if you are happy with the above simple compromise and are short on time, I would consider skipping forward to the next quote. Anyway, I'm thinking that we can't avoid having to download every embed file from the server, but it is still possible skip rewriting to the disk if we can verify that we already have the same exact embed (same source URL and same content) already downloaded. I imagine the writing to disk is the most expensive part of the process in terms of time, so if we can prevent these redundant writes, we can still get a nice speed up for re-exporting ops involving embeds with images. However, it is very possible that I'm wrong here (or at least less correct) because drive-related details like buffer caching, caching strategy (e.g. write-back), embed file size relative to cache/block size, and the hardware implementation of the disk itself really muddies things. If we go ahead and assume writing embeds to disk is significantly slower than reading them. Here's an example of how we can do this without significantly slowing down the common use case (i.e. one-time exports): When writing embed media to disk, mark it as an embed somehow. This could be another naming scheme or just having the embeds sit in their own directory within the media folder--it doesn't matter. In addition, incorporate the source URL into the filenames (similarly to how we added the ID to filenames for normal attachments). When we start the exporting process, we scan the media directory for already-downloaded embeds and keep these filenames in memory (this should take less than a millisecond on an SSD I would imagine). Whenever we reach an embed during the exporting process, we lookup the already existing files for the source URL. If we get a match, there is a pretty decent chance that the file is the same as the one we downloaded, but we still need to verify this--we can perform a byte-by-byte comparison of the two files. If the comparison comes back positive, we move on; if negative, we rewrite the file. This read/compare op would be a performance hit. However, if the files came from the same URL, I imagine that there is a very high chance that these files are the same. Thus there is a very high chance that the comparison will come back positive and we can then avoid the write. P.S. we could short-circuit the file compare op by checking file size first to see if going ahead with a byte-by-byte comparison makes sense, but this is just another optimization detail that isn't crucial.
So instead of including a media downloader in Regardless, back to the main point. In summary, we could stick to my plan (or something similar) when it comes to regular attachments, because these have IDs. When it comes to embeds that only have URLs, we can simply throw our hands up and just redownload these. We would still enjoy a performance gain when it comes to the more common, likely larger standard attachments. If we really wanted to get complicated, there are hoops we could jump through to Thanks for continuing the discussion. |
At the end of the day it's not really the size of the image that matters, but the server that provides it, as it may be slow or unstable, making the download slower even due to insignificant files.
I'm pretty sure networking is much slower than write operations, even on slower HDDs. Plus we can't really load the file in memory before saving it to file system, as the files may be rather large. So if we already had the file on hands, it's cheaper and easier to just write it anyway. Assuming I understood you correctly.
Yeah, that was pretty much it. I guess there are indeed too many difficulties associated with this approach. Regarding the main topic: we can take a hash of the (canonical?) url and append it to the filename. So instead of attachment ID, we would have some random gibberish attached at the end. It will look messy, but this will mean two things:
|
Do we want to depend on the URL (or a hash of it)? The content behind the URL could have been changed between exports (I believe we discussed this on the oneclick backup thread). This probably isn't likely but it could happen. Thus if we just use the URL we might fail to update media, because we think we already have it when what we have locally is an outdated version. This problem is the reason I went with using attachment ID over the URL. |
If it's a Discord URL, I believe it's immutable because it includes both the entity ID and its parent ID. With URLs from other sources (e.g. embed images), it's not guaranteed but then it's on the user's consciousness. :) |
OK, so we switch to hashing the URL and having that uniquely identify a piece of media. GUI: Maybe when it gets toggled, a warning modal pops up that says something like
CLI:
I'm partial towards 2 and 4, but I don't have a lotta experience with CLIs so I am not sure what the norm is for modified versions of existing options. What do you think? |
Since the option will be in settings, where it's separate from the export dialog, it won't need to be enabled/disabled based on the media switch. The suggested label and tooltip look great. The warning could be included in the tooltip or otherwise just not included at all I would say. For CLI I would just add another option called Also, the hash could be pretty long and we may increase the risk hitting filename limits with it. We may consider truncating it to X characters, probably would need to determine what's the lowest we can keep it at to avoid collisions. Given that attachment name is also included, I think the chances are miserably low, so maybe something like 5 chars should be even enough. |
Oh yeah I forgot that was in export dialog and not the settings one.
Good thinking about the filename length limit. I'll probably do something like you suggested. Despite keeping the hash short, it is possible that we hit the filename limit anyway, but this could be solved with additional truncation. |
Rereading through this conversation, I don't think I ever asked--should this feature be on or off by default? Having it on by default might mean more users take advantage of it, but this would result in a change in behavior. I'm guessing keep it off--but wanted your opinion regardless. |
I would say off by default. For a user, it would result in fewer surprises, IMO. |
Observation: If you export a channel multiple times with reuse=off and then turn reusing on for future exports, the program will use the original files. For example, it will choose |
Since we're using hashes and should never have duplicates, maybe we can remove the (1) stuff? It was previously used because we expected to have files with identical name, but with hashes we can guarantee that's not going to happen. Overwriting existing files is fine in this case I suppose. |
private static string HashUrl(string url) | ||
{ | ||
// Knuth hash | ||
UInt64 hashedValue = 3074457345618258791ul; | ||
for (int i = 0; i < url.Length; i++) | ||
{ | ||
hashedValue += url[i]; | ||
hashedValue *= 3074457345618258799ul; | ||
} | ||
return hashedValue.ToString(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you chose this hash implementation? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question for which I do not have a good answer. I just Google'd something like "fast string hashing C#" and found this on some SO post. I completely forgot that String.GetHashCode
is a thing in C#. I imagine that would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind--I just remembered that String.GetHashCode
is meant for hashtables and isn't necessarily ideal for avoiding collisions. Then again, the strings would be pretty long and there wouldn't be that many of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, then again, I also just remembered that we are also incorporating the original filenames into the filename we save, so even if a collision happened--it probably wouldn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I am not super sure what to do. The chance of the media filename and the url hash colliding seems absurdly low, but it's possible. My instinct is to play it safe with a cryptographic hash like SHA-256. We are hashing hundreds-of-thousands of strings, so I highly doubt it would result in any real performance impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.GetHashCode
also returns different values for the same string across sessions/runtimes, so it's not useful to us. There are other hashes like md5/sha1 which could also be useful. Pretty sure they are CPU-optimized anyway so I doubt performance is a concern. I guess the important thing here is making sure we can actually cut off everything but the first 5 characters and still retain good entropy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.GetHashCode also returns different values for the same string across sessions/runtimes, so it's not useful to us.
Oh, that is good to know!
From a couple of different SO posts like this one, truncating md5(or sha1) seems good enough.
prevent accidentally forgetting to set it
Didn't really have a good reason to use the prior hash method other than being a placeholder method that I found on the internet first. MD5 is at least pretty uniformly distributed in terms of output, so it seems suitable for the purpose of hashing media URLs.
Ready for review again. |
Looks good, thanks! |
Closes #382
Heya, this PR is mostly meant to facilitate discussion about #382. Maybe I should keep discussion there instead, but I don't know. Anyway, this PR demonstrates a minimal (though arguably not robust) way to avoid redownloading media when re-performing an export.
Implementation. When parsing a URL,
MediaUrlParser
also extracts the parent directory of the file being downloaded.MediaDownloader
adds this directory in parenthesis to the filename when saving the files to disk. Example: if I embedbagel.png
in a message and then exported the channel, it might now appear locally asbagel(5QO5WVLYSEI).png
. In this case,5QO5WVLYSEI
is the message ID. The idea is that the filename + message ID should uniquely identify any media, so if someone else also uploads abagel.png
after that message, then that file would still get downloaded as well, it would just have a different message ID in the filename.Concerns.
bagel.png
is nicer to read thanbagel(5QO5WVLYSEI).png
(or some variant of it). In my personal experience, most uploaded media isn't named, so I imagine the filenames of most downloaded media were already made mostly of gibberish beforehand--so this change probably wouldn't make the usual situation any worse.I am interested in your thoughts. Let me know if you would like some clarification.