-
-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve sample detection using filesize and extension #2408
Improve sample detection using filesize and extension #2408
Conversation
0ece65e
to
2efe3f8
Compare
RE_IP4 = re.compile(r"inet\s+(addr:\s*)?(\d+\.\d+\.\d+\.\d+)") | ||
RE_IP6 = re.compile(r"inet6\s+(addr:\s*)?([0-9a-f:]+)", re.I) | ||
|
||
# Check if strings are defined for AM and PM | ||
HAVE_AMPM = bool(time.strftime("%p")) | ||
|
||
SAMPLE_MAX_SIZE = 100 * MEBI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "100M" preferred andd put it in constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the size of the file is not a very good indicator if its a sample. crudely you could say if theres multiple media files and one is significantly smaller.. it may be a a sample. ideally one should look at its runtime (under 90 seconds). this is what apps like sonarr/radarr and others do. (uses ffprobe)
Well test_sorting.py is kinda complicated, how to resolve?
|
@@ -68,12 +68,17 @@ | |||
RE_UNITS = re.compile(r"(\d+\.*\d*)\s*([KMGTP]?)", re.I) | |||
RE_VERSION = re.compile(r"(\d+)\.(\d+)\.(\d+)([a-zA-Z]*)(\d*)") | |||
RE_SAMPLE = re.compile(r"((^|[\W_])(sample|proof))", re.I) # something-sample or something-proof | |||
RE_SAMPLE_EXTENSIONS = re.compile( | |||
r"^(\.mkv|\.avi|\.divx|\.xvid|\.mov|\.wmv|\.mp4|\.mpg|\.mpeg|\.ogg|\.jpg|\.jpeg|\.png|\.par2)$", re.I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of media files (does not include metadata files like nfo/png/jpg or things like subtitles):
https://github.com/Radarr/Radarr/blob/develop/src/NzbDrone.Core/MediaFiles/MediaFileExtensions.cs
can see this list can get pretty long..
using the name is fine until its not. having a guard to not delete the only media file, or delete if its under 100 MB is probably good enough of an improvement. realstically, samples are only generally for media files not other stuff (unless badly named). so having it only do it for mediafiles like mkv/avi/etc is fine. just needs to be somewhat exhaustive or just dont limit it. now 'proof' is the other part of this, which you will see an image like jpg/jpng/etc of a scan of a disk or something. but honestly i dont think there is any reason why those should be nuked. |
Thanks for the review, I'll take a look at how sonarr/radarr detect samples and see if we could use ffprobe-python, biggest challenge is probably how you bundle that for all the platforms. postproc would need changes because it wouldn't be able to skip since it would need a valid file to send to ffprobe. I do agree that is_samples should only affect media files. |
ideally we would just do it like we do other 3rd party utilities (unrar/par/etc) for specific binaries (mac/win) but for other platforms we expect the package manager to install it. |
I agree with @thezoggy's review. The list of extensions could probably be endless and the filesize is also a bit arbitrary.. |
I agree I don't think it's worth the effort or bloat/maintenance from adding external tool dependencies which is the only way it could be achieved accurately. It's probably quite a rare problem with such wide use of automation. Perhaps consider removing the feature? |
Fixes #2083 (mostly)
I've taken inspiration from a script I've used in NZBGet for many years DeleteSamples.py
Before the current filename check I first check against a size cut off (100 MB - reduced from inspiration because files are getting smaller) and limit it to known sample extensions - some of these were based on the existing tests, I probably wouldn't have bothered with if not for them.
I removed
test_is_sample_known_false_positives
and moved them into test_is_sample but I think the tests should be simplified to testing the bounds of each condition (size, extension and name) - which I can do but I'd rather know if this is likely to be accepted.I think there are probably still possibilities for false positive, but this will reduce the chance of them.
Perhaps I should include the suggested check if NameOfTheDownload contains sample|proof always return False?