Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sample detection using filesize and extension #2408

Conversation

mnightingale
Copy link
Contributor

Fixes #2083 (mostly)

I've taken inspiration from a script I've used in NZBGet for many years DeleteSamples.py

Before the current filename check I first check against a size cut off (100 MB - reduced from inspiration because files are getting smaller) and limit it to known sample extensions - some of these were based on the existing tests, I probably wouldn't have bothered with if not for them.

I removed test_is_sample_known_false_positives and moved them into test_is_sample but I think the tests should be simplified to testing the bounds of each condition (size, extension and name) - which I can do but I'd rather know if this is likely to be accepted.

I think there are probably still possibilities for false positive, but this will reduce the chance of them.

Perhaps I should include the suggested check if NameOfTheDownload contains sample|proof always return False?

@mnightingale mnightingale marked this pull request as draft January 9, 2023 23:14
@mnightingale mnightingale force-pushed the feature/improve_sample_detection branch from 0ece65e to 2efe3f8 Compare January 9, 2023 23:17
RE_IP4 = re.compile(r"inet\s+(addr:\s*)?(\d+\.\d+\.\d+\.\d+)")
RE_IP6 = re.compile(r"inet6\s+(addr:\s*)?([0-9a-f:]+)", re.I)

# Check if strings are defined for AM and PM
HAVE_AMPM = bool(time.strftime("%p"))

SAMPLE_MAX_SIZE = 100 * MEBI
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "100M" preferred andd put it in constants?

Copy link
Contributor

@thezoggy thezoggy Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size of the file is not a very good indicator if its a sample. crudely you could say if theres multiple media files and one is significantly smaller.. it may be a a sample. ideally one should look at its runtime (under 90 seconds). this is what apps like sonarr/radarr and others do. (uses ffprobe)

@mnightingale
Copy link
Contributor Author

Well test_sorting.py is kinda complicated, how to resolve?

  • Add a filesize for every file - before they just test against name
  • Give is_sample a really large default size or None so it always return False
  • Something else?

@mnightingale mnightingale marked this pull request as ready for review January 9, 2023 23:32
@@ -68,12 +68,17 @@
RE_UNITS = re.compile(r"(\d+\.*\d*)\s*([KMGTP]?)", re.I)
RE_VERSION = re.compile(r"(\d+)\.(\d+)\.(\d+)([a-zA-Z]*)(\d*)")
RE_SAMPLE = re.compile(r"((^|[\W_])(sample|proof))", re.I) # something-sample or something-proof
RE_SAMPLE_EXTENSIONS = re.compile(
r"^(\.mkv|\.avi|\.divx|\.xvid|\.mov|\.wmv|\.mp4|\.mpg|\.mpeg|\.ogg|\.jpg|\.jpeg|\.png|\.par2)$", re.I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of media files (does not include metadata files like nfo/png/jpg or things like subtitles):
https://github.com/Radarr/Radarr/blob/develop/src/NzbDrone.Core/MediaFiles/MediaFileExtensions.cs

subtitles:
https://github.com/Radarr/Radarr/blob/develop/src/NzbDrone.Core/Extras/Subtitles/SubtitleFileExtensions.cs

can see this list can get pretty long..

@thezoggy
Copy link
Contributor

thezoggy commented Jan 10, 2023

using the name is fine until its not. having a guard to not delete the only media file, or delete if its under 100 MB is probably good enough of an improvement. realstically, samples are only generally for media files not other stuff (unless badly named). so having it only do it for mediafiles like mkv/avi/etc is fine. just needs to be somewhat exhaustive or just dont limit it.

now 'proof' is the other part of this, which you will see an image like jpg/jpng/etc of a scan of a disk or something. but honestly i dont think there is any reason why those should be nuked.

@mnightingale
Copy link
Contributor Author

Thanks for the review, I'll take a look at how sonarr/radarr detect samples and see if we could use ffprobe-python, biggest challenge is probably how you bundle that for all the platforms.

postproc would need changes because it wouldn't be able to skip since it would need a valid file to send to ffprobe.

I do agree that is_samples should only affect media files.

@thezoggy
Copy link
Contributor

Thanks for the review, I'll take a look at how sonarr/radarr detect samples and see if we could use ffprobe-python, biggest challenge is probably how you bundle that for all the platforms.

postproc would need changes because it wouldn't be able to skip since it would need a valid file to send to ffprobe.

I do agree that is_samples should only affect media files.

ideally we would just do it like we do other 3rd party utilities (unrar/par/etc) for specific binaries (mac/win) but for other platforms we expect the package manager to install it.

@Safihre
Copy link
Member

Safihre commented Jan 10, 2023

I agree with @thezoggy's review. The list of extensions could probably be endless and the filesize is also a bit arbitrary..
We sort of given up on making really good Sample Detection, because that would require external tools which seems too much for a small option like this. Especially in the age of Sonarr/Radarr taking care of all these things.

@mnightingale
Copy link
Contributor Author

I agree I don't think it's worth the effort or bloat/maintenance from adding external tool dependencies which is the only way it could be achieved accurately.

It's probably quite a rare problem with such wide use of automation.

Perhaps consider removing the feature?

@mnightingale mnightingale deleted the feature/improve_sample_detection branch January 10, 2023 10:03
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shows/Movies with sample or proof in the name get ignored
3 participants