Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce duplicate calculations by caching metrics to disk (replacement for load_from_csv) #342

Open
jucor opened this issue Jul 26, 2023 · 8 comments
Labels
Milestone

Comments

@jucor
Copy link

jucor commented Jul 26, 2023

Hi!

Thanks for a great pacakage.

I notice that StatsManager.load_from_csv now has a deprectation warning. What is now the recommended way to avoid re-running the costly (20 minutes on my 1h20 x265 video) scene detection, please?

I have a stats file created with the CLI:

scenedetect --input myfile.mkv --stats adaptive-stats.csv detect-adaptive

and I would like to load that adaptative-stats.csv file into a SceneManager to do the actual scene detection. load_from_csv sounded promising :)

Thanks for any help, and a great tool!

Environment:

[PySceneDetect] PySceneDetect 0.6.2

System Info
------------------------------------------------------------
OS           macOS-13.4.1-arm64-arm-64bit
Python       3.11.4

Packages
------------------------------------------------------------
av           Not Installed
click        8.1.6
cv2          4.8.0
moviepy      Not Installed
numpy        1.25.1
platformdirs 3.9.1
scenedetect  0.6.2
tqdm         4.65.0

Tools
------------------------------------------------------------
ffmpeg       6.0
mkvmerge     v77.0 ('Elemental') 64-bit
@jucor
Copy link
Author

jucor commented Jul 26, 2023

I am seeing in the deprecation commit the notes:

Equivalent functionality with much better performance could be obtained for most use cases by seeking instead.

Is this seeking alternative something that I can already use, please (and how?), or am I better (in the current moment) to use load_from_csv ?
Thanks for any pointer!

@jucor
Copy link
Author

jucor commented Jul 26, 2023

Mm, reading AdaptiveDetector.process_frame I now understand that the stats CSV is not used to speed things up anymore anyway.
But the commit introducing online calculation (smart!) mentions the ability to skip/seek, as did the deprecation commit mentioned above.

How could I use that new skip/seek trick, based on saved CSV of the stats, please?

@jucor
Copy link
Author

jucor commented Jul 26, 2023

(and yes, I think I'm absolutely spending an hour diving into the code just to avoid waiting 20 minutes again to parse my video 😅 )
Yak Shaving

@jucor jucor changed the title Replacement for load_from_csv? Save statistics and reload them to avoid repeating the processing time? [Was: Replacement for load_from_csv?] Jul 26, 2023
@Breakthrough
Copy link
Owner

Breakthrough commented Jul 26, 2023

@jucor unfortunately I don't have a replacement planned for this at this time. thanks for bringing this up, I'll repurpose this issue to track adding replacement functionality.

The main reason for removing this feature was that it just wasn't well thought out before I introduced it - it became very difficult to scale the CSV format as a cache for detector calculations as the number of dimensions/parameters per detector rose. This led to a bunch of weird bugs and edge cases that for a long time went unnoticed, so felt it was time to rework how this caching happens.

The latest version made a lot of performance improvements to scanning - are you disabling downscaling in your case, or is most of the time processing the video spent just decoding each frame?

I do apologize for the current status quo regarding loading/caching frame metrics. Removing it was necessary to allow customizing the calculation of each frame's score from the component weights. In the long term, I want to look into having each detector itself be responsible for caching calculations, and reserve the statsfile purely for statistical/human readable analysis. That was the intended purpose, but it kind of unravelled when attempting to re-use the same format as a cache for detection calculations.

@Breakthrough
Copy link
Owner

Breakthrough commented Jul 26, 2023

Regarding seeking/skipping, I don't think that applies for your particular use case since you want to re-process the entire video. Instead it might be worth just hacking a ContentDetector to load frame scores from a specified CSV file instead of calculating them. I do apologize for this making things slower for you in the interim - if you end up making any progress on that, feel free to submit a PR for it, we did something similar for using the output of list-scenes recently by creating a pseudo-detector (SceneLoader).

You could do this pretty easily by creating a new detector that derives from ContentDetector, and override the _calculate_frame_score method to return the score for that frame loaded from disk instead of re-calculating it. This should also resolve some of the shortcomings of the existing solution (e.g. can guarantee a user doesn't override the parameters if we load the scores from disk), so I'm more than open to including this in the next release as an official workaround.

In the long run, I think you raise some great points. I want to look into having ContentDetector allow specifying a file to use as a cache, just not sure what the correct format would be. The statsfile would effectively be derived from the cache, instead of being the cache.

Thanks for filing this!

@Breakthrough Breakthrough changed the title Save statistics and reload them to avoid repeating the processing time? [Was: Replacement for load_from_csv?] Reduce duplicate calculations by caching metrics to disk (replacement for load_from_csv) Jul 26, 2023
@Breakthrough Breakthrough added this to the v0.6.3 milestone Jul 26, 2023
@jucor
Copy link
Author

jucor commented Jul 27, 2023

Please don't apologize! It's a terrific tool :) :)
Most of the time is spent processing the frames, in this example. But I've just bit the bullet and waited 20 minutes, and it was good enough for what I needed to do, so no problem on my end any more!

@Breakthrough
Copy link
Owner

Breakthrough commented Jan 29, 2024

Taking a look at this with a fresh set of eyes it's probably unwise to pursue re-adding this until a well thought out API is in place. The issue is how the current interface for a SceneDetector is defined. If this continues to become a burden for some workflows it can be re-prioritized. The v0.7 API definitely needs to take this into account, so that's where I would ideally like to target this for.

Edit: thanks for your the feedback as well on this @jucor, much appreciated.

@Breakthrough Breakthrough modified the milestones: v0.6.3, v0.7 Jan 29, 2024
@jucor
Copy link
Author

jucor commented Jan 30, 2024

Pleasure! Thanks for the hard work and for keeping thinking about it. Makes sense to prioritize as you did :)

@Breakthrough Breakthrough modified the milestones: v0.6.4, v0.7 Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants