Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add some parameters for standardizing/beautifying subtitle layout #68

Closed
GOvEy1nw opened this issue Mar 22, 2024 · 3 comments
Closed

Comments

@GOvEy1nw
Copy link

Hey, I'm a windows user, and I'm really grateful for Subgen as it's the simplest way to get Whisper running with Bazarr on Windows without having to use Docket etc.

However, one thing I've noticed is that the subtitles aren't formatted the best, due to how Faster-Whisper operates. I've found that the standalone Faster Whisper (https://github.com/Purfview/whisper-standalone-win) has a great optional argument called --standard, which does the following:

--standard: Quick hardcoded preset to split lines in standard way. 42 chars per 2 lines with max_comma_cent=70 and --sentence are activated automatically.

--sentence: Enables splitting lines to sentences for srt and vtt subs. Every sentence starts in the new segment. Be default meant to output whole sentence per line for better translations, but not limited to, read about '--max_...' parameters.

This gives the subtitles a much more standardized look that are common across streaming services such as Netflix, BBC etc.

Is it possible to implement these into SubGen, please?

@McCloudS
Copy link
Owner

The standalone version doesn’t appear to have any source code so I can’t decipher what’s happening. We use stable-ts, but there are different ways to split the dialogue. See https://github.com/jianfch/stable-ts?tab=readme-ov-file#regrouping-words. Open to any suggestions.

@McCloudS
Copy link
Owner

McCloudS commented Mar 22, 2024

I made a separate branch if you want to toy with the idea: https://github.com/McCloudS/subgen/blob/Custom-Params/subgen.py

It takes custom_regroup = os.getenv('CUSTOM_REGROUP', '') Where it is the regroup string as mentioned above. The default ran on the model is cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?

Instructions pasted below:` Regroup (in-place) words into segments.

Parameters
----------
regroup_algo: str or bool, default 'da'
     String representation of a custom regrouping algorithm or ``True`` use to the default algorithm 'da'.
verbose : bool, default False
    Whether to show all the methods and arguments parsed from ``regroup_algo``.
only_show : bool, default False
    Whether to show the all methods and arguments parsed from ``regroup_algo`` without running the methods

Returns
-------
stable_whisper.result.WhisperResult
    The current instance after the changes.

Notes
-----
Syntax for string representation of custom regrouping algorithm.
    Method keys:
        sg: split_by_gap
        sp: split_by_punctuation
        sl: split_by_length
        sd: split_by_duration
        mg: merge_by_gap
        mp: merge_by_punctuation
        ms: merge_all_segment
        cm: clamp_max
        l: lock
        us: unlock_all_segments
        da: default algorithm (cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/?)
        rw: remove_word
        rs: remove_segment
        rp: remove_repetition
        rws: remove_words_by_str
        fg: fill_in_gaps
    Metacharacters:
        = separates a method key and its arguments (not used if no argument)
        _ separates method keys (after arguments if there are any)
        + separates arguments for a method key
        / separates an argument into list of strings
        * separates an item in list of strings into a nested list of strings
    Notes:
    -arguments are parsed positionally
    -if no argument is provided, the default ones will be used
    -use 1 or 0 to represent True or False
    Example 1:
        merge_by_gap(.2, 10, lock=True)
        mg=.2+10+++1
        Note: [lock] is the 5th argument hence the 2 missing arguments inbetween the three + before 1
    Example 2:
        split_by_punctuation([('.', ' '), '。', '?', '?'], True)
        sp=.* /。/?/?+1
    Example 3:
        merge_all_segments().split_by_gap(.5).merge_by_gap(.15, 3)
        ms_sg=.5_mg=.15+3`

@McCloudS
Copy link
Owner

I'm still toying around, but cm_sl=84_sl=42++++++1 does the double lines if the dialog exceeds a certain time. Otherwise, it will still try to find natural breaks.

Repository owner locked and limited conversation to collaborators Mar 24, 2024
@McCloudS McCloudS converted this issue into discussion #70 Mar 24, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants