Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Muting is delayed #549

Open
fogoplayer opened this issue Apr 17, 2024 · 7 comments
Open

Muting is delayed #549

fogoplayer opened this issue Apr 17, 2024 · 7 comments
Assignees
Labels
Audio Bug Something isn't working

Comments

@fogoplayer
Copy link

馃悰 Description

Audio muting often misses the profanity, muting the word after instead. The duration seems to be about right, it's just too late. The issue is not exclusive to when the words appear at the beginning of captions.

馃攢 Steps To Reproduce

I'm using default settings, except for turning on audio muting. Here are links to several affected videos. Each link is to a timestamp--the issue should be evident within the first 10 seconds:

I don't use many other video sites, so I don't know if this issue is YouTube-speicific.

鉁旓笍 Expected behavior

That the muting starts when the curse word starts and ends when the curse word ends

馃捇 Details

  • Device: Asus Chromebook Spin CX5400
  • OS: ChromeOS 123
  • Browser: Chrome
  • Browser Version: 123
  • Affected site(s): YouTube.com

馃摑 Additional context

If this is too difficult to calibrate at scale, I'd take a setting that allows me to bring in the mute some milliseconds early.

@fogoplayer fogoplayer added the Bug Something isn't working label Apr 17, 2024
@richardfrost
Copy link
Collaborator

Thanks for the examples @fogoplayer, I'll take a look and see what can be done as soon as I get some more time.

@richardfrost richardfrost self-assigned this Apr 19, 2024
@fogoplayer
Copy link
Author

Sounds great!

I'd be happy to pitch in my own typescript skills, and every time I've scanned the repo for words related to audio, muting, etc. I haven't been able to find anything that looks like a relevant segment of code. I don't expect you to hold my hand through the whole process, but if you can point me in a direction I'd love to get started on a PR!

@richardfrost
Copy link
Collaborator

Thank you for the offer @fogoplayer! At least for now, I've had to remove the GPL/open-source license on the audio component, so it isn't in this repo anymore. I have a pretty busy next couple of days, but after that I hope to have some time to dedicate to this. If I had to guess, it is probably a case of the captions being added out of sync with the audio, but that's just a guess. It seems like YouTube has gotten progressively worse over the last couple years with that. I'll definitely take a closer look as soon as I can though, and report findings back here.

If you wanted to do some more research, I'm curious if we can find the API that YouTube is using to request the captions, and see if it also includes more timing info. Right now the muting is pretty simple. It basically has a mutation observer that watches for any nodes being added or modified, and then it determines if it is part of the captions. Then, it filters the node and if it contains a word that needed to be filtered it will mute until the next word/phrase gets added to the page, and then it repeats the process again, and if it doesn't have to filter it will unmute. So, right now the timing is very dependent on when the elements get added/removed from the DOM.

@fogoplayer
Copy link
Author

It looks like we can find the API!

For this video: https://www.youtube.com/watch?v=CWWSovO3Txc

It sent this request and got this response. (Recorded as Gists so I don't add a 9,000 line JSON file to this thread.)

Key insights:

  • The request seems to grab caption data for the entire video--it's only fired once
  • It's fired when the user enables Closed Captioning, not on page load.
  • The jackpot seems to be the events member, which is a list.
    • Members of that list are objects that have the following members:
    • tStartMs: number
    • dDurationMs: number
    • wWinId: number
    • segs: Object[]
    • I'm guessing the first two are the start time and duration of the string represented in the caption
    • segs is another list of objects. Those keys are:
      • utf8: a string containing a single word
      • tOffsetMs: number
      • acAsrConf: number

My next steps are going to be to try to whip together a Proof-of-Concept script to validate if this JSON data can be converted into accurate timings. I also might try and write something to crawl over the file and make sure there aren't any exceptions to the grammar above.

P.S.: I know you said you're short on time, so I have no expectation of quick replies. I'll post anything I can find in this thread, and I'll take direction and updates as they come.

@fogoplayer
Copy link
Author

Okay, proof-of-concept done:

// jsdoc-typed js, because there's no need to get Babel involved in this

/**
 * @param {string} captionData - the stringified JSON data from the YT API
 * @param {boolean?} verbose - controls logging
 */
function checkTimings(captionData, verbose = false) {
    /** @type {HTMLVideoElement} */
    const videoElement = document.querySelector("video")

    // YT caption JSON data includes "control characters" that are invalid JSON. 
    // I'm not sure how Google is parsing them, but given that no curse words contain control characters, it's enough for us to just ignore them.
    captionData = captionData.replaceAll(/\"[\n\r\t\v\b\a\f\\\h\x]\"/g, '"control character"')

    const {events} = JSON.parse(captionData)

    for (const event of events) {
        const {tStartMs: startTime, segs} = event
        setTimeout(()=>{
            for (const seg of segs) {
                const {tOffsetMs: delay = 0, utf8: token} = seg
                setTimeout(()=>{
                    console.log(token)
                    if(token.trim() === "earbuds") {
                        videoElement.volume = 0
                    } else {
                        videoElement.volume = 1
                    }
                }, delay / videoElement.playbackRate)
            }
        }, startTime / videoElement.playbackRate)
    }

    videoElement.currentTime = 0
    videoElement.play()
}

A few key findings here:

  • The start times in the JSON file do seem to have the obvious meaning!
  • Sanitization of the JSON file may prove to be a non-trival difficulty here. I know there are ways to access the response.json() directly, and if that method is properly escaped it might be a non-issue.

And the big one:

  • Timings were still somewhat inaccurate. They started out extremely accurate, but seemed to drift as time went on, and by the three minute mark they were noticeably lagging behind. It could be that YouTube's time stamps are off, but it seems more likely that it's an issue with my code.*
    This is where I'm likely to focus going forward. I've heard using timeouts inside a Web Worker makes them more accurate, but a solution that runs on the timeUpdate event might be better.

*setTimeout is known to drift, and I wasn't trying particularly hard to optimize my code so there may be additional delays (such as the use of a slower for loop syntax).In addition to the method above (timeouts inside timeouts), I also tried a technique where I computed the total time offset before creating the timeout, with no noticeable difference.

@fogoplayer
Copy link
Author

New attempt:

/** 
 * @typedef {{
 *   timeMs: number
 *   token: string
 * }} 
 */

/**
 * @param {string} captionData - the stringified JSON data from the YT API
 * @param {boolean?} verbose - controls logging
 */
function checkTimings(captionData, wordToMute) {
    ////////////////////////
    // Parse caption data //
    ////////////////////////
    // YT caption JSON data includes "control characters" that are invalid JSON. 
    // I'm not sure how Google is parsing them, but given that no curse words contain control characters, it's enough for us to just ignore them.
    captionData = captionData.replaceAll(/\"[\n\r\t\v\b\a\f\\\h\x]\"/g, '"control character"')
    const {events} = JSON.parse(captionData)

    ////////////////////////////////////
    // Create sorted timestamps array //
    ////////////////////////////////////
    /**
     * @type {[]}
     */
    const timestamps = []

    for (const event of events) {
        const {tStartMs: startTime, segs} = event
        if(!segs) continue

        for (const seg of segs) {
            const {tOffsetMs: delay = 0, utf8: token} = seg
            timestamps.push({timeMs: startTime + delay, token})
        }
    }

    //////////////////////////////////////////////////////
    // Check current token each time the player updates //
    //////////////////////////////////////////////////////
    const video = document.querySelector("video")
    // removed due to low polling rate
    // video.ontimeupdate = 
    
    // we'd probably want to add and remove the interval on play and pause events, but good enough for now
    clearInterval(window.captionInterval)
    window.captionInterval = setInterval(() => {
        const start = performance.now()
        const {token} = binSearch(video.currentTime * 1000)
        const end = performance.now()
        console.log(token, "\t", video.currentTime * 1000)

        if(token.trim() === wordToMute) video.volume = 0
        else video.volume = 1
    }, 50)

    video.currentTime = 0
    video.play()

    /**
     * Binary search of timestamps array
     * @param {number} val the current time in ms
     * @returns 
     */
    function binSearch(val, start=0, end=timestamps.length) {
        if(end-start <= 1) return timestamps[start]

        const med = Math.floor((start+end)/2)
        if(timestamps[med].timeMs > val) return binSearch(val, start, med)
        else return binSearch(val, med, end)
    }
}

rather than setting timeouts for future mutes, i check the current time of the video, do a binary search to turn that into a token, and then apply filtering if that token matches the passed-in word to block.

I found a lot of benefits to this approach. It didn't decay over time, like setTimeout. it handles changes to the playback rate and skipping around the video by default, without any extra logic. The binary search is super fast--the performance API often said its runtime was 0ms.

However, it still comes in too late sometimes, and at this point i think that's probably due to an inaccuracy in the timings in the captions. Increasing the polling rate to 100hz and decreasing the playback rate had no effect on the accuracy.

I wonder if a setting could be added to make audio censoring always come in early, kind of like how a minimum duration can be set right now?

@richardfrost
Copy link
Collaborator

Thank you so much for all your work with this @fogoplayer, you were very thorough! I'm sorry its taken longer for me to get back to you with it. I will take some time to go through it all and let you know some next steps with where we can go with it.

I do agree also with your conclusion that the actual timing info may not be accurate, but the problem before was that we didn't have the timing info available, so there was no way to mute pre-emptively, but now with this information we should be able to. It likely wouldn't need to be much extra, but I do think it could be an option that we could allow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants