Audio Marker Sync
This is a piece of software to find an audio marker in an audio stream. The marker is a set of three beeps with strict timing and predefined frequency. The software is able to find the start of this marker with millisecond accurate precision. Being able to do this allows several use cases.
- It allows to synchronize audio recorded via a microphone and video streams: if a camera records a
- It allows to synchronize several cameras: if each camera records the same event and records the audio marker simultaneously the location in time of the markers can be used as a sync point.
How do I use this?
First prepare your audio files with the marker so your camera or microphone can record it. The marker has the following properties. It is a square wave with a frequency of 480.5Hz that is on for 200ms off (silent) for 100ms and on for 200ms, off for 100ms and finally on for 200ms. If you have recorded the files with markers place them into a single folder. A prepared marker can be found in the docs folder.
Install a recent JRE, download the ready build audio marker finder and choose a folder with your media files that contain audio with the marker present and in the text box below a number of timestamps should appear where the marker was found. A percentage score is also given.
It is probable that false positives are present but using the scores these can be filtered out by the user. This is to allow a larger marker/noise ratio. In the screenshot below it is probable that the marker in the first audio file is a false positive, while the second audio files contains a true positive (scores jump from +-100% to +-70%).
The test set consists of two public domain audio files originating from archive.org With the following
ffmpeg commands the marker is mixed at 60s and 120s respectively with the music. The command resamples, mixes the marker with a chosen offset and compresses the audio so it is a reasonable challenge to locate the marker.
ffmpeg -i "set/Alexander_Wheres_That_Band.ogg" -i "docs/marker.wav" -filter_complex "[1:a]adelay=60000[s2];[0:a][s2]amix=2[mixout]" -map "[mixout]" "set/60s_marker.ogg"
ffmpeg -i "set/Bucktown_5_Hot_Mittens.ogg" -i "docs/marker.wav" -filter_complex "[1:a]adelay=120000[s2];[0:a][s2]amix=2[mixout]" -map "[mixout]" "set/120s_marker.ogg"
The results (see screenshot above) show that two false positives are detected but also that the marker is found correctly at the expected locations. The threshold to detect markers is set deliberately low to ensure that all markers are found. Ideally there is a large score difference between a true and false positive (as is the case here).
You need a recent java runtime installed on your system.
The software needs ffmpeg installed on your system to decode media files. It will attempt to download ffmpeg and execute it automatically but more capable versions might be present if you install it manually. E.g. using homebrew in macOS:
brew install ffmpeg or a packet manager like
apt-get on Debian like systems:
apt-get install ffmpeg. The static ffmpeg binaries provided by zeranoe might be practical as well on e.g. windows.
Audio is downmixed to one channel. If a container format has more than one audio stream (e.g. languages in movies), then the first audio stream in the container is used automatically.
How does it work?
Is it magic? Is it witchcraft? No, it is done by calculating how present the expected frequency is in the signal for every point in time. To do this efficiently, a generalized version of the goertzel algorithm is used. Since we expect that the frequency is only very present for a small amount of the time (during the marker and other short random events) we can disregard most of the signal for further analysis.
The further analysis compares the expected timing with the actual timing of the presence of the frequency. This essentially filters out random events and enables the algorithm to find the beginning of the marker.
To do this efficiently, a sort of binary convolution is calculated using bit sets of the length of the marker. The match score is then the cardinality of the signal NOT XOR’ed with the template. The more 1’s there are in the bit string the better the match. In the following example the template represents the expected power of the frequency present in the marker. The signal is the calculated frequency power and the number of ones in the bit string represents the match score:
Template: _____ _____ ______ | |___| |___| | Signal: .. .. ... . ..... . ... ... Binary NOT XOR: 11011 111 11101 111 11111
This is MIT lisenced and uses a GPL’ed library called TarsosDSP
Joren Six at University Ghent, IPEM.