I noticed a weird behaviour when transcribing an audio from a news report that has a couple of interviews.
the model transcribes the speech of the host decently, but completely skips the parts where the guests speak. some segments are over 40 seconds long and are completely absent from the transcript.
I am using large-v2 model for Korean language transcription.
I played around with using the vad filter and even an initial prompt that mentions how many speakers will be in the audio, but nothing fixed it yet.
any idea what could be the cause?