-
Notifications
You must be signed in to change notification settings - Fork 0
3. Splitting the dialogues
Now that all the dialogues in the transcript matched my audio, I was more than halfway through.
I was curious to see how the classification would turn out to be, so I prompted the model with a short chapter like "11. Monoco's Station", so that the output was quickly available and short enough to read. And surprisingly, the outcome was what I was expecting! All the lines were properly evaluated, none was missing, there were no data quality issues like the sum of all emotions not being 1.0, the model classifying emotions I didn't ask it to...
Pretty good! Now, let's see a longer chapter like the first one, "0. The Gommage". It was about 25 minutes of dialogues, compared to Monoco's Station 4 minutes. And... the prompting failed. Or, better said, it did not complete due to the response exceeding the maximum amount of tokens for gpt4-audio, which are 16k.
So I tried shorter chapters, like "4. Flying Waters" which is about 12 minutes. And it did complete. However, the results were very low quality: most of the lines were classified as 0.5 emotion A and 0.5 emotion B, which to me looked like a clear indication of attention diluition due to the model having a very big amount of information to process.
Since shorter chapters tended to be classified correctely and longer chapters tended to either fail or be low quality estimates, after some other testing with other chapters I came up with the solution of splitting longer chapters in chunks. I thought it was still important to preserve dialogue context, meaning I couldn't simply split the dialogue mid-sentence, so I needed to find a good sample duration for most chunks.
To quickly sum up:
- a 4 minute chapter was getting classified with no errors and good data quality
- a 12 minute chapter was getting classified with no errors, but low data quality
- a 22 minute chapter was not getting classified, erroring for outage of available tokens
I thus needed to find an optimal, average chapter duration that would accomodate GPT's token restriction, while also not dispersing a scene in different chunks.
I proceeded by trial and error, as I thought it would be the most reliable way of optimizing the process. I started by taking a chunk of 4 minutes from "0. The Gommage", then increased the chunk duration by 2 minutes at each repetition, always ensuring that the end of the chunk coincided with the end of a dialogue.
Then, I calculated the variance of emotions for each test, to see how variegated my output was by increasing the duration of a chunk. I was expecting to see variance decrease by increasing the duration of a chunk and, sure enough, it was like that
At this point, I sticked to a 4-6 minutes duration as optimal chunk duration, since it was applicable almost to any chapter and it would not create too many splits even for the longer chapters.
Now that I had decided how long should an optimal chunk be, for each chapter longer than 6 minutes, I need to
- find a timestamp that would not cut a dialogue in two chunks that wouldn't make sense separately
- find the related line index in my transcript
- create a rule that associates a timestamp with that line index
- repeat until I have split a long audio into chunks of 4-6 minutes each
This was definetely easier and quicker than the previous processing step, the editing part. I tackled it the same way, by creating a JSON ruleset that would indicate where to split the transcript and where to split the audio. Then the splitter.py module would read it and perform these actions. Every split was saved separately with an incremental suffix to indicate which part of the original it was.
Moreover, when splitting I would also convert WAV files to MP3 since gpt4-audio performs better with MP3 and also the format, being compressed, saves a lot of tokens.
I prompted the model with the first split of "4. Flying Waters" and compared it to the relevant first lines of the previous, full prompt. Needless to say, the result was exactely what I expected: a better, more coherent and less "lazy" evaluation for each line.
And, for some reason, cheaper: based on OpenAI cost breakdown, the completition tokens of the full chapter of "Flying Waters" were about 8000 tokens, whereas the first split was around 2200. I decided to split that chapter in 3 splits, so if every split costed equal tokens, I would have saved around 2000 tokens. But I never did that test, mostly because I was still testing and did not care about costs at that point.
At this point, I was ready to classify each chapter and its splits