-
Notifications
You must be signed in to change notification settings - Fork 0
4. Emotion Classification
I knew this was going to be the fun part, but in order to make sure this step wasn't going to be a one-shot, dirty Jupyter notebook, hard to replicate step, I wanted to add something more to the process.
First of all, clarity: I want to know which chapters I have already classified, which ones I have partially classified (meaning, I only classified some splits, but not all) and which ones are missing.
Second, method: a single, 200 rows script that reads all split files, prompts them to the LLM, parses and saves the output may be enough, but since this is the step where a single mistake can cost not only money, but execution time, I want to have control over each sub step but without having to forcefully intervene too much.
Third, reproducibility: if in 3-4 months I want to repeat this process, I don't want to have to edit the quick and dirty script, discard previous output files and/or overwriting existing results.
This is part of why the classifier.py script is definetely the longest one and, at first glance, even the most complicated one. So that is why I'll spend a bit more over this topic.
As I said in the previous chapter, I have split long chapters in chunks so that prompting is efficient but most imporantly, effective. Some chapters have 2 splits, some 3, some only 1, but regardless of how many there are, the end result should be the same: a single file containing all classified lines for each chapter.
This is the reason behind the Pair and Chapter utility classes: they serve as wrappers for both audio and transcript files so that, regardless of the splits of one chapter, I can classify them all and then merge them together.
Every time I run the classifier, I certainly do not want to repeat classification for all chapters. Nor do I want to hard-code the chapters to be classified in the main() every time I run the script. I was thinking of accepting the chapters via command line arguments, but that would be cumbersome and prone to misspelling.
So I decided to create a curses command line interface that would allow the user to select the chapters and/or splits to classify during that execution.
Let's see the end result first:
- By using the arrow keys, the user can select an entire chapter (i.e. all its splits) or only some splits of a chapter
- There are shortcut keys to select or deselect all the chapters
- Before the title of the chapter, you can see
how many splits you have selected / the total number of splits - Also, a
[ ]or[C]character identifies chapters that already have been classified before
After selecting the desired chapters, the UI continues with a recap of the selected chapters and asks the user for confirmation. After that, the classification begins.
The basic requirements for prompting the model are:
- The audio dialogue, in MP3 format
- The transcript, in plain text
- The system message, in plain text
The audio dialogue will be passed as a base64 encoded string, which is easy to do.
The transcript can't be passed as CSV attachment, so I loaded it into a pandas DataFrame, concatenated the speaker with the actual line and created a row identifier that the model will use as reference when returning the emotion estimate for that line.
An example of transcript:
0_0 | Verso: Follow the tracks, they’ll lead us to Monoco.
0_1 | Maelle: A train. Gustave would have liked to see that.
0_2 | Lune: Is it true, before the Fracture, Lumière had trains running throughout the continent?
0_3 | Verso: Always running late, but running, yes.
The system message is where I focused my attention.
I needed to make sure that the model
- would focus on the audio and just use transcript as basis to reply
- would not classify emotions I did not ask it to
- would not classify opposite emotions in a single line (example: 0.6 happiness and 0.4 sadness)
- would not classify more than 3 emotions in a single line
- always returned a JSON formatted string, nothing more and nothing less
My first version of the system message was as follows:
## TASK
Evaluate the likelihood of the emotions in the dialogue.
Consider the actor's interpretation the background music and the meaning of the words.
Only classify the following emotions: happiness, joy, surprise, determination, anger, sadness, fear
## REQUIREMENTS
- You will have the transcript of the dialogue. Use the row index as key when returning the estimate for the voice line.
- Make sure to not classify any other emotion apart from those listed.
- Your estimate should be between 0 and 1 and the total should add up to 1.
- When you reply do not add any other text. Just reply with a JSON formatted string.
At first the results were adequate. But while the model was not making mistakes, there was something that I did not like: the lines were too "polarized". Even for colloquial phrases, interjections, and phrases without strong emotions, the model was assigning very high emotion values that were not actually present in the dialogue.
At first I thought that the model was not understanding emotions properly, which one can argue that may still be true. So I changed some emotions, used synonyms, I even tried to record a voice line by myself in two different moods, but the model was still too polarized.
But then I noticed something: I have asked it to make sure all emotions add up to 1, for calculation simplicity. So the model was forcefully "increasing" the emotion intensity, even if that was not necessary.
That is why I introduced the neutral emotion.
This was effectively the most important change I made to the classification part, as it behaved sort of a "thermometer" for emotions: the lower the neutral, the higher were the other emotions, thus the stronger the line felt.
Results changed immediately and they were far more accurate, albeit sometimes opinable in that the line can sound a bit more (for example) happy, but the model didn't catch it. But that was fine: I am sure that if I asked five people to do the same, the results would always be different, so I did not waste time trying to perfect it.
After having perfected the system message and prepared the inputs, prompting was quite simple. I would always save the result of a prompt, both parsed and raw, in the data/output/ folder. That allows me to keep track of minor changes to the emotions I chose, as well as increasing/reducing the length of the inputs by tweaking the splits duration.
While 70% of my credit was spent during testing, I have to say that I have made a good use of the remaining 30%, meaning that I really had to make a very small amount of "final" prompts.
With the script prep_for_dashboard.py I loop over every most recent emotion evaluation of every chapter, concatenate them together and finally write an output file.
With that final output file, I concluded the project. Or at least, the data preparation part. I have created a dashboard on it using the free version of Tableau called Tableau Public. You can check it out at this link.
Thank you for having followed through my project and I hope you have found it interesting!