## Emotion Classification Walkthrough

### 0. Intro
In this notebook, we will take a look at how emotions are classified by `gpt4-audio` by prompting it with an audio and text dialogue from the game.

The actual process in `classifier.py` is more complex and involves a bit more steps. Here we will see the most important ones, since the high-level logic stays the same.

**NOTE**: we are assuming that the steps of data scraping and preparation have been already completed by running `main.py`. Follow along the next chapter if you want to revist it, or skip ahead to chapter **1. Reading the files**

### 0.1 A brief recap of the steps that are performed in `main.py`
Starting with a fresh repository, we will have the following folder structure under `data/`:

```text
+---0_data_manip_cfg/
+---audio/
|   +---2_edits
|   +---3_splits
+---csv/
|   +---1_raw
|   +---2_edits
|   +---3_splits
```

The process in as follows:
1. dialogues are scraped from the source webpage and saved in `csv/1_raw`
2. dialogues from `csv/1_raw` are edited based on the rules defined in `0_data_manip_cfg/edit_rules.json` and saved in `csv/2_edits`
3. dialogues from `csv/2_edits` are split based on the rules defined in `0_data_manip_cfg/split_rules.json` and saved in `csv/3_splits`
4. audio files in `audio/2_edits` (manually saved there as WAV) are split according to the same rules and saved in `audio/3_splits` as MP3

#### What do you mean by edit and split? And why did we do that?
**Editing** means that we strip away conversations that are not present in our audio files. That is mainly because in my gameplay I did not obtain those conversation lines, or because when I got them, they were too noisy (in this case, mostly dialogue lines that are spoken during fights, which contain a huge amount of VFX sounds)

**Splitting** means that we divide the edited audio in chunks, so that the GPT model can have a shorter audio context to focus on. Naturally, to have a full match between lines spoken and lines present, we must also split the conversation file.

For both editing and splitting there are "rule" files in `data/0_data_manip_cfg/`.

Performing both edits and splits significantly limits
* hallucination
* context overflow
* attention dilution

### 1. Reading the files
Let's import the required models.

In [None]:
import openai
import json
import pandas as pd
import pathlib
import base64

Now, as an example, let's pick one chapter, preferably a short one and one without splits, so we don't overcomplicate the example.

Let's pick the chapter "**11_Monocos_Station**"

In [2]:
path = pathlib.Path().resolve()

chapter = "11_Monocos_Station"
dialogues_file_path = path/f"data/csv/3_splits/{chapter}.csv"
audio_file_path = path/f"data/audio/3_splits/{chapter}.mp3"

In [5]:
df = pd.read_csv(dialogues_file_path.as_posix())
df.sort_values(by=["chapter_index", "dialogue_index", "line_index"], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,chapter_index,chapter,dialogue_index,line_index,speaker,line
0,11,Monoco’s Station,0,0,Verso,"Follow the tracks, they’ll lead us to Monoco."
1,11,Monoco’s Station,0,1,Maelle,A train. Gustave would have liked to see that.
2,11,Monoco’s Station,0,2,Lune,"Is it true, before the Fracture, Lumière had t..."
3,11,Monoco’s Station,0,3,Verso,"Always running late, but running, yes."
4,11,Monoco’s Station,2,0,Verso,"If Monoco’s not here, that means– Slow as ever..."


### 2. Preparing for prompting
We need three things prior to prompting the model:
- the conversation transcript, as a single string
- the audio, as a base64 string
- a system message, to explain to the model what we need to do

Creating the transcript is quite simple:
1. we create a row identifier as a concatenation of dialogue index and line index
    - the model will use this as reference when returning the estimate of that line
2. we convert the pandas Series to list
3. we join the list with newline characters

In [9]:
df["id"] = df["dialogue_index"].astype(str) +"_"+ df["line_index"].astype(str)
df["outc"] = df["id"] + " | " + df["speaker"] + ": " + df["line"]

dialogues_text = "\n".join(df["outc"].to_list())
print(dialogues_text[:333])

0_0 | Verso: Follow the tracks, they’ll lead us to Monoco.
0_1 | Maelle: A train. Gustave would have liked to see that.
0_2 | Lune: Is it true, before the Fracture, Lumière had trains running throughout the continent?
0_3 | Verso: Always running late, but running, yes.
2_0 | Verso: If Monoco’s not here, that means– Slow as ever, mo


Now to encode the audio as base64 string, simply

In [11]:
audio_file = open(audio_file_path.as_posix(), "rb").read()
audio_b64 = base64.b64encode(audio_file).decode("utf-8")
print(audio_b64[:33])

SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2Z


And finally the system message

In [12]:
negative_emotions = ["anger", "sadness", "fear"]
positive_emotions = ["happiness", "ambitious", "surprise"]
target_emotions = negative_emotions + positive_emotions

# Prompt
system_message = f"""
## TASK
Evaluate the likelihood of the emotions in the audio dialogue.
Consider the actor's interpretation, the background music and the meaning of the words.
Only classify the following emotions:
- positive: [{', '.join(positive_emotions)}]
- negative: [{', '.join(negative_emotions)}]
- neutral: [neutral]

## REQUIREMENTS
- You will have the transcript of the dialogue. Use the row index as key when returning the estimate for the voice line.
- Your analysis must rely EXCLUSIVELY on the audio. The transcript is provided ONLY to map voice lines by their row index.
- Do NOT use the text to infer tone, emotion, or meaning.
- Make sure to not classify any other emotion apart from those listed.
- Don't mix positive and negative emotions in a single voice line.
- Your estimate should be between 0 and 1, and the total should add up to 1.
- If an emotion has a score lower than 0.1 , ignore it and add that score to the highest valued emotions.
- If an emotion is not scored, return it with a score of 0.0
- When you reply, do not add any other text. Just reply with a JSON formatted string.
"""

### 3. Prompting the model
Now that we have all it takes, we simply create a new OpenAI Client object, passing it our key

In [14]:
with open(path/"data/open_ai_token.txt", "r") as key:
    client = openai.OpenAI(api_key = key.read())

In [15]:
# https://platform.openai.com/docs/api-reference/chat/create
response = client.chat.completions.create(
    model="gpt-audio",
    temperature=0.1,
    max_completion_tokens=16384,
    messages=[
        {
            "role": "system",
            "content": system_message
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "mp3"
                    }
                },
                {
                    "type": "text",
                    "text": "TRANSCRIPT (DO NOT USE FOR CLASSIFICATION):\n" + dialogues_text
                }
            ]
        }
    ]
)

Now let's inspect our response

First of all, if the model wasn't able to reply with a full response because it ran out of tokens, we must raise an Error

In [17]:
res_dict = response.to_dict()
if res_dict['choices'][0]['finish_reason'] != 'stop':
    print(json.dumps(res_dict, indent=2))
    raise ValueError("API response was not complete. Exiting...")

If that skips over, we are in the clear: we are sure that the model replied with one evaluation per each line.

Let's quickly see it

In [None]:
content = response.choices[0].message.content
print(content[:50])

{
  "0_0": {
    "neutral": 0.7,
    "ambitious": 0.3,
    "happiness": 0.0,
    "surprise": 0.0,
    "anger": 0.0,
    "sadness": 0.0,
    "fear": 0.0
  },
  "0_1": {
    "sadness": 0.6,
    "neutral": 0.4,
    "happiness": 0.0,
    "ambitious": 0.0,
    "surprise": 0.0,
    "anger": 0.0,
    "fear": 0.0
  },
  "0_2": {
    "neutral": 0.8,
    "surprise": 0.2,
    "happiness": 0.0,
    "ambitious": 0.0,
    "anger": 0.0,
    "sadness": 0.0,
    "fear": 0.0
  },



As you can see, the model responded with a JSON object having as key a row identifier from our transcript, and as value another JSON object with a key per each emotion we requested it to classify.

Let's see these informations more clearly with a DataFrame

In [21]:
json_content = json.loads(content)
emotions_df = pd.DataFrame.from_dict(json_content, orient='index')
emotions_df.head()

Unnamed: 0,neutral,ambitious,happiness,surprise,anger,sadness,fear
0_0,0.7,0.3,0.0,0.0,0.0,0.0,0.0
0_1,0.4,0.0,0.0,0.0,0.0,0.6,0.0
0_2,0.8,0.0,0.0,0.2,0.0,0.0,0.0
0_3,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2_0,0.7,0.0,0.3,0.0,0.0,0.0,0.0


Now, all we have to do is to merge these evaluations back to our conversation file.

In [22]:
joined_df = pd.merge(
    df,
    emotions_df,
    "inner",
    left_on="id",
    right_on=emotions_df.index
)
joined_df.drop(["outc", "id"], axis=1, inplace=True)
joined_df.head()

Unnamed: 0,chapter_index,chapter,dialogue_index,line_index,speaker,line,neutral,ambitious,happiness,surprise,anger,sadness,fear
0,11,Monoco’s Station,0,0,Verso,"Follow the tracks, they’ll lead us to Monoco.",0.7,0.3,0.0,0.0,0.0,0.0,0.0
1,11,Monoco’s Station,0,1,Maelle,A train. Gustave would have liked to see that.,0.4,0.0,0.0,0.0,0.0,0.6,0.0
2,11,Monoco’s Station,0,2,Lune,"Is it true, before the Fracture, Lumière had t...",0.8,0.0,0.0,0.2,0.0,0.0,0.0
3,11,Monoco’s Station,0,3,Verso,"Always running late, but running, yes.",1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,11,Monoco’s Station,2,0,Verso,"If Monoco’s not here, that means– Slow as ever...",0.7,0.0,0.3,0.0,0.0,0.0,0.0


And... that's mostly it!

Now what I would do in production is save this dataframe in `data/output/emotions_scored/<chapter_name>/<timestamp>.csv` but for the sake of this simple example, I won't do it.

### 4. Wrapping up
We have seen a very simple overview of the classification part of this project, with data processing quickly covered.

After classifying all the chapters and actually writing outputs, running `prep_for_dashboard.py` concatenates all of them and creates a final, single CSV file ready for visualization.

Data visualization is done externally in Tableau, if you are curious you can check out [my dashboard](https://public.tableau.com/views/ClairObscurExpedition33EmotionClassification/Dashboard) on Tableau Public for free!

Thanks for checking out my passion project!