-
Notifications
You must be signed in to change notification settings - Fork 0
1. Data Ingestion
This project is funded upon two major data requirements: the text transcript of the dialogues of the game, and the audio lines themselves.
"If I can acquire them both, possibly at the same time, I'm in for the next step", I thought.
And that is true, only... it's not as simple as it sounds.
To gather all of the dialogues, in both formats, I had a very hacky idea. I would play the game, screen record it, then on the video footage apply two functions:
- "watch" the video, checking for existance of static elements like the in-game HUD like health bar, the subtitles etc. to determine what scene that is
- if the scene contains subtitles, then perform Optical Character Recognition (OCR) to transcribe the subtitles into a transcript.
... Yeah. You can see as many problems as you want in this, for example but not only:
- how do I know in which chapter I am, based on the subtitles alone?
- how do I know which "line" of dialogue is the character at? Since a line can span more than just one subtitle block
- how do I get the lines of dialogue that are not in the subtitles? For example, some dialogue lines are narrated at bottom right and they have no clear indication of the character speaking, just the character's icon
- what if the OCR does not recognize the characters correctely? Or if it picks up a text from something that is not a dialogue?
Some screenshots to give a better idea:
![]() |
|---|
| Combat: recognizable by the HUD |
![]() |
| Dialogue: recognizable by the subtitles |
![]() |
| Open World Exploration: recognizable by the bottom-left HUD |
![]() |
| Side dialogues: recognizable by the bottom-right HUD |
So... a hard work, but... necessary? I don't think there is just all the dialogues of the game, freely and publicly available, right? Right?
Yes, yes there are! Take a look at this awesome website: it has all I need, and in a simple, fancy style! No dynamic loading of the page, no registration, no cookies, no nothing that would otherwise impede "simple, quick and traditional" web scraping!
This was definetely the quickest and easiest part of the project, thanks to the source data being already neatly organized. I could loop the pages, one for each chapter, get the chapter name, loop over the dialogues and assign an index to them too, then finally loop over the lines
![]() |
|---|
| An outline of the logic |
Naturally, along with the line itself, I also kept track of which character was delivering that line. There was a minor inconvenience, in that the website also keeps track of the narrator lines (which no character speaks) and they are formatted differently, but it was only a minor edit.
My scraping script was definetely not organized as it is today, but the major logic was still there:
- Load the base webpage
- Get the title, which is the chapter name
- Find all the paragraphs (i.e., the dialogues)
- For each dialogue, index it and find all the lines
- For each line, index it and split the speaker from the actual line
- When all dialogues have been processed, write the output for the current chapter in a CSV file
- Move on to the next page (if there is one, otherwise terminate the script)
Scraping is still the first step of the whole pipeline and is invoked in main.py
When I had downloaded all the chapters' dialogues, I started to watch my recorded footage with the transcript at hand, to check that all the lines were at the right place. And they were, just... some were extra.
Why would there be extra lines?
Well, simply, I had missed those lines in my playthrough. And surely I wasn't going to restart a whole game to re-record them, nor was I going to look for them in other gameplays online. I just thought "well, who cares, I'll keep them regardless".
Well, keeping them all was not properly smart: since I am going to prompt a LLM with them, what if the model does not understand that a audio line might be missing, but the text line can be present? Will it know to skip over them?
I quickly sliced a short piece of footage, keeping only a few minutes of audio dialogues. I then prompted the model with both audio and text dialogues in OpenAI Playground and... yeah, as expected, the model evaluates the emotions of all lines, regardless of whether they are present in the audio or not. No prompt engineering could make the model focus only on the audio and rely on the text just as reference.
I didn't want to keep estimates that the model made on text-only data, because I thought it would be hard even for a human to read a line and guess its emotions.
So, I needed to strip away all the lines that were not present in my gameplay. But how?




