Skip to content

0. Tech rundown

Davide edited this page Dec 13, 2025 · 5 revisions

Installation

This repository uses git lfs to checkout large files (WAV and MP3). To have a fully functional environment, I recommend downloading them as well, simply by cloning the entire repository.

If you are not interested in running any code, you can speed up the cloning time (and save disk space) by skipping large files by running the following command Linux:

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysis

Windows:

$ set GIT_LFS_SKIP_SMUDGE=1
$ git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysis

If, at a later stage, you want to also download large files, you can install git lfs and pull them

$ git lfs pull

Python environment

Create a virtual environment, activate it and install the requirements

Linux:

$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Windows:

$ python -m venv venv
$ venv\Scripts\activate
$ pip install -r requirements.txt

Run the data processing stack

Dialogues transcript scraping, lines editing and splitting can all be run via main.py. It accepts inline arguments:

usage: main.py [-h] [--no-scraper] [--no-editor] [--no-splitter] [--keep-narrator] [--keep-gibberish]

options:
  -h, --help        show this help message and exit
  --no-scraper      Do not run the Scraper
  --no-editor       Do not run the Editor
  --no-splitter     Do not run the Splitter
  --keep-narrator   Keep the narrator lines
  --keep-gibberish  Do not add a "(gibberish)" prefix to all the lines in gibberish

First, I'll focus only on the latter two options:

  • --keep-narrator: does not discard the narrator lines from the raw transcript, which by default includes them
  • --keep-gibberish: do not add a "(gibberish)" prefix to all the lines in gibberish. Some characters, like gestrals, grandis and faceless entities, do not have voice lines, instead they only mutter unintelligible language. To ensure that, during prompting to a LLM, these lines would be identified and clarified as "unintelligible", I have decided to prefix them by default.

Now to explain a bit more in detail what each step does.

Scraper

scraper.py, as expected, scrapes all the lines of dialogue from this website and saves them as organized CSV files under data/csv/1_raw, overwriting them if existing.

This uses the requests and beautifulsoup modules, which do not require browser automation, relying only on raw HTML.

Editor

editor.py is a module that deletes (and optionally inserts) manually-specified lines of dialogue from the raw, scraped dialogue transcripts. This is done in order to "align" the lines of dialogue in the audio footage, which I recorded, with the scraped transcript. Since, for various reasons, I was not able to record parts of some dialogues, deleting the according lines from the transcript is the only way to have consistent data across the two formats.

This module reads a manually-created JSON file called edit_rules.json that specifies which line ranges to delete.

The structure for line deletions is as follows:

{
	"source": "0_The_Gommage",  # chapter name, just as it was scraped and saved in data/csv/1_raw
	"ranges": [
		{
			"dial_s": 2,  # dialogue index start
			"line_s": 6,  # line index start
			"dial_e": 2,  # dialogue index end
			"line_e": 6   # line index end
		},
		{"dial_s": 4, "line_s": 3, "dial_e": 4, "line_e": 6},
		...
	]
}

After completing, the module saves the edited transcripts into the folder data/csv/2_edits.

Additionally, in the same file, the user can provide a list of files that will be added to the edited transcripts.

"inserts": [
    "29_A_Life_to_Paint",
    "30_A_Life_to_Love"
]

These files must be placed in the folder data/csv/2_edits/custom_inserts. If a file with the same name already exists under data/csv/2_edits, inserts will overwrite them.

NOTE: if no edits have been declared for a chapter, the editor will just copy-paste the transcript for that chapter from data/csv/1_raw to data/csv/2_edits

Splitter

splitter.py is a module that, according to the rules defined in a manually-created JSON file called split_rules.json, splits:

  • the transcripts
  • the audio footage for that chapter

The module reads audios from data/audio/2_edits in WAV format and transcripts from data/csv/2_edits in CSV format, previously generated by the editor module. Then loops over the rules defined in the split_rules.json file:

{
	"source": "0_The_Gommage",  # chapter name
	"ranges": [
		{
			"dial_s": 0,  # dialogue index start
			"line_s": 0,  # line index start
			"dial_e": 3,  # dialogue index end
			"line_e": 17  # line index end (-1 means last line)
		},
		...
	],
	"timestamps": [
		"04:36",  # first timestamp at which to split
		"08:05",  # second timestamp
		...
	]
}

After completing a chapter, transcripts are saved in data/csv/3_splits and audios are converted to MP3 and saved in data/audio/3_splits.

NOTE: if no splits have been declared for a chapter, the only step that is still performed is the conversion from WAV to MP3

Running classification

NOTE: To run the classification step, you need to provide your OpenAI API key to the module. By default, I have set the repo up to read the key from data/open_ai_token.txt, but you can provide it however you like to the authorize() method of the Classifier class.

You can run classifier.py straight after downloading all medias in the repo (also those tracked by git lfs) or, in general, after having completed the splitting step.

This module has a curses command-line interface that lets you choose which chapters to classify

  • By using the arrow keys, the user can select an entire chapter (i.e. all its splits) or only some splits of a chapter
  • There are shortcut keys to select or deselect all the chapters
  • Before the title of the chapter, you can see how many splits you have selected / the total number of splits
  • Also, a [ ] or [C] character identifies chapters that already have been classified before

After selecting the desired chapters, the UI continues with a recap of the selected chapters and asks the user for confirmation. After that, the classification begins.

For each chapter selected, you loop over its splits in pairs, prompting the model with both audio and text.

The model's response is saved both raw and parsed in the data/output folder. After all pairs of a chapter have been classified

  • the raw API responses are saved under data/output/api_responses/<chapter>/
  • the parsed output is concatenated and saved in data/output/emotions_scored/<chapter>/

Inspecting the output of a classification with charts

Especially when prompt engineering, the output of classifications may vary. I have created the module viz_output.py for the purpose of analyzing one single output stand alone, or compare two outputs together.

The module uses Streamlit as library to create a simple data application. It can be run via the command streamlit run viz_output.py

Single inspection mode Comparison mode

Building the output

Once you are satisfied with the outputs of classification, you can run prep_for_dashboard.py that will iterate all classified chapters (i.e., all folders under data/output/emotions_scored/), take the most recent file and concatenate them together.

Clone this wiki locally