Skip to content

0. Tech rundown

Davide edited this page Dec 11, 2025 · 5 revisions

Installation

This repository uses git lfs to checkout large files (WAV and MP3). To have a fully functional environment, I recommend downloading them as well, simply by cloning the entire repository.

If you are not interested in running any code, you can speed up the cloning time (and save disk space) by skipping large files by running the following command Linux:

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysis

Windows:

$ set GIT_LFS_SKIP_SMUDGE=1
$ git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysis

If, at a later stage, you want to also download large files, you can install git lfs and pull them

$ git lfs pull

Python environment

Create a virtual environment, activate it and install the requirements

Linux:

$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Windows:

$ python -m venv venv
$ venv\Scripts\activate
$ pip install -r requirements.txt

Run the data processing stack

Dialogues transcript scraping, lines editing and splitting can all be run via main.py. It accepts inline arguments:

usage: main.py [-h] [--no-scraper] [--no-editor] [--no-splitter] [--keep-narrator] [--keep-gibberish]

options:
  -h, --help        show this help message and exit
  --no-scraper      Do not run the Scraper
  --no-editor       Do not run the Editor
  --no-splitter     Do not run the Splitter
  --keep-narrator   Keep the narrator lines
  --keep-gibberish  Do not add a "(gibberish)" prefix to all the lines in gibberish

First, I'll focus only on the latter two options:

  • --keep-narrator: does not discard the narrator lines from the raw transcript, which by default includes them
  • --keep-gibberish: do not add a "(gibberish)" prefix to all the lines in gibberish. Some characters, like gestrals, grandis and faceless entities, do not have voice lines, instead they only mutter unintelligible language. To ensure that, during prompting to a LLM, these lines would be identified and clarified as "unintelligible", I have decided to prefix them by default.

Now to explain a bit more in detail what each step does.

Scraper

scraper.py, as expected, scrapes all the lines of dialogue from this website and saves them as organized CSV files under data/csv/1_raw, overwriting them if existing.

This uses the requests and beautifulsoup modules, which do not require browser automation, relying only on raw HTML.

Editor

editor.py is a module that deletes (and optionally inserts) manually-specified lines of dialogue from the raw, scraped dialogue transcripts. This is done in order to "align" the lines of dialogue in the audio footage, which I recorded, with the scraped transcript. Since, for various reasons, I was not able to record parts of some dialogues, deleting the according lines from the transcript is the only way to have consistent data across the two formats.

This module reads a manually-created JSON file called edit_rules.json that specifies which line ranges to delete.

The structure for line deletions is as follows:

{
	"source": "0_The_Gommage",  # chapter name, just as it was scraped and saved in data/csv/1_raw
	"ranges": [
		{
			"dial_s": 2,  # dialogue index start
			"line_s": 6,  # line index start
			"dial_e": 2,  # dialogue index end
			"line_e": 6   # line index end
		},
		{"dial_s": 4, "line_s": 3, "dial_e": 4, "line_e": 6},
		...
	]
}

After completing, the module saves the edited transcripts into the folder data/csv/2_edits.

Additionally, in the same file, the user can provide a list of files that will be added to the edited transcripts.

"inserts": [
    "29_A_Life_to_Paint",
    "30_A_Life_to_Love"
]

These files must be placed in the folder data/csv/2_edits/custom_inserts. If a file with the same name already exists under data/csv/2_edits, inserts will overwrite them.

NOTE: if no edits have been declared for a chapter, the editor will just copy-paste the transcript for that chapter from data/csv/1_raw to data/csv/2_edits

Splitter

splitter.py is a module that, according to the rules defined in a manually-created JSON file called split_rules.json, splits:

  • the transcripts
  • the audio footage for that chapter

The module reads audios from data/audio/2_edits in WAV format and transcripts from data/csv/2_edits in CSV format, previously generated by the editor module. Then loops over the rules defined in the split_rules.json file:

{
	"source": "0_The_Gommage",  # chapter name
	"ranges": [
		{
			"dial_s": 0,  # dialogue index start
			"line_s": 0,  # line index start
			"dial_e": 3,  # dialogue index end
			"line_e": 17  # line index end (-1 means last line)
		},
		...
	],
	"timestamps": [
		"04:36",  # first timestamp at which to split
		"08:05",  # second timestamp
		...
	]
}

After completing a chapter, transcripts are saved in data/csv/3_splits and audios are converted to MP3 and saved in data/audio/3_splits.

NOTE: if no splits have been declared for a chapter, the only step that is still performed is the conversion from WAV to MP3

Running classification

WIP

Clone this wiki locally