-
Notifications
You must be signed in to change notification settings - Fork 0
0. Tech rundown
This repository uses git lfs to checkout large files (WAV and MP3). To have a fully functional environment, I recommend downloading them as well, simply by cloning the entire repository.
If you are not interested in running any code, you can speed up the cloning time (and save disk space) by skipping large files by running the following command Linux:
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysisWindows:
$ set GIT_LFS_SKIP_SMUDGE=1
$ git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysisIf, at a later stage, you want to also download large files, you can install git lfs and pull them
$ git lfs pullCreate a virtual environment, activate it and install the requirements
Linux:
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txtWindows:
$ python -m venv venv
$ venv\Scripts\activate
$ pip install -r requirements.txtDialogues transcript scraping, lines editing and splitting can all be run via main.py. It accepts inline arguments:
usage: main.py [-h] [--no-scraper] [--no-editor] [--no-splitter] [--keep-narrator] [--keep-gibberish]
options:
-h, --help show this help message and exit
--no-scraper Do not run the Scraper
--no-editor Do not run the Editor
--no-splitter Do not run the Splitter
--keep-narrator Keep the narrator lines
--keep-gibberish Do not add a "(gibberish)" prefix to all the lines in gibberishFirst, I'll focus only on the latter two options:
-
--keep-narrator: does not discard the narrator lines from the raw transcript, which by default includes them -
--keep-gibberish: do not add a "(gibberish)" prefix to all the lines in gibberish. Some characters, like gestrals, grandis and faceless entities, do not have voice lines, instead they only mutter unintelligible language. To ensure that, during prompting to a LLM, these lines would be identified and clarified as "unintelligible", I have decided to prefix them by default.
Now to explain a bit more in detail what each step does.
scraper.py, as expected, scrapes all the lines of dialogue from this website and saves them as organized CSV files under data/csv/1_raw, overwriting them if existing.
This uses the requests and beautifulsoup modules, which do not require browser automation, relying only on raw HTML.
editor.py is a module that deletes (and optionally inserts) manually-specified lines of dialogue from the raw, scraped dialogue transcripts. This is done in order to "align" the lines of dialogue in the audio footage, which I recorded, with the scraped transcript. Since, for various reasons, I was not able to record parts of some dialogues, deleting the according lines from the transcript is the only way to have consistent data across the two formats.
This module reads a manually-created JSON file called edit_rules.json that specifies which line ranges to delete.
The structure for line deletions is as follows:
{
"source": "0_The_Gommage", # chapter name, just as it was scraped and saved in data/csv/1_raw
"ranges": [
{
"dial_s": 2, # dialogue index start
"line_s": 6, # line index start
"dial_e": 2, # dialogue index end
"line_e": 6 # line index end
},
{"dial_s": 4, "line_s": 3, "dial_e": 4, "line_e": 6},
...
]
}After completing, the module saves the edited transcripts into the folder data/csv/2_edits.
Additionally, in the same file, the user can provide a list of files that will be added to the edited transcripts.
"inserts": [
"29_A_Life_to_Paint",
"30_A_Life_to_Love"
]These files must be placed in the folder data/csv/2_edits/custom_inserts. If a file with the same name already exists under data/csv/2_edits, inserts will overwrite them.
NOTE: if no edits have been declared for a chapter, the editor will just copy-paste the transcript for that chapter from data/csv/1_raw to data/csv/2_edits
splitter.py is a module that, according to the rules defined in a manually-created JSON file called split_rules.json, splits:
- the transcripts
- the audio footage for that chapter
The module reads audios from data/audio/2_edits in WAV format and transcripts from data/csv/2_edits in CSV format, previously generated by the editor module. Then loops over the rules defined in the split_rules.json file:
{
"source": "0_The_Gommage", # chapter name
"ranges": [
{
"dial_s": 0, # dialogue index start
"line_s": 0, # line index start
"dial_e": 3, # dialogue index end
"line_e": 17 # line index end (-1 means last line)
},
...
],
"timestamps": [
"04:36", # first timestamp at which to split
"08:05", # second timestamp
...
]
}After completing a chapter, transcripts are saved in data/csv/3_splits and audios are converted to MP3 and saved in data/audio/3_splits.
NOTE: if no splits have been declared for a chapter, the only step that is still performed is the conversion from WAV to MP3
NOTE: To run the classification step, you need to provide your OpenAI API key to the module. By default, I have set the repo up to read the key from
data/open_ai_token.txt, but you can provide it however you like to theauthorize()method of theClassifierclass.
You can run classifier.py straight after downloading all medias in the repo (also those tracked by git lfs) or, in general, after having completed the splitting step.
This module has a curses command-line interface that lets you choose which chapters to classify
- By using the arrow keys, the user can select an entire chapter (i.e. all its splits) or only some splits of a chapter
- There are shortcut keys to select or deselect all the chapters
- Before the title of the chapter, you can see
how many splits you have selected / the total number of splits - Also, a
[ ]or[C]character identifies chapters that already have been classified before
After selecting the desired chapters, the UI continues with a recap of the selected chapters and asks the user for confirmation. After that, the classification begins.
For each chapter selected, you loop over its splits in pairs, prompting the model with both audio and text.
The model's response is saved both raw and parsed in the data/output folder. After all pairs of a chapter have been classified
- the raw API responses are saved under
data/output/api_responses/<chapter>/ - the parsed output is concatenated and saved in
data/output/emotions_scored/<chapter>/
Especially when prompt engineering, the output of classifications may vary. I have created the module viz_output.py for the purpose of analyzing one single output stand alone, or compare two outputs together.
The module uses Streamlit as library to create a simple data application. It can be run via the command streamlit run viz_output.py
| Single inspection mode | Comparison mode |
|---|---|
![]() |
![]() |
Once you are satisfied with the outputs of classification, you can run prep_for_dashboard.py that will iterate all classified chapters (i.e., all folders under data/output/emotions_scored/), take the most recent file and concatenate them together.

