-
Notifications
You must be signed in to change notification settings - Fork 0
0. Tech rundown
This repository uses git lfs to checkout large files (WAV and MP3). To have a fully functional environment, I recommend downloading them as well, simply by cloning the entire repository.
If you are not interested in running any code, you can speed up the cloning time (and save disk space) by skipping large files by running the following command Linux:
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysisWindows:
$ set GIT_LFS_SKIP_SMUDGE=1
$ git clone https://github.com/DavMrc/ClairObscurSentimentAnalysis
$ cd ClairObscurSentimentAnalysisIf, at a later stage, you want to also download large files, you can install git lfs and pull them
$ git lfs pullCreate a virtual environment, activate it and install the requirements
Linux:
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txtWindows:
$ python -m venv venv
$ venv\Scripts\activate
$ pip install -r requirements.txtDialogues transcript scraping, lines editing and splitting can all be run via main.py. It accepts inline arguments:
usage: main.py [-h] [--no-scraper] [--no-editor] [--no-splitter] [--keep-narrator] [--keep-gibberish]
options:
-h, --help show this help message and exit
--no-scraper Do not run the Scraper
--no-editor Do not run the Editor
--no-splitter Do not run the Splitter
--keep-narrator Keep the narrator lines
--keep-gibberish Do not add a "(gibberish)" prefix to all the lines in gibberishFirst, I'll focus only on the latter two options:
-
--keep-narrator: does not discard the narrator lines from the raw transcript, which by default includes them -
--keep-gibberish: do not add a "(gibberish)" prefix to all the lines in gibberish. Some characters, like gestrals, grandis and faceless entities, do not have voice lines, instead they only mutter unintelligible language. To ensure that, during prompting to a LLM, these lines would be identified and clarified as "unintelligible", I have decided to prefix them by default.
Now to explain a bit more in detail what each step does.
scraper.py, as expected, scrapes all the lines of dialogue from this website and saves them as organized CSV files under data/csv/1_raw, overwriting them if existing.
This uses the requests and beautifulsoup modules, which do not require browser automation, relying only on raw HTML.
editor.py is a module that deletes (and optionally inserts) manually-specified lines of dialogue from the raw, scraped dialogue transcripts. This is done in order to "align" the lines of dialogue in the audio footage, which I recorded, with the scraped transcript. Since, for various reasons, I was not able to record parts of some dialogues, deleting the according lines from the transcript is the only way to have consistent data across the two formats.
This module reads a manually-created JSON file called edit_rules.json that specifies which line ranges to delete.
The structure for line deletions is as follows:
{
"source": "0_The_Gommage", # chapter name, just as it was scraped and saved in data/csv/1_raw
"ranges": [
{
"dial_s": 2, # dialogue index start
"line_s": 6, # line index start
"dial_e": 2, # dialogue index end
"line_e": 6 # line index end
},
{"dial_s": 4, "line_s": 3, "dial_e": 4, "line_e": 6},
...
]
}After completing, the module saves the edited transcripts into the folder data/csv/2_edits.
Additionally, in the same file, the user can provide a list of files that will be added to the edited transcripts.
"inserts": [
"29_A_Life_to_Paint",
"30_A_Life_to_Love"
]These files must be placed in the folder data/csv/2_edits/custom_inserts. If a file with the same name already exists under data/csv/2_edits, inserts will overwrite them.
NOTE: if no edits have been declared for a chapter, the editor will just copy-paste the transcript for that chapter from data/csv/1_raw to data/csv/2_edits
splitter.py is a module that, according to the rules defined in a manually-created JSON file called split_rules.json, splits:
- the transcripts
- the audio footage for that chapter
The module reads audios from data/audio/2_edits in WAV format and transcripts from data/csv/2_edits in CSV format, previously generated by the editor module. Then loops over the rules defined in the split_rules.json file:
{
"source": "0_The_Gommage", # chapter name
"ranges": [
{
"dial_s": 0, # dialogue index start
"line_s": 0, # line index start
"dial_e": 3, # dialogue index end
"line_e": 17 # line index end (-1 means last line)
},
...
],
"timestamps": [
"04:36", # first timestamp at which to split
"08:05", # second timestamp
...
]
}After completing a chapter, transcripts are saved in data/csv/3_splits and audios are converted to MP3 and saved in data/audio/3_splits.
NOTE: if no splits have been declared for a chapter, the only step that is still performed is the conversion from WAV to MP3
WIP