-
Notifications
You must be signed in to change notification settings - Fork 1
Dev Logs
- .csv Files can now be imported into the app
- These get converted into a .feather file that will always be used for reference until a new .csv is imported
- The top 2-4 rows need to be deleted before importing so it's just the column title at the top
- Filename section fields are now in order of how the actual filename is laid out.
- Override CatID Combobox now has all the appropriate CatID info thanks to the pandas module and the .csv importing system mentioned previously
- Now shows the .wav info when a file is selected
- The speech analysis process will now scan files for non-silent regions, according to a threshold (default is -30dB), and then export those files as chunks to the users 'Temp' file folder. Each file then gets processed with STT individually. Doing it this way will open up the possibility of creating a thread pool to divide the analysis. Temp files then get deleted when done.
See bottom left of next picture for path example
- Investigations into the use of Natural Language Processing have begun. This will be a significant effort to bring to fruition
- Darker drop region
- Groupboxes that are checked will now turn their text white. And will turn grey when off. Previously they remained grey. Groupboxes that are not checkable (e.g. CreatorID and SourceID), remain white
It's been a few days since I've given any major updates, so I figured I should!
Some minor changes have been made since the last log. Areas changed mainly include:
- the UI (new window added)
- automated tests
- new test data
I spent a considerable amount of time trying to debug an issue where no single function wanted to read my '.wav' files. After much searching, with no results, I discovered my files were corrupted by GitHub's LFS conversion. Fun times ⏳
I spent some time looking into the speed of the Google Speech Recognition API in order to gather some metrics to work with.
I used the time.time()
function to get the time before calling the GSR API, and then called it again after the STT results came in. The difference between the two is the time taken for the whole file to be processed and text to be returned.
I found the approximate time for the Google Speech to Text API to process 1 second of voice is between 0.06 and 0.17 seconds.
Two files:
- x1 8 seconds
- x1 100 seconds
File data:
- 24 bit
- 48Khz
- 2 channels
A couple things to consider: 1. The effect that different bit depths, sample rates and channels have on the API calls were not factored into this test. 2. Internet speed was not considered. Each time the API is called, there is a small change in the time it takes to process the speech.
For example, in a series of API calls, the following times were shown on each call, run ten times, separately, on the same 8-second audio file. Each one is the total time divided by 8 to give a rough time-per-second:
- 0.10067299008369446
- 0.1209230124950409
- 0.16493374109268188
- 0.1466636061668396
- 0.11550077795982361
- 0.13094982504844666
- 0.10052460432052612
- 0.09868890047073364
- 0.08651429414749146
- 0.08166024088859558
This shows how drastic the change in time can be. On the faster runs, we hit close to 0.086 seconds. On the longer ones, around 0.165 seconds.
This second speed test was conducted on a 100 second clip of voice, that has the same phrase looped for the full length of the file. Adjusting for the total time to time-per-second, here are the results:
Speech, looped: "ambiences down by the river birds in the background in New Westminster water lapping against the shore boats driving by"
Can a boat drive? 🤔
- 0.0908750319480896
- 0.08467741012573242
- 0.06730304718017578
- 0.06506705522537232
- 0.08816150665283203
- 0.08871432065963746
- 0.08521945714950561
- 0.06659796953201294
- 0.06711288690567016
- 0.06610565662384033
This second test oddly shows a reduction in time-per-second. After rerunning the first test, I can confirm that the shorter 8 second file, takes longer per second, than the 100-second.
This helps to give a rough idea of how long it may take to process larger files. It shows that a 30-minute recording could take up to 2.09 minutes to process. That's not really much of a time-saver. If you have 20 files to label, that's an easy 40 minutes of waiting.
The GSR API does not directly allow for processing files of that length, so it doesn't matter anyway.
One thing to explore is the use of multiprocessing to divide the file into larger chunks that can then be sent to the API in parallel. Each chunk could cut the time by the number of total chunks i.e. time = process-per-second/nchunks Doing this on the 30-min file with the UCSVNT tool setting total parallel chunks to 5, we could reduce the total time to 24 seconds from the original 2-minutes.
A new window was added to allow users to confirm the files they wish to process.
Users also have the option of skipping the file selection if they wish. In the future, this option will be part of a collection of persistent settings.
Came across a simple Material theme to apply to the UI.
Some more UI updates today!
I'm thinking that we could have this temp filename that visually shows what fields will be added/removed and will show what requires manual input, what will be using speech-to-text, and what's optional.
I've also included a section to manually override the CatID if the user wishes
On the settings page, I've added a place to input the UCS .csv files if users wish to use a customized file.
Thanks for reading!
Lots of ideas have been pouring out in the past couple of days. From different functionality, to how we'll be handling automated unit testing, to prompting the user for third-party consent relating to the voice tech, to how we'll be attempting to speed up the analysis time, how we can handle errors, how we can create a global error correction model, a local one...AI correction? Persistent settings? A better UI design?
I took some time to look into some of the libraries that will be used in making this tool. SpeechRecognition is a Python module that has a few APIs that we can use to handle the speech-to-text bulk of the tool. There are a few things to look into before solidifying that. What are the legal implications behind sharing a user's voice with the companies that make these APIs? More to research.
There are a couple of other modules to look into as well. PyAudio for general audio manipulation, and another one that can detect spaces of noise between speech.
As for the UI, I've added a few more fields today:
- A field to specify whether to only analyze the start or end of files (as a way of saving time). Up to 120 seconds.
- A field to specify an area to analyze for text, based on the user's chosen Start and End markers.
- A field for custom take markers. Not exactly sure what for yet. Make to split the file into multiple files based on takes?
- A customizable FXName template that can attempt to label the file FXName based on chosen wildcards that can insert certain text if found in the speech
I've also changed the color of the Drag & Drop section.
That's all for now! More to come soon.
After my initial Slack conversation with Tim, I created this GitHub page. I immediately got to thinking about all of the different ways that we would be able to overcome some of the challenges of text to speech.
I made it clear that at this moment I am just looking into the feasibility of some of the tech behind this tool. I know that this tool won't be usable to all people in all cases, but I'm hoping that a lot of people can at least find some use.
I'm familiar with a bit of the speech-to-text technology, I'm familiar with python, I've programmed UI, and I think I have a clear vision of where I'd like to take this tool.
Currently, I've worked on establishing some short-term, project goals. I put a good amount of these goals into my GitHub project for tracking and have established a milestone to have a proof of concept finished by July 19th.
As for the tool itself, at the moment I have made a simple UI, with a drag-and-drop area, and some fields to enter the different category information.
I had some difficulty trying to get python to compile within pycharm, and I've been trying to understand how to submit to GitHub since my experience is mainly with perforce.
Excited about where we will be in the coming weeks!