Dev Logs

Log: Update Aug 02, 2022

.csv Files

.csv Files can now be imported into the app
These get converted into a .feather file that will always be used for reference until a new .csv is imported
The top 2-4 rows need to be deleted before importing so it's just the column title at the top

"Voice" tab layout

Filename section fields are now in order of how the actual filename is laid out.
Override CatID Combobox now has all the appropriate CatID info thanks to the pandas module and the .csv importing system mentioned previously

File confirmation page

Now shows the .wav info when a file is selected

Speech Analysis

The speech analysis process will now scan files for non-silent regions, according to a threshold (default is -30dB), and then export those files as chunks to the users 'Temp' file folder. Each file then gets processed with STT individually. Doing it this way will open up the possibility of creating a thread pool to divide the analysis. Temp files then get deleted when done.

See bottom left of next picture for path example

Other Notes

Investigations into the use of Natural Language Processing have begun. This will be a significant effort to bring to fruition

Slight update to UI

Darker drop region
Groupboxes that are checked will now turn their text white. And will turn grey when off. Previously they remained grey. Groupboxes that are not checkable (e.g. CreatorID and SourceID), remain white

Log: Update July 13, 2022

It's been a few days since I've given any major updates, so I figured I should!

Some minor changes have been made since the last log. Areas changed mainly include:

the UI (new window added)
automated tests
new test data

I spent a considerable amount of time trying to debug an issue where no single function wanted to read my '.wav' files. After much searching, with no results, I discovered my files were corrupted by GitHub's LFS conversion. Fun times ⏳

Google Speech Recognition API time tests

I spent some time looking into the speed of the Google Speech Recognition API in order to gather some metrics to work with.

I used the time.time() function to get the time before calling the GSR API, and then called it again after the STT results came in. The difference between the two is the time taken for the whole file to be processed and text to be returned.

I found the approximate time for the Google Speech to Text API to process 1 second of voice is between 0.06 and 0.17 seconds.

Two files:

x1 8 seconds
x1 100 seconds

File data:

24 bit
48Khz
2 channels

A couple things to consider: 1. The effect that different bit depths, sample rates and channels have on the API calls were not factored into this test. 2. Internet speed was not considered. Each time the API is called, there is a small change in the time it takes to process the speech.

For example, in a series of API calls, the following times were shown on each call, run ten times, separately, on the same 8-second audio file. Each one is the total time divided by 8 to give a rough time-per-second:

Speech: "door closing heavy creaking"

0.10067299008369446
0.1209230124950409
0.16493374109268188
0.1466636061668396
0.11550077795982361
0.13094982504844666
0.10052460432052612
0.09868890047073364
0.08651429414749146
0.08166024088859558

Average per second : 0.104834309220314

Average total time per file : 0.8386744737625119

This shows how drastic the change in time can be. On the faster runs, we hit close to 0.086 seconds. On the longer ones, around 0.165 seconds.

This second speed test was conducted on a 100 second clip of voice, that has the same phrase looped for the full length of the file. Adjusting for the total time to time-per-second, here are the results:

Speech, looped: "ambiences down by the river birds in the background in New Westminster water lapping against the shore boats driving by"

Can a boat drive? 🤔

0.0908750319480896
0.08467741012573242
0.06730304718017578
0.06506705522537232
0.08816150665283203
0.08871432065963746
0.08521945714950561
0.06659796953201294
0.06711288690567016
0.06610565662384033

Average per second : 0.0678959310054779

Average total time per file : 6.78959310054779

This second test oddly shows a reduction in time-per-second. After rerunning the first test, I can confirm that the shorter 8 second file, takes longer per second, than the 100-second.

This helps to give a rough idea of how long it may take to process larger files. It shows that a 30-minute recording could take up to 2.09 minutes to process. That's not really much of a time-saver. If you have 20 files to label, that's an easy 40 minutes of waiting.

The GSR API does not directly allow for processing files of that length, so it doesn't matter anyway.

One thing to explore is the use of multiprocessing to divide the file into larger chunks that can then be sent to the API in parallel. Each chunk could cut the time by the number of total chunks i.e. time = process-per-second/nchunks Doing this on the 30-min file with the UCSVNT tool setting total parallel chunks to 5, we could reduce the total time to 24 seconds from the original 2-minutes.

File Confirmation Window on Drag & Drop

A new window was added to allow users to confirm the files they wish to process.

Users also have the option of skipping the file selection if they wish. In the future, this option will be part of a collection of persistent settings.

Log: Update July 5, 2022

Came across a simple Material theme to apply to the UI.

Log: Update July 2, 2022

Some more UI updates today!

I'm thinking that we could have this temp filename that visually shows what fields will be added/removed and will show what requires manual input, what will be using speech-to-text, and what's optional.

I've also included a section to manually override the CatID if the user wishes

ezgif-2-fdb8a50b10

On the settings page, I've added a place to input the UCS .csv files if users wish to use a customized file.

Thanks for reading!

Log: Update July 1, 2022

Lots of ideas have been pouring out in the past couple of days. From different functionality, to how we'll be handling automated unit testing, to prompting the user for third-party consent relating to the voice tech, to how we'll be attempting to speed up the analysis time, how we can handle errors, how we can create a global error correction model, a local one...AI correction? Persistent settings? A better UI design?

I took some time to look into some of the libraries that will be used in making this tool. SpeechRecognition is a Python module that has a few APIs that we can use to handle the speech-to-text bulk of the tool. There are a few things to look into before solidifying that. What are the legal implications behind sharing a user's voice with the companies that make these APIs? More to research.

There are a couple of other modules to look into as well. PyAudio for general audio manipulation, and another one that can detect spaces of noise between speech.

As for the UI, I've added a few more fields today:

A field to specify whether to only analyze the start or end of files (as a way of saving time). Up to 120 seconds.
A field to specify an area to analyze for text, based on the user's chosen Start and End markers.
A field for custom take markers. Not exactly sure what for yet. Make to split the file into multiple files based on takes?
A customizable FXName template that can attempt to label the file FXName based on chosen wildcards that can insert certain text if found in the speech

I've also changed the color of the Drag & Drop section.

That's all for now! More to come soon.

Log: Update June 29, 2022

After my initial Slack conversation with Tim, I created this GitHub page. I immediately got to thinking about all of the different ways that we would be able to overcome some of the challenges of text to speech.

I made it clear that at this moment I am just looking into the feasibility of some of the tech behind this tool. I know that this tool won't be usable to all people in all cases, but I'm hoping that a lot of people can at least find some use.

I'm familiar with a bit of the speech-to-text technology, I'm familiar with python, I've programmed UI, and I think I have a clear vision of where I'd like to take this tool.

Currently, I've worked on establishing some short-term, project goals. I put a good amount of these goals into my GitHub project for tracking and have established a milestone to have a proof of concept finished by July 19th.

As for the tool itself, at the moment I have made a simple UI, with a drag-and-drop area, and some fields to enter the different category information.

I had some difficulty trying to get python to compile within pycharm, and I've been trying to understand how to submit to GitHub since my experience is mainly with perforce.

Excited about where we will be in the coming weeks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly