feature/improved-gsr #150

evamaxfield · 2022-01-19T00:46:11Z

Description of Changes

Include a description of the proposed changes.

While this is a general improvement, I will credit the push for this work to @ArthurSmid for noticing that our transcription in King County was quite poor. Specifically, the transcription on the land acknowledgement was atrocious.

Different from Seattle, King County doesn't publish closed caption files for us to convert to our transcript format and as such that instance was using Google Speech-to-Text (Google Speech Recognition or GSR) for transcription.

Our original configuration for GSR had served us decently well but with this push I figured it was time to look at ways to improve it.

PR Changes

The most basic change is to the model selection itself. We now use the enhanced ("video") model for speech-to-text. Generally this costs more, but if we turn on data logging (where Google gets to keep the audio file for their own datasets) the cost is nullified and returns to our normal amount. So for us, this means that we basically get a free upgrade since our data is already public. More info on the upgraded model here
The next, finer detail change, is the improvements to our speech adapation / model adaption. We currently provide event metadata to the model object such as people names, bill abstracts, and more which definitely helps, but one of the things I have been noticing is that our transcripts fail at place names (street addresses, etc.), dollar amounts, reporting ordinals (percents), and more. This adds class tokens that specifically attempt to solve those problems! More info on class tokens here
Finally, I am simply improving the model metadata. Changing the interaction type from discussion to phone call. Google specifically sites "videos of discussions" or "conference calls" should use "phone call" instead of "discussion." Basically, we should have never been using discussion, even when meetings were in-person. "Discussion" means everyone is in the same room, recorded by the same mic -- it would be like if we were having a meeting at a coffee shop and I simply wanted to record the meeting of us talking.

Results

I made a dev deployment for myself that I will likely use for storing experiments like this in the future. I chose a meeting from King County that had noticably bad transcription as the baseline. Full details here: https://github.com/JacksonMaxfield/cdp-dev/tree/main/speech-recognition-config-tests

Baseline transcript: https://jacksonmaxfield.github.io/cdp-dev/#/events/1126b685f94d
(note: the minutes item on that event is incorrect, I had a bug in my event details generator that overwrote the minutes of that event the next time I ran it. It is truthfully the baseline -- cdp-backend==3.0.2)

Comments: noticably bad transcription on the land acknowledgement and further down when they start getting into the discussion on bills and such, bad transcription on things like: "pages X to Y." But overall is just missing some words and has some oddities throughout.
Basic Upgrades: https://jacksonmaxfield.github.io/cdp-dev/#/events/6f15f3db0b19
(note: this minutes item has the correct commit for this test, you can see how it overwrote the prior because the minutes item name is the same)

Comments: This includes a massive upgrade. The model, the adaption, and the metadata iteraction type, were all upgraded in this test. It was hard to test them all independently / impossible to, because apparently, certain class tokens only work with the enhanced models anyway. But there are drastic improvements over the base, but there are also now weird alphanumeric sequences introduced to the transcript. Likely because this test I ran with the alphanumeric sequence class token enabled but I didn't expect it to take over that much.
Same Massive Upgrades - Remove Alphanumeric Class: https://jacksonmaxfield.github.io/cdp-dev/#/events/38fa2d6e0603
(note: this commit link is correct, still had a bug but at least it created a new minutes item to track 😂)

Comments: This is, imo, the best version of the transcript. There are still problems with people's names and problems with numeric sequences such as "bill 2020-1038" but, after the next test, I still think this is the best.
Same Massive Upgrades - Replace $YEAR and $POSTCODE with $NUMERIC_SEQUENCE: https://jacksonmaxfield.github.io/cdp-dev/#/events/7d4212911c66
(note: yayyy i finally figured out how to keep the commits / minutes items intact)

Comments: This is basically a test to see if we can fix the above issue bill reference / number problems but unfortunately, I now see slightly more errors because anytime someone uses a "{number} to {number}" this could be "item 5 to 9" or "pages 10 to 20" or similar, it makes them a single sequence like "items 529" and "pages 10220". And regardless, the bill reference / number problem still isn't fixed entirely because the TRUE bill number includes a hyphen in the middle and this doesn't capture that. So I rolled this commit back.

Summary

✔️ land acknowledgements seem fixed (at least for this meeting)
✔️ overall transcription seems to be improved (with no added cost thanks to data logging)
✔️ better understanding of google speech-to-text possibilities
✔️ prototyped a nice method and a dev infrastructure for these types of tests in the future

Further Changes Needed

The only change that needs to happen outside of this repo is somewhere in the cookiecutter setup processes (both to the github bot and the manual deployment steps), I need to make a PR that informs people they need to turn on data logging.

On the King County side, I won't be reprocessing the November - Today data with these new additions, just too much money. But moving forward we should see an immidiate benefit.

Also included in this PR is a very minor change to dev infrastructure management to make infrastructure management "safer" by requiring a key for which infrastructure to clean rather than simply defaulting to the last created infrastructure.

codecov · 2022-01-19T00:50:59Z

Codecov Report

Merging #150 (7c76417) into main (7b1efc3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #150   +/-   ##
=======================================
  Coverage   94.82%   94.83%           
=======================================
  Files          50       50           
  Lines        2532     2534    +2     
=======================================
+ Hits         2401     2403    +2     
  Misses        131      131

Impacted Files	Coverage Δ
cdp_backend/sr_models/google_cloud_sr_model.py	`98.66% <100.00%> (+0.03%)`	⬆️
...kend/tests/sr_models/test_google_cloud_sr_model.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b1efc3...7c76417. Read the comment docs.

nniiicc · 2022-01-19T00:57:22Z

Big if true

nniiicc · 2022-01-19T00:59:47Z

Yo.
Yoooo.
YOOOOOOO...

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

isaacna

This is awesome! Upgrading the transcription without cost is a huge plus.

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

It seems that legislation/bills tends to have alphanumeric codes so it could be worth adding that class token? Also I could see temperature coming into handy for climate related talk.

evamaxfield · 2022-01-19T02:10:04Z

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

The second test I ran included ALPHANUMERIC but it actually made certain parts worse. The temperature... eh I don't know. I think we leave it out because its very rare, even in a bill about climate, for anyone to be discussing exact temperatures.

dphoria · 2022-01-19T02:19:54Z

Very neat / awesome to physically see the changes / improvements between the results you listed in the OP. 🙌

evamaxfield · 2022-01-19T04:13:57Z

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

@nniiicc A part of me wants to say we should simply run this upgraded model against the seattle closed caption generated transcripts and do a text diff? Basically, "how close does the speech-to-text model get to mirroring the 'gov created closed captions?'"

Edit: errr clarification, we could simply use the closed caption files as "ground truth" and compute word error rate

tohuynh

Nice!

Using closed captions files as the ground truth to measure accuracy of this SR model sounds like a good idea.

kristopher-smith · 2022-01-21T03:36:15Z

Great stuff here @JacksonMaxfield. The King County text looks much cleaner! Those "Custom Classes" they mention under the tokens look curious to me. May be able to utilize those for this funny legislation language at some point in the model.

JacksonMaxfield added 6 commits January 13, 2022 20:02

Make dev infra clean commands safer

e5b108e

For testing, store transcripts w/ commit over ver

23124d8

Use enhanced models

71f8ae4

Remove alphanumeric sequences

e970086

Revert transcript naming method

491038c

Add note on cleaning phrases and update tests

7c76417

evamaxfield added enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing labels Jan 19, 2022

evamaxfield requested review from nniiicc, BrianL3, isaacna, tohuynh and dphoria January 19, 2022 00:46

evamaxfield self-assigned this Jan 19, 2022

evamaxfield changed the title ~~feature/improved-gsp~~ feature/improved-gsr Jan 19, 2022

isaacna approved these changes Jan 19, 2022

View reviewed changes

dphoria approved these changes Jan 19, 2022

View reviewed changes

evamaxfield mentioned this pull request Jan 19, 2022

ASV for tracking our SR model configuration and performance #152

Closed

tohuynh approved these changes Jan 19, 2022

View reviewed changes

evamaxfield merged commit 461d188 into main Jan 20, 2022

evamaxfield deleted the feature/improved-gsr branch January 20, 2022 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature/improved-gsr #150

feature/improved-gsr #150

evamaxfield commented Jan 19, 2022

codecov bot commented Jan 19, 2022 •

edited

nniiicc commented Jan 19, 2022

nniiicc commented Jan 19, 2022

isaacna left a comment

evamaxfield commented Jan 19, 2022

dphoria commented Jan 19, 2022

evamaxfield commented Jan 19, 2022 •

edited

tohuynh left a comment

kristopher-smith commented Jan 21, 2022

feature/improved-gsr #150

feature/improved-gsr #150

Conversation

evamaxfield commented Jan 19, 2022

Description of Changes

PR Changes

Results

Summary

Further Changes Needed

codecov bot commented Jan 19, 2022 • edited

Codecov Report

nniiicc commented Jan 19, 2022

nniiicc commented Jan 19, 2022

isaacna left a comment

Choose a reason for hiding this comment

evamaxfield commented Jan 19, 2022

dphoria commented Jan 19, 2022

evamaxfield commented Jan 19, 2022 • edited

tohuynh left a comment

Choose a reason for hiding this comment

kristopher-smith commented Jan 21, 2022

codecov bot commented Jan 19, 2022 •

edited

evamaxfield commented Jan 19, 2022 •

edited