Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/improved-gsr #150

Merged
merged 6 commits into from
Jan 20, 2022
Merged

feature/improved-gsr #150

merged 6 commits into from
Jan 20, 2022

Conversation

evamaxfield
Copy link
Member

Description of Changes

Include a description of the proposed changes.

While this is a general improvement, I will credit the push for this work to @ArthurSmid for noticing that our transcription in King County was quite poor. Specifically, the transcription on the land acknowledgement was atrocious.

Different from Seattle, King County doesn't publish closed caption files for us to convert to our transcript format and as such that instance was using Google Speech-to-Text (Google Speech Recognition or GSR) for transcription.

Our original configuration for GSR had served us decently well but with this push I figured it was time to look at ways to improve it.

PR Changes

  • The most basic change is to the model selection itself. We now use the enhanced ("video") model for speech-to-text. Generally this costs more, but if we turn on data logging (where Google gets to keep the audio file for their own datasets) the cost is nullified and returns to our normal amount. So for us, this means that we basically get a free upgrade since our data is already public. More info on the upgraded model here

  • The next, finer detail change, is the improvements to our speech adapation / model adaption. We currently provide event metadata to the model object such as people names, bill abstracts, and more which definitely helps, but one of the things I have been noticing is that our transcripts fail at place names (street addresses, etc.), dollar amounts, reporting ordinals (percents), and more. This adds class tokens that specifically attempt to solve those problems! More info on class tokens here

  • Finally, I am simply improving the model metadata. Changing the interaction type from discussion to phone call. Google specifically sites "videos of discussions" or "conference calls" should use "phone call" instead of "discussion." Basically, we should have never been using discussion, even when meetings were in-person. "Discussion" means everyone is in the same room, recorded by the same mic -- it would be like if we were having a meeting at a coffee shop and I simply wanted to record the meeting of us talking.

Results

I made a dev deployment for myself that I will likely use for storing experiments like this in the future. I chose a meeting from King County that had noticably bad transcription as the baseline. Full details here: https://github.com/JacksonMaxfield/cdp-dev/tree/main/speech-recognition-config-tests

  • Baseline transcript: https://jacksonmaxfield.github.io/cdp-dev/#/events/1126b685f94d
    (note: the minutes item on that event is incorrect, I had a bug in my event details generator that overwrote the minutes of that event the next time I ran it. It is truthfully the baseline -- cdp-backend==3.0.2)

    Comments: noticably bad transcription on the land acknowledgement and further down when they start getting into the discussion on bills and such, bad transcription on things like: "pages X to Y." But overall is just missing some words and has some oddities throughout.

  • Basic Upgrades: https://jacksonmaxfield.github.io/cdp-dev/#/events/6f15f3db0b19
    (note: this minutes item has the correct commit for this test, you can see how it overwrote the prior because the minutes item name is the same)

    Comments: This includes a massive upgrade. The model, the adaption, and the metadata iteraction type, were all upgraded in this test. It was hard to test them all independently / impossible to, because apparently, certain class tokens only work with the enhanced models anyway. But there are drastic improvements over the base, but there are also now weird alphanumeric sequences introduced to the transcript. Likely because this test I ran with the alphanumeric sequence class token enabled but I didn't expect it to take over that much.

  • Same Massive Upgrades - Remove Alphanumeric Class: https://jacksonmaxfield.github.io/cdp-dev/#/events/38fa2d6e0603
    (note: this commit link is correct, still had a bug but at least it created a new minutes item to track 馃槀)

    Comments: This is, imo, the best version of the transcript. There are still problems with people's names and problems with numeric sequences such as "bill 2020-1038" but, after the next test, I still think this is the best.

  • Same Massive Upgrades - Replace $YEAR and $POSTCODE with $NUMERIC_SEQUENCE: https://jacksonmaxfield.github.io/cdp-dev/#/events/7d4212911c66
    (note: yayyy i finally figured out how to keep the commits / minutes items intact)

    Comments: This is basically a test to see if we can fix the above issue bill reference / number problems but unfortunately, I now see slightly more errors because anytime someone uses a "{number} to {number}" this could be "item 5 to 9" or "pages 10 to 20" or similar, it makes them a single sequence like "items 529" and "pages 10220". And regardless, the bill reference / number problem still isn't fixed entirely because the TRUE bill number includes a hyphen in the middle and this doesn't capture that. So I rolled this commit back.

Summary

  • 鉁旓笍 land acknowledgements seem fixed (at least for this meeting)
  • 鉁旓笍 overall transcription seems to be improved (with no added cost thanks to data logging)
  • 鉁旓笍 better understanding of google speech-to-text possibilities
  • 鉁旓笍 prototyped a nice method and a dev infrastructure for these types of tests in the future

Further Changes Needed

The only change that needs to happen outside of this repo is somewhere in the cookiecutter setup processes (both to the github bot and the manual deployment steps), I need to make a PR that informs people they need to turn on data logging.

On the King County side, I won't be reprocessing the November - Today data with these new additions, just too much money. But moving forward we should see an immidiate benefit.


Also included in this PR is a very minor change to dev infrastructure management to make infrastructure management "safer" by requiring a key for which infrastructure to clean rather than simply defaulting to the last created infrastructure.

@evamaxfield evamaxfield added enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing labels Jan 19, 2022
@evamaxfield evamaxfield self-assigned this Jan 19, 2022
@evamaxfield evamaxfield changed the title feature/improved-gsp feature/improved-gsr Jan 19, 2022
@codecov
Copy link

codecov bot commented Jan 19, 2022

Codecov Report

Merging #150 (7c76417) into main (7b1efc3) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #150   +/-   ##
=======================================
  Coverage   94.82%   94.83%           
=======================================
  Files          50       50           
  Lines        2532     2534    +2     
=======================================
+ Hits         2401     2403    +2     
  Misses        131      131           
Impacted Files Coverage 螖
cdp_backend/sr_models/google_cloud_sr_model.py 98.66% <100.00%> (+0.03%) 猬嗭笍
...kend/tests/sr_models/test_google_cloud_sr_model.py 100.00% <100.00%> (酶)

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 7b1efc3...7c76417. Read the comment docs.

@nniiicc
Copy link

nniiicc commented Jan 19, 2022

Big if true

@nniiicc
Copy link

nniiicc commented Jan 19, 2022

Yo.
Yoooo.
YOOOOOOO...

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Upgrading the transcription without cost is a huge plus.

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

It seems that legislation/bills tends to have alphanumeric codes so it could be worth adding that class token? Also I could see temperature coming into handy for climate related talk.

@evamaxfield
Copy link
Member Author

Also, do you think it's worth adding the class tokens $OOV_CLASS_ALPHANUMERIC_SEQUENCE and $OOV_CLASS_TEMPERATURE?

The second test I ran included ALPHANUMERIC but it actually made certain parts worse. The temperature... eh I don't know. I think we leave it out because its very rare, even in a bill about climate, for anyone to be discussing exact temperatures.

@dphoria
Copy link
Contributor

dphoria commented Jan 19, 2022

Very neat / awesome to physically see the changes / improvements between the results you listed in the OP. 馃檶

@evamaxfield
Copy link
Member Author

evamaxfield commented Jan 19, 2022

the speech adapation + phone call configuraiton is a massive upgrade. This is great @JacksonMaxfield !!!! We should come up with a baseline quality metric and then work from that each time (can discuss more off thread).

@nniiicc A part of me wants to say we should simply run this upgraded model against the seattle closed caption generated transcripts and do a text diff? Basically, "how close does the speech-to-text model get to mirroring the 'gov created closed captions?'"

Edit: errr clarification, we could simply use the closed caption files as "ground truth" and compute word error rate

Copy link
Collaborator

@tohuynh tohuynh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Using closed captions files as the ground truth to measure accuracy of this SR model sounds like a good idea.

@evamaxfield evamaxfield merged commit 461d188 into main Jan 20, 2022
@evamaxfield evamaxfield deleted the feature/improved-gsr branch January 20, 2022 23:13
@kristopher-smith
Copy link

Great stuff here @JacksonMaxfield. The King County text looks much cleaner! Those "Custom Classes" they mention under the tokens look curious to me. May be able to utilize those for this funny legislation language at some point in the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants