Add basic instance metadata collection to infrastructure stack #117

evamaxfield · 2021-10-01T04:17:59Z

Feature Description

A clear and concise description of the feature you're requesting.

Partially discussed in CouncilDataProject/cdp-roadmap#3, it would be nice to have a collection with a single document to pull metadata about the instance.

Use Case

Please provide a use case to help us understand your request in context.

Most importantly would be which cdp version the instance is currently on. This would help in determine which features the instance supports.

Solution

Please describe your ideal solution.

Use pulumi gcp firestore document resource to create / update the same document every time the infrastructure is upgraded.
https://www.pulumi.com/docs/reference/pkg/gcp/firestore/document/

@tohuynh @isaacna can you think of other metadata that would be good to store in the database itself about the instance?
version, some subset of cookiecutter params maybe? anything else?

tohuynh · 2021-10-06T06:33:26Z

Not sure how you would store this. But from a researcher's perspective querying across multiple instances I'd want to be able to determine what the database models and transcript models (maybe even ingestion models too) look like for each instance. It would be cool if from the metadata we could produce something like what we currently do for our gh-pages. Either store the generated gh-page links or actually store the model templates?

evamaxfield · 2021-10-06T14:27:06Z

Not sure how you would store this. But from a researcher's perspective querying across multiple instances I'd want to be able to determine what the database models and transcript models (maybe even ingestion models too) look like for each instance. It would be cool if from the metadata we could produce something like what we currently do for our gh-pages. Either store the generated gh-page links or actually store the model templates?

Wouldn't this information be implicitly contained in "version"? i.e. whatever version the instance is on has that versions db models. I mean. the shorthand would be "install cdp-backend=={VERSION STORED IN DB}" then run "create_cdp_database_uml".

Although, this is just an argument for us to move over to readthedocs instead of self publishing our docs. readthedocs stores all versions of doc pages so we could link straight to it that way.

tohuynh · 2021-10-06T17:16:59Z

Maybe I'm thinking too far into the future. I was thinking of the web API in CouncilDataProject/cdp-roadmap#3, it would be nice to have an endpoint like /transcript_models/seattle or /instances/seattle/transcript_model to allow users to retrieve the transcript model without having to install a cdp-backend version. Or something else that would allow users to see a template of the data they are retrieving.

evamaxfield · 2021-10-06T19:04:56Z

Maybe I'm thinking too far into the future. I was thinking of the web API in CouncilDataProject/cdp-roadmap#3, it would be nice to have an endpoint like /transcript_models/seattle or /instances/seattle/transcript_model to allow users to retrieve the transcript model without having to install a cdp-backend version. Or something else that would allow users to see a template of the data they are retrieving.

Totally fine with thinking for the future 🙂 -- So I guess. Here is my take: if we record the version, and switch over to readthedocs. We can just store a link to these. Like cdp-backend.readthedocs.org/v3.0.0/transcript_model.html or whatever the url is. It can be constructed. Is that not enough?

tohuynh · 2021-10-06T19:39:45Z

Is that not enough?

Yes, I think that's enough. It would be good to have templates available without needing to install cdp-backend.

isaacna · 2021-10-08T05:02:29Z

I think of the cookiecutter params, firestore_region might be the only one that makes sense to add as metadata, since the other ones are pretty easy to infer based on the identity of the instance. The region could be useful if we or the user ever wants to collect/analyze metrics on billing costs or latency

isaacna · 2021-10-08T05:06:34Z

Also this wouldn't really be infrastructure/instance related metadata, but maybe we could keep track of the average transcript confidence across all transcript models in the database? Like everytime a transcript is created we update a single value keeping track of the running average confidence and recalculate it if transcript models are ever deleted.

Could be an interesting/eyecatching stat to show for an instance.

evamaxfield · 2021-10-08T06:02:05Z

Interesting. I like it but have a follow up question:

Would you want the average to include only the highest confidence transcript for each session or just all transcripts stored in the instance.

When generating the search index that is what we do, and I believe when determining which transcript to render on the frontend we do the same (@tohuynh correct me if I am wrong there). I would argue that if we were to store an average confidence for transcripts in the instance, the more important number is this such a one. With session duplicates / transcripts with lower confidence for the same session filtered out of the set.

If we want that, the logic may be a bit tricky but if we just want all transcripts its definitely possible to make it efficient with some sort of incremental averaging and I actually wrote a nifty handler for incremental averaging for the topic segmentation problem but it turned out I didn't need it, so nice that I can likely just pull that code into this repo 😂

Technically the set of highest session transcripts is contained within the set of all transcripts so maybe I am just being pedantic especially since we don't have any process in place that will generate a second (or third, etc) transcript for the same session currently.

I think the running list of data to store from my side is:

current cdp backend version
municipality name
hosting_github_url
firestore_region -- note, this is a cool one because google releases stats on the regions carbon footprint
average_transcript_confidence
n_transcripts (required for incremental averaging)

Also pinging @nniiicc, any ideas for metadata regarding the instance stored in the instance itself?

tohuynh · 2021-10-08T17:07:02Z

Would you want the average to include only the highest confidence transcript for each session or just all transcripts stored in the instance.

I believe when determining which transcript to render on the frontend we do the same

Yeah, on the front end we retrieve the transcript with the highest confidence. But the difference between the two options would be small, right? I'd say go with the second option since it's easier to calculate and it's always lower than the avg confidence of transcripts in use by search index and front end.

evamaxfield · 2021-10-08T18:33:29Z

Agree. Even if it's lower it should be fine.

An interesting note is that if the pipeline is provided a closed caption file, we just assert that the confidence is 0.97 which is questionable. We somewhat pulled that number out of the air.

isaacna · 2021-10-08T22:20:57Z

I'd be fine with either using the highest confidence or all of them, depends on what exactly we want the total confidence to represent.

we just assert that the confidence is 0.97 which is questionable. We somewhat pulled that number out of the air.

Hm looks like it got added here. Was there some background context into that confidence or was that just something like an accuracy stat that WebVTT claims?

evamaxfield · 2021-10-08T23:19:59Z

Hm looks like it got added here. Was there some background context into that confidence or was that just something like an accuracy stat that WebVTT claims?

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

isaacna · 2021-10-09T01:13:00Z

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

Then in that case I'd be fine making confidence optional in the transcript db model and omit it when the captions are already provided. Unless there's a specific use for it on the frontend, but I don't think it's used for anything else in the backend

evamaxfield · 2021-10-09T21:02:56Z

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

Then in that case I'd be fine making confidence optional in the transcript db model and omit it when the captions are already provided. Unless there's a specific use for it on the frontend, but I don't think it's used for anything else in the backend

The purpose of it is to select the "best" transcript from a session. Because we can have multiple transcripts (from reprocessing or if we ever get a better model, or if we get closed captions for a session after it has already been processed and generated one using speech to text), we need to select the best one for a session before indexing and for rendering on the front end.

In that way, I choose 0.97 because I thought it was high enough that it would be selected over the speech to text generated transcripts and low enough that if we ever got a better model they would be selected out.

isaacna · 2021-10-10T00:34:33Z

Hm in that case, thoughts keeping it at 0.97 and only add to the average when we use transcription without provided captions? But if that's too specific than I'm good just not including the average confidence stat in the metadata stack

evamaxfield · 2021-10-10T03:19:57Z

Hm in that case, thoughts keeping it at 0.97 and only add to the average when we use transcription without provided captions? But if that's too specific than I'm good just not including the average confidence stat in the metadata stack

Yea that's fair. I think it's decently easy to simply run a query and calculate it yourself if you're interested. So maybe we just don't include it unfortunately...

(Sidenote, I really do mean unfortunately, I thought a lot about this over the last couple of days and I think it would be really interesting to see if there are regional differences in how confident Google's model is. I.e. Pacific northwest cities / transcripts vs east coast cities / transcripts.)

evamaxfield · 2021-10-29T21:35:30Z

So after a bit, I think for now the basic metadata should be:

current cdp backend version
municipality name
hosting_github_url
firestore_region
municipality type (city council, school board, etc.) -- default to city council

evamaxfield added the enhancement New feature or request label Oct 1, 2021

evamaxfield self-assigned this Oct 1, 2021

evamaxfield added this to Ready for Dev in v3.0 via automation Oct 1, 2021

This was referenced Nov 8, 2021

Speaker classification #131

Open

feature/add-metadata-doc-to-db #135

Merged

evamaxfield closed this as completed in #135 Nov 13, 2021

v3.0 automation moved this from Ready for Dev to Done Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic instance metadata collection to infrastructure stack #117

Add basic instance metadata collection to infrastructure stack #117

evamaxfield commented Oct 1, 2021

tohuynh commented Oct 6, 2021

evamaxfield commented Oct 6, 2021

tohuynh commented Oct 6, 2021

evamaxfield commented Oct 6, 2021

tohuynh commented Oct 6, 2021 •

edited

Loading

isaacna commented Oct 8, 2021

isaacna commented Oct 8, 2021

evamaxfield commented Oct 8, 2021

tohuynh commented Oct 8, 2021

evamaxfield commented Oct 8, 2021

isaacna commented Oct 8, 2021

evamaxfield commented Oct 8, 2021 •

edited

Loading

isaacna commented Oct 9, 2021

evamaxfield commented Oct 9, 2021

isaacna commented Oct 10, 2021 •

edited

Loading

evamaxfield commented Oct 10, 2021

evamaxfield commented Oct 29, 2021

Add basic instance metadata collection to infrastructure stack #117

Add basic instance metadata collection to infrastructure stack #117

Comments

evamaxfield commented Oct 1, 2021

Feature Description

Use Case

Solution

tohuynh commented Oct 6, 2021

evamaxfield commented Oct 6, 2021

tohuynh commented Oct 6, 2021

evamaxfield commented Oct 6, 2021

tohuynh commented Oct 6, 2021 • edited Loading

isaacna commented Oct 8, 2021

isaacna commented Oct 8, 2021

evamaxfield commented Oct 8, 2021

tohuynh commented Oct 8, 2021

evamaxfield commented Oct 8, 2021

isaacna commented Oct 8, 2021

evamaxfield commented Oct 8, 2021 • edited Loading

isaacna commented Oct 9, 2021

evamaxfield commented Oct 9, 2021

isaacna commented Oct 10, 2021 • edited Loading

evamaxfield commented Oct 10, 2021

evamaxfield commented Oct 29, 2021

tohuynh commented Oct 6, 2021 •

edited

Loading

evamaxfield commented Oct 8, 2021 •

edited

Loading

isaacna commented Oct 10, 2021 •

edited

Loading