Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic instance metadata collection to infrastructure stack #117

Closed
evamaxfield opened this issue Oct 1, 2021 · 17 comments · Fixed by #135
Closed

Add basic instance metadata collection to infrastructure stack #117

evamaxfield opened this issue Oct 1, 2021 · 17 comments · Fixed by #135
Assignees
Labels
enhancement New feature or request
Projects

Comments

@evamaxfield
Copy link
Member

Feature Description

A clear and concise description of the feature you're requesting.

Partially discussed in CouncilDataProject/cdp-roadmap#3, it would be nice to have a collection with a single document to pull metadata about the instance.

Use Case

Please provide a use case to help us understand your request in context.

Most importantly would be which cdp version the instance is currently on. This would help in determine which features the instance supports.

Solution

Please describe your ideal solution.

Use pulumi gcp firestore document resource to create / update the same document every time the infrastructure is upgraded.
https://www.pulumi.com/docs/reference/pkg/gcp/firestore/document/

@tohuynh @isaacna can you think of other metadata that would be good to store in the database itself about the instance?
version, some subset of cookiecutter params maybe? anything else?

@evamaxfield evamaxfield added the enhancement New feature or request label Oct 1, 2021
@evamaxfield evamaxfield self-assigned this Oct 1, 2021
@evamaxfield evamaxfield added this to Ready for Dev in v3.0 via automation Oct 1, 2021
@tohuynh
Copy link
Collaborator

tohuynh commented Oct 6, 2021

Not sure how you would store this. But from a researcher's perspective querying across multiple instances I'd want to be able to determine what the database models and transcript models (maybe even ingestion models too) look like for each instance. It would be cool if from the metadata we could produce something like what we currently do for our gh-pages. Either store the generated gh-page links or actually store the model templates?

@evamaxfield
Copy link
Member Author

Not sure how you would store this. But from a researcher's perspective querying across multiple instances I'd want to be able to determine what the database models and transcript models (maybe even ingestion models too) look like for each instance. It would be cool if from the metadata we could produce something like what we currently do for our gh-pages. Either store the generated gh-page links or actually store the model templates?

Wouldn't this information be implicitly contained in "version"? i.e. whatever version the instance is on has that versions db models. I mean. the shorthand would be "install cdp-backend=={VERSION STORED IN DB}" then run "create_cdp_database_uml".

Although, this is just an argument for us to move over to readthedocs instead of self publishing our docs. readthedocs stores all versions of doc pages so we could link straight to it that way.

@tohuynh
Copy link
Collaborator

tohuynh commented Oct 6, 2021

Maybe I'm thinking too far into the future. I was thinking of the web API in CouncilDataProject/cdp-roadmap#3, it would be nice to have an endpoint like /transcript_models/seattle or /instances/seattle/transcript_model to allow users to retrieve the transcript model without having to install a cdp-backend version. Or something else that would allow users to see a template of the data they are retrieving.

@evamaxfield
Copy link
Member Author

Maybe I'm thinking too far into the future. I was thinking of the web API in CouncilDataProject/cdp-roadmap#3, it would be nice to have an endpoint like /transcript_models/seattle or /instances/seattle/transcript_model to allow users to retrieve the transcript model without having to install a cdp-backend version. Or something else that would allow users to see a template of the data they are retrieving.

Totally fine with thinking for the future 🙂 -- So I guess. Here is my take: if we record the version, and switch over to readthedocs. We can just store a link to these. Like cdp-backend.readthedocs.org/v3.0.0/transcript_model.html or whatever the url is. It can be constructed. Is that not enough?

@tohuynh
Copy link
Collaborator

tohuynh commented Oct 6, 2021

Is that not enough?

Yes, I think that's enough. It would be good to have templates available without needing to install cdp-backend.

@isaacna
Copy link
Collaborator

isaacna commented Oct 8, 2021

I think of the cookiecutter params, firestore_region might be the only one that makes sense to add as metadata, since the other ones are pretty easy to infer based on the identity of the instance. The region could be useful if we or the user ever wants to collect/analyze metrics on billing costs or latency

@isaacna
Copy link
Collaborator

isaacna commented Oct 8, 2021

Also this wouldn't really be infrastructure/instance related metadata, but maybe we could keep track of the average transcript confidence across all transcript models in the database? Like everytime a transcript is created we update a single value keeping track of the running average confidence and recalculate it if transcript models are ever deleted.

Could be an interesting/eyecatching stat to show for an instance.

@evamaxfield
Copy link
Member Author

Interesting. I like it but have a follow up question:

Would you want the average to include only the highest confidence transcript for each session or just all transcripts stored in the instance.

When generating the search index that is what we do, and I believe when determining which transcript to render on the frontend we do the same (@tohuynh correct me if I am wrong there). I would argue that if we were to store an average confidence for transcripts in the instance, the more important number is this such a one. With session duplicates / transcripts with lower confidence for the same session filtered out of the set.

If we want that, the logic may be a bit tricky but if we just want all transcripts its definitely possible to make it efficient with some sort of incremental averaging and I actually wrote a nifty handler for incremental averaging for the topic segmentation problem but it turned out I didn't need it, so nice that I can likely just pull that code into this repo 😂

Technically the set of highest session transcripts is contained within the set of all transcripts so maybe I am just being pedantic especially since we don't have any process in place that will generate a second (or third, etc) transcript for the same session currently.


I think the running list of data to store from my side is:

  • current cdp backend version
  • municipality name
  • hosting_github_url
  • firestore_region -- note, this is a cool one because google releases stats on the regions carbon footprint
  • average_transcript_confidence
  • n_transcripts (required for incremental averaging)

Also pinging @nniiicc, any ideas for metadata regarding the instance stored in the instance itself?

@tohuynh
Copy link
Collaborator

tohuynh commented Oct 8, 2021

Would you want the average to include only the highest confidence transcript for each session or just all transcripts stored in the instance.

I believe when determining which transcript to render on the frontend we do the same

Yeah, on the front end we retrieve the transcript with the highest confidence. But the difference between the two options would be small, right? I'd say go with the second option since it's easier to calculate and it's always lower than the avg confidence of transcripts in use by search index and front end.

@evamaxfield
Copy link
Member Author

Agree. Even if it's lower it should be fine.

An interesting note is that if the pipeline is provided a closed caption file, we just assert that the confidence is 0.97 which is questionable. We somewhat pulled that number out of the air.

@isaacna
Copy link
Collaborator

isaacna commented Oct 8, 2021

I'd be fine with either using the highest confidence or all of them, depends on what exactly we want the total confidence to represent.

we just assert that the confidence is 0.97 which is questionable. We somewhat pulled that number out of the air.

Hm looks like it got added here. Was there some background context into that confidence or was that just something like an accuracy stat that WebVTT claims?

@evamaxfield
Copy link
Member Author

evamaxfield commented Oct 8, 2021

Hm looks like it got added here. Was there some background context into that confidence or was that just something like an accuracy stat that WebVTT claims?

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

@isaacna
Copy link
Collaborator

isaacna commented Oct 9, 2021

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

Then in that case I'd be fine making confidence optional in the transcript db model and omit it when the captions are already provided. Unless there's a specific use for it on the frontend, but I don't think it's used for anything else in the backend

@evamaxfield
Copy link
Member Author

I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy.

Then in that case I'd be fine making confidence optional in the transcript db model and omit it when the captions are already provided. Unless there's a specific use for it on the frontend, but I don't think it's used for anything else in the backend

The purpose of it is to select the "best" transcript from a session. Because we can have multiple transcripts (from reprocessing or if we ever get a better model, or if we get closed captions for a session after it has already been processed and generated one using speech to text), we need to select the best one for a session before indexing and for rendering on the front end.

In that way, I choose 0.97 because I thought it was high enough that it would be selected over the speech to text generated transcripts and low enough that if we ever got a better model they would be selected out.

@isaacna
Copy link
Collaborator

isaacna commented Oct 10, 2021

Hm in that case, thoughts keeping it at 0.97 and only add to the average when we use transcription without provided captions? But if that's too specific than I'm good just not including the average confidence stat in the metadata stack

@evamaxfield
Copy link
Member Author

Hm in that case, thoughts keeping it at 0.97 and only add to the average when we use transcription without provided captions? But if that's too specific than I'm good just not including the average confidence stat in the metadata stack

Yea that's fair. I think it's decently easy to simply run a query and calculate it yourself if you're interested. So maybe we just don't include it unfortunately...

(Sidenote, I really do mean unfortunately, I thought a lot about this over the last couple of days and I think it would be really interesting to see if there are regional differences in how confident Google's model is. I.e. Pacific northwest cities / transcripts vs east coast cities / transcripts.)

@evamaxfield
Copy link
Member Author

So after a bit, I think for now the basic metadata should be:

  • current cdp backend version
  • municipality name
  • hosting_github_url
  • firestore_region
  • municipality type (city council, school board, etc.) -- default to city council

v3.0 automation moved this from Ready for Dev to Done Nov 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
v3.0
Done
Development

Successfully merging a pull request may close this issue.

3 participants