-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic instance metadata collection to infrastructure stack #117
Comments
Not sure how you would store this. But from a researcher's perspective querying across multiple instances I'd want to be able to determine what the database models and transcript models (maybe even ingestion models too) look like for each instance. It would be cool if from the metadata we could produce something like what we currently do for our gh-pages. Either store the generated gh-page links or actually store the model templates? |
Wouldn't this information be implicitly contained in "version"? i.e. whatever version the instance is on has that versions db models. I mean. the shorthand would be "install cdp-backend=={VERSION STORED IN DB}" then run "create_cdp_database_uml". Although, this is just an argument for us to move over to readthedocs instead of self publishing our docs. readthedocs stores all versions of doc pages so we could link straight to it that way. |
Maybe I'm thinking too far into the future. I was thinking of the web API in CouncilDataProject/cdp-roadmap#3, it would be nice to have an endpoint like |
Totally fine with thinking for the future 🙂 -- So I guess. Here is my take: if we record the version, and switch over to readthedocs. We can just store a link to these. Like |
Yes, I think that's enough. It would be good to have templates available without needing to install cdp-backend. |
I think of the cookiecutter params, |
Also this wouldn't really be infrastructure/instance related metadata, but maybe we could keep track of the average transcript confidence across all transcript models in the database? Like everytime a transcript is created we update a single value keeping track of the running average confidence and recalculate it if transcript models are ever deleted. Could be an interesting/eyecatching stat to show for an instance. |
Interesting. I like it but have a follow up question: Would you want the average to include only the highest confidence transcript for each session or just all transcripts stored in the instance. When generating the search index that is what we do, and I believe when determining which transcript to render on the frontend we do the same (@tohuynh correct me if I am wrong there). I would argue that if we were to store an average confidence for transcripts in the instance, the more important number is this such a one. With session duplicates / transcripts with lower confidence for the same session filtered out of the set. If we want that, the logic may be a bit tricky but if we just want all transcripts its definitely possible to make it efficient with some sort of incremental averaging and I actually wrote a nifty handler for incremental averaging for the topic segmentation problem but it turned out I didn't need it, so nice that I can likely just pull that code into this repo 😂 Technically the set of highest session transcripts is contained within the set of all transcripts so maybe I am just being pedantic especially since we don't have any process in place that will generate a second (or third, etc) transcript for the same session currently. I think the running list of data to store from my side is:
Also pinging @nniiicc, any ideas for metadata regarding the instance stored in the instance itself? |
Yeah, on the front end we retrieve the transcript with the highest confidence. But the difference between the two options would be small, right? I'd say go with the second option since it's easier to calculate and it's always lower than the avg confidence of transcripts in use by search index and front end. |
Agree. Even if it's lower it should be fine. An interesting note is that if the pipeline is provided a closed caption file, we just assert that the confidence is 0.97 which is questionable. We somewhat pulled that number out of the air. |
I'd be fine with either using the highest confidence or all of them, depends on what exactly we want the total confidence to represent.
Hm looks like it got added here. Was there some background context into that confidence or was that just something like an accuracy stat that WebVTT claims? |
I made that number up. I basically just looked at the closed caption produced transcripts and gave them a number I felt best represented how confident I am in their accuracy. |
Then in that case I'd be fine making confidence optional in the transcript db model and omit it when the captions are already provided. Unless there's a specific use for it on the frontend, but I don't think it's used for anything else in the backend |
The purpose of it is to select the "best" transcript from a session. Because we can have multiple transcripts (from reprocessing or if we ever get a better model, or if we get closed captions for a session after it has already been processed and generated one using speech to text), we need to select the best one for a session before indexing and for rendering on the front end. In that way, I choose 0.97 because I thought it was high enough that it would be selected over the speech to text generated transcripts and low enough that if we ever got a better model they would be selected out. |
Hm in that case, thoughts keeping it at 0.97 and only add to the average when we use transcription without provided captions? But if that's too specific than I'm good just not including the average confidence stat in the metadata stack |
Yea that's fair. I think it's decently easy to simply run a query and calculate it yourself if you're interested. So maybe we just don't include it unfortunately... (Sidenote, I really do mean unfortunately, I thought a lot about this over the last couple of days and I think it would be really interesting to see if there are regional differences in how confident Google's model is. I.e. Pacific northwest cities / transcripts vs east coast cities / transcripts.) |
So after a bit, I think for now the basic metadata should be:
|
Feature Description
A clear and concise description of the feature you're requesting.
Partially discussed in CouncilDataProject/cdp-roadmap#3, it would be nice to have a collection with a single document to pull metadata about the instance.
Use Case
Please provide a use case to help us understand your request in context.
Most importantly would be which cdp version the instance is currently on. This would help in determine which features the instance supports.
Solution
Please describe your ideal solution.
Use pulumi gcp firestore document resource to create / update the same document every time the infrastructure is upgraded.
https://www.pulumi.com/docs/reference/pkg/gcp/firestore/document/
@tohuynh @isaacna can you think of other metadata that would be good to store in the database itself about the instance?
version, some subset of cookiecutter params maybe? anything else?
The text was updated successfully, but these errors were encountered: