Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transcription range fields to database and ingestion models, add … #221

Merged

Conversation

chrisjkhan
Copy link
Contributor

Link to Relevant Issue

This pull request resolves #211

Description of Changes

-Add optional transcription range fields to database model
-Add optional transcription range fields to ingestion model
-Add validator for time duration
-Add range filter to ffmpeg audio split (similar to clip functionality)
-Update tests for valid inputs

…validator for time duration, add range filter to ffmpeg audio split, update tests
@evamaxfield evamaxfield added the enhancement New feature or request label Nov 9, 2022
Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codewise, this all looks great. Would be curious as to how you found or contributing notes and all of that. Any places for improvement?

Implementation wise... did you have a chance to test this on a dev deployment? I think this would transcribe the portion selected but the web UI / session interface may fail because the video doesn't match up with the audio... the best way to do this interestingly may be to split the video by timestamp and then do all the normal downstream processing rather than split the audio by timestamp...

Are you on the call this Thursday? If so it might be good to talk about this feature then. / How to dev-deploy

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Merging #221 (46aff14) into main (3fb35a2) will increase coverage by 0.07%.
The diff coverage is 84.05%.

@@            Coverage Diff             @@
##             main     #221      +/-   ##
==========================================
+ Coverage   72.33%   72.40%   +0.07%     
==========================================
  Files          64       64              
  Lines        3492     3541      +49     
==========================================
+ Hits         2526     2564      +38     
- Misses        966      977      +11     
Impacted Files Coverage Δ
cdp_backend/pipeline/mock_get_events.py 100.00% <ø> (ø)
cdp_backend/utils/file_utils.py 87.26% <57.14%> (-2.64%) ⬇️
cdp_backend/pipeline/event_gather_pipeline.py 84.70% <63.63%> (-0.76%) ⬇️
cdp_backend/pipeline/ingestion_models.py 99.01% <91.66%> (-0.99%) ⬇️
cdp_backend/database/models.py 100.00% <100.00%> (ø)
cdp_backend/database/validators.py 89.04% <100.00%> (+0.98%) ⬆️
cdp_backend/tests/database/test_validators.py 100.00% <100.00%> (ø)
...ckend/tests/pipeline/test_event_gather_pipeline.py 100.00% <100.00%> (ø)
cdp_backend/tests/utils/test_file_utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@chrisjkhan
Copy link
Contributor Author

Codewise, this all looks great. Would be curious as to how you found or contributing notes and all of that. Any places for improvement?

I made it this far! If there's anything that I'm doing that makes it appear I missed the documentation, let me know and I might have some feedback.

Implementation wise... did you have a chance to test this on a dev deployment?

No, that was the last thing I mentioned to @smai-f. Hopefully, we can get a test process going as we integrate the scraper.

I think this would transcribe the portion selected but the web UI / session interface may fail because the video doesn't match up with the audio... the best way to do this interestingly may be to split the video by timestamp and then do all the normal downstream processing rather than split the audio by timestamp...

My ideal was to do as you mentioned, and probably integrate with clip functionality. The main reason I didn't was because there seems to be a reliance on existing video hosting, and I thought that would be worth preserving.

I didn't realize how the front end functionality worked (oops!), but I think there are some easy ways to preserve both the existing video hosting and accurate timestamps. The simplest might be to send the offset to the transcribe function to alter the timestamps in the transcript:

In addition, we can update the front end so it seeks to the appropriate place on page load: https://github.com/CouncilDataProject/cdp-frontend/blob/77944572b816ed07ef0dcffd772fc19b52aea12e/src/components/Details/EventVideo/EventVideo.tsx#L133

Are you on the call this Thursday? If so it might be good to talk about this feature then. / How to dev-deploy

I should be available. Can you add me to the event?

(None, "10"),
("1", None),
("0", "0"),
("1", "0"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests to see how ffmpeg handles 0 or negative duration output.

@dphoria
Copy link
Contributor

dphoria commented Nov 10, 2022

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

Thank you very much @chrisjkhan for this!

@chrisjkhan
Copy link
Contributor Author

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Thank you very much @chrisjkhan for this!

You're welcome

@dphoria
Copy link
Contributor

dphoria commented Nov 14, 2022

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Sounds great. Just let me know / hit the "re-request review" button whenever.

… host when limiting video to a range, updated mp4 conversion to allow a range, connected mp4 to clip functionality, updated tests and tried to make testing slightly more consistent, added Session ingestion verification to test out
@chrisjkhan
Copy link
Contributor Author

chrisjkhan commented Nov 21, 2022

I may have gone a little overboard with changes trying to slightly increase consistency across method signatures. The only thing I changed that might be controversial is making the hash calculation part of an already large task rather than keeping it separate. If you really want the hash to be performant, I would suggest modifying it to use CRC/MD5/SHA-1 and do a combination of the first x bytes/blocks and file size.

I tested negative duration pairs (end_time < start_time) causing a event processing failure, as desired. Positive duration pairs cause the file to be trimmed and CDP hosted, and considered unique/distinct from full length videos. Transcription works as expected.

@chrisjkhan chrisjkhan requested review from evamaxfield and dphoria and removed request for evamaxfield November 21, 2022 16:12
Copy link
Contributor

@dphoria dphoria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm good with this. Thank you, Chris! I'll let Eva do the honors, though. She's a much better reviewer. (I also need to figure out how to start a review, make N comments for the one review, end the review...)

@smai-f
Copy link

smai-f commented Nov 27, 2022

Thanks everybody! Here are some manual tests using Chris' latest commit in the Montana legislature instance:

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. I have one remaining question before I click the approve and merge buttons.

Q: should we take the trim start time and add that to the session start time / event start times?

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

cdp_backend/pipeline/event_gather_pipeline.py Show resolved Hide resolved
cdp_backend/pipeline/event_gather_pipeline.py Outdated Show resolved Hide resolved
cdp_backend/pipeline/ingestion_models.py Show resolved Hide resolved
cdp_backend/tests/database/test_validators.py Show resolved Hide resolved
@evamaxfield
Copy link
Member

Also, thank you for such great work on this! Really nicely done and thank you @smai-f for helping test!

@evamaxfield
Copy link
Member

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

cc @BrianL3 @dphoria @smai-f this is a question i would love all of your opinions on

@chrisjkhan
Copy link
Contributor Author

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

@dphoria
Copy link
Contributor

dphoria commented Dec 3, 2022

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

First, great question Eva. 👏 I'm in line with Chris' thoughts; (I hope I'm understanding correctly) At the core, there are 2 fields aded to the model, the video start and end times (right?). At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural. I may at first be slightly confused, but even without any external explanation, I think I would realize that this means, in this case, the video begins 1 hour into the session.

If I'm looking at an example of what's probably the vast majority of the models - 'session start time' == 'video start time', that again wouldn't throw me off, especially if I have at that point, seen an example like above when they are different.

@chrisjkhan
Copy link
Contributor Author

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

@smai-f
Copy link

smai-f commented Dec 5, 2022

@evamaxfield From my part consuming this PR in the Montana legislative scraper (and speaking the time/duration language Chris is pointing out of video_start_time, not considering it a datetime):

  • The scraped session_datetime on the page (not the video_start_time) already accounts for the offset into the video (that is, if the long meeting video starts at 1pm, and the bill discussion which is the part of the video I'm segmenting starts at 3pm, the piece of data I'm scraping on the page is 3pm for the session_datetime). So I don't need to do anything to get the correct session datetime, and it would be incorrect if the backend added the offset to it.
  • The piece of math I did have to do (which seems unusual, and like this won't be the case on most frontends) is that the audio player actually sets the 0:00 place of the video to be beginning time of the long video. So in the previous example, the video starts at 13:00:00, the part I'm scraping starts at 15:00:00, so I had to do the math of 15:00:00 - 13:00:00 = 2:00:00, to set the video_start_time to be 2:00:00.

I think the way Chris has it now makes the most sense from my narrow consuming perspective. Have the consumer decide whether or not they want to augment the session datetime to include the video start time offset or not. I can imagine reasons to support both behaviors related to the start time of the meeting aligning with a schedule written in another place.

@dphoria
Copy link
Contributor

dphoria commented Dec 6, 2022

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

Thank you for pointing that out. I missed it!

@evamaxfield
Copy link
Member

evamaxfield commented Dec 6, 2022

Thanks for all the discussion and clarification. Yes the only think I think should be changed prior to merge is adding the tiniest bit of documentation to the ingestion model that states the video time is relative to the session time.

I would just put it in the class docstring as we still need to find a better overall project documentation structure haha

@chrisjkhan
Copy link
Contributor Author

chrisjkhan commented Dec 6, 2022

I would just put it in the class docstring...

@evamaxfield Added. Let me know how it looks. I followed the Notes model in the rest of the file. It's a bit verbose.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. Verbose is more than welcome. Clicking the merge button!

@evamaxfield evamaxfield merged commit b00f378 into CouncilDataProject:main Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept timestamps and only process a subset of a video as an event
4 participants