Add transcription range fields to database and ingestion models, add … #221

chrisjkhan · 2022-11-08T02:46:17Z

Link to Relevant Issue

This pull request resolves #211

Description of Changes

-Add optional transcription range fields to database model
-Add optional transcription range fields to ingestion model
-Add validator for time duration
-Add range filter to ffmpeg audio split (similar to clip functionality)
-Update tests for valid inputs

…validator for time duration, add range filter to ffmpeg audio split, update tests

evamaxfield

Codewise, this all looks great. Would be curious as to how you found or contributing notes and all of that. Any places for improvement?

Implementation wise... did you have a chance to test this on a dev deployment? I think this would transcribe the portion selected but the web UI / session interface may fail because the video doesn't match up with the audio... the best way to do this interestingly may be to split the video by timestamp and then do all the normal downstream processing rather than split the audio by timestamp...

Are you on the call this Thursday? If so it might be good to talk about this feature then. / How to dev-deploy

codecov · 2022-11-09T01:00:11Z

Codecov Report

Merging #221 (46aff14) into main (3fb35a2) will increase coverage by 0.07%.
The diff coverage is 84.05%.

@@            Coverage Diff             @@
##             main     #221      +/-   ##
==========================================
+ Coverage   72.33%   72.40%   +0.07%     
==========================================
  Files          64       64              
  Lines        3492     3541      +49     
==========================================
+ Hits         2526     2564      +38     
- Misses        966      977      +11

Impacted Files	Coverage Δ
cdp_backend/pipeline/mock_get_events.py	`100.00% <ø> (ø)`
cdp_backend/utils/file_utils.py	`87.26% <57.14%> (-2.64%)`	⬇️
cdp_backend/pipeline/event_gather_pipeline.py	`84.70% <63.63%> (-0.76%)`	⬇️
cdp_backend/pipeline/ingestion_models.py	`99.01% <91.66%> (-0.99%)`	⬇️
cdp_backend/database/models.py	`100.00% <100.00%> (ø)`
cdp_backend/database/validators.py	`89.04% <100.00%> (+0.98%)`	⬆️
cdp_backend/tests/database/test_validators.py	`100.00% <100.00%> (ø)`
...ckend/tests/pipeline/test_event_gather_pipeline.py	`100.00% <100.00%> (ø)`
cdp_backend/tests/utils/test_file_utils.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

chrisjkhan · 2022-11-09T03:07:20Z

Codewise, this all looks great. Would be curious as to how you found or contributing notes and all of that. Any places for improvement?

I made it this far! If there's anything that I'm doing that makes it appear I missed the documentation, let me know and I might have some feedback.

Implementation wise... did you have a chance to test this on a dev deployment?

No, that was the last thing I mentioned to @smai-f. Hopefully, we can get a test process going as we integrate the scraper.

I think this would transcribe the portion selected but the web UI / session interface may fail because the video doesn't match up with the audio... the best way to do this interestingly may be to split the video by timestamp and then do all the normal downstream processing rather than split the audio by timestamp...

My ideal was to do as you mentioned, and probably integrate with clip functionality. The main reason I didn't was because there seems to be a reliance on existing video hosting, and I thought that would be worth preserving.

I didn't realize how the front end functionality worked (oops!), but I think there are some easy ways to preserve both the existing video hosting and accurate timestamps. The simplest might be to send the offset to the transcribe function to alter the timestamps in the transcript:

cdp-backend/cdp_backend/sr_models/google_cloud_sr_model.py

Line 199 in 1e5b807

start_time = (

In addition, we can update the front end so it seeks to the appropriate place on page load: https://github.com/CouncilDataProject/cdp-frontend/blob/77944572b816ed07ef0dcffd772fc19b52aea12e/src/components/Details/EventVideo/EventVideo.tsx#L133

Are you on the call this Thursday? If so it might be good to talk about this feature then. / How to dev-deploy

I should be available. Can you add me to the event?

cdp_backend/database/validators.py

cdp_backend/utils/file_utils.py

cdp_backend/tests/database/test_validators.py

chrisjkhan · 2022-11-09T09:51:27Z

cdp_backend/tests/utils/test_file_utils.py

+        (None, "10"),
+        ("1", None),
+        ("0", "0"),
+        ("1", "0"),


Add tests to see how ffmpeg handles 0 or negative duration output.

dphoria · 2022-11-10T05:16:26Z

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

Thank you very much @chrisjkhan for this!

…mpeg error logging

chrisjkhan · 2022-11-12T20:58:05Z

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Thank you very much @chrisjkhan for this!

You're welcome

dphoria · 2022-11-14T05:22:33Z

Beyond what Eva already mentioned, about the different video and audio stream lengths maybe being an issue, I think my only wish is that split_audio checks end_time > start_time in some way.

We decided at the meeting to do full video clipping and host the video. A combination of reusable functionality, relative low priority of original video hosting, simplicity of downstream issues, and overall UX. I'll change the functionality and the semantics of the whole issue.

Sounds great. Just let me know / hit the "re-request review" button whenever.

… host when limiting video to a range, updated mp4 conversion to allow a range, connected mp4 to clip functionality, updated tests and tried to make testing slightly more consistent, added Session ingestion verification to test out

chrisjkhan · 2022-11-21T16:11:38Z

I may have gone a little overboard with changes trying to slightly increase consistency across method signatures. The only thing I changed that might be controversial is making the hash calculation part of an already large task rather than keeping it separate. If you really want the hash to be performant, I would suggest modifying it to use CRC/MD5/SHA-1 and do a combination of the first x bytes/blocks and file size.

I tested negative duration pairs (end_time < start_time) causing a event processing failure, as desired. Positive duration pairs cause the file to be trimmed and CDP hosted, and considered unique/distinct from full length videos. Transcription works as expected.

cdp_backend/tests/utils/test_file_utils.py

cdp_backend/utils/file_utils.py

cdp_backend/pipeline/ingestion_models.py

cdp_backend/tests/utils/test_file_utils.py

…ix in cdp_will_host control structure

dphoria

I think I'm good with this. Thank you, Chris! I'll let Eva do the honors, though. She's a much better reviewer. (I also need to figure out how to start a review, make N comments for the one review, end the review...)

smai-f · 2022-11-27T04:24:03Z

Thanks everybody! Here are some manual tests using Chris' latest commit in the Montana legislature instance:

evamaxfield

This all looks good to me. I have one remaining question before I click the approve and merge buttons.

Q: should we take the trim start time and add that to the session start time / event start times?

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

cdp_backend/pipeline/event_gather_pipeline.py

cdp_backend/pipeline/ingestion_models.py

cdp_backend/tests/database/test_validators.py

evamaxfield · 2022-11-28T19:03:31Z

Also, thank you for such great work on this! Really nicely done and thank you @smai-f for helping test!

evamaxfield · 2022-11-30T17:14:26Z

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

cc @BrianL3 @dphoria @smai-f this is a question i would love all of your opinions on

chrisjkhan · 2022-11-30T18:09:56Z

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

dphoria · 2022-12-03T00:53:45Z

lets say the event starts at noon but you want to trim to 1 hour into the video. should the session and event start time say: "1pm"?

Ah, I missed this comment. My guess is it should still be noon. There is a session_datetime, so that value would be over constrained. I chose the very general semantics of video_start_time to leave room for any possible reasons for trimming, leaving a disconnect between event/session/etc times and video times. People more familiar with usage would know better though. If we did make this kind of link in timing, it might make sense to narrow the semantics.

The only case I can imagine is a session "starting" at noon, the video recording starting at noon, but then actual activity starting at 1pm. My guess is, if this 1 hour of dead time is in the source data being scraped, session_datetime would reflect the actual start time instead of the false start time. If dead time not in the source data, it couldn't be scraped and then trimmed.

First, great question Eva. 👏 I'm in line with Chris' thoughts; (I hope I'm understanding correctly) At the core, there are 2 fields aded to the model, the video start and end times (right?). At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural. I may at first be slightly confused, but even without any external explanation, I think I would realize that this means, in this case, the video begins 1 hour into the session.

If I'm looking at an example of what's probably the vast majority of the models - 'session start time' == 'video start time', that again wouldn't throw me off, especially if I have at that point, seen an example like above when they are different.

chrisjkhan · 2022-12-03T01:30:59Z

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

smai-f · 2022-12-05T04:47:49Z

@evamaxfield From my part consuming this PR in the Montana legislative scraper (and speaking the time/duration language Chris is pointing out of video_start_time, not considering it a datetime):

The scraped session_datetime on the page (not the video_start_time) already accounts for the offset into the video (that is, if the long meeting video starts at 1pm, and the bill discussion which is the part of the video I'm segmenting starts at 3pm, the piece of data I'm scraping on the page is 3pm for the session_datetime). So I don't need to do anything to get the correct session datetime, and it would be incorrect if the backend added the offset to it.
The piece of math I did have to do (which seems unusual, and like this won't be the case on most frontends) is that the audio player actually sets the 0:00 place of the video to be beginning time of the long video. So in the previous example, the video starts at 13:00:00, the part I'm scraping starts at 15:00:00, so I had to do the math of 15:00:00 - 13:00:00 = 2:00:00, to set the video_start_time to be 2:00:00.

I think the way Chris has it now makes the most sense from my narrow consuming perspective. Have the consumer decide whether or not they want to augment the session datetime to include the video start time offset or not. I can imagine reasons to support both behaviors related to the start time of the meeting aligning with a schedule written in another place.

dphoria · 2022-12-06T16:03:40Z

At least to me, if I look at a model, and see that 'session start time' is noon, and 'video start time' is 1 PM, that would sound pretty natural.

Maybe there is still some confusion. video_start_time is an HH:MM:SS time/duration, rather than a datetime. It's relative to the beginning of the video. Should I do something to make this more clear?

Thank you for pointing that out. I missed it!

evamaxfield · 2022-12-06T16:09:11Z

Thanks for all the discussion and clarification. Yes the only think I think should be changed prior to merge is adding the tiniest bit of documentation to the ingestion model that states the video time is relative to the session time.

I would just put it in the class docstring as we still need to find a better overall project documentation structure haha

chrisjkhan · 2022-12-06T18:58:09Z

I would just put it in the class docstring...

@evamaxfield Added. Let me know how it looks. I followed the Notes model in the rest of the file. It's a bit verbose.

evamaxfield

Perfect. Verbose is more than welcome. Clicking the merge button!

Add transcription range fields to database and ingestion models, add …

3183b17

…validator for time duration, add range filter to ffmpeg audio split, update tests

evamaxfield requested review from evamaxfield and dphoria November 9, 2022 00:55

evamaxfield added the enhancement New feature or request label Nov 9, 2022

evamaxfield reviewed Nov 9, 2022

View reviewed changes

dphoria reviewed Nov 9, 2022

View reviewed changes

cdp_backend/database/validators.py Show resolved Hide resolved

chrisjkhan commented Nov 9, 2022

View reviewed changes

cdp_backend/utils/file_utils.py Outdated Show resolved Hide resolved

Add tests for edge case transcription ranges

81bfaf5

chrisjkhan commented Nov 9, 2022

View reviewed changes

cdp_backend/tests/database/test_validators.py Show resolved Hide resolved

chrisjkhan commented Nov 9, 2022

View reviewed changes

Fix end_time=start_time typo, update tests to be more concise, add ff…

5a525cc

…mpeg error logging

chrisjkhan added 3 commits November 16, 2022 11:23

Update session hash to reflect trimmed video

88d619f

Bypass hash task

217deef

chrisjkhan requested review from evamaxfield and dphoria and removed request for evamaxfield November 21, 2022 16:12

dphoria reviewed Nov 22, 2022

View reviewed changes

cdp_backend/tests/utils/test_file_utils.py Show resolved Hide resolved

dphoria reviewed Nov 22, 2022

View reviewed changes

cdp_backend/utils/file_utils.py Outdated Show resolved Hide resolved

dphoria reviewed Nov 22, 2022

View reviewed changes

cdp_backend/pipeline/ingestion_models.py Show resolved Hide resolved

dphoria reviewed Nov 22, 2022

View reviewed changes

cdp_backend/tests/utils/test_file_utils.py Show resolved Hide resolved

Remove unnecessary logging, duration validation comments, elif typo f…

78c2019

…ix in cdp_will_host control structure

dphoria approved these changes Nov 22, 2022

View reviewed changes

evamaxfield reviewed Nov 28, 2022

View reviewed changes

Reverted function parameter doc for split audio

f6955ac

Improved documentation for video_start_time in ingestion_models

46aff14

evamaxfield approved these changes Dec 6, 2022

View reviewed changes

evamaxfield merged commit b00f378 into CouncilDataProject:main Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transcription range fields to database and ingestion models, add … #221

Add transcription range fields to database and ingestion models, add … #221

chrisjkhan commented Nov 8, 2022

evamaxfield left a comment

codecov bot commented Nov 9, 2022 •

edited

chrisjkhan commented Nov 9, 2022

chrisjkhan Nov 9, 2022

dphoria commented Nov 10, 2022

chrisjkhan commented Nov 12, 2022

dphoria commented Nov 14, 2022

chrisjkhan commented Nov 21, 2022 •

edited

dphoria left a comment •

edited

smai-f commented Nov 27, 2022

evamaxfield left a comment

evamaxfield commented Nov 28, 2022

evamaxfield commented Nov 30, 2022

chrisjkhan commented Nov 30, 2022

dphoria commented Dec 3, 2022 •

edited

chrisjkhan commented Dec 3, 2022

smai-f commented Dec 5, 2022 •

edited

dphoria commented Dec 6, 2022

evamaxfield commented Dec 6, 2022 •

edited

chrisjkhan commented Dec 6, 2022 •

edited

evamaxfield left a comment

Add transcription range fields to database and ingestion models, add … #221

Add transcription range fields to database and ingestion models, add … #221

Conversation

chrisjkhan commented Nov 8, 2022

Link to Relevant Issue

Description of Changes

evamaxfield left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2022 • edited

Codecov Report

chrisjkhan commented Nov 9, 2022

chrisjkhan Nov 9, 2022

Choose a reason for hiding this comment

dphoria commented Nov 10, 2022

chrisjkhan commented Nov 12, 2022

dphoria commented Nov 14, 2022

chrisjkhan commented Nov 21, 2022 • edited

dphoria left a comment • edited

Choose a reason for hiding this comment

smai-f commented Nov 27, 2022

evamaxfield left a comment

Choose a reason for hiding this comment

evamaxfield commented Nov 28, 2022

evamaxfield commented Nov 30, 2022

chrisjkhan commented Nov 30, 2022

dphoria commented Dec 3, 2022 • edited

chrisjkhan commented Dec 3, 2022

smai-f commented Dec 5, 2022 • edited

dphoria commented Dec 6, 2022

evamaxfield commented Dec 6, 2022 • edited

chrisjkhan commented Dec 6, 2022 • edited

evamaxfield left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2022 •

edited

chrisjkhan commented Nov 21, 2022 •

edited

dphoria left a comment •

edited

dphoria commented Dec 3, 2022 •

edited

smai-f commented Dec 5, 2022 •

edited

evamaxfield commented Dec 6, 2022 •

edited

chrisjkhan commented Dec 6, 2022 •

edited