Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out bad caption files #213

Closed
dphoria opened this issue Sep 23, 2022 · 0 comments · Fixed by #214
Closed

Filter out bad caption files #213

dphoria opened this issue Sep 23, 2022 · 0 comments · Fixed by #214
Assignees
Labels
enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing

Comments

@dphoria
Copy link
Contributor

dphoria commented Sep 23, 2022

Feature Description

{"generator": "CDP WebVTT Conversion -- CDP v3.2.3", "confidence": 0.97, "session_datetime": "2022-09-14T10:00:00-04:00", "created_datetime": "2022-09-17T15:51:32.537666", "sentences": [{"index": 0, "confidence": 0.97, "start_time": 0.0, "end_time": 0.0, "words": [], "text": "", "speaker_index": 0, "speaker_name": null, "annotations": null}], "annotations": null}

Note end_time in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.

We want to filter these out just attempt speech-to-text.

Use Case

Allow for proper transcript generation in generate_transcript() by appropriately filtering out empty caption files.

Solution

Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file.
This means the scraper can no longer just hand off the caption file URL as-is.

Alternatives

  • Throw away caption files with file size less than some threshold, e.g. 100 bytes.
  • Throw away caption files less than ~1 minute.
@dphoria dphoria added enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing labels Sep 23, 2022
@dphoria dphoria self-assigned this Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request event gather pipeline A feature or bugfix relating to event processing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant