Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out bad caption files #115

Closed
dphoria opened this issue Sep 21, 2022 · 4 comments
Closed

Filter out bad caption files #115

dphoria opened this issue Sep 21, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@dphoria
Copy link
Collaborator

dphoria commented Sep 21, 2022

Feature Description

{"generator": "CDP WebVTT Conversion -- CDP v3.2.3", "confidence": 0.97, "session_datetime": "2022-09-14T10:00:00-04:00", "created_datetime": "2022-09-17T15:51:32.537666", "sentences": [{"index": 0, "confidence": 0.97, "start_time": 0.0, "end_time": 0.0, "words": [], "text": "", "speaker_index": 0, "speaker_name": null, "annotations": null}], "annotations": null}

Note end_time in the above output from a PIttsburgh, PA captions processing. Turns out that event does have a caption file in the Legistar data structure for the event but it is in fact empty.

We want to filter these out and leave Session.caption_uri = None.

Use Case

Better to avoid wasting resources to process these invalid caption files.

Solution

Compare the lengths of the video and the caption file. If they differ by more than some threshold, e.g. 20%, throw away the caption file.
This means the scraper can no longer just hand off the caption file URL as-is.

Alternatives

  • Throw away caption files with file size less than some threshold, e.g. 100 bytes.
  • Throw away caption files less than ~1 minute.
@dphoria dphoria added the enhancement New feature or request label Sep 21, 2022
@dphoria dphoria self-assigned this Sep 21, 2022
@dphoria dphoria changed the title Filter out bad transcripts Filter out bad caption files Sep 21, 2022
@dphoria
Copy link
Collaborator Author

dphoria commented Sep 21, 2022

The more I think about this, I prefer not comparing against the associated video. Mostly, I don't want to have to handle all the different video formats. So I prefer just to do some very simple stupid validation on the caption file, alone.

@evamaxfield
Copy link
Member

I would say do this on the cdp-backend side where we already have ffmpeg stuff enabled.

which means that something like this should work: https://stackoverflow.com/a/3844467

@dphoria
Copy link
Collaborator Author

dphoria commented Sep 21, 2022

I would say do this on the cdp-backend side where we already have ffmpeg stuff enabled.

Ah, good idea.

@dphoria
Copy link
Collaborator Author

dphoria commented Sep 23, 2022

@dphoria dphoria closed this as completed Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants