Video Pipeline/Training Support#510
Conversation
There was a problem hiding this comment.
All the conventional wisdom I'm aware of is telling me that judging files by their extension is a bad idea. We've done a large amount of work to make sure whatever garbage files a user might upload get forced through our sieve and come out sanitized, safe video, and this is a step in the wrong direction.
Proposal
If we want to move to using the raw uploaded video, we should go through the process of having ffprobe verify the file (we already do this), tag that file, and then use it.
Then, the logic could look like this:
if tagged safe original video exists,
send it to training,
else if tagged safe transcoded video exists:
send it to training
else
throw an error about no safe videos.
| for (key, value) in metadata.items(): | ||
| metadata_list.append(f" {key}: {value}") | ||
| metadata_str = ",".join(metadata_list) | ||
| writer.writerow([f'# metadata -{metadata_str}']) |
There was a problem hiding this comment.
Unnecessary branching/default. Could we just make metadata a required positional arg and remove the if?
| return groundtruth | ||
|
|
||
|
|
||
| def get_source_video(input_path: Path, folderId: str, girder_client): |
There was a problem hiding this comment.
This function is kinda dangerous and should be documented
| from girder_worker.task import Task | ||
|
|
||
| # TODO: Move viame_server.constants into a shared area like viame_utils to share constants | ||
| validVideoFormats = [".mp4", ".avi", ".mov", ".mpg"] |
There was a problem hiding this comment.
Might as well make it a set instead of a list so that when the refactor happens, it's a drop-in replacement.
| extension = os.path.splitext(file_name)[1].lower() | ||
| if item.get("meta", {}).get("codec") is None and extension in validVideoFormats: | ||
| return file_name | ||
| return None |
There was a problem hiding this comment.
I don't like the strategy of "choose the thing that isn't tagged but has the right file extension".
It's fragile. I think the transcoded file should be the thing we actually transmit, not the original source.
If a user uploads a txt file named .mp4, this will explode. However, if they used the tagged transcoded video, it's basically bulletproof because transcoding will only succeed on a well-formed video.
Transcoding is basically a litmus test for valid video before training runs, so I think we should use the "safer" version.
We can slightly relax our compression on transcoding if that would help. I really don't think this is a good idea.
Co-authored-by: Brandon Davis <git@subdavis.com>
|
So I removed the extension checking and now it checks for:
Added the Updated the My Testing process:
Only issue I see is if there is a super old video which doesn't have the |
There was a problem hiding this comment.
Seems like it would be prudent to go ahead and tag every video we've verified with codec.
Now, codec can just plainly mean codec, instead of carrying the extra implied "BTW this is the trancoded video".
Note that you also have to explicitly query for source_video does not exist in _get_clip_meta
item = Item().findOne(
{
'folderId': folder['_id'],
'meta.codec': 'h264',
'meta.source_video': { '$exists': False },
}
)
| manager.updateStatus(JobStatus.PUSHING_OUTPUT) | ||
| new_file = gc.uploadFileToFolder(folderId, output_path) | ||
| gc.addMetadataToItem(new_file['itemId'], {"codec": "h264"}) | ||
| gc.addMetadataToItem(itemId, {"source_video": True}) |
There was a problem hiding this comment.
| gc.addMetadataToItem(itemId, {"source_video": True}) | |
| gc.addMetadataToItem( | |
| itemId, | |
| { | |
| "source_video": True, | |
| "codec": videostream[0]["codec_name"], | |
| }, | |
| ) |
There was a problem hiding this comment.
I'm testing this myself now, but I have to download the newer gpu algorithms.
Not the best way to solve the issue but the least intrusive way for the time being.
Pipeline Running Updates:
image-sequencefor training will now filter out .json filesvideotype will now load the folder filter out any non compatible video files and then look for the video file which doesn't have the codec meta tag on it. I would like to share thevalidMediaFileswith the constants folder or maybe do this filtering before calling the task, I just didn't want to go with too big of a change right now with just getting video training to work.CSV Metadata Update:
fps = Noneis there because of a training bug right now. Once fixed that will be removed.Training Updates:
fpsso it knows to write it.