'pretrain_AVA.model' is trained on the AVA training set and evaluate on the AVA val/test set, (Has the entire code) 'pretrain_TalkSet.model' is trained on our TalkSet and evaluate on the Columbia ASD set or other raw videos.
We design a scheme to feed the variable-length data into the same mini-batch: we sort all videos by their length and put the videos with similar length into the same batch. We crop all videos into the minimum number of frames in this batch. In this way, we train the TalkNet with videos of different length without losing too much data.
In the ASD task, when there are multiple face tracks in the video, we consider one track at a time. The face track of interest is given in each test trial. You can also consider the relationship between the faces on the screen at the same time. There are some papers about that.
Do "pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html -U", check this page.
I use google drive to upload the pretrain model and csv files. So you need to make sure you can use google drive under your internet. The error during extract video clips can be ignored.
In our experiments, for the result in AVA validation set, for the same code/model, the best training result is 92.6mAP, the worst one is 92.1mAP. So it is reasonable if you get the result little different than this 92.3mAP. Also batchsize might effect the result (not too much).
AVA test set did not release the labels. So you need to upload your csv result in their system. Notice that we delete add the first line in the test_res.csv
file since we modify a bit for the get_ava_active_speaker_performance.py
. You need to delete the first line when you upload it. For the upload file, you need to set all label
as SPEAKING_AUDIBLE
.
There are three labels: SPEAKING_AND_AUDIBLE, SPEAKING_BUT_NOT_AUDIBLE, NOT_SPEAKING, but in the finally evaluation, SPEAKING_BUT_NOT_AUDIBLE and NOT_SPEAKING share the same label. So this is a binary classification issue.
Our model has 15.01M params, in one 22G GPU, each epoch we train 15 mins, evaluate in val set takes 5 mins. Train 25 epochs can get the best result. So at most 7 hours.
Because we want our model can be used for all videos in the wild. AVA dataset has already provide the face bounding box for each trial, so for the videos not in AVA. If you want to do ASD, you need to do face detection first. In our experiments, the face detection method used in AVA is hard to reimplement. Which means we can hardly get the face area that similar to the detected face in AVA. Due to that, the model trained in AVA can not perform well in videos outside AVA if we use other face detection method. Due to that, we build TalkSet, the face in TalkSet is all detected by S3FD. So for any raw video (Such as the videos in Col ASD dataset), we can use S3FD to do face detection first, then apply our TalkNet model to get the ASD label.
We did not verify this code. Because we just modify LRS3 and VoxCeleb2 to build this set, so we do not (or cannot) upload this set. This generate_TalkSet.py
is what we used to generate this dataset, and we did not check it later. So it just used for your reference. We have already provide the data list, so you can generate this dataset based on it.
You can check the demo video 001.mp4
first and compare your output and our result 001_res.mp4
to make sure what you did is correct. So if you are not statisfied with the performance. We are sorry about that (~cry). We think this model can further improve. For the very short clips (less than 1s), small face and side face, the performance is not so good.
That is the reason for the face detection model instead of the ASD part. You can use better face detecion model such as Insightface. Only when the face can be detected, ASD model can perform to get the ASD labels.