New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerating evaluation speed #4
Comments
One method is to use the dataloader for evaluation. Considering we have M utterances in the evaluation set, we sort it by the length of the utterances. In each minibatch, we select the data with a similar length (all utterances in the same batch will be cut into the same length) to extract the features. That will be faster. I did not do that in this project. You can check it in another project:(https://github.com/TaoRuijie/TalkNet_ASD/blob/main/dataLoader.py), line 96 to 104. My explanation can be found here (https://github.com/TaoRuijie/TalkNet_ASD/blob/main/FAQ.md): "1.2 How to figure the variable length of data during training ? " |
I have found an alternative way of doing this, by using torch.combinations. |
I understand. While if I understand correctly, the utilization of the GPU is still the same. Each time you can only feed one utterance into the GPU to extract the features. Am I right? |
Yes. |
Yes, may I ask how long you used for evaluation? Because in my case, the evaluation in Vox1_O takes only 1+ mins, so I did not change it into Dataloader format. Because that will add more code. For Vox1_E and Vox_H, it takes about 20 mins. But I only do it once after all the experiments, so I did not add the dataloader. |
OK I see. I use a custom dataset where the number of comparison pairs are 7million+ which takes more than half an hour running ECAPAModel.py: L82-L91, so I have to rewrite calculation part of the evaluation score. Maybe it's good to remain unchanged if it works fast on VoxCeleb. (only 3k pairs? Not all possible utterance pairs are considered maybe.) |
Oh, I understand your meaning and your proposed method. I guess you mean: the most time-consuming part of your project is not to extract the speaker embedding for each utterance. But to compute the score between these embeddings. Since that, I think the method you proposed is reasonable.
Btw, I am sth supervised about that, 7million+pairs is a very huge number. |
Yes, exactly. I have modified the full evaluation into vectorized form in GPU, and the EER of ECAPA-TDNN + ArcFace is better than my expectation. So I think the issue can be safely closed since training time is not a major issue on VoxCeleb. |
During evaluation, the current implementation calculates the similarity scores one by one using a for loop, that could be slow when the size of "lines" gets larger. Is there an elegant way of vectorizing it?
The text was updated successfully, but these errors were encountered: