-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSDD inference is too slow #7101
Comments
What do you mean by parallelism of embedding extraction when you are inferencing on single GPU? |
This is very recent issue we also discovered. It's not the MSDD but TitaNet embedding extractor is taking a lot of time. I will look into it and get back soon. |
Hi @nithinraok , thanks for your response. |
Thanks @tango4j yes I also think so |
You can skip writing pkl files as well, have you tried disabling saving pickle files through config? |
Of course but the msdd use them so if I disable the saving of the pkl files I get FileNotFoundError |
This issue is happening only for MSDD diarizer, not for clustering diarizer. I suppose something related to yaml setting is causing this. Let me get back to this soon. |
I want to use the msdd diarization |
@SagyHarpazGong Sure, let us work on this. Thanks...! |
@SagyHarpazGong
in the yaml config. |
@tango4j I checked as well and still slow, I'm really suspect that the reason for the slow inference is the utilization of the CPUs, most of the inference time the utilization of GPU is on 0%, and all I/O of file system is another reason for slow inference. |
@SagyHarpazGong If it changes and speeds up, but the improvement is less then 30%, then please let us know. |
What CUDA settings should I need to check? |
@SagyHarpazGong |
@tango4j not at all |
@SagyHarpazGong Apart from this, I will update the NGC MSDD model checkpoint to resolve this slow down issue. |
@tango4j thanks, I'll try to share images of the nvidia-smi during the inference in order to show you that most of the time the utilization of the GPU is on 0% |
Hi all, I fixed the issue by inherit the classes: ClusteringDiarizer, ClusterEmbedding, NeuralDiarizer and modified them so instead of saving the embeddings in pkl files and load them for the MSDD inference , the embeddings are passing to the MSDD inference without using the file system they are in the GPU memory. this is my implementation:
|
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Looks like not many people use MSDD. It is 2024 mid and Nemo inference is super slow for MSDD and still no action is taken on this |
@SagyHarpazGong did your implementation help reduce the CPU-GPU bottleneck and improve performance speed? |
@prkumar112451 thanks for your comments, unfortunately we might have missed this or busy with other works, thank you for bringing this issue again. |
@prkumar112451 Thanks for detailed comments. Currently the way we improved accuracy of NeMo diarization is by using embeddings at multi-scales which I believe would be the issue for your 20min audio. There are ways to improve this. First to answer them I would need some clarifications from your end,
|
@nithinraok Thanks for quick response. the 20 min audio is a call center telephony conversation between a customer and an agent. I am using a combination of Whisper for transcription and then Nemo for diarization. Taking this repo as reference - we can see that lots of whisper optimization techniques are there like flash-attention, batching etc. And have been able to speed up whisper alot. But the diarization part is acting as bottleneck. Just to be very sure, I completely removed the whisper part and ran a plain nemo's telephony based ai-model iar_msdd_telephonic but it's speed is 1 minute diarization time for 20 min call recording. To answer your questions :
|
Regarding the performace bottleneck of diarization, if you could tolerate some performance in accuracy, I would suggest you to try the clustering diarizer with single scale without msdd model, as shown in below config here: MANIFEST_FILE='callhome_109.json'
python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_infer.py \
--config-path='examples/speaker_tasks/diarization/conf/inference' --config-name='diar_infer_telephonic.yaml' \
diarizer.manifest_filepath=$MANIFEST_FILE \
diarizer.out_dir='/data/sample/' \
diarizer.speaker_embeddings.model_path=${MODEL_PATH} \
diarizer.speaker_embeddings.parameters.window_length_in_sec=1.5 \
diarizer.speaker_embeddings.parameters.shift_length_in_sec=0.75 \
diarizer.vad.model_path='vad_multilingual_marblenet' \
diarizer.asr.model_path=null \
diarizer.msdd_model.model_path=null \
diarizer.oracle_vad=False \
diarizer.clustering.parameters.oracle_num_speakers=False \
batch_size=256 \
num_workers=1 This setting would be fast, you may note that we could switch from external VAD to ASR VAD as well, so you could do ASR+SD in one go. We explained some of these settings here, pls feel free to explore: https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization#run-speech-recognition-with-speaker-diarization. Very important to note that common setting might not be best for all kind of audio samples due to various backgrounds and noise level so use it accordingly. Above configuration does only clustering based diarization with single scale embeddings using VAD output from marblenet vad. Also, I am looking to put together a space with above model and speaker diarization soon will keep it posted here. |
We are working on improving RTF for ASR models even more, you can only expect models to get better in terms of both speed and accuracy. |
To run python examples/speaker_tasks/diarization/clustering_diarizer/offline_diar_infer.py on Kaggle notebook, installed these libraries (mentioned in PIP installation section of NEMO Github https://github.com/NVIDIA/NeMo/ ) : and then did the import that is on the top of offline_diar_infer.py file (https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/clustering_diarizer.py ): from omegaconf import OmegaConf from nemo.collections.asr.models import ClusteringDiarizer Getting this error : File /opt/conda/lib/python3.10/site-packages/datasets/filesystems/s3filesystem.py:1 File /opt/conda/lib/python3.10/site-packages/s3fs/init.py:1 File /opt/conda/lib/python3.10/site-packages/s3fs/core.py:29 File /opt/conda/lib/python3.10/site-packages/aiobotocore/session.py:10 File /opt/conda/lib/python3.10/site-packages/aiobotocore/client.py:10 ModuleNotFoundError: No module named 'botocore.compress' Is there any restriction on which python version we need to use, I am using 3.10.13 |
It looks like |
@nithinraok My implementation speed up the embeddings extraction phase but still the clustering phase and the final phase (msdd) are extremely slow because there are a lot of traffic between the CPU and the GPU, also there are heavy CPU computations like "get_argmin_mat" function in offline_clustering.py |
Your comments are highly appreciated, we will soon take these in to consideration and update our codebase. |
@nithinraok This is the original function:
and this is my re-implementation:
|
@nithinraok - tried the clustering diarization but the accuracy is too poor compared to neural diarization. Also the speed of clustering diarization of Nemo is roughly similar to the neural diarization of pyannote with better accuracy. It took a 6 min audio file 16 seconds to diarize. I was planning to use Nemo for production instance but looking at its speed, finding it not very reliable to be used for production. Especially the factor that it is not able to use the GPUs really well and depends alot more on CPUs. Could you share if there is any reason for Nemo team not taking feedback from @SagyHarpazGong and implementing the fixes he gave almost one year back in this comment I really liked Nemo, its Neural diarization accuracy was found to be better than pyaanote. But its speed is too low for it to be scalable enough @SagyHarpazGong - Could you share if the fixes you have added, are you using Nemo on production instance? And will it be possible for you to check-in these changes in the clone of Nemo that you have in your repository. I tried cloning your repo but it didn't have the changes you shared in this thread to speedup Nemo |
@prkumar112451 Regarding to your question about using NeMo on production instance, the answer is yes NeMo (with my fixes) is running in production instance for at least half a year. So as you already mentioned the bottleneck is the traffic between CPU and GPU and vice versa, but also the memory usage both in GPU and CPU. BTW, I found out that a lot of the inference functions are under torch.no_grad() method, meaning each tensor in the process is x2 larger in size (tensor data and tensor gradients), so in my repo I just add decorator @torch.no_grad() at the top of the inference functions. |
We are working on next version of Speaker Diarization, which doesn't depend on current clustering or MSDD, hence probably developers who worked on MSDD hasn;t given much attention however these are very valid points to add to the code base. We love these suggestions and will apply to improve. |
Note: We worked on improving the speed of the current clustering diarizer for RIVA with support for TensorRT, which are not part of NeMo, however those improvements can only be used when using RIVA. |
I'm also trying to add a NeMo diarizer to my pipeline and can't wait for the update. |
@nithinraok - RIVA is too expensive, can't go with that. You shared about TensorRT. Are you suggesting that Nemo will be faster with TensorRT? https://github.com/NVIDIA/TensorRT If we run Nemo within TensorRT container in a T4 GPU, will it speed up nemo to at least 2x to 3x times? |
@nithinraok - Also, you shared about working on new version of speaker diarization which is not related to clustering or MSDD. Could you share if there is any rough date on which we can expect that. After looking at all the options, found that pyannote uses GPUs in a much better way. That would mean, GPUs with better cuda cores would means quicker response. Had to go with pyannote and currently working on finetuning of pyannote to improve the quality of the output. But must say, it was disappointing to find that Nemo, which felt really good at first, had so much CPU dependency and such a poor usage of GPU. Looks like for a production environment with good number of recordings to transcribe, this just wouldn't work and currently pyannote is only good enough option.. |
Any updates on this please? |
@prkumar112451 |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
@tango4j @nithinraok I am seeing this update on linkedin - is this true, have you guys made this 10x speed improvement in your code base? |
Its for ASR models (especially parakeet), yes its part of NeMo main. |
but nothing for speaker diarization yet? Is a new end-to-end speaker diarization model still expected soon? |
I run the MSDD model on Nvidia A10 (24GB), but the inference is too slow, I looked on the code and there is a lot of traffic between the CPU and GPU and vice versa.
most of the time GPU utilization is on 0%
First the data is split into short segments according to the number of scales (I have 5 scales).
After each scale splitting the embedding extraction is applied and save the embedding to pkl file.
Then the the clustering is applied and finally the MSDD is applied.
Is there something that can be done in order to speed up the inference?
Is there any flag for parallelism the embedding extraction stage?
please help.
The text was updated successfully, but these errors were encountered: