model: directory is the base model which is used to make the detections

vad_model: If the audio clips are larger than 60 sec this model splits the audio clips into smaller clips and use for processing

vad_kwargs: This is the maximum splitted clip time by vad model

hub: optional: ms (default) to download models from ModelScope. Use hf to download models from Hugging Face.


model(str): model name in the Model Repository, or a model path on local disk.

device(str): cuda:0 (default gpu0) for using GPU for inference, specify cpu for using CPU.

ncpu(int): 4 (default), sets the number of threads for CPU internal operations.

output_dir(str): None (default), set this to specify the output path for the results.

batch_size(int): 1 (default), the number of samples per batch during decoding.

hub(str)：ms (default) to download models from ModelScope. Use hf to download models from Hugging Face.

**kwargs(dict): Any parameters found in config.yaml can be directly specified here, for instance, the maximum segmentation length in the vad model max_single_segment_time=6000 (milliseconds).


In [2]:
# Loading the libraries
from datasets import load_dataset
from transformers import pipeline
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
import librosa
import numpy as np

from IPython.display import Audio as IPythonAudio

In [3]:
# Load the MP3 file
audio_path = r'C:\Users\Administrator\Desktop\LLM_work\SenseVoiceSmall\example\zh.mp3'
audio_array, sample_rate = librosa.load(audio_path, sr=None)

In [10]:
# Print details
print("Sample Rate:", sample_rate)
print("Audio Array:", audio_array)
print("Audio Array datatype is Array: ", isinstance(audio_array,np.ndarray) )

Sample Rate: 48000
Audio Array: [ 0.0000000e+00 -1.0005779e-14 -7.1443096e-15 ...  2.2035025e-10
  2.1887764e-10  2.0613053e-10]
Audio Array datatype is Array:  True


In [11]:
# Resampling the Audio so it matches to the requirement of the model
audio_16KHz = librosa.resample(audio_array,
                               orig_sr=sample_rate,
                               target_sr=16000)

In [7]:
model_dir = "FunAudioLLM/SenseVoiceSmall"

# SenseVoiceSmall 


In [31]:
# model_dir = r'C:/Users/Administrator/Desktop/LLM_work/SenseVoiceSmall'
# pretrained_model_path=  r'C:\Users\Administrator\Desktop\LLM_work\SenseVoiceSmall\model.pt'
model = AutoModel(
    model='iic/SenseVoiceSmall',
    # init_param = pretrained_model_path
)

# en
res = model.generate(
    input=audio_16KHz,
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

funasr version: 1.1.6.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.1.6


rtf_avg: 0.042: 100%|[34m██████████[0m| 1/1 [00:00<00:00,  4.16it/s]

开饭时间早上9点至下午5点。





# paraformer-zh

In [34]:
model = AutoModel(
    model='paraformer-zh',  # This is the directory of model path
    device="cuda:0",  # "cuda:0" for GPU (if CUDA is available) or "cpu" for CPU.
    hub="hf",   # "hf" for Hugging Face Hub, "local" for local filesystem.
)

# en
res = model.generate(
    input=audio_16KHz,
    cache={},
    language="zn",
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)



funasr version: 1.1.6.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.1.6


Fetching 10 files: 100%|██████████| 10/10 [00:00<?, ?it/s]
rtf_avg: 0.035: 100%|[34m██████████[0m| 1/1 [00:00<00:00,  4.95it/s]

开放时间早上九点至下午五点





# ct-punc

In [38]:
# This model is used to detect the punctuation in the generated text
model = AutoModel(model="ct-punc", model_revision="v2.0.4")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res)

funasr version: 1.1.6.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.1.6


2024-09-06 12:02:22,218 - modelscope - INFO - Use user-specified model revision: v2.0.4
  src_state = torch.load(path, map_location=map_location)
rtf_avg: -0.022: 100%|[34m██████████[0m| 1/1 [00:00<00:00, 41.66it/s]

[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': '那今天的会就到这里吧，happy new year,明年见。', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 3])}]





# fsmn-vad

In [39]:
from funasr import AutoModel

model = AutoModel(model="fsmn-vad", model_revision="v2.0.4")

res = model.generate(input=audio_16KHz)
print(res)

funasr version: 1.1.6.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.1.6


2024-09-06 12:02:39,634 - modelscope - INFO - Use user-specified model revision: v2.0.4
rtf_avg: 0.006: 100%|[34m██████████[0m| 1/1 [00:00<00:00, 27.66it/s]

[{'key': 'rand_key_2yW4Acq9GFz6Y', 'value': [[420, 5600]]}]





# Combined all

In [45]:
model = AutoModel(
    model='iic/SenseVoiceSmall',
    # vad_model="fsmn-vad",
    # vad_kwargs={"max_single_segment_time": 30000},
    device="cuda:0",
    punc_model = "ct-punc"
)

# en
res = model.generate(
    input=audio_16KHz,
    cache={},
    language="auto",  # "zn", "en", "yue", "ja", "ko", "nospeech"
    use_itn=True,
    batch_size_s=60,
    merge_vad=True,  #
    merge_length_s=15,
)
text = rich_transcription_postprocess(res[0]["text"])
print(text)

funasr version: 1.1.6.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.1.6


rtf_avg: 0.047: 100%|[34m██████████[0m| 1/1 [00:00<00:00,  3.74it/s]

开饭时间早上9点至下午5点。



