<div align="center">
    <h1>
        <a href="https://github.com/modelscope/FunASR/tree/main">实时语音识别FunASR</a>
    </h1>
</div>

- 环境要求：**python>=3.8 torch>=1.13 torchaudio**
- 安装依赖项
  ```bash
  pip3 install -U funasr
  sudo apt-get install portaudio19-dev libportaudio2 libportaudiocpp0
  pip install soundfile sounddevice
  ```
- 选用 **[paraformer-zh-streaming](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/files)** 模型

---

### 1. 加载模型

- 默认会去魔搭下载模型。但是从这里下载比较慢，直接去[网页](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/files)上下载会快很多
- 魔搭模型默认下载地址在 ~/.cache/modelscope里面
  ```python
  model = AutoModel(model="paraformer-zh-streaming")
  ```
- 当然也可以填写绝对路径
  ```python
  model = AutoModel(model="/home/sumi/.cache/modelscope/hub/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online")
  ```




In [5]:
from funasr import AutoModel
import logging
import warnings

# 全局日志设置
logging.basicConfig(level=logging.ERROR)
logging.getLogger("funasr").setLevel(logging.ERROR)

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

# model = AutoModel(model="paraformer-zh-streaming")
model = AutoModel(
    model="speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online",
    disable_update=True,
    disable_pbar=True,
    disable_log=True,
    progress_bar=False,
    verbose=False  # 额外静默参数
    )


funasr version: 1.2.6.


### 2. 模型推理
- 使用达摩院提供的例子代码和例子音频

In [6]:

import soundfile as sf
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = sf.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, 
                         is_final=is_final, chunk_size=chunk_size, 
                         encoder_chunk_look_back=encoder_chunk_look_back, 
                         decoder_chunk_look_back=decoder_chunk_look_back,
                         disable_pbar=True,progress_bar=False,verbose=False
                         )
    # print(res)
    print(res[0]['text'])



欢迎大
家来
体验达
摩院推
出的语
音识
别模型



### 3. 测试麦克风

In [None]:
# 测试麦克风录音
import sounddevice as sd
import numpy as np
try:
    sd.default.reset()
    fs = 16000  # 采样率
    duration = 5  # 录音5秒

    # 可以先测试可用的音频设备
    print("可用音频设备:")
    print(sd.query_devices())
    print(f"默认输入设备: {sd.query_devices(kind='input')}")

    print("开始录音测试，请说话...")
    recording = sd.rec(int(duration * fs), samplerate=fs, channels=1)
    sd.wait()  # 等待录音完成
    print("录音完成")

    # 保存录音文件
    sf.write('test_mic_ipynb.wav', recording, fs)
    print("已保存到test_mic_ipynb.wav，请检查文件是否有声音")

    # 显示录音信息
    print(f"录音信息: 形状={recording.shape}, 最小值={np.min(recording)}, 最大值={np.max(recording)}")
finally:
    # 关闭音频流
    sd.stop()
    print("音频流已关闭")

可用音频设备:
   0 HDA Intel PCH: ALC1220 Analog (hw:0,0), ALSA (2 in, 6 out)
   1 HDA Intel PCH: ALC1220 Digital (hw:0,1), ALSA (0 in, 2 out)
   2 HDA Intel PCH: ALC1220 Alt Analog (hw:0,2), ALSA (2 in, 0 out)
   3 HDA NVidia: HDMI 0 (hw:1,3), ALSA (0 in, 2 out)
   4 HDA NVidia: HDMI 1 (hw:1,7), ALSA (0 in, 2 out)
   5 HDA NVidia: HDMI 2 (hw:1,8), ALSA (0 in, 8 out)
   6 HDA NVidia: HDMI 3 (hw:1,9), ALSA (0 in, 8 out)
   7 HDA NVidia: HDMI 4 (hw:1,10), ALSA (0 in, 8 out)
   8 HDA NVidia: HDMI 5 (hw:1,11), ALSA (0 in, 8 out)
   9 HDA NVidia: HDMI 6 (hw:1,12), ALSA (0 in, 8 out)
  10 sysdefault, ALSA (128 in, 128 out)
  11 front, ALSA (0 in, 6 out)
  12 surround21, ALSA (0 in, 128 out)
  13 surround40, ALSA (0 in, 6 out)
  14 surround41, ALSA (0 in, 128 out)
  15 surround50, ALSA (0 in, 128 out)
  16 surround51, ALSA (0 in, 6 out)
  17 surround71, ALSA (0 in, 6 out)
  18 iec958, ALSA (0 in, 2 out)
  19 spdif, ALSA (0 in, 2 out)
  20 samplerate, ALSA (128 in, 128 out)
  21 speexrate, ALSA (128

### 4. 使用麦克风音频进行识别和输出

In [8]:
import queue
import time
import sys

# 配置音频参数
FS = 16000  # 必须与模型训练采样率一致
CHANNELS = 1
DEVICE = None  # 使用默认设备

audio_queue = queue.Queue()
cache = {}  # 上下文缓存
current_text = ""

# 计算每个块的大小
chunk_stride = chunk_size[1] * 960  # 600ms

def audio_callback(indata, frames, time, status):
    """音频回调函数"""
    
    # 转换为float32格式
    audio_data = indata.flatten().copy().astype(np.float32)
    
    # 如果使用int16，需要转换为模型可以处理的范围
    if indata.dtype == np.int16:
        audio_data = audio_data / 32768.0
    
    audio_queue.put(audio_data)

print("开始录音，请说话...\n")
try:
    with sd.InputStream(
        samplerate=FS,
        channels=CHANNELS,
        blocksize=chunk_stride,
        dtype='float32',
        device=DEVICE,
        callback=audio_callback
    ):
        while True:
            # 获取音频块
            chunk = audio_queue.get()
            
            ## 调试信息
            # signal_level = np.max(np.abs(chunk))
            # print(f"音频块: 形状={chunk.shape}, 信号强度={signal_level:.6f}")
            
            # 执行推理
            res = model.generate(input=chunk, cache=cache, 
                             is_final=False, chunk_size=chunk_size, 
                             encoder_chunk_look_back=encoder_chunk_look_back, 
                             decoder_chunk_look_back=decoder_chunk_look_back,
                             disable_pbar=True, progress_bar=False, verbose=False
                             )
            
            # 显示识别结果
            if res and len(res) > 0 and 'text' in res[0] and res[0]['text']:
                new_text = res[0]['text']
                if new_text != current_text and new_text.strip():
                    current_text += new_text
                    print(current_text + " "*20, end="\r")  # 清除行尾
                    sys.stdout.flush()
            time.sleep(0.1)  # 避免CPU过载
            
except KeyboardInterrupt:
    print("\n\n录音已停止")
except Exception as e:
    print(f"发生错误: {e}")

开始录音，请说话...



录音已停止
