# Tutorial: Automatic Speech Recognition with ReazonSpeech

In this tutorial, we perform Japanese speech recognition using ReazonSpeech v2.0.

（Note: Choose a GPU instance in 'Runtime > Change runtime type' for acceleration)

## Set up ReazonSpeech

First, install ReazonSpeech python package:

In [1]:
!apt-get install libsndfile1 ffmpeg
!git clone https://github.com/Triplecq/ReazonSpeech.git
!pip install --no-warn-conflicts ReazonSpeech/pkg/k2-asr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1 is already the newest version (1.0.31-2ubuntu0.1).
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Cloning into 'ReazonSpeech'...
remote: Enumerating objects: 339, done.[K
remote: Counting objects: 100% (131/131), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 339 (delta 86), reused 93 (delta 63), pack-reused 208[K
Receiving objects: 100% (339/339), 258.79 KiB | 1.08 MiB/s, done.
Resolving deltas: 100% (194/194), done.
Processing ./ReazonSpeech/pkg/k2-asr
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sherpa-onnx (from reazonspeech-k2-asr==2.0.1)
  Downloading sherpa_onnx-1.10.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

## Download an audio file

Next download an example audio file:

In [2]:
!curl -O https://research.reazon.jp/_static/demo.mp3

from IPython.display import Audio, display
display(Audio("demo.mp3"))

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  164k  100  164k    0     0   683k      0 --:--:-- --:--:-- --:--:--  683k


## Perform speech recognition

Now that the setup is ready, we can start perform Japanese speech recognition.

The following Python code shows how to do it:

In [3]:
from reazonspeech.k2.asr import transcribe, audio_from_path, load_model

# Download ReazonSpeech model from Hugging Face
model = load_model()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.19k [00:00<?, ?B/s]

test_audios/JSUT0001.wav:   0%|          | 0.00/102k [00:00<?, ?B/s]

test_audios/JSUT0002.wav:   0%|          | 0.00/157k [00:00<?, ?B/s]

decoder-epoch-99-avg-1.onnx:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

joiner-epoch-99-avg-1.onnx:   0%|          | 0.00/10.7M [00:00<?, ?B/s]

joiner-epoch-99-avg-1.int8.onnx:   0%|          | 0.00/2.70M [00:00<?, ?B/s]

encoder-epoch-99-avg-1.int8.onnx:   0%|          | 0.00/155M [00:00<?, ?B/s]

test_audios/JSUT0003.wav:   0%|          | 0.00/122k [00:00<?, ?B/s]

encoder-epoch-99-avg-1.onnx:   0%|          | 0.00/592M [00:00<?, ?B/s]

tokens.txt:   0%|          | 0.00/45.8k [00:00<?, ?B/s]

decoder-epoch-99-avg-1.int8.onnx:   0%|          | 0.00/2.96M [00:00<?, ?B/s]

Repository downloaded to: /root/.cache/huggingface/hub/models--reazon-research--reazonspeech-zipformer-large/snapshots/c5e1261b1167a0f2bbc75a4da29b9222eda62b40


In [4]:
# Perform speech recognition
audio = audio_from_path("demo.mp3")
ret = transcribe(model, audio)

# Output
print("\n## Result")
print(ret.text)

audio normalized
{"text": "長野県は全国で三番目に大きな県ですお隣の山梨県の三倍以上もあります長野から飯田へ行くのにも東京へ行くのと同じ時間がかかるのを見ても面積の広いことが分かります", "timestamps": [0.00, 0.52, 0.72, 0.92, 1.32, 1.52, 1.76, 2.08, 2.32, 2.56, 2.68, 2.96, 3.04, 3.16, 3.44, 3.72, 3.84, 4.24, 4.44, 4.72, 5.16, 5.36, 5.60, 5.84, 6.16, 6.40, 6.68, 6.72, 6.92, 7.12, 7.32, 7.56, 7.64, 8.32, 8.48, 8.76, 8.80, 9.16, 9.36, 9.56, 9.76, 9.88, 10.04, 10.16, 10.32, 10.64, 10.76, 10.96, 11.20, 11.28, 11.44, 11.60, 11.88, 12.00, 12.32, 12.40, 12.60, 12.80, 12.92, 13.08, 13.24, 13.36, 13.56, 13.72, 13.84, 14.24, 14.40, 14.64, 14.88, 15.04, 15.28, 15.36, 15.48, 15.72, 15.84, 16.04, 16.68, 16.72], "tokens":["長", "野", "県", "は", "全", "国", "で", "三", "番", "目", "に", "大", "き", "な", "県", "で", "す", "お", "隣", "の", "山", "梨", "県", "の", "三", "倍", "以", "上", "も", "あ", "り", "ま", "す", "長", "野", "か", "ら", "飯", "田", "へ", "行", "く", "の", "に", "も", "東", "京", "へ", "行", "く", "の", "と", "同", "じ", "時", "間", "が", "か", "か", "る", "の", "を", "見", "て", "も", "面", "積", "の", "広", "い", "こ", "と", "が

If you can see Japanese text in the last line of the output, then it's successful.