Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: ModelScope - Open-Source AI Pre-trained AI models hub
title: ModelScope - Open-Source Pre-trained AI models hub
weight: 3

### FIXED, DO NOT MODIFY
Expand All @@ -8,7 +8,7 @@ layout: learningpathall

## Before you begin

To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.
To follow the instructions for this Learning Path, you will need an Arm based server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.

## Introduce ModelScope
[ModelScope](https://github.com/modelscope/modelscope/) is an open-source platform that makes it easy to use AI models in your applications.
Expand All @@ -34,13 +34,13 @@ Arm provides optimized software and tools, such as Kleidi, to accelerate AI infe
You can learn more about [Faster PyTorch Inference using Kleidi on Arm Neoverse](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/faster-pytorch-inference-kleidi-arm-neoverse) from Arm community website.


## Installing ModelScope
## Install ModelScope and PyTorch

First, ensure your system is up-to-date and install the required tools and libraries:

```bash
sudo apt-get update -y
sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3
sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3 ffmpeg
```

Create and activate a virtual environment:
Expand All @@ -49,19 +49,25 @@ python -m venv venv
source venv/bin/activate
```

Install related packages:
In your active virtual environment, install modelscope:

```bash
pip3 install modelscope
```

Install PyTorch and related python dependencies:
```bash
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip3 install numpy packaging addict datasets simplejson sortedcontainers transformers ffmpeg

```
{{% notice Note %}}
This learning path will execute models on Arm Neoverse, so we only need to install the PyTorch CPU package.
In this learning path you will execute models on the Arm Neoverse CPU, so you will only need to install the PyTorch CPU package.
{{% /notice %}}

## Create a sample example

After completing the installation, we will use an example related to Chinese semantic understanding to illustrate how to use ModelScope.
You can now run an example to understand how to use ModelScope for understanding Chinese semantics.

There is a fundamental difference between Chinese and English writing.
The relationship between Chinese characters and their meanings is somewhat analogous to the difference between words and phrases in English.
Expand All @@ -70,9 +76,11 @@ Some Chinese characters, like English words, have clear meanings on their own, s
However, more often, Chinese characters need to be combined with other characters to express more complete meanings, just like phrases in English.
For example, “祝福” (blessing) can be broken down into “祝” (wish) and “福” (good fortune); “分享” (share) can be broken down into “分” (divide) and “享” (enjoy); “生成” (generate) is composed of “生” (produce) and “成” (become).

For computers to understand Chinese sentences, we need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.
For computers to understand Chinese sentences, you will need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.

In this simple example, you will use a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base) to break down Chinese sentences into individual words, facilitating analysis and understanding by computers.

Here ia a simple example using a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base), which can break down Chinese sentences into individual words, facilitating analysis and understanding by computers.
Using a file editor of your choice, copy the code shown below into a file named `segmentation.py`:

```python
from modelscope.pipelines import pipeline
Expand All @@ -84,7 +92,13 @@ result = word_segmentation(text)
print(result)
```

The output will be like this:
Run the model inference on the sample text:

```bash
python3 segmentation.py
```

The output should look like this:
```output
2025-01-28 00:30:29,692 - modelscope - WARNING - Model revision not specified, use revision: v1.0.3
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/damo/nlp_structbert_word-segmentation_chinese-base
Expand Down
120 changes: 74 additions & 46 deletions content/learning-paths/servers-and-cloud-computing/funASR/3_funasr.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ layout: learningpathall
## Installing FunASR
Install FunASR using pip:
```bash
pip3 install funasr
pip3 install funasr==1.2.3
```
{{% notice Note %}}
The following content is based on tests conducted using FunASR version 1.2.3. Variations may exist in different versions.
The learning path examples use FunASR version 1.2.3. You may notice minor differences in results with other versions.
{{% /notice %}}

## Performing Speech Recognition
FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement speech recognition application.
## Speech Recognition
FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement a speech recognition application.

Let's use a sample English speech voice as an example.
Let's use an English speech voice sample as an example to run audio transcription on. Copy the code shown below into a file named `funasr_test1.py`

```python
from funasr import AutoModel
Expand All @@ -36,37 +36,41 @@ res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/Ma
print(f"\nResult: \n{res[0]['text']}")
```

Quick explain of the above Python code:
Before you run this script, lets look at what the Python code is doing:

AutoModel(): This is a class that provides an interface to load different AI models.
The imported `AutoModel()` class provides an interface to load different AI models.

* **model="paraformer":**

Specifies the which model you'd like to load.
In this example you will load Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.
Specifies the model you would like to load.
In this example you will load the Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.

* **device="cpu":**

Specify the model runs on the CPU (instead of a GPU).
Specify the model runs on the CPU. It does not require a GPU.

* **hub="ms":**

Indicates that the model is sourced from the "ms" (ModelScope) hub.

model.generate(): This function processes an audio file and generates a transcribed text output.
The `model.generate()` function processes an audio file and generates a transcribed text output.
* **input="...":**

The input is an audio file URL, which is a .wav file containing spoken content.
The input is an audio file URL, which is a .wav file containing an English audio sample.


Since the result contains a lot of information, to make it sample, we will only list the content of res[0]['text'].
The result contains a lot of information. To keep the example simple, you will only list the transcribed text contained in res[0]['text'].

In this initial test, a two-second English audio clip from the internet will be used for paraformer model to infleune the wave file.

Copy the Python code and execute in your Arm Neoverse, the result will looks like:
Run this Python script on your Arm based server:

```output
```bash
python funasr_test1.py
```

The output will look like:

```output
funasr version: 1.2.3.
Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
You are using the latest version of funasr-1.2.3
Expand All @@ -91,9 +95,9 @@ Result:
he tried to think how it could be
```

The output shows "he tried to think how it could be" as expected.
The transcribed test shows "he tried to think how it could be". This is the expected result for the audio sample.

After understanding the basic usage, let's use a Chinese model.
Now lets try an example that uses a Chinese speech recognition model. Copy the code shown below in a file named `funasr_test2.py`:

```python
import os
Expand All @@ -111,16 +115,22 @@ res = model.generate(input=wav_file)
text_content = res[0]['text'].replace(" ","")
print(f"Result: \n{text_content}")

pring(res)
print(res)
```

You can see that the executed model has been replaced with a model that has a 'zh' suffix.
You can see that the loaded model has been replaced with a Chinese speech recognition model that has a `-zh` suffix.

FunASR will recognise each sound in the speech with appropriate character recognition.
FunASR will process each sound in the audio with appropriate character recognition.

We'd like to slightly modify the output format. In addition to recognizing Chinese characters, we'll also add timestamps indicating the start and end times of each character. This will facilitate applications such as subtitle generation and sentiment analysis.
You have also modified the output format from the previous example. In addition to recognising the Chinese characters, you will add timestamps indicating the start and end times of each character. This is used for applications like subtitle generation and sentiment analysis.

The output should be looks like:
Run the Python script:

```bash
python3 funasr_test2.py
```

The output should look like:

```output
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Expand All @@ -137,13 +147,14 @@ Result:

The output shows "欢迎大家来到达摩社区进行体验" as expected.

You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous session.
You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous section.

The speech processing pipeline can be conceptualized as follows: the output of the speech recognition module serves as the input for the semantic segmentation model, enabling us to validate the accuracy of the recognized results.
You can now build a speech processing pipeline. The output of the speech recognition module serves as the input for the semantic segmentation model, enabling you to validate the accuracy of the recognized results. Copy the code shown below in a file named `funasr_test3.py`:

```python
from funasr import AutoModel
from modelscope.pipelines import pipeline
import os

model = AutoModel(
model="paraformer-zh",
Expand All @@ -164,8 +175,13 @@ seg_result = word_segmentation(text_content)

print(f"Result: \n{seg_result}")
```
Run this Python script:

The output will be looks like:
```bash
python3 funasr_test3.py
```

The output should look like:

```output
Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Expand Down Expand Up @@ -196,17 +212,17 @@ Result:
{'output': ['欢迎', '大家', '来到', '达摩', '社区', '进行', '体验']}
```

Good, the result exactly what we were looking for.
Good, the result is exactly what you are looking for.

## Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Now, I'd like to introduce more advance speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).
Lets now look at a more advanced speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).

Paraformer is a novel architecture for automatic speech recognition (ASR) that offers both enhanced speed and accuracy compared to traditional models. Its key innovation lies in its parallel transformer design, enabling simultaneous processing of multiple parts of the input speech. This parallel processing capability leads to significantly faster inference, making Paraformer well-suited for real-time ASR applications where responsiveness is crucial.

Furthermore, Paraformer has demonstrated state-of-the-art accuracy on several benchmark datasets, showcasing its effectiveness in accurately transcribing speech. This combination of speed and accuracy makes Paraformer a promising advancement in the field of ASR, opening up new possibilities for high-performance speech recognition systems.

Paraformer has been fully integrated into FunASR. Here is a sample program.
Paraformer has been fully integrated into FunASR. Copy the sample program shown below into a file named `paraformer.py`.

This example uses PyTorch-optimized Paraformer model from ModelScope, the program will first check if the test audio file has been downloaded.

Expand Down Expand Up @@ -240,8 +256,13 @@ rec_result = inference_pipeline(input=filename)

print(f"\nResult: \n{rec_result[0]['text']}")
```
Run this Python script

```bash
python3 paraformer.py
```

When you execute the code, the output will looks like:
The output should look like:

```output
2025-01-28 00:03:24,373 - modelscope - INFO - Use user-specified model revision: v2.0.4
Expand Down Expand Up @@ -273,17 +294,15 @@ The output shows "飞机穿过云层眼下一片云海有时透过稀薄的云

## Punctuation Restoration

In the previous example, the speech of each word was correctly recognized, but the lack of punctuation hindered understanding the speaker's intended expression.
In the previous example, the speech of each word was correctly recognized, but it lacked punctuation. The lack of punctuation hinders our understanding of the speaker's intended expression.

Therefore, we can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next stage in the audio workload.
You can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next step in processing your audio workload.

In the example above, we continue to use the Paraformer model from the previous example and add two ModelScope's model:
In addition to using the Paraformer model, you will add two more ModelScope models:
- VAD ([Voice Activity Detection](https://modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary)) and
- PUNC ([Punctuation Restoration](https://modelscope.cn/models/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files))

by add `vad_model` and `punc_model` parameters in the later stages.

This way, we can obtain punctuation that matches the semantics of the speech recognition.
This way, you can obtain punctuation that matches the semantics of the speech recognition. Copy the updated code shown below in a file named `paraformer-2.py`:

```python
import os
Expand Down Expand Up @@ -320,11 +339,17 @@ print(f"\nResult: \n{rec_result[0]['text']}")
```

{{% notice Note %}}
vad_model_revision & punc_model_revision are not a required parameter. In most cases, it can work smoothly without specifying the version.
vad_model_revision & punc_model_revision are optional parameters. In most cases, your models should work without specifying the version.
{{% /notice %}}

Run the updated Python script:

The entire speech is correctly segmented into four parts based on semantics.
```bash
python3 paraformer-2.py
```


The entire speech sample is correctly segmented into four parts based on semantics.

```output
rtf_avg: 0.047: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.45it/s]
Expand All @@ -335,7 +360,7 @@ Result:
飞机穿过云层,眼下一片云海,有时透过稀薄的云雾,依稀可见南国葱绿的群山大地。
```

Simply translate this recognized result, and you can easily see that the four sentences represent different meanings.
Lets translate this recognized result, and you can easily see that the four sentences represent different meanings.

"飞机穿过云层" means: The airplane passed through the clouds.

Expand All @@ -348,11 +373,11 @@ Simply translate this recognized result, and you can easily see that the four se


## Sentiment Analysis
FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of spoken language.
FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of the spoken language.

This can be valuable for applications like customer service and social media monitoring.

We use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) on ModelScope as an example.
You can use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) from ModelScope as an example.

The model will identify which of the following emotions is the closest match for the emotion expressed in the speech:
- Neutral
Expand All @@ -361,6 +386,7 @@ The model will identify which of the following emotions is the closest match for
- Angry
- Unknow

Copy the code shown below in a file named `sentiment.py`:

```python
from modelscope.pipelines import pipeline
Expand Down Expand Up @@ -410,10 +436,16 @@ process_audio_file(
'https://utoronto.scholaris.ca/bitstreams/5ce257a3-be71-41a8-8d88-d097ca15af4e/download'
)

```
Run this script:

```bash
python3 sentiment.py
```

Without a model that understands semantics, emotion2vec+ can still correctly recognize the speaker's emotions through changes in intonation.
Without a model that understands semantics, `emotion2vec+` can still correctly recognize the speaker's emotions through changes in intonation.

The output should look like:

```output
Neutral Chinese Speech
Expand All @@ -427,9 +459,5 @@ rtf_avg: 1.444: 100%|███████████████████
Result: ['生气/angry (1.00)', '中立/neutral (0.00)', '开心/happy (0.00)', '难过/sad (0.00)', '<unk> (0.00)']
```

## Best Price-Performance for ASR on Arm Neoverse N2
Arm CPUs, with their high performance and low power consumption, provide an ideal platform for running ModelScope's AI models, especially in edge computing scenarios. Arm's comprehensive software ecosystem supports the development and deployment of ModelScope models, enabling developers to create innovative and efficient applications.
You can learn more about [Kleidi Technology Delivers Best Price-Performance for ASR on Arm Neoverse N2](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/neoverse-n2-delivers-leading-price-performance-on-asr) from the Arm community blog.

## Conclusion
ModelScope and FunASR empower developers to build robust Chinese ASR applications. By leveraging the strengths of Arm CPUs and the optimized software ecosystem, developers can create innovative and efficient solutions for various use cases. Explore the capabilities of ModelScope and FunASR, and unlock the potential of Arm technology for your next Chinese ASR project.
Loading