ArmDeveloperEcosystem · pareenaverma · Feb 5, 2025 · Feb 5, 2025
diff --git a/content/learning-paths/servers-and-cloud-computing/funASR/2_modelscope.md b/content/learning-paths/servers-and-cloud-computing/funASR/2_modelscope.md
@@ -1,5 +1,5 @@
 ---
-title: ModelScope - Open-Source AI Pre-trained AI models hub
+title: ModelScope - Open-Source Pre-trained AI models hub
 weight: 3
 
 ### FIXED, DO NOT MODIFY
@@ -8,7 +8,7 @@ layout: learningpathall
 
 ## Before you begin
 
-To follow the instructions for this Learning Path, you will need an Arm server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.
+To follow the instructions for this Learning Path, you will need an Arm based server running Ubuntu 22.04 LTS or later version with at least 8 cores, 16GB of RAM, and 30GB of disk storage.
 
 ## Introduce ModelScope
 [ModelScope](https://github.com/modelscope/modelscope/) is an open-source platform that makes it easy to use AI models in your applications. 
@@ -34,13 +34,13 @@ Arm provides optimized software and tools, such as Kleidi, to accelerate AI infe
 You can learn more about [Faster PyTorch Inference using Kleidi on Arm Neoverse](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/faster-pytorch-inference-kleidi-arm-neoverse) from Arm community website.
 
 
-## Installing ModelScope
+## Install ModelScope and PyTorch
 
 First, ensure your system is up-to-date and install the required tools and libraries:
 
 ```bash
 sudo apt-get update -y
-sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3
+sudo apt-get install -y curl git wget python3 python3-pip python3-venv python-is-python3 ffmpeg
 ```
 
 Create and activate a virtual environment:
@@ -49,19 +49,25 @@ python -m venv venv
 source venv/bin/activate
 ```
 
-Install related packages: 
+In your active virtual environment, install modelscope:
+
+```bash
+pip3 install modelscope
+```
+
+Install PyTorch and related python dependencies: 
 ```bash
 pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
 pip3 install numpy packaging addict datasets simplejson sortedcontainers transformers ffmpeg
 
 ```
 {{% notice Note %}}
-This learning path will execute models on Arm Neoverse, so we only need to install the PyTorch CPU package.
+In this learning path you will execute models on the Arm Neoverse CPU, so you will only need to install the PyTorch CPU package.
 {{% /notice %}}
 
 ## Create a sample example
 
-After completing the installation, we will use an example related to Chinese semantic understanding to illustrate how to use ModelScope.
+You can now run an example to understand how to use ModelScope for understanding Chinese semantics.
 
 There is a fundamental difference between Chinese and English writing. 
 The relationship between Chinese characters and their meanings is somewhat analogous to the difference between words and phrases in English. 
@@ -70,9 +76,11 @@ Some Chinese characters, like English words, have clear meanings on their own, s
 However, more often, Chinese characters need to be combined with other characters to express more complete meanings, just like phrases in English. 
 For example, “祝福” (blessing) can be broken down into “祝” (wish) and “福” (good fortune); “分享” (share) can be broken down into “分” (divide) and “享” (enjoy); “生成” (generate) is composed of “生” (produce) and “成” (become).
 
-For computers to understand Chinese sentences, we need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.
+For computers to understand Chinese sentences, you will need to understand the rules of Chinese characters, vocabulary, and grammar to accurately understand and express meaning.
+
+In this simple example, you will use a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base) to break down Chinese sentences into individual words, facilitating analysis and understanding by computers.
 
-Here ia a simple example using a general-domain Chinese [word segmentation model](https://www.modelscope.cn/models/iic/nlp_structbert_word-segmentation_chinese-base), which can break down Chinese sentences into individual words, facilitating analysis and understanding by computers.
+Using a file editor of your choice, copy the code shown below into a file named `segmentation.py`:
 
 ```python
 from modelscope.pipelines import pipeline
@@ -84,7 +92,13 @@ result = word_segmentation(text)
 print(result)
 ```
 
-The output will be like this:
+Run the model inference on the sample text:
+
+```bash
+python3 segmentation.py
+```
+
+The output should look like this:
 ```output
 2025-01-28 00:30:29,692 - modelscope - WARNING - Model revision not specified, use revision: v1.0.3
 Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/damo/nlp_structbert_word-segmentation_chinese-base

diff --git a/content/learning-paths/servers-and-cloud-computing/funASR/3_funasr.md b/content/learning-paths/servers-and-cloud-computing/funASR/3_funasr.md
@@ -13,16 +13,16 @@ layout: learningpathall
 ## Installing FunASR
 Install FunASR using pip:
 ```bash
-pip3 install funasr
+pip3 install funasr==1.2.3
 ```
 {{% notice Note %}}
-The following content is based on tests conducted using FunASR version 1.2.3. Variations may exist in different versions.
+The learning path examples use FunASR version 1.2.3. You may notice minor differences in results with other versions.
 {{% /notice %}}
 
-## Performing Speech Recognition
-FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement speech recognition application.
+## Speech Recognition
+FunASR offers a simple interface for performing speech recognition tasks. You can easily transcribe audio files or implement real-time speech recognition using FunASR's functionalities. In this learning path, you will learn how to leverage FunASR to implement a speech recognition application.
 
-Let's use a sample English speech voice as an example.
+Let's use an English speech voice sample as an example to run audio transcription on. Copy the code shown below into a file named `funasr_test1.py`
 
 ```python
 from funasr import AutoModel
@@ -36,37 +36,41 @@ res = model.generate(input="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/Ma
 print(f"\nResult: \n{res[0]['text']}")
 ```
 
-Quick explain of the above Python code: 
+Before you run this script, lets look at what the Python code is doing:
 
-AutoModel(): This is a class that provides an interface to load different AI models.
+The imported `AutoModel()` class provides an interface to load different AI models.
 
 * **model="paraformer":**
 
-    Specifies the which model you'd like to load. 
-    In this example you will load Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.
+    Specifies the model you would like to load. 
+    In this example you will load the Paraformer model, which is an end-to-end automatic speech recognition (ASR) model designed for real-time transcription.
 
 * **device="cpu":** 
 
-    Specify the model runs on the CPU (instead of a GPU).
+    Specify the model runs on the CPU. It does not require a GPU.
 
 * **hub="ms":** 
 
     Indicates that the model is sourced from the "ms" (ModelScope) hub.
 
-model.generate(): This function processes an audio file and generates a transcribed text output.
+The `model.generate()` function processes an audio file and generates a transcribed text output.
 * **input="...":** 
 
-    The input is an audio file URL, which is a .wav file containing spoken content.
+    The input is an audio file URL, which is a .wav file containing an English audio sample.
 
-
-Since the result contains a lot of information, to make it sample, we will only list the content of res[0]['text'].
+The result contains a lot of information. To keep the example simple, you  will only list the transcribed text contained in res[0]['text'].
 
 In this initial test, a two-second English audio clip from the internet will be used for paraformer model to infleune the wave file.
 
-Copy the Python code and execute in your Arm Neoverse, the result will looks like:
+Run this Python script on your Arm based server:
 
-```output
+```bash
 python funasr_test1.py
+```
+
+The output will look like:
+
+```output
 funasr version: 1.2.3.
 Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel
 You are using the latest version of funasr-1.2.3
@@ -91,9 +95,9 @@ Result:
 he tried to think how it could be
 ```
 
-The output shows "he tried to think how it could be" as expected. 
+The transcribed test shows "he tried to think how it could be". This is the expected result for the audio sample.
 
-After understanding the basic usage, let's use a Chinese model.
+Now lets try an example that uses a Chinese speech recognition model. Copy the code shown below in a file named `funasr_test2.py`:
 
 ```python
 import os
@@ -111,16 +115,22 @@ res = model.generate(input=wav_file)
 text_content = res[0]['text'].replace(" ","")
 print(f"Result: \n{text_content}")
 
-pring(res)
+print(res)
 ```
 
-You can see that the executed model has been replaced with a model that has a 'zh' suffix. 
+You can see that the loaded model has been replaced with a Chinese speech recognition model that has a `-zh` suffix. 
 
-FunASR will recognise each sound in the speech with appropriate character recognition.
+FunASR will process each sound in the audio with appropriate character recognition.
 
-We'd like to slightly modify the output format. In addition to recognizing Chinese characters, we'll also add timestamps indicating the start and end times of each character. This will facilitate applications such as subtitle generation and sentiment analysis.
+You have also modified the output format from the previous example. In addition to recognising the Chinese characters, you will add timestamps indicating the start and end times of each character. This is used for applications like subtitle generation and sentiment analysis.
 
-The output should be looks like:
+Run the Python script:
+
+```bash
+python3 funasr_test2.py
+```
+
+The output should look like:
 
 ```output
 Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
@@ -137,13 +147,14 @@ Result:
 
 The output shows "欢迎大家来到达摩社区进行体验" as expected.
 
-You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous session.
+You can also observe that the spacing between the third and sixth characters is very short. This is because they are combined with other characters, as discussed in the previous section.
 
-The speech processing pipeline can be conceptualized as follows: the output of the speech recognition module serves as the input for the semantic segmentation model, enabling us to validate the accuracy of the recognized results.
+You can now build a speech processing pipeline. The output of the speech recognition module serves as the input for the semantic segmentation model, enabling you to validate the accuracy of the recognized results. Copy the code shown below in a file named `funasr_test3.py`:
 
 ```python
 from funasr import AutoModel
 from modelscope.pipelines import pipeline
+import os
 
 model = AutoModel(
     model="paraformer-zh",
@@ -164,8 +175,13 @@ seg_result = word_segmentation(text_content)
 
 print(f"Result: \n{seg_result}")
 ```
+Run this Python script:
 
-The output will be looks like:
+```bash
+python3 funasr_test3.py
+```
+
+The output should look like:
 
 ```output
 Downloading Model to directory: /home/ubuntu/.cache/modelscope/hub/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
@@ -196,17 +212,17 @@ Result:
 {'output': ['欢迎', '大家', '来到', '达摩', '社区', '进行', '体验']}
 ```
 
-Good, the result exactly what we were looking for.
+Good, the result is exactly what you are looking for.
 
 ## Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
 
-Now, I'd like to introduce more advance speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).
+Lets now look at a more advanced speech recognition model, [Paraformer](https://aclanthology.org/2020.wnut-1.18/).
 
 Paraformer is a novel architecture for automatic speech recognition (ASR) that offers both enhanced speed and accuracy compared to traditional models. Its key innovation lies in its parallel transformer design, enabling simultaneous processing of multiple parts of the input speech. This parallel processing capability leads to significantly faster inference, making Paraformer well-suited for real-time ASR applications where responsiveness is crucial.  
 
 Furthermore, Paraformer has demonstrated state-of-the-art accuracy on several benchmark datasets, showcasing its effectiveness in accurately transcribing speech. This combination of speed and accuracy makes Paraformer a promising advancement in the field of ASR, opening up new possibilities for high-performance speech recognition systems.
 
-Paraformer has been fully integrated into FunASR. Here is a sample program.
+Paraformer has been fully integrated into FunASR. Copy the sample program shown below into a file named `paraformer.py`.
 
 This example uses PyTorch-optimized Paraformer model from ModelScope, the program will first check if the test audio file has been downloaded.
 
@@ -240,8 +256,13 @@ rec_result = inference_pipeline(input=filename)
 
 print(f"\nResult: \n{rec_result[0]['text']}")
 ```
+Run this Python script
+
+```bash
+python3 paraformer.py
+```
 
-When you execute the code, the output will looks like:
+The output should look like:
 
 ```output
 2025-01-28 00:03:24,373 - modelscope - INFO - Use user-specified model revision: v2.0.4
@@ -273,17 +294,15 @@ The output shows "飞机穿过云层眼下一片云海有时透过稀薄的云
 
 ## Punctuation Restoration
 
-In the previous example, the speech of each word was correctly recognized, but the lack of punctuation hindered understanding the speaker's intended expression.
+In the previous example, the speech of each word was correctly recognized, but it lacked punctuation. The lack of punctuation hinders our understanding of the speaker's intended expression.
 
-Therefore, we can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next stage in the audio workload.
+You can add a [Punctuation Restoration model](https://aclanthology.org/2020.wnut-1.18/) responsible for punctuation as the next step in processing your audio workload.
 
-In the example above, we continue to use the Paraformer model from the previous example and add two ModelScope's model:
+In addition to using the Paraformer model, you will add two more ModelScope models:
 - VAD ([Voice Activity Detection](https://modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary)) and 
 - PUNC ([Punctuation Restoration](https://modelscope.cn/models/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files)) 
 
-by add `vad_model` and `punc_model` parameters in the later stages. 
-
-This way, we can obtain punctuation that matches the semantics of the speech recognition.
+This way, you can obtain punctuation that matches the semantics of the speech recognition. Copy the updated code shown below in a file named `paraformer-2.py`:
 
 ```python
 import os
@@ -320,11 +339,17 @@ print(f"\nResult: \n{rec_result[0]['text']}")
 ```
 
 {{% notice Note %}}
-vad_model_revision & punc_model_revision are not a required parameter. In most cases, it can work smoothly without specifying the version.
+vad_model_revision & punc_model_revision are optional parameters. In most cases, your models should work without specifying the version.
 {{% /notice %}}
 
+Run the updated Python script:
 
-The entire speech is correctly segmented into four parts based on semantics.
+```bash
+python3 paraformer-2.py
+```
+
+
+The entire speech sample is correctly segmented into four parts based on semantics.
 
 ```output
 rtf_avg: 0.047: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.45it/s]
@@ -335,7 +360,7 @@ Result:
 飞机穿过云层，眼下一片云海，有时透过稀薄的云雾，依稀可见南国葱绿的群山大地。
 ```
 
-Simply translate this recognized result, and you can easily see that the four sentences represent different meanings.
+Lets translate this recognized result, and you can easily see that the four sentences represent different meanings.
 
 "飞机穿过云层" means: The airplane passed through the clouds.
 
@@ -348,11 +373,11 @@ Simply translate this recognized result, and you can easily see that the four se
 
 
 ## Sentiment Analysis
-FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of spoken language. 
+FunASR also supports sentiment analysis of speech, allowing you to determine the emotional tone of the spoken language. 
 
 This can be valuable for applications like customer service and social media monitoring.
 
-We use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) on ModelScope as an example.
+You can use a mature speech emotion recognition model [emotion2vec+](https://modelscope.cn/models/iic/emotion2vec_plus_large) from ModelScope as an example.
 
 The model will identify which of the following emotions is the closest match for the emotion expressed in the speech:
 - Neutral
@@ -361,6 +386,7 @@ The model will identify which of the following emotions is the closest match for
 - Angry
 - Unknow
 
+Copy the code shown below in a file named `sentiment.py`:
 
 ```python
 from modelscope.pipelines import pipeline
@@ -410,10 +436,16 @@ process_audio_file(
     'https://utoronto.scholaris.ca/bitstreams/5ce257a3-be71-41a8-8d88-d097ca15af4e/download'
 )
 
+```
+Run this script:
+
+```bash
+python3 sentiment.py
 ```
 
-Without a model that understands semantics, emotion2vec+ can still correctly recognize the speaker's emotions through changes in intonation.
+Without a model that understands semantics, `emotion2vec+` can still correctly recognize the speaker's emotions through changes in intonation.
 
+The output should look like:
 
 ```output
 Neutral Chinese Speech
@@ -427,9 +459,5 @@ rtf_avg: 1.444: 100%|███████████████████
 Result: ['生气/angry (1.00)', '中立/neutral (0.00)', '开心/happy (0.00)', '难过/sad (0.00)', '<unk> (0.00)']
 ```
 
-## Best Price-Performance for ASR on Arm Neoverse N2
-Arm CPUs, with their high performance and low power consumption, provide an ideal platform for running ModelScope's AI models, especially in edge computing scenarios. Arm's comprehensive software ecosystem supports the development and deployment of ModelScope models, enabling developers to create innovative and efficient applications. 
-You can learn more about [Kleidi Technology Delivers Best Price-Performance for ASR on Arm Neoverse N2](https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/neoverse-n2-delivers-leading-price-performance-on-asr) from the Arm community blog.
-
 ## Conclusion
 ModelScope and FunASR empower developers to build robust Chinese ASR applications. By leveraging the strengths of Arm CPUs and the optimized software ecosystem, developers can create innovative and efficient solutions for various use cases. Explore the capabilities of ModelScope and FunASR, and unlock the potential of Arm technology for your next Chinese ASR project.