# ASR Demo

In [1]:
import IPython.display as dp
from IPython.display import HTML
html_str = '''
<video controls width="600" height="360" src="{}">animation</video>
'''.format("work/source/asr-demo-1.mp4")
dp.display(HTML(html_str))
print ("结果为：")

结果为：


尝试使用，可点击该链接：（hub链接）[](http://)

# 前言

## 背景知识
语音识别(ASR, Automatic Speech Recognition) 是一项从一段音频中提取出语言文字内容的任务。目前该技术已经广泛应用于我们的工作和生活当中，包括生活中使用手机的语音转写，工作上使用的会议记录等等。

## 发展历史
* 早期，生成模型流行阶段：GMM-HMM (上世纪90年代，2006以前)
* 深度学习爆发初期： DNN，CTC[1] （2006）
* RNN流行，Attention提出初期: RNN-T[2]（2013）, DeepSpeech[3](2014)， DeepSpeech2 [4](2016), LAS[5]（2016）
* Attetion is all you need提出开始[6]: transformer[6]（2017），transformer-transducer[7]（2020） conformer[8] （2020）


目前transformer和conformer是语音识别领域的主流模型，因此本教程采用了transformer作为讲解的主要内容，并在课后作业中步骤了coformer的相关练习。


# 使用Transformer进行语音识别的的基本流程
<div align=center>
<img src="work/source/transformer_asr_pipeline.png" />
</div>

对于语音识别的流程，最为简单的描述就是：第一步特征提取模块获取音频的声学特征，接着第二部语音识别模型利用声学特征来获取识别结果。

声学提取模块一般使用fbank特征，这在后续的章节中会有讲解。

而对于语音识别模型，本课程使用的transformer语音识别模型主要分为2个部分，第一个部分是Encoder，第二个部分是Decoder。

声学特征会首先进入Encoder，获取特征编码。然后Decoder会利用Encoder提取的特征编码得到预测结果。

# 实战

## Stage 0 准备工作

### 安装 paddlespeech

In [2]:
!pip install paddlespeech

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Processing ./work/wheel_store/sentencepiece-0.1.96-cp37-cp37m-linux_x86_64.whl
Installing collected packages: sentencepiece
  Found existing installation: sentencepiece 0.1.85
    Uninstalling sentencepiece-0.1.85:
      Successfully uninstalled sentencepiece-0.1.85
Successfully installed sentencepiece-0.1.96
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting paddlespeech
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/c3/4829550a06df372d607018c75223b7710fba4ac8d2a82b966c4125d13cce/paddlespeech-0.1.0a1-py3-none-any.whl (723kB)
[K     |████████████████████████████████| 727kB 4.2MB/s eta 0:00:01
[?25hCollecting textgrid (from paddlespeech)
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9f/9e/04fb27ec5ac287b203afd5b228bc7c4ec5b7d3d81c4422d57847e755b0cc/TextGrid-1.5-py3-none-any.whl
Collecting praatio~=4.1 (from paddlespeech)
[?25l  Downloading https://pypi.tuna.tsinghua.edu.cn/

### 准备工作目录

In [3]:
%cd ./work
!mkdir -p ./workspace_asr
%cd ./workspace_asr

/home/aistudio/work
/home/aistudio/work/workspace_asr


### 获取预训练模型

In [4]:
!wget -nc https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
!tar xzvf transformer.model.tar.gz

--2021-11-30 11:22:18--  https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
Resolving paddlespeech.bj.bcebos.com (paddlespeech.bj.bcebos.com)... 182.61.200.195, 182.61.200.229, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to paddlespeech.bj.bcebos.com (paddlespeech.bj.bcebos.com)|182.61.200.195|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 123699838 (118M) [application/octet-stream]
Saving to: ‘transformer.model.tar.gz’


2021-11-30 11:22:20 (47.3 MB/s) - ‘transformer.model.tar.gz’ saved [123699838/123699838]

conf/transformer.yaml
conf/preprocess.yaml
data/mean_std.json
exp/transformer/checkpoints/avg_20.pdparams
data/lang_char/
data/lang_char/vocab.txt


In [5]:
# 获取用于预测的音频文件
%cp ../data/BAC009S0908W0355.wav ./data

### 导入python包

In [6]:
import paddle
import warnings
warnings.filterwarnings('ignore')
from paddlespeech.s2t.exps.u2.config import get_cfg_defaults
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
#from paddlespeech.s2t.io.collator import SpeechCollator
from paddlespeech.s2t.models.u2 import U2Model
from paddlespeech.s2t.utils import layer_tools

#from paddlespeech.s2t.frontend.normalizer import FeatureNormalizer
#from paddlespeech.s2t.frontend.featurizer.audio_featurizer import AudioFeaturizer
#from paddlespeech.s2t.frontend.speech import SpeechSegment

from paddlespeech.s2t.transform.spectrogram import LogMelSpectrogramKaldi
from paddlespeech.s2t.transform.cmvn import GlobalCMVN
import soundfile

### 设置预训练模型的路径

In [7]:
config_path = "conf/transformer.yaml" 
checkpoint_path = "./exp/transformer/checkpoints/avg_20.pdparams"
decoding_method = "attention"
audio_file = "data/BAC009S0908W0355.wav"

result_file = "exp/result.rsl"

# 读取 conf 文件并结构化
transformer_config = get_cfg_defaults()
transformer_config.merge_from_file(config_path)
transformer_config.decoding.decoding_method = decoding_method

#transformer_config = CfgNode(new_allowed=True)
#transformer_config.merge_from_file(config_path)
print("========Config========")
print(transformer_config)

collator:
  augmentation_config: conf/preprocess.yaml
  batch_size: 64
  delta_delta: False
  dither: 1.0
  feat_dim: 80
  keep_transcription_text: False
  max_freq: None
  mean_std_filepath: 
  n_fft: None
  num_workers: 2
  random_seed: 0
  raw_wav: True
  shuffle_method: batch_shuffle
  sortagrad: True
  spectrum_type: fbank
  spm_model_prefix: 
  stride_ms: 10.0
  target_dB: -20
  target_sample_rate: 16000
  unit_type: char
  use_dB_normalization: True
  vocab_filepath: data/lang_char/vocab.txt
  window_ms: 25.0
data:
  dev_manifest: data/manifest.dev
  manifest: 
  max_input_len: 20.0
  max_output_input_ratio: 10.0
  max_output_len: 400.0
  min_input_len: 0.5
  min_output_input_ratio: 0.05
  min_output_len: 0.0
  test_manifest: data/manifest.test
  train_manifest: data/manifest.train
decoding:
  alpha: 2.5
  batch_size: 128
  beam_size: 10
  beta: 0.3
  ctc_weight: 0.5
  cutoff_prob: 1.0
  cutoff_top_n: 0
  decoding_chunk_size: -1
  decoding_method: attention
  error_rate_type: ce

## Stage 1 获取特征

### 音频特征Fbank


![信号处理流水线](work/source/signal_pipeline.png)
(摘自https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb)

### 构建音频特征提取对象

In [8]:
# 构建 logmel 特征
logmel_kaldi= LogMelSpectrogramKaldi(
            fs= 16000,
            n_mels= 80,
            n_shift= 160,
            win_length= 400,
            dither= True)

# 特征减均值除以方差
cmvn = GlobalCMVN(
    cmvn_path="data/mean_std.json"
)

### 提取音频的特征

In [9]:

array, _ = soundfile.read(audio_file, dtype="int16")
array = logmel_kaldi(array, train=False)
array = cmvn(array)
audio_feature = array

print("========Feature========")

audio_len = audio_feature.shape[0]
audio_feature = paddle.to_tensor(audio_feature, dtype='float32')
print (audio_feature)
audio_len = paddle.to_tensor(audio_len)
audio_feature = paddle.unsqueeze(audio_feature, axis=0)


Tensor(shape=[848, 80], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[ 0.51146847,  0.25355875, -1.69958019, ..., -0.66561252,
         -0.69228119, -0.72872376],
        [ 0.06364148, -0.52314031, -0.86850190, ..., -1.15806091,
         -0.92412323, -0.80306697],
        [-0.17580508, -0.36929476, -1.90482414, ..., -0.95011121,
         -0.91865319, -0.78732079],
        ...,
        [ 1.01879728,  0.65635836, -0.79088914, ..., -0.98369741,
         -1.01682007, -0.99973106],
        [-0.49923953, -0.53835851, -0.24169487, ..., -1.13909304,
         -1.07762146, -0.82446492],
        [ 0.13295671,  0.08121970, -0.92406386, ..., -1.22621000,
         -1.33519983, -1.00139403]])


W1130 11:22:27.251442 32585 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1130 11:22:27.256433 32585 device_context.cc:465] device: 0, cuDNN Version: 7.6.


## Stage 2 声学模型

### Transofomer 语音识别模型的结构


<div align=center>
<img src="work/source/transformer.png"/>
</div>

图片参考了https://arxiv.org/pdf/1706.03762.pdf


Transformer模型主要由2个部分组成，包括transformer encoder和transformer decoder。 

### Transformer Encoder

Transformer encoder主要是对音频的原始特征（这里原始特征使用的是80维fbank）进行特征编码，其输入是fbank，输出是特征编码。一个Transformer Encoder由一个结合了位置编码(position encoding)的降采样模块(subsampling embedding)和多个Transformer Encoder Layer组成。  

其中降采样模块一般由2层降采样的CNN构成。而一层Transformer Encoder Layer主要由 Multi-head attention和Feed forward Layer构成。这里使用的Multi-head attention使用了self-attention方式，主要特点是Q(query), K(key)和V(value)都是同一个输入。而Feed forward layer由两层全连接层构建，其特点是保持了输入和输出的特征维度是一致的。 另外，encoder采用了残差网络的结构，并分别应用在了Multi-head attention 和 Feed forward Layer两个模块上。

#### Multi-Head Attention 机制
<div align=center>
<img src="work/source/Attention.png" />
</div>
> 图片参考了https://arxiv.org/pdf/1706.03762.pdf


对于self-attention的方式，由于其Q，K，V都是相同的，因此可以用如下的示例图更加清晰地表示：

<div align=center>
<img src="work/source/Attention_detail.png" />
</div>

其主要步骤可以分为三步：

第一步：
Q和K的向量通过求内积的方式计算相似度，经过scale和softmax后，获得每个Q和所有K之间的score。

第二步：
将每个Q和所有K之间的score和V进行相乘，再将相乘后的结果想加，得到attetion的输出向量。

第三步：
用多个Attetion模块都进行第一步和第二步，并将最后的输出向量进行合并，得到最后的Multi-Head Attention输出。

#### Transformer的Encoder构建代码

transformer Encoder主要由多层的 transformer encoder layer组成

```python
""" 构建 TransformerEncoder
        
"""
class TransformerEncoder(BaseEncoder):
    def __init__(
            self,
            input_size: int,
            output_size: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            pos_enc_layer_type: str="abs_pos",
            normalize_before: bool=True,
            concat_after: bool=False,
            static_chunk_size: int=0,
            use_dynamic_chunk: bool=False,
            global_cmvn: nn.Layer=None,
            use_dynamic_left_chunk: bool=False, ):
        
        assert check_argument_types()
        super().__init__(input_size, output_size, attention_heads, linear_units,
                         num_blocks, dropout_rate, positional_dropout_rate,
                         attention_dropout_rate, input_layer,
                         pos_enc_layer_type, normalize_before, concat_after,
                         static_chunk_size, use_dynamic_chunk, global_cmvn,
                         use_dynamic_left_chunk)
        self.encoders = nn.LayerList([
            TransformerEncoderLayer(
                size=output_size,
                self_attn=MultiHeadedAttention(attention_heads, output_size,
                                               attention_dropout_rate),
                feed_forward=PositionwiseFeedForward(output_size, linear_units,
                                                     dropout_rate),
                dropout_rate=dropout_rate,
                normalize_before=normalize_before,
                concat_after=concat_after) for _ in range(num_blocks)
        ])
        
    def forward(
            self,
            xs: paddle.Tensor,
            xs_lens: paddle.Tensor,
            decoding_chunk_size: int=0,
            num_decoding_left_chunks: int=-1,
    ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Embed positions in tensor.
        Args:
            xs: 经过了padding的输入  (B, L, D)
            xs_lens: 输入长度 (B)
            decoding_chunk_size: 解码的chunk的动态长度
                0: 用于训练, 使用动态素鸡的chunk长度.
                <0: for decoding, 使用整句话.
                >0: for decoding, 输入decoding_chunk_size的长度.
            num_decoding_left_chunks: 使用已经解码的chunk数进行decoding,
                >=0: 使用num_decoding_left_chunks个
                <0: 使用所有的chunk
        Returns:
            xs： Encoder输出tensor
            masks: Encoder输出tensor的mask
        """
        masks = make_non_pad_mask(xs_lens).unsqueeze(1)  # (B, 1, L)

        if self.global_cmvn is not None:
            xs = self.global_cmvn(xs)
        xs, pos_emb, masks = self.embed(xs, masks.astype(xs.dtype), offset=0)
        masks = masks.astype(paddle.bool)
        mask_pad = masks.logical_not()
        chunk_masks = add_optional_chunk_mask(
            xs, masks, self.use_dynamic_chunk, self.use_dynamic_left_chunk,
            decoding_chunk_size, self.static_chunk_size,
            num_decoding_left_chunks)
        for layer in self.encoders:
            xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
        if self.normalize_before:
            xs = self.after_norm(xs)
            
        # 这里我们假设Encoder中的掩码没有改变，所以只是
        # 返回Encoder输入的mask，这些mask将被使用于Decoder中
        return xs, masks
```

```python

class TransformerEncoderLayer(nn.Layer):
    """Encoder layer module."""

    def __init__(
            self,
            size: int,
            self_attn: nn.Layer,
            feed_forward: nn.Layer,
            dropout_rate: float,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """构建Encoder 层.
        Args:
            size (int): Input dimension.
            self_attn (nn.Layer): 自注意力模块实例.
                `MultiHeadedAttention` 或者 `RelPositionMultiHeadedAttention`的实例
            feed_forward (nn.Layer): Feed-forward 模块实例.
      				使用`PositionwiseFeedForward`的实例
            dropout_rate (float): Dropout rate.
            normalize_before (bool):
                True: 在 sub-block前使用 layer-norm.
                False: 在 sub-block后使用 layer-norm.
            concat_after (bool): 是否合并attention层的输入和输出
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
		"""
        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        # concat_linear 一般情况下不会使用到，但会在模型保存的时候存储下来
        self.concat_linear = nn.Linear(size + size, size)

    def forward(
            self,
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: Optional[paddle.Tensor]=None,
            mask_pad: Optional[paddle.Tensor]=None,
            output_cache: Optional[paddle.Tensor]=None,
            cnn_cache: Optional[paddle.Tensor]=None,
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
            x (paddle.Tensor): 输入 Tensor (#batch, time, size).
            mask (paddle.Tensor): 输入的 Mask tensor (#batch, time).
            pos_emb (paddle.Tensor): 位置编码， 这里只是为了和ConformerEncoderLayer保持接口兼容性
            mask_pad (paddle.Tensor): 没有使用这个参数，这里只是为了和ConformerEncoderLayer保持接口兼容性
            output_cache (paddle.Tensor): 输出的缓存  (#batch, time2, size), time2 < time in x.
            cnn_cache (paddle.Tensor): 没有使用这个参数，这里只是为了和ConformerEncoderLayer保持接口兼容性
        Returns:
            x: paddle.Tensor: 输出 tensor (#batch, time, size).
            mask: paddle.Tensor: mask tensor (#batch, time).
            fake_cnn_cache: paddle.Tensor: ，这里只是为了和Conformer保持接口兼容性 (#batch, channels, time').
        """
        residual = x
        if self.normalize_before:
            x = self.norm1(x)

        if output_cache is None:
            x_q = x
        else:
            assert output_cache.shape[0] == x.shape[0]
            assert output_cache.shape[1] < x.shape[1]
            assert output_cache.shape[2] == self.size
            chunk = x.shape[1] - output_cache.shape[1]
            x_q = x[:, -chunk:, :]
            residual = residual[:, -chunk:, :]
            mask = mask[:, -chunk:, :]

        if self.concat_after:
            x_concat = paddle.concat(
                (x, self.self_attn(x_q, x, x, mask)), axis=-1)
            x = residual + self.concat_linear(x_concat)
        else:
            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
        if not self.normalize_before:
            x = self.norm1(x)

        residual = x
        if self.normalize_before:
            x = self.norm2(x)
        x = residual + self.dropout(self.feed_forward(x))
        if not self.normalize_before:
            x = self.norm2(x)

        if output_cache is not None:
            x = paddle.concat([output_cache, x], axis=1)

        fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
        return x, mask, fake_cnn_cache
```


### Transformer Decoder

Transformer的Decoder用于获取最后输出的结果。其结构和Encoder有一定的相似性，也具有Multi-head attention模块和Feed forward layer。主要的不同点有2个，第一个不同点是Decoder采用的是一种自回归的方式进行解码。而第二个是在于Decoder在Multi-head attention和Feed forward layer模块之间增加了一层Multi-head attention层用于获取Encoder得到的特征编码。



#### Decoder的自回归解码 
其采用了一种自回归的结构，即decoder的上一个时间点的输出会作为下一个时间点的输入。另外，计算的过程中，decoder会利用encoder的输出信息。如果使用greedy的方式，decoder的解码过程如下：

<div align=center>
<img src="work/source/attentiondecode_process_greedy.png"/>
</div>


使用greedy模式解码比较简单，但是很有可能会在解码过程中丢失整体上效果更好的解码结果，因此我们实际使用的是beam search方式的解码，beam search模式下的decoder的解码过程如下：


<div align=center>
<img src="work/source/attentiondecode_process_greedy.png"/>
</div>


#### Decoder获取Encoder的K和V进行attention
<div align=center>
<img src="work/source/src_attention.png"  />
</div>

Decoder在每一步的解码过程中，都会利用Encoder的输出的特征编码进行Multi-head attention。

其中Decoder会将对自回结果的编码作为attention中的Q，而Encoder输出的特征编码作为K和V来完成attetion计算，从而利用Encoder提取的音频信息。



#### （细节）Masked Multi-head Attention
细心的通许可能发现了，Decoder的一个multi-head attention前头有一个mask。增加了这个mask的原因在于进行Decoder训练的时候，Decoder的输入是一句完整的句子，而不是像预测这样一步步输入句子的前缀。为了模拟预测的过程，Decoder训练的时候需要用mask遮住句子。例如T=1的时候，就要mask住输入中除第一个字符以外其他的字符，T=2的时候则是遮住除前两个字符以外的其余字符。




#### Transformer的Decoder构建代码
```python
class TransformerDecoder(BatchScorerInterface, nn.Layer):
    """Base class of Transfomer decoder module.
    Args:
        vocab_size: 输出维数
        encoder_output_size: 等价于attention的输出维数
        attention_heads: multi head attention中的head的数目
        linear_units: position-wise feedforward中hidden层的维数
        num_blocks: decoder blocks的数目
        dropout_rate: dropout rate
        self_attention_dropout_rate: attention的dropout rate
        input_layer: 输入层的类型，例如'embed'
        use_output_layer: 是否使用output layer
        pos_enc_class: PositionalEncoding的类
        normalize_before:
             True: 在 sub-block前使用 layer-norm.
             False: 在 sub-block后使用 layer-norm.
        concat_after (bool): 是否合并attention层的输入和输出
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
    """

    def __init__(
            self,
            vocab_size: int,
            encoder_output_size: int,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            self_attention_dropout_rate: float=0.0,
            src_attention_dropout_rate: float=0.0,
            input_layer: str="embed",
            use_output_layer: bool=True,
            normalize_before: bool=True,
            concat_after: bool=False, ):

        assert check_argument_types()
        nn.Layer.__init__(self)
        self.selfattention_layer_type = 'selfattn'
        attention_dim = encoder_output_size

        if input_layer == "embed":
            self.embed = nn.Sequential(
                nn.Embedding(vocab_size, attention_dim),
                PositionalEncoding(attention_dim, positional_dropout_rate), )
        else:
            raise ValueError(f"only 'embed' is supported: {input_layer}")

        self.normalize_before = normalize_before
        self.after_norm = nn.LayerNorm(attention_dim, epsilon=1e-12)
        self.use_output_layer = use_output_layer
        self.output_layer = nn.Linear(attention_dim, vocab_size)

        self.decoders = nn.LayerList([
            DecoderLayer(
                size=attention_dim,
                self_attn=MultiHeadedAttention(attention_heads, attention_dim,
                                               self_attention_dropout_rate),
                src_attn=MultiHeadedAttention(attention_heads, attention_dim,
                                              src_attention_dropout_rate),
                feed_forward=PositionwiseFeedForward(
                    attention_dim, linear_units, dropout_rate),
                dropout_rate=dropout_rate,
                normalize_before=normalize_before,
                concat_after=concat_after, ) for _ in range(num_blocks)
        ])
```

```python
class DecoderLayer(nn.Layer):
    """Single decoder layer module.
    Args:
        size (int): Input 的维度数.
        self_attn (nn.Layer): 自注意力模块实例.
            `MultiHeadedAttention` 的实例可以作为参数.
        src_attn (nn.Layer): 自注意力模块实例.
            `MultiHeadedAttention` 的实例可以作为参数.
        feed_forward (nn.Layer): Feed-forward 层的实例.
            `PositionwiseFeedForward` 的实例可以作为参数.
        dropout_rate (float): Dropout rate.
        normalize_before:
             True: 在 sub-block前使用 layer-norm.
             False: 在 sub-block后使用 layer-norm.
        concat_after (bool): 是否合并attention层的输入和输出
                True: x -> x + linear(concat(x, att(x)))
                False: x -> x + att(x)
    """

    def __init__(
            self,
            size: int,
            self_attn: nn.Layer,
            src_attn: nn.Layer,
            feed_forward: nn.Layer,
            dropout_rate: float,
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """构建 DecoderLayer 对象."""
        super().__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
        self.norm3 = nn.LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        self.concat_linear1 = nn.Linear(size + size, size)
        self.concat_linear2 = nn.Linear(size + size, size)

    def forward(
            self,
            tgt: paddle.Tensor,
            tgt_mask: paddle.Tensor,
            memory: paddle.Tensor,
            memory_mask: paddle.Tensor,
            cache: Optional[paddle.Tensor]=None
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute decoded features.
        Args:
            tgt (paddle.Tensor): 输入tensor (#batch, maxlen_out, size).
            tgt_mask (paddle.Tensor): 输入tensor的Mask
                (#batch, maxlen_out).
            memory (paddle.Tensor): Encoder的输出
                (#batch, maxlen_in, size).
            memory_mask (paddle.Tensor): Encoder的输出的mask
                (#batch, maxlen_in).
            cache (paddle.Tensor): 缓存 tensors.
                (#batch, maxlen_out - 1, size).
        Returns:   
            x: paddle.Tensor: 输出tensor (#batch, maxlen_out, size).
            tgt_mask: paddle.Tensor: 输出tensor的mask (#batch, maxlen_out).
            memory: paddle.Tensor: Encoder的输出，这里输入和输出一致 (#batch, maxlen_in, size).
            memory_mask: paddle.Tensor: Encoder的输出的mask (#batch, maxlen_in).
        """
        residual = tgt
        if self.normalize_before:
            tgt = self.norm1(tgt)

        if cache is None:
            tgt_q = tgt
            tgt_q_mask = tgt_mask
        else:
            # 使用最后一帧作为Q
            assert cache.shape == [
                tgt.shape[0],
                tgt.shape[1] - 1,
                self.size,
            ], f"{cache.shape} == {[tgt.shape[0], tgt.shape[1] - 1, self.size]}"
            tgt_q = tgt[:, -1:, :]
            residual = residual[:, -1:, :]
            tgt_q_mask = tgt_mask.cast(paddle.int64)[:, -1:, :].cast(
                paddle.bool)

        if self.concat_after:
            tgt_concat = paddle.cat(
                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1)
            x = residual + self.concat_linear1(tgt_concat)
        else:
            x = residual + self.dropout(
                self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
        if not self.normalize_before:
            x = self.norm1(x)

        residual = x
        if self.normalize_before:
            x = self.norm2(x)
        if self.concat_after:
            x_concat = paddle.cat(
                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1)
            x = residual + self.concat_linear2(x_concat)
        else:
            x = residual + self.dropout(
                self.src_attn(x, memory, memory, memory_mask))
        if not self.normalize_before:
            x = self.norm2(x)

        residual = x
        if self.normalize_before:
            x = self.norm3(x)
        x = residual + self.dropout(self.feed_forward(x))
        if not self.normalize_before:
            x = self.norm3(x)

        if cache is not None:
            x = paddle.cat([cache, x], dim=1)

        return x, tgt_mask, memory, memory_mask
```

#### Decoder解码的过程代码

```python
 def recognize(
            self,
            speech: paddle.Tensor,
            speech_lengths: paddle.Tensor,
            beam_size: int=10,
            decoding_chunk_size: int=-1,
            num_decoding_left_chunks: int=-1,
            simulate_streaming: bool=False, ) -> paddle.Tensor:
        """ Apply beam search on attention decoder
        Args:
            speech (paddle.Tensor): (batch, max_len, feat_dim)
            speech_length (paddle.Tensor): (batch, )
            beam_size (int): beam size，top-k中k的值
            decoding_chunk_size (int): 解码的chunk长度
                <0: for decoding, 使用整句话
                >0: for decoding, 使用decoding_chunk_size的长度.
                0: 用于训练, 这里不能使用
            simulate_streaming (bool): Encoder的输出是否按照流式
        Returns:
            best_hyps: paddle.Tensor: 解码结果 (batch, max_result_len)
        """
        assert speech.shape[0] == speech_lengths.shape[0]
        assert decoding_chunk_size != 0
        device = speech.place
        batch_size = speech.shape[0]

        # Let's assume B = batch_size and N = beam_size
        
        # 1. 得到Encoder的结果
        encoder_out, encoder_mask = self._forward_encoder(
            speech, speech_lengths, decoding_chunk_size,
            num_decoding_left_chunks,
            simulate_streaming)  # (B, maxlen, encoder_dim)
        maxlen = encoder_out.shape[1]
        encoder_dim = encoder_out.shape[2]
        running_size = batch_size * beam_size
        encoder_out = encoder_out.unsqueeze(1).repeat(1, beam_size, 1, 1).view(
            running_size, maxlen, encoder_dim)  # (B*N, maxlen, encoder_dim)
        encoder_mask = encoder_mask.unsqueeze(1).repeat(
            1, beam_size, 1, 1).view(running_size, 1,
                                     maxlen)  # (B*N, 1, max_len)

        hyps = paddle.ones(
            [running_size, 1], dtype=paddle.long).fill_(self.sos)  # (B*N, 1)
        
        # log scale score
        scores = paddle.to_tensor(
            [0.0] + [-float('inf')] * (beam_size - 1), dtype=paddle.float)
        scores = scores.to(device).repeat(batch_size).unsqueeze(1).to(
            device)  # (B*N, 1)
        end_flag = paddle.zeros_like(scores, dtype=paddle.bool)  # (B*N, 1)
        cache: Optional[List[paddle.Tensor]] = None
        
        # 2. Decoder进行一步一步地解码
        for i in range(1, maxlen + 1):
            # Stop if all batch and all beam produce eos
            if end_flag.cast(paddle.int64).sum() == running_size:
                break

            # 2.1 Decoder进行一次前向激素哪
            hyps_mask = subsequent_mask(i).unsqueeze(0).repeat(
                running_size, 1, 1).to(device)  # (B*N, i, i)
            # logp: (B*N, vocab)
            logp, cache = self.decoder.forward_one_step(
                encoder_out, encoder_mask, hyps, hyps_mask, cache)

            # 2.2 First beam prune: 选择当前step中topk个候选字符
            top_k_logp, top_k_index = logp.topk(beam_size)  # (B*N, N)
            top_k_logp = mask_finished_scores(top_k_logp, end_flag)
            top_k_index = mask_finished_preds(top_k_index, end_flag, self.eos)

            # 2.3 Seconde beam prune: 利用历史候选句子和当前step中的topk个候选进行组合，并选择当前step下topk个候选句子
            scores = scores + top_k_logp  # (B*N, N), broadcast add
            scores = scores.view(batch_size, beam_size * beam_size)  # (B, N*N)
            scores, offset_k_index = scores.topk(k=beam_size)  # (B, N)
            scores = scores.view(-1, 1)  # (B*N, 1)

            # 2.4. 计算出topk个候选句子的index,
            # 把 top_k_index 看成 (B*N*N)的形式,把 offset_k_index 看成 (B*N)的形式,
            # 在 top_k_index 中找到 offset_k_index
            base_k_index = paddle.arange(batch_size).view(-1, 1).repeat(
                1, beam_size)  # (B, N)
            base_k_index = base_k_index * beam_size * beam_size
            best_k_index = base_k_index.view(-1) + offset_k_index.view(
                -1)  # (B*N)

            # 2.5 更新候选
            best_k_pred = paddle.index_select(
                top_k_index.view(-1), index=best_k_index, axis=0)  # (B*N)
            best_hyps_index = best_k_index // beam_size
            last_best_k_hyps = paddle.index_select(
                hyps, index=best_hyps_index, axis=0)  # (B*N, i)
            hyps = paddle.cat(
                (last_best_k_hyps, best_k_pred.view(-1, 1)),
                dim=1)  # (B*N, i+1)

            # 2.6 更新 end flag
            end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)

        # 3. 从top-k(beam-size)个最优结果中选择最优的结果
        scores = scores.view(batch_size, beam_size)
        best_index = paddle.argmax(scores, axis=-1).long()  # (B)
        best_hyps_index = best_index + paddle.arange(
            batch_size, dtype=paddle.long) * beam_size
        best_hyps = paddle.index_select(hyps, index=best_hyps_index, axis=0)
        best_hyps = best_hyps[:, 1:]
        return best_hyps
```

### 构建transformer模型

In [10]:
model_conf = transformer_config.model
# input_dim 存储的是特征的纬度
model_conf.input_dim = 80
# output_dim 存储的字表的长度
model_conf.output_dim = 4233 
print ("model_conf", model_conf)
model = U2Model.from_config(model_conf)

model_conf cmvn_file: None
cmvn_file_type: json
decoder: transformer
decoder_conf:
  attention_heads: 4
  dropout_rate: 0.1
  linear_units: 2048
  num_blocks: 6
  positional_dropout_rate: 0.1
  self_attention_dropout_rate: 0.0
  src_attention_dropout_rate: 0.0
encoder: transformer
encoder_conf:
  attention_dropout_rate: 0.0
  attention_heads: 4
  dropout_rate: 0.1
  input_layer: conv2d
  linear_units: 2048
  normalize_before: True
  num_blocks: 12
  output_size: 256
  positional_dropout_rate: 0.1
input_dim: 80
model_conf:
  ctc_dropoutrate: 0.0
  ctc_grad_norm_type: None
  ctc_weight: 0.3
  length_normalized_loss: False
  lsm_weight: 0.1
output_dim: 4233
2021-11-30 11:22:29.848 | INFO     | paddlespeech.s2t.models.u2.u2:_init_from_config:880 - U2 Encoder type: transformer
2021-11-30 11:22:30.089 | INFO     | paddlespeech.s2t.modules.loss:__init__:41 - CTCLoss Loss reduction: sum, div-bs: True
2021-11-30 11:22:30.090 | INFO     | paddlespeech.s2t.modules.loss:__init__:42 - CTCLoss Grad 

### 加载预训练的模型

In [11]:
model_dict = paddle.load(checkpoint_path)
model.set_state_dict(model_dict)

### 进行预测

In [12]:
decoding_config = transformer_config.decoding
# text_feature = collate_fn_test.text_feature
text_feature = TextFeaturizer(unit_type='char',
                            vocab_filepath=transformer_config.collator.vocab_filepath)

result_transcripts = model.decode(
            audio_feature,
            audio_len,
            text_feature=text_feature,
            decoding_method=decoding_config.decoding_method,
            lang_model_path=decoding_config.lang_model_path,
            beam_alpha=decoding_config.alpha,
            beam_beta=decoding_config.beta,
            beam_size=decoding_config.beam_size,
            cutoff_prob=decoding_config.cutoff_prob,
            cutoff_top_n=decoding_config.cutoff_top_n,
            num_processes=decoding_config.num_proc_bsearch,
            ctc_weight=decoding_config.ctc_weight,
            decoding_chunk_size=decoding_config.decoding_chunk_size,
            num_decoding_left_chunks=decoding_config.num_decoding_left_chunks,
            simulate_streaming=decoding_config.simulate_streaming)
print ("预测结果和对应的token id为:")
print (result_transcripts)
print ("预测结果为:")
print (result_transcripts[0][0])

2021-11-30 11:22:30.600 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:228 - BLANK id: 0
2021-11-30 11:22:30.601 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:229 - UNK id: 1
2021-11-30 11:22:30.602 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:230 - EOS id: 4232
2021-11-30 11:22:30.602 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:231 - SOS id: 4232
2021-11-30 11:22:30.602 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:232 - SPACE id: -1
2021-11-30 11:22:30.602 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:233 - MASKCTC id: -1
预测结果和对应的token id为:
(['夺得国际田联竞走世界杯男子二十公里竞走银牌'], [[826, 1237, 712, 3961, 2482, 3014, 2778, 3590, 16, 2493, 1781, 2487, 936, 70, 426, 274, 3827, 2778, 3590, 3875, 2341, 4232]])
预测结果为:
夺得国际田

# 作业 
1. 使用开发模式安装 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)  
环境要求：docker, Ubuntu 16.04，root user。  
命令： `pip install -e .`

2. 跑通 example/aishell/s1 中的conformer模型，完成训练和预测。 

3. 按照 example 的格式使用自己的数据集训练 ASR 模型。  


# 关注PaddleSpeech
https://github.com/PaddlePaddle/PaddleSpeech/  
您的关注是我们最大的动力。 

# 参考文献

[1] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.

[2] Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks[C]//2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 2013: 6645-6649.

[3] Hannun A, Case C, Casper J, et al. Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567, 2014.

[4] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in english and mandarin[C]//International conference on machine learning. PMLR, 2016: 173-182.

[5] Chan W, Jaitly N, Le Q, et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition[C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016: 4960-4964.

[6] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

[7] Zhang Q, Lu H, Sak H, et al. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020: 7829-7833.

[8] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.

In [13]:
## 回退到原始目录
%cd ../../

/home/aistudio


# 使用Deepspeech2语音识别模型

## 使用Deepspeech2进行语音识别的流程

<div align=center>
<img src="work/source/deepspeech2_pipeline.png"  />
</div>

Deepspeech2进行语音识别的流程如上图所示。其中声学提取模块一般使用linear特征，也就是将音频信息由时域转到频域后的信息，没有使用mel滤波。

而对于deepspeech2语音识别模型，其主要分为2个部分，第一个部分是Encoder，第二个部分是CTC Decoder。

声学特征会首先进入Encoder，获取特征编码。然后CTC Decoder会利用Encoder提取的特征编码得到预测结果。

## Deepspeech2 模型结构

<div align=center>
<img src="work/source/deepspeech2_architecture.png"  />
</div>

### Encoder
Encoder 主要采用了2层降采样的CNN（subsampling Convolution layer）和多层RNN（Recurrent Neural Network）层组成。

![subsampling CNN](work/source/subsampling_cnn.png)

其中降采样的CNN的结构如上。其主要用途在于扩大每一个step的感受野，减少模型输入的帧数。

而多层RNN的作用在于获取语音的上下文信息，这样可以获得更加准确的信息，并一定程度上进行语义消歧。Deepspeech2的模型中，每个RNN的cell使用了GRU或者LSTM。

而最后softmax层将特征向量映射到为一个字表长度的向量，向量中存储了当前step结果预测为字表中每个字的概率。

### Decoder
Decoder的作用主要是将Encoder输出的概率转换为最终的文字结果。由于Deepspeech2采用的是CTC的损失函数，因此模型使用了CTC Decoder。
对于CTC Decoder，主要有两种形式，第一种形式CTC greedy search decoder，其是采用greedy的方式进行解码。第二种形式是 CTC prefix beam search decoder，其采用的是beam search的方式进行解码。

#### CTC Greedy Search
<div align=center>
<img src="work/source/CTC_greedy_search.png"  />
</div>

对于CTC Greedy Search的方式，其只有一个候选序列，并且在每个时间点选择概率最高的字符加入候选序列来更新该候选序列，最后这个候选序列就可以直接生成最终结果。

#### CTC Beam Search
CTC Beam Search的方式是有beam size个候选序列，并在每个时间点生成新的最好的beam size个候选序列。最后在beam size个候选序列中选择概率最高的序列生成最终结果。

# 实战

## Stage 0 准备工作

### 安装 paddlespeech

In [14]:
!pip install paddlespeech

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


### 准备工作目录

In [15]:
%cd ./work
!mkdir -p ./workspace_asr_ds2
%cd ./workspace_asr_ds2

/home/aistudio/work
/home/aistudio/work/workspace_asr_ds2


### 获取预训练模型

In [16]:
!wget -nc https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
!tar xzvf ds2.model.tar.gz

--2021-11-30 11:22:36--  https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
Resolving paddlespeech.bj.bcebos.com (paddlespeech.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to paddlespeech.bj.bcebos.com (paddlespeech.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 297856913 (284M) [application/octet-stream]
Saving to: ‘ds2.model.tar.gz’


2021-11-30 11:22:40 (66.3 MB/s) - ‘ds2.model.tar.gz’ saved [297856913/297856913]

conf/deepspeech2.yaml
data/mean_std.json
exp/bw/checkpoints/avg_1.pdparams
data/lang_char/
data/lang_char/vocab.txt


In [17]:
!touch conf/augmentation.json
# 下载语言模型
!mkdir -p data/lm
!wget -nc https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm -P data/lm
# 获取用于预测的音频文件
%cp ../data/BAC009S0908W0355.wav ./data

--2021-11-30 11:22:44--  https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm
Resolving deepspeech.bj.bcebos.com (deepspeech.bj.bcebos.com)... 182.61.200.229, 182.61.200.195, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to deepspeech.bj.bcebos.com (deepspeech.bj.bcebos.com)|182.61.200.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2953395058 (2.8G) [application/octet-stream]
Saving to: ‘data/lm/zh_giga.no_cna_cmn.prune01244.klm’


2021-11-30 11:23:14 (94.3 MB/s) - ‘data/lm/zh_giga.no_cna_cmn.prune01244.klm’ saved [2953395058/2953395058]



### 导入python包

In [18]:
import paddle
import warnings
warnings.filterwarnings('ignore')
from paddlespeech.s2t.exps.u2.config import get_cfg_defaults
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
from paddlespeech.s2t.io.collator import SpeechCollator
from paddlespeech.s2t.models.ds2 import DeepSpeech2Model
from paddlespeech.s2t.utils import layer_tools

#from yacs.config import CfgNode
from paddlespeech.s2t.frontend.featurizer.audio_featurizer import AudioFeaturizer
from paddlespeech.s2t.frontend.speech import SpeechSegment
from paddlespeech.s2t.frontend.normalizer import FeatureNormalizer

### 设置预训练模型的路径

In [19]:
config_path = "conf/deepspeech2.yaml" 
checkpoint_path = "./exp/bw/checkpoints/avg_1.pdparams"
audio_file = "data/BAC009S0908W0355.wav"

result_file = "exp/result.rsl"

# 读取 conf 文件并结构化
ds2_config = get_cfg_defaults()
ds2_config.merge_from_file(config_path)


#ds2_config = CfgNode(new_allowed=True)
#ds2_config.merge_from_file(config_path)
print("========Config========")
print(ds2_config)

collator:
  augmentation_config: conf/augmentation.json
  batch_size: 64
  delta_delta: False
  dither: 1.0
  feat_dim: None
  keep_transcription_text: False
  max_freq: None
  mean_std_filepath: data/mean_std.json
  n_fft: None
  num_workers: 2
  random_seed: 0
  shuffle_method: batch_shuffle
  sortagrad: True
  spectrum_type: linear
  spm_model_prefix: None
  stride_ms: 10.0
  target_dB: -20
  target_sample_rate: 16000
  unit_type: char
  use_dB_normalization: True
  vocab_filepath: data/lang_char/vocab.txt
  window_ms: 20.0
data:
  dev_manifest: data/manifest.dev
  manifest: 
  max_input_len: 27.0
  max_output_input_ratio: inf
  max_output_len: inf
  min_input_len: 0.0
  min_output_input_ratio: 0.0
  min_output_len: 0.0
  test_manifest: data/manifest.test
  train_manifest: data/manifest.train
decoding:
  alpha: 1.9
  batch_size: 128
  beam_size: 300
  beta: 5.0
  ctc_weight: 0.0
  cutoff_prob: 0.99
  cutoff_top_n: 40
  decoding_chunk_size: -1
  decoding_method: ctc_beam_search
  err

### 构建音频特征提取对象

In [20]:
feat_config = ds2_config.collator
audio_featurizer = AudioFeaturizer(
    spectrum_type=feat_config.spectrum_type,
    feat_dim=feat_config.feat_dim,
    delta_delta=feat_config.delta_delta,
    stride_ms=feat_config.stride_ms,
    window_ms=feat_config.window_ms,
    n_fft=feat_config.n_fft,
    max_freq=feat_config.max_freq,
    target_sample_rate=feat_config.target_sample_rate,
    use_dB_normalization=feat_config.use_dB_normalization,
    target_dB=feat_config.target_dB,
    dither=feat_config.dither)
feature_normalizer = FeatureNormalizer(feat_config.mean_std_filepath) if feat_config.mean_std_filepath else None

### 提取音频的特征

In [21]:
# 'None' 只是一个占位符，因为预测的时候不需要reference
speech_segment = SpeechSegment.from_file(
                audio_file, "None")
audio_feature = audio_featurizer.featurize(speech_segment)
if feature_normalizer:
    audio_feature = feature_normalizer.apply(audio_feature)

#audio_feature, _ = collate_fn_test.process_utterance(audio_file=audio_file, transcript="None")
#vocab_list = collate_fn_test.vocab_list
print("========Feature========")

audio_len = audio_feature.shape[0]
audio_feature = paddle.to_tensor(audio_feature, dtype='float32')
audio_len = paddle.to_tensor(audio_len)
audio_feature = paddle.unsqueeze(audio_feature, axis=0)
print (audio_feature)
print (audio_len)

Tensor(shape=[1, 849, 161], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[[ 1.33949280,  1.30350459, -0.04746983, ...,  1.59237695,
           0.42001620, -1.61082637],
         [-1.21325135,  0.12596609, -0.61995435, ...,  0.71308112,
           0.58741844,  0.07226342],
         [ 1.03332150,  0.87029517, -0.45239139, ...,  0.99254310,
          -0.03513453,  1.27926254],
         ...,
         [-0.35289204,  0.27512935,  0.10877325, ...,  0.80169225,
           0.63553405,  1.57065392],
         [ 0.59620470,  0.19263259, -0.29943508, ...,  0.70254236,
           1.09836316,  1.04222035],
         [ 1.23617208,  1.17976248, -0.31874165, ..., -1.26211381,
           0.13160947,  0.56485575]]])
Tensor(shape=[1], dtype=int64, place=CUDAPlace(0), stop_gradient=True,
       [849])


### 构建transformer模型

In [23]:
model_conf = ds2_config.model
# input dim is feature size
model_conf.input_dim = 161
# output_dim is vocab size
model_conf.output_dim = 4301
model = DeepSpeech2Model.from_config(model_conf)

2021-11-30 11:24:09.548 | INFO     | paddlespeech.s2t.modules.loss:__init__:41 - CTCLoss Loss reduction: sum, div-bs: True
2021-11-30 11:24:09.549 | INFO     | paddlespeech.s2t.modules.loss:__init__:42 - CTCLoss Grad Norm Type: instance
2021-11-30 11:24:09.550 | INFO     | paddlespeech.s2t.modules.loss:__init__:73 - CTCLoss() kwargs:{'norm_by_times': True}, not support: {'norm_by_batchsize': False, 'norm_by_total_logits_len': False}


### 加载预训练的模型

In [24]:
model_dict = paddle.load(checkpoint_path)
model.set_state_dict(model_dict)

### 进行预测

In [25]:
decoding_config = ds2_config.decoding
print ("decoding_config", decoding_config)
# text_feature = collate_fn_test.text_feature
text_feature = TextFeaturizer(unit_type='char',
                            vocab_filepath=ds2_config.collator.vocab_filepath)


result_transcripts = model.decode(
        audio_feature,
        audio_len,
        text_feature.vocab_list,
        decoding_method=decoding_config.decoding_method,
        lang_model_path=decoding_config.lang_model_path,
        beam_alpha=decoding_config.alpha,
        beam_beta=decoding_config.beta,
        beam_size=decoding_config.beam_size,
        cutoff_prob=decoding_config.cutoff_prob,
        cutoff_top_n=decoding_config.cutoff_top_n,
        num_processes=decoding_config.num_proc_bsearch)

print (result_transcripts)
print ("预测结果为:")
print (result_transcripts[0])


decoding_config alpha: 1.9
batch_size: 128
beam_size: 300
beta: 5.0
ctc_weight: 0.0
cutoff_prob: 0.99
cutoff_top_n: 40
decoding_chunk_size: -1
decoding_method: ctc_beam_search
error_rate_type: cer
lang_model_path: data/lm/zh_giga.no_cna_cmn.prune01244.klm
num_decoding_left_chunks: -1
num_proc_bsearch: 10
simulate_streaming: False
2021-11-30 11:24:15.669 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:228 - BLANK id: 0
2021-11-30 11:24:15.670 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:229 - UNK id: 1
2021-11-30 11:24:15.670 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:230 - EOS id: 4300
2021-11-30 11:24:15.671 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:231 - SOS id: 4300
2021-11-30 11:24:15.671 | INFO     | paddlespeech.s2t.frontend.featurizer.text_featurizer:_load_vocabulary_from_file:232 - SPACE