# 构建语音识别系统

## 数据集下载

1. 下载数据集：[BZNSYP](https://www.data-baker.com/data/index/TNtts/)
- 大约12小时
- 来自同一个说话人
- 本次实验我们采用pinyin作为我们的识别结果，来构建一个语音识别系统


2. 构建数据集
- 将数据集放到dataset下，dataset目录如下：

    ```
    dataset
    ├── PhoneLabeling
    │   ├── 000001.interval
    ├── ProsodyLabeling
    │   ├── 000001-010000.txt
    ├── Wave
    │   ├── 000001.wav
    ```

- 运行 `splitdata/split_data.py` 划分数据集，最后dataset目录下会多一个split目录

    ```
    dataset
    ├── split
    │   ├── train
    │   │   ├── wav.scp
    │   │   ├── pinyin
    │   ├── dev
    │   │   ├── wav.scp
    │   │   ├── pinyin
    │   ├── test
    │   │   ├── wav.scp
    │   │   ├── pinyin
    ```

## 数据提取

数据提取的框架已经构建好，位于 data/dataloader.py 的BZNSYP类中，需要完成：
- 语音特征提取
- 文本处理

1. 语音特征提取(`data.dataloader.extract_audio_features`函数)

    -  可以把上一节课完成的特征提取给放进去
    -  要保证返回的结果为一个tensor，并且维度为(L,f)

2. 文本处理
    - 构建tokenizer：`tokenizer.tokenizer.Tokenizer`
    - tokenizer需要做到：
        - 将一段字符映射成token id
        - 需要完成Tokenizer框架中的TODO

### tokenizer构建

1. 构建字典
    - 构建字典的文件已经写好，tokenizer/gen_vocab.py
    - 可以修改一下对应的路径，vocab结果如下

        ```
        huang
        cheng
        lo
        ...
        ```

2. 完成Tokenizer的TODO部分
    - call函数
    - decode函数
    - 注意特殊字符\<pad\>,\<unk\>, \<sos\>, \<eos\>,\<blk\>, " "

In [1]:
from tokenizer.tokenizer import Tokenizer
pinyin_list1 = ['wo', 'men', 'cheng', 'shi', 'de', 'fu', 'su', 'you', 'lai']
pinyin_list2 = ["A", 'wo', 'men', 'cheng']
tokenizer = Tokenizer("./tokenizer/vocab.txt")

In [2]:
id_list1 = tokenizer(pinyin_list1)
print(tokenizer.decode(id_list1))

['wo', 'men', 'cheng', 'shi', 'de', 'fu', 'su', 'you', 'lai']


In [3]:
id_list2 = tokenizer(pinyin_list2)
print(tokenizer.decode(id_list2))

unk:  ['A']
['wo', 'men', 'cheng']


## 数据构建

In [4]:
from torch.utils.data import DataLoader, Dataset
from tokenizer.tokenizer import Tokenizer
import torch
import random
import os
from utils.utils import collate_with_PAD

def extract_audio_features(wav_file:str)->torch.Tensor:
    if not isinstance(wav_file, str):
        raise TypeError(f"Expected string for wav_file")

    # TODO
    # 提取音频特征,并转化成torch.Tensor
    random_number = random.randint(100, 1000)
    res = torch.randn(random_number, 80)

    if not isinstance(res, torch.Tensor):
        raise TypeError("Return value must be torch.Tensor")
    return res


class BZNSYP(Dataset):
    def __init__(self, wav_file, text_file, tokenizer):
        self.tokenizer = tokenizer
        self.wav2path = {}
        self.wav2text = {}
        self.ids = []

        with open(wav_file, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split("\t", 1)
                if len(parts) == 2:
                    id = parts[0]
                    self.ids.append(id)
                    path = "./dataset/" + parts[1]
                    self.wav2path[id] = path
                else:
                    raise ValueError(f"Invalid line format: {line}")

        with open(text_file, "r", encoding="utf-8") as f:
            for line in f:
                parts = line.strip().split("\t", 1)
                if len(parts) == 2:
                    id = parts[0]
                    pinyin_list = parts[1].split(" ")
                    self.wav2text[id] = self.tokenizer(["<sos>"]+pinyin_list+["<eos>"])
                else:
                    raise ValueError(f"Invalid line format: {line}")
    
    def __len__(self):
        return len(self.wav2path)
    
    def __getitem__(self, index):
        id = list(self.wav2path.keys())[index]
        path = self.wav2path[id]
        text = self.wav2text[id]
        return id, extract_audio_features(id), text
    

def get_dataloader(wav_file, text_file, batch_size, tokenizer, shuffle=True):
    dataset = BZNSYP(wav_file, text_file, tokenizer)
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=collate_with_PAD
    )
    return dataloader

In [5]:
tokenizer = Tokenizer()
dataloader = get_dataloader("./dataset/split/train/wav.scp", "./dataset/split/train/pinyin", 3, tokenizer, shuffle = False)
input = None
for batch in dataloader:
    input = batch
    break

audio_lens = input["audio_lens"]
audios = input["audios"]
texts = input["texts"]
text_lens = input["text_lens"]

In [6]:
print(input.keys())

dict_keys(['ids', 'audios', 'audio_lens', 'texts', 'text_lens'])


In [7]:
# audio
print(audio_lens)
print(audios[0, : , :])

tensor([846, 270, 731], dtype=torch.int32)
tensor([[-1.9291e-01, -8.0170e-02,  5.8445e-01,  ..., -5.9170e-01,
         -8.0922e-01, -4.3831e-01],
        [ 5.7072e-01, -5.5849e-01,  1.5525e-02,  ..., -5.1530e-01,
          2.5720e+00,  1.4080e+00],
        [ 9.6679e-01,  1.1044e+00,  4.1950e-01,  ...,  1.1179e+00,
          4.5566e-01, -5.5372e-01],
        ...,
        [-2.3685e-03,  1.6951e+00,  3.2000e-01,  ...,  2.7687e+00,
         -1.6895e+00, -1.7418e-01],
        [-7.4421e-01, -3.2440e-01, -1.3456e-02,  ..., -6.7055e-01,
         -7.0159e-01, -8.1147e-01],
        [ 1.5708e-01, -1.2388e+00,  1.5396e+00,  ..., -9.1777e-02,
          2.1803e-02, -6.8506e-01]])


In [8]:
# text
texts = texts.tolist()

for text in texts:
    print(tokenizer.decode(text))
    print(tokenizer.decode(text, ignore_special=False))

['ka', 'e', 'er', 'pu', 'pei', 'wai', 'sun', 'wan', 'hua', 'ti']
['<sos>', 'ka', 'e', 'er', 'pu', 'pei', 'wai', 'sun', 'wan', 'hua', 'ti', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>']
['jia', 'yu', 'cun', 'yan', 'bie', 'zai', 'yong', 'bao', 'wo']
['<sos>', 'jia', 'yu', 'cun', 'yan', 'bie', 'zai', 'yong', 'bao', 'wo', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['bao', 'ma', 'pei', 'gua', 'bo', 'luo', 'an', 'diao', 'chan', 'yuan', 'zhen', 'dong', 'weng', 'ta']
['<sos>', 'bao', 'ma', 'pei', 'gua', 'bo', 'luo', 'an', 'diao', 'chan', 'yuan', 'zhen', 'dong', 'weng', 'ta', '<eos>']
