<h2><u>ISR-Transformer, Arsitektur <i>Robust</i> Untuk Transkripsi Audio Bahasa Indonesia Lebih Baik</u></h2>
<b>Dr. Rizka Wakhidatus Sholikah, S.Kom., Kevin Putra Santoso, Mohammad Idris Arif Budiman</b><br>
<i>Departemen Teknologi Informasi, Institut Teknologi Sepuluh Nopember Surabaya</i>

Dalam proyek ini, kami mencoba menerapkan arsitektur Transformer yang terinspirasi dari paper **Attention is All You Need** (oleh Google Brain) dan arsitektur **Whisper** dari OpenAI untuk melakukan Automatic Speech Recognition dalam bahasa Indonesia. Data yang digunakan dalam proyek ini adalah Common Voice (sumber: huggingface) untuk bahasa Indonesia.

<h3>Import Necessary Libraries</h3>

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
import torchaudio.functional as F_audio
import torchaudio.transforms as T
from torchinfo import summary
import seaborn as sns
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import string

from PreprocessAudio import PreprocessAudio
from model import Transformer

<h3>Read and Clean the Metadatas</h3>

Metadata ini akan digunakan untuk membuat dataset. Pertama, diperlukan metadata dengan format ```audio_file_path, transcription```. Pada proyek ini, transkripsi audio diharapkan dapat menebak kata-kata yang dieja dalam file audio dengan benar, sehingga keberadaan tanda baca atau huruf kapital dapat diabaikan. Hal ini juga bersifat menguntungkan karena dapat meningkatkan akurasi.

In [2]:
df  = pd.read_csv('./common_voice_id/dev.tsv', sep='\t')
df2 = pd.read_csv('./common_voice_id/invalidated.tsv', sep='\t')
df3 = pd.read_csv('./common_voice_id/other.tsv', sep='\t')
df4 = pd.read_csv('./common_voice_id/train.tsv', sep='\t')

df  = df[['path', 'sentence']]
df2 = df2[['path', 'sentence']]
df3 = df3[['path', 'sentence']]
df4 = df4[['path', 'sentence']]

df = pd.concat([df, df2, df3, df4], ignore_index=True)
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)

In [3]:
df

Unnamed: 0,path,sentence
0,common_voice_id_26747327.mp3,"Kamu harus melakukannya, suka tidak suka."
1,common_voice_id_21699230.mp3,Saya dibonceng di belakang sepeda teman.
2,common_voice_id_25248896.mp3,Tom berkata dia dapat menunggu lama.
3,common_voice_id_25537482.mp3,Minggu lalu terus-menerus hujan.
4,common_voice_id_21195036.mp3,Saat libur musim panas tahun ini saya pergi ke...
...,...,...
40199,common_voice_id_25039385.mp3,Aku menyuruh adikku untuk membeli gula di warung.
40200,common_voice_id_25426706.mp3,perusahaan yang berkembang selalu diikuti deng...
40201,common_voice_id_26229445.mp3,Melepaskan yang melekat membawa ke Nirvana.
40202,common_voice_id_20954419.mp3,dia tidak pernah ke dokter gigi selama hidupnya


In [4]:
df_testing = df[:500]

In [5]:
def remove_strips(text):
    # Mengganti tanda hubung (-) dengan spasi dan menghapus tanda baca di awal dan akhir kata
    cleaned_text = text.replace('“', '')
    cleaned_text = cleaned_text.replace('”', '')
    cleaned_text = cleaned_text.replace('-', ' ')
    cleaned_text = cleaned_text.strip(string.punctuation)
    return cleaned_text

for i in range(len(df_testing['sentence'])):
    # print(text)
    df_testing['sentence'][i] = df_testing['sentence'][i].lower()
    df_testing['sentence'][i] = ' '.join(word.strip(string.punctuation) for word in df_testing['sentence'][i].split())
    df_testing['sentence'][i] = ' '.join(remove_strips(word) for word in df_testing['sentence'][i].split())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = df_testing['sentence'][i].lower()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = ' '.join(word.strip(string.punctuation) for word in df_testing['sentence'][i].split())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = ' '.join(remove_strips(word) for word in df_testing['sentence'][i].split())


Dipersiapkan sebuah direktori untuk menyimpan audio yang akan diolah dan nantinya akan dikelompokkan dalam data train dan validasi.

In [6]:
label_list = []
HOME = os.getcwd()
for i in range(len(df_testing)):
    os.system(f'copy \"{HOME}\\common_voice_id\\clips\\{df_testing["path"][i]}\" \"{HOME}\\audio_folder\\{df_testing["path"][i]}\"')
    label_list.append(df_testing['sentence'][i])

In [7]:
print(f'Banyaknya label: {len(label_list)}')
print('Sampel 10 label')
label_list[:10]

Banyaknya label: 500
Sampel 10 label


['kamu harus melakukannya suka tidak suka',
 'saya dibonceng di belakang sepeda teman',
 'tom berkata dia dapat menunggu lama',
 'minggu lalu terus menerus hujan',
 'saat libur musim panas tahun ini saya pergi ke laut dan mendaki gunung',
 'dia memanggil namanya',
 'saat berada di sana saya belajar bahasa inggris',
 'di mana kamu membeli buku itu',
 'sepuluh tahun adalah waktu yang lama untuk menunggu',
 'di atas meja ada vas bunga']

Dataset yang akan kita latih diharapkan memiliki format sebagai berikut.

```python
[[tensor_1], [transcription_1],
 [tensor_2], [transcription_2],
 ...
 [tensor_n], [transcription_n]]
```

Untuk itu modul ```Dataset``` oleh PyTorch dapat digunakan untuk membuat dataset ini. Modul ini dipanggil dengan syntax

```python
from torch.utils.data import Dataset
```

dengan ukuran batch (batch size) sebesar 64.

tensor_i adalah tensor yang memuat matriks MFCC dari sebuah audio. Meninjau ulang bahwa sebuah matriks MFCC memiliki ukuran ```(n_mfcc, timesteps)``` dengan ```n_mfcc=64``` dan timesteps bergantung dari audio yang memiliki durasi terpanjang (timesteps tidak sama dengan durasi audio).

transcription_i adalah tensor yang memuat transkripsi yang telah di encode menjadi angka dalam dictionary encoder yang ditentukan oleh user (biasa disebut sebagai vocabulary). tensor ini memiliki ukuran ```(1, max_len)``` dimana ```max_len``` adalah transkripsi terpanjang dari sebuah audio. Perlu diingat bahwa panjang transkripsi maksimum akan dibatasi sebesar 256.

In [8]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
max_timestamps = 2752
max_len = 256

alphabets = ['', ' '] + [chr(i + 96) for i in range(1, 27)]
char2num_dict, num2char_dict = {}, {}

for index, chars in enumerate(alphabets):
    char2num_dict[chars] = index
    num2char_dict[index] = chars

def conv_char2num(label, maxlen=max_len):
    label = label[:maxlen].lower()
    label_enc = []
    padding_len = maxlen - len(label)
    for i in label:
        label_enc.append(char2num_dict[i])
    return np.array(label_enc + [0] * padding_len)

def conv_num2char(num):
    txt = ""
    for i in num:
        if i == 0:
            break
        else:
            txt += num2char_dict[i]
    
    return txt

In [9]:
example_text = 'Saya berangkat ke sekolah di pagi hari'
print(conv_char2num(example_text))

[20  2 26  2  1  3  6 19  2 15  8 12  2 21  1 12  6  1 20  6 12 16 13  2
  9  1  5 10  1 17  2  8 10  1  9  2 19 10  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


In [10]:
def split_data(df, test_size=0.8):
    data_size = len(df)
    df = df.sample(frac=1).reset_index(drop=True)
    split = int(test_size * data_size)
    df_train, df_valid = df[:split], df[split:]
    df_train
    
    return df_train, df_valid.reset_index(drop=True)

df_train, df_valid = split_data(df_testing)

In [11]:
# Copy train audio data to 'train' folder
label_list_train = []

for i in range(len(df_train)):
    os.system(f'move \"{HOME}\\audio_folder\\{df_train["path"][i]}\" \"{HOME}\\audio_folder\\train\\{df_train["path"][i]}\"')
    label_list_train.append(df_train['sentence'][i])

In [12]:
# Copy train audio data to 'valid' folder
label_list_valid = []

for i in range(len(df_valid)):
    os.system(f'move \"{HOME}\\audio_folder\\{df_valid["path"][i]}\" \"{HOME}\\audio_folder\\valid\\{df_valid["path"][i]}\"')
    label_list_valid.append(df_valid['sentence'][i])

In [13]:
PipelineTrain = PreprocessAudio('./audio_folder/train/', df_train, 35)
PipelineValid = PreprocessAudio('./audio_folder/valid/', df_valid, 35)

In [14]:
dataset_train, df_train_filtered = PipelineTrain.load_audio()
dataset_valid, df_valid_filtered = PipelineValid.load_audio()

Mounted audio directory at: ./audio_folder/train/


  return torch.tensor([signal])


Counter di 12
Error di file ./audio_folder/train/common_voice_id_21587706.mp3
Counter di 38
Error di file ./audio_folder/train/common_voice_id_19783809.mp3
Counter di 95
Error di file ./audio_folder/train/common_voice_id_21194463.mp3
Counter di 109
Error di file ./audio_folder/train/common_voice_id_25470004.mp3
Counter di 163
Error di file ./audio_folder/train/common_voice_id_25469455.mp3
Counter di 217
Error di file ./audio_folder/train/common_voice_id_21194346.mp3
Counter di 219
Error di file ./audio_folder/train/common_voice_id_26242833.mp3
Counter di 346
Error di file ./audio_folder/train/common_voice_id_21699467.mp3
Mounted audio directory at: ./audio_folder/valid/
Counter di 22
Error di file ./audio_folder/valid/common_voice_id_21192699.mp3
Counter di 38
Error di file ./audio_folder/valid/common_voice_id_26237570.mp3
Counter di 77
Error di file ./audio_folder/valid/common_voice_id_20847480.mp3
Counter di 95
Error di file ./audio_folder/valid/common_voice_id_35338065.mp3


In [15]:
def add_padding(mfcc_tensor, index, n_mfcc=64, max_padding=512):
    height, width = np.array(mfcc_tensor[index][0]).shape[0], np.array(mfcc_tensor[index][0]).shape[1]
    
    padded_mfcc = np.zeros([max_padding, n_mfcc])
    padded_mfcc[:height, :width] = mfcc_tensor[index][0]
    return padded_mfcc

In [16]:
train_dataset_list = []
for i in range(len(dataset_train)):
    train_dataset_list.append(add_padding(dataset_train, i, max_padding=max_timestamps).tolist())

In [17]:
valid_dataset_list = []
for i in range(len(dataset_valid)):
    valid_dataset_list.append(add_padding(dataset_valid, i, max_padding=max_timestamps).tolist())

In [18]:
train_dataset_list = torch.tensor(train_dataset_list)
valid_dataset_list = torch.tensor(valid_dataset_list)

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_x = train_dataset_list.to(device)
valid_x = valid_dataset_list.to(device)

In [20]:
df_train_filtered.head()

Unnamed: 0,path,sentence
0,common_voice_id_25248257.mp3,pak kimura menunjukkan foto kepada saya
1,common_voice_id_35744791.mp3,kalian bersenang senang bukan
2,common_voice_id_25407923.mp3,ibu saya lahir pada tanggal dua puluh sembilan...
3,common_voice_id_35627850.mp3,di amerika utara usaha umumnya berpegang pada ...
4,common_voice_id_33453481.mp3,tas itu besar dan berat


In [21]:
train_y, valid_y = [], []

for text in df_train_filtered['sentence']:
    train_y.append(conv_char2num(text))

for text in df_valid_filtered['sentence']:
    valid_y.append(conv_char2num(text))

In [22]:
train_y = torch.tensor(train_y)
valid_y = torch.tensor(valid_y)

In [23]:
from torch.utils.data import Dataset, DataLoader

class myDataset(Dataset):
    def __init__(self, data, transcriptions):
        self.data = data
        self.transcriptions = transcriptions
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        mfcc_matrix = self.data[index]
        transcription = self.transcriptions[index]
        return mfcc_matrix, transcription

In [24]:
my_train_dataset = myDataset(train_x, train_y)
my_valid_dataset = myDataset(valid_x, valid_y)

In [25]:
batch_size = 64
train_set = DataLoader(my_train_dataset, batch_size=batch_size, shuffle=True)
valid_set = DataLoader(my_valid_dataset, batch_size=batch_size, shuffle=True)

<h3>Initialize Model and Start Training</h3>

In [240]:
import torch
import torch.nn as nn
import torch.optim as optim

model = Transformer().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.98), eps=1e-9)

def train_step(model, optimizer, batch):
    source = batch[0]
    target = batch[1]
    dec_input = target[:, :-1]
    dec_target = target[:, 1:]

    optimizer.zero_grad()

    preds = model([source, dec_input])

    one_hot = F.one_hot(dec_target.long(), num_classes=model.num_classes).float()
    one_hot = one_hot.permute(0, 2, 1)
    mask = dec_target != 0
    mask = mask.float()
    
    loss = F.cross_entropy(preds.transpose(1, 2), one_hot.to(device), label_smoothing=0.1)
    loss = (loss * mask.to(device)).sum() / mask.sum()

    loss.backward()
    optimizer.step()

    return {"Loss": loss.item()}

def validate():
    model.eval()
    total_loss = 0.0
    with torch.no_grad():
        for batch in valid_set:
            source = batch[0]
            target = batch[1]
            dec_input = target[:, :-1]
            dec_target = target[:, 1:]
            preds = model((source, dec_input))
            one_hot_tgt = F.one_hot(dec_target.long(), num_classes=model.num_classes).float().permute(0, 2, 1).to(device)

            mask = batch_tgt != 0
            mask = mask.float()
            loss = F.cross_entropy(preds.transpose(1, 2), one_hot_tgt,label_smoothing=0.1)
            loss = (loss * mask.to(device)).sum() / mask.sum()

            total_loss += loss.item()
    return total_loss / len(valid_set)

num_epochs = 100

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0.0
    for batch_src, batch_tgt in train_set:
        loss = train_step(model, optimizer, [batch_src, batch_tgt])
        epoch_loss += loss['Loss']

    avg_epoch_loss = epoch_loss / len(train_set)
    print(f"Epoch [{epoch+1}/{num_epochs}], Avg Loss: {avg_epoch_loss:.4f}")

    validation_loss = validate()
    print(f"Validation Loss: {validation_loss:.4f}")

Epoch [1/100], Avg Loss: 3.2326
Validation Loss: 3.0290


  return torch._native_multi_head_attention(


Epoch [2/100], Avg Loss: 2.9968
Validation Loss: 2.8091
Epoch [3/100], Avg Loss: 2.7899
Validation Loss: 2.6137
Epoch [4/100], Avg Loss: 2.6124
Validation Loss: 2.4470
Epoch [5/100], Avg Loss: 2.4621
Validation Loss: 2.3119
Epoch [6/100], Avg Loss: 2.3327
Validation Loss: 2.1944
Epoch [7/100], Avg Loss: 2.2227
Validation Loss: 2.0944
Epoch [8/100], Avg Loss: 2.1251
Validation Loss: 2.0001
Epoch [9/100], Avg Loss: 2.0394
Validation Loss: 1.9222
Epoch [10/100], Avg Loss: 1.9641
Validation Loss: 1.8594
Epoch [11/100], Avg Loss: 1.8962
Validation Loss: 1.8280
Epoch [12/100], Avg Loss: 1.8363
Validation Loss: 1.7464
Epoch [13/100], Avg Loss: 1.7823
Validation Loss: 1.7141
Epoch [14/100], Avg Loss: 1.7361
Validation Loss: 1.6613
Epoch [15/100], Avg Loss: 1.6949
Validation Loss: 1.6274
Epoch [16/100], Avg Loss: 1.6579
Validation Loss: 1.6162
Epoch [17/100], Avg Loss: 1.6231
Validation Loss: 1.5716
Epoch [18/100], Avg Loss: 1.5942
Validation Loss: 1.5671
Epoch [19/100], Avg Loss: 1.5665
Valida