<h2><u>Ringkasan</u></h2>

Dalam proyek ini, kami mencoba menerapkan arsitektur Transformer yang terinspirasi dari paper Attention is All You Need (oleh Google Brain) dan arsitektur Whisper dari OpenAI untuk melakukan Automatic Speech Recognition dalam bahasa Indonesia. Data yang digunakan dalam proyek ini adalah Common Voice (sumber: huggingface) untuk bahasa Indonesia.

<h3>Import Necessary Libraries</h3>

In [57]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
import torchaudio.functional as F_audio
import torchaudio.transforms as T
from torchinfo import summary
import seaborn as sns
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import string

from PreprocessAudio import PreprocessAudio

<h3>Read and Clean the Metadatas</h3>

Metadata ini akan digunakan untuk membuat dataset. Pertama, diperlukan metadata dengan format ```audio_file_path, transcription```. Pada proyek ini, transkripsi audio diharapkan dapat menebak kata-kata yang dieja dalam file audio dengan benar, sehingga keberadaan tanda baca atau huruf kapital dapat diabaikan. Hal ini juga bersifat menguntungkan karena dapat meningkatkan akurasi.

In [58]:
df  = pd.read_csv('./common_voice_id/dev.tsv', sep='\t')
df2 = pd.read_csv('./common_voice_id/invalidated.tsv', sep='\t')
df3 = pd.read_csv('./common_voice_id/other.tsv', sep='\t')
df4 = pd.read_csv('./common_voice_id/train.tsv', sep='\t')

df  = df[['path', 'sentence']]
df2 = df2[['path', 'sentence']]
df3 = df3[['path', 'sentence']]
df4 = df4[['path', 'sentence']]

df = pd.concat([df, df2, df3, df4], ignore_index=True)
df = df.sample(frac=1, random_state=42)
df = df.reset_index(drop=True)

In [59]:
df

Unnamed: 0,path,sentence
0,common_voice_id_26747327.mp3,"Kamu harus melakukannya, suka tidak suka."
1,common_voice_id_21699230.mp3,Saya dibonceng di belakang sepeda teman.
2,common_voice_id_25248896.mp3,Tom berkata dia dapat menunggu lama.
3,common_voice_id_25537482.mp3,Minggu lalu terus-menerus hujan.
4,common_voice_id_21195036.mp3,Saat libur musim panas tahun ini saya pergi ke...
...,...,...
40199,common_voice_id_25039385.mp3,Aku menyuruh adikku untuk membeli gula di warung.
40200,common_voice_id_25426706.mp3,perusahaan yang berkembang selalu diikuti deng...
40201,common_voice_id_26229445.mp3,Melepaskan yang melekat membawa ke Nirvana.
40202,common_voice_id_20954419.mp3,dia tidak pernah ke dokter gigi selama hidupnya


In [60]:
df_testing = df[:500]

In [61]:
def remove_strips(text):
    # Mengganti tanda hubung (-) dengan spasi dan menghapus tanda baca di awal dan akhir kata
    cleaned_text = text.replace('“', '')
    cleaned_text = cleaned_text.replace('”', '')
    cleaned_text = cleaned_text.replace('-', ' ')
    cleaned_text = cleaned_text.strip(string.punctuation)
    return cleaned_text

for i in range(len(df_testing['sentence'])):
    # print(text)
    df_testing['sentence'][i] = df_testing['sentence'][i].lower()
    df_testing['sentence'][i] = ' '.join(word.strip(string.punctuation) for word in df_testing['sentence'][i].split())
    df_testing['sentence'][i] = ' '.join(remove_strips(word) for word in df_testing['sentence'][i].split())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = df_testing['sentence'][i].lower()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = ' '.join(word.strip(string.punctuation) for word in df_testing['sentence'][i].split())
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_testing['sentence'][i] = ' '.join(remove_strips(word) for word in df_testing['sentence'][i].split())


Dipersiapkan sebuah direktori untuk menyimpan audio yang akan diolah dan nantinya akan dikelompokkan dalam data train dan validasi.

In [62]:
label_list = []
HOME = os.getcwd()
for i in range(len(df_testing)):
    os.system(f'copy \"{HOME}\\common_voice_id\\clips\\{df_testing["path"][i]}\" \"{HOME}\\audio_folder\\{df_testing["path"][i]}\"')
    label_list.append(df_testing['sentence'][i])

In [63]:
print(f'Banyaknya label: {len(label_list)}')
print('Sampel 10 label')
label_list[:10]

Banyaknya label: 500
Sampel 10 label


['kamu harus melakukannya suka tidak suka',
 'saya dibonceng di belakang sepeda teman',
 'tom berkata dia dapat menunggu lama',
 'minggu lalu terus menerus hujan',
 'saat libur musim panas tahun ini saya pergi ke laut dan mendaki gunung',
 'dia memanggil namanya',
 'saat berada di sana saya belajar bahasa inggris',
 'di mana kamu membeli buku itu',
 'sepuluh tahun adalah waktu yang lama untuk menunggu',
 'di atas meja ada vas bunga']

Dataset yang akan kita latih diharapkan memiliki format sebagai berikut.

```python
[[tensor_1], [transcription_1],
 [tensor_2], [transcription_2],
 ...
 [tensor_n], [transcription_n]]
```

Untuk itu modul ```Dataset``` oleh PyTorch dapat digunakan untuk membuat dataset ini. Modul ini dipanggil dengan syntax

```python
from torch.utils.data import Dataset
```

dengan ukuran batch (batch size) sebesar 64.

tensor_i adalah tensor yang memuat matriks MFCC dari sebuah audio. Meninjau ulang bahwa sebuah matriks MFCC memiliki ukuran ```(n_mfcc, timesteps)``` dengan ```n_mfcc=64``` dan timesteps bergantung dari audio yang memiliki durasi terpanjang (timesteps tidak sama dengan durasi audio).

transcription_i adalah tensor yang memuat transkripsi yang telah di encode menjadi angka dalam dictionary encoder yang ditentukan oleh user (biasa disebut sebagai vocabulary). tensor ini memiliki ukuran ```(1, max_len)``` dimana ```max_len``` adalah transkripsi terpanjang dari sebuah audio. Perlu diingat bahwa panjang transkripsi maksimum akan dibatasi sebesar 256.

In [64]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
max_timestamps = 2752
max_len = 256

alphabets = ['', ' '] + [chr(i + 96) for i in range(1, 27)]
char2num_dict, num2char_dict = {}, {}

for index, chars in enumerate(alphabets):
    char2num_dict[chars] = index
    num2char_dict[index] = chars

def conv_char2num(label, maxlen=max_len):
    label = label[:maxlen].lower()
    label_enc = []
    padding_len = maxlen - len(label)
    for i in label:
        label_enc.append(char2num_dict[i])
    return np.array(label_enc + [0] * padding_len)

def conv_num2char(num):
    txt = ""
    for i in num:
        if i == 0:
            break
        else:
            txt += num2char_dict[i]
    
    return txt

In [65]:
example_text = 'Saya berangkat ke sekolah di pagi hari'
print(conv_char2num(example_text))

[20  2 26  2  1  3  6 19  2 15  8 12  2 21  1 12  6  1 20  6 12 16 13  2
  9  1  5 10  1 17  2  8 10  1  9  2 19 10  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]


In [66]:
def split_data(df, test_size=0.8):
    data_size = len(df)
    df = df.sample(frac=1).reset_index(drop=True)
    split = int(test_size * data_size)
    df_train, df_valid = df[:split], df[split:]
    df_train
    
    return df_train, df_valid.reset_index(drop=True)

df_train, df_valid = split_data(df_testing)

In [67]:
# Copy train audio data to 'train' folder
label_list_train = []

for i in range(len(df_train)):
    os.system(f'move \"{HOME}\\audio_folder\\{df_train["path"][i]}\" \"{HOME}\\audio_folder\\train\\{df_train["path"][i]}\"')
    label_list_train.append(df_train['sentence'][i])

In [68]:
# Copy train audio data to 'valid' folder
label_list_valid = []

for i in range(len(df_valid)):
    os.system(f'move \"{HOME}\\audio_folder\\{df_valid["path"][i]}\" \"{HOME}\\audio_folder\\valid\\{df_valid["path"][i]}\"')
    label_list_valid.append(df_valid['sentence'][i])

In [69]:
PipelineTrain = PreprocessAudio('./audio_folder/train/', df_train, 35)
PipelineValid = PreprocessAudio('./audio_folder/valid/', df_valid, 35)

In [70]:
dataset_train, df_train_filtered = PipelineTrain.load_audio()
dataset_valid, df_valid_filtered = PipelineValid.load_audio()

Mounted audio directory at: ./audio_folder/train/
Counter di 35
Error di file ./audio_folder/train/common_voice_id_21194463.mp3
Counter di 90
Error di file ./audio_folder/train/common_voice_id_21192699.mp3
Counter di 91
Error di file ./audio_folder/train/common_voice_id_26237570.mp3
Counter di 135
Error di file ./audio_folder/train/common_voice_id_20847480.mp3
Counter di 207
Error di file ./audio_folder/train/common_voice_id_25469455.mp3
Counter di 252
Error di file ./audio_folder/train/common_voice_id_26242833.mp3
Counter di 293
Error di file ./audio_folder/train/common_voice_id_19783809.mp3
Counter di 351
Error di file ./audio_folder/train/common_voice_id_25470004.mp3
Counter di 368
Error di file ./audio_folder/train/common_voice_id_21699467.mp3
Counter di 375
Error di file ./audio_folder/train/common_voice_id_35338065.mp3
Mounted audio directory at: ./audio_folder/valid/
Counter di 29
Error di file ./audio_folder/valid/common_voice_id_21587706.mp3
Counter di 69
Error di file ./audio

In [71]:
def add_padding(mfcc_tensor, index, n_mfcc=64, max_padding=512):
    height, width = np.array(mfcc_tensor[index][0]).shape[0], np.array(mfcc_tensor[index][0]).shape[1]
    
    padded_mfcc = np.zeros([max_padding, n_mfcc])
    padded_mfcc[:height, :width] = mfcc_tensor[index][0]
    return padded_mfcc

In [72]:
train_dataset_list = []
for i in range(len(dataset_train)):
    train_dataset_list.append(add_padding(dataset_train, i, max_padding=max_timestamps).tolist())

In [73]:
valid_dataset_list = []
for i in range(len(dataset_valid)):
    valid_dataset_list.append(add_padding(dataset_valid, i, max_padding=max_timestamps).tolist())

In [74]:
train_dataset_list = torch.tensor(train_dataset_list)
valid_dataset_list = torch.tensor(valid_dataset_list)

In [75]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_x = train_dataset_list.to(device)
valid_x = valid_dataset_list.to(device)

In [76]:
df_train_filtered.head()

Unnamed: 0,path,sentence
0,common_voice_id_35291086.mp3,pesawat ini sangat jelek
1,common_voice_id_20957324.mp3,tom menikahi perempuan yang lebih mudah dari d...
2,common_voice_id_35277617.mp3,untuk alasan inilah saya tidak bisa datang ber...
3,common_voice_id_25970214.mp3,istana persegi panjang mengelilingi taman tama...
4,common_voice_id_25415776.mp3,dia suka mendengarkan musik


In [77]:
train_y, valid_y = [], []

for text in df_train_filtered['sentence']:
    train_y.append(conv_char2num(text))

for text in df_valid_filtered['sentence']:
    valid_y.append(conv_char2num(text))

In [78]:
train_y = torch.tensor(train_y)
valid_y = torch.tensor(valid_y)

In [79]:
from torch.utils.data import Dataset, DataLoader

class myDataset(Dataset):
    def __init__(self, data, transcriptions):
        self.data = data
        self.transcriptions = transcriptions
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        mfcc_matrix = self.data[index]
        transcription = self.transcriptions[index]
        return mfcc_matrix, transcription

In [80]:
my_train_dataset = myDataset(train_x, train_y)
my_valid_dataset = myDataset(valid_x, valid_y)

In [81]:
batch_size = 64
train_set = DataLoader(my_train_dataset, batch_size=batch_size, shuffle=True)
valid_set = DataLoader(my_valid_dataset, batch_size=batch_size, shuffle=True)

<h3>Build Transformer Model </h3>

In [524]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        error_const = torch.erf(x / math.sqrt(2.0))
        x = x * 0.5 * (1.0 + error_const)
        return x

class TokenEmbedding(nn.Module):
    def __init__(self, num_vocab=30, num_hid=64):
        super().__init__()
        self.num_vocab = num_vocab
        self.num_hid = num_hid

    def forward(self, x):
        maxlen = x.shape[-1]
        pos = torch.arange(0, maxlen, 1)
        emb = nn.Embedding(self.num_vocab, self.num_hid)(x)
        pos_emb = nn.Embedding(maxlen, self.num_hid)(x)
        out = emb + pos_emb
        return out.to(device)

class SpeechFeatureEmbedding(nn.Module):
    def __init__(self, num_hid=64, maxlen=100):
        super().__init__()
        self.conv1 = nn.Conv1d(in_channels=num_hid, out_channels=num_hid, kernel_size=11, stride=2, padding=5)
        self.conv2 = nn.Conv1d(in_channels=num_hid, out_channels=num_hid, kernel_size=11, stride=2, padding=5)
        self.conv3 = nn.Conv1d(in_channels=num_hid, out_channels=num_hid, kernel_size=11, stride=2, padding=5)
        self.admaxpool1 = nn.AdaptiveMaxPool1d(1024)
        self.admaxpool2 = nn.AdaptiveMaxPool1d(512)
        self.gelu = GELU()

    def forward(self, x):
        x = self.conv1(x.permute(0, 2, 1))
        x = self.gelu(x)
        x = self.admaxpool1(x)
        x = self.conv2(x)
        x = self.gelu(x)
        x = self.conv3(x)
        x = self.gelu(x)
        return x.permute(0, 2, 1)

class TransformerEncoder(nn.Module):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.att = nn.MultiheadAttention(embed_dim, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, feed_forward_dim),     
            GELU(),
            nn.Linear(feed_forward_dim, embed_dim)
        )
        self.layernorm1 = nn.LayerNorm(embed_dim, eps=1e-6)
        self.layernorm2 = nn.LayerNorm(embed_dim, eps=1e-6)
        self.dropout1 = nn.Dropout(rate)
        self.dropout2 = nn.Dropout(rate)

    def forward(self, inputs):
        attn_output = self.att(inputs, inputs, inputs)[0]
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        out = self.layernorm2(out1 + ffn_output)
        return out

class TransformerDecoder(nn.Module):
    def __init__(self, embed_dim, num_heads, feed_forward_dim, dropout_rate=0.1):
        super().__init__()
        self.layernorm1 = nn.LayerNorm(embed_dim, eps=1e-6)
        self.layernorm2 = nn.LayerNorm(embed_dim, eps=1e-6)
        self.layernorm3 = nn.LayerNorm(embed_dim, eps=1e-6)
        self.self_att = nn.MultiheadAttention(embed_dim, num_heads)
        self.enc_att = nn.MultiheadAttention(embed_dim, num_heads)
        self.self_dropout = nn.Dropout(0.5)
        self.enc_dropout = nn.Dropout(0.1)
        self.ffn_dropout = nn.Dropout(0.1)
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, feed_forward_dim),
            GELU(),
            nn.Linear(feed_forward_dim, embed_dim)
        )

    def causalAttentionMask(self, size, dtype=float('-inf'), device='cpu'):
        return torch.triu(torch.full((size, size), dtype, device=device), diagonal=1)

    def forward(self, enc_out, target):
        causal_mask = self.causalAttentionMask(size=target.shape[0], dtype=float('-inf'), device=target.device)
        target_att = self.self_att(target, target, target, attn_mask=causal_mask, is_causal=True)[0]
        target_norm = self.layernorm1(target + self.self_dropout(target_att))
        enc_out = self.enc_att(target_norm, target_norm, enc_out)[0]
        enc_out_norm = self.layernorm2(self.enc_dropout(enc_out) + target_norm)
        ffn_out = self.ffn(enc_out_norm)
        ffn_out_norm = self.layernorm3(enc_out_norm + self.ffn_dropout(ffn_out))
        return ffn_out_norm

class Transformer(nn.Module):
    def __init__(
        self,
        num_hid = 64,
        num_head = 2,
        num_feed_forward = 128,
        source_maxlen = 100,
        target_maxlen = 100,
        num_layers_enc = 4,
        num_layers_dec = 4,
        num_classes = 30
    ):
        super().__init__()
        self.num_layers_enc = num_layers_enc
        self.num_layers_dec = num_layers_dec
        self.target_maxlen = target_maxlen
        self.num_classes = num_classes

        self.enc_input = SpeechFeatureEmbedding(num_hid=num_hid, maxlen=source_maxlen)
        self.dec_input = TokenEmbedding(num_vocab=num_classes, num_hid=num_hid)

        self.encoder = nn.Sequential(
            self.enc_input,
            *[TransformerEncoder(num_hid, num_head, num_feed_forward) for _ in range(num_layers_enc)]
        )
        
        for i in range(num_layers_dec):
            self.add_module(
                f"dec_layer_{i}",
                TransformerDecoder(num_hid, num_head, num_feed_forward),
            )

        self.classifier = nn.Linear(num_hid, num_classes)

    def decoder(self, enc_out, target):
        y = self.dec_input(target)
        for i in range(self.num_layers_dec):
            dec_layer = getattr(self, f"dec_layer_{i}")
            y = dec_layer(enc_out, y)
        return y

    def forward(self, inputs):
        source = inputs[0].unsqueeze(0)
        target = inputs[1].unsqueeze(0)
        x = self.encoder(source)
        y = self.decoder(x, target)
        return self.classifier(y)

In [590]:
test_dataset = my_train_dataset[:64]
x = test_dataset[0]
y = test_dataset[1]

SFE = SpeechFeatureEmbedding().to("cuda:0")
out_sfe = SFE(x)
print("Output SFE:", out_sfe.shape)
TransEnc = TransformerEncoder(64, 2, 128).to("cuda:0")
out_trans_enc = TransEnc(out_sfe)
print("Output TransEnc", out_trans_enc.shape)

TE = TokenEmbedding().to("cuda:0")
out_te = TE(y)
print("Output Token Embedding", out_te.shape)
TransDec = TransformerDecoder(64, 2, 128).to("cuda:0")
out_trans_dec = TransDec(out_trans_enc, out_te)
print("Output TransDec", out_trans_dec.shape)

ffn = nn.Linear(64, 30).to("cuda:0")
out_ffn = ffn(out_trans_dec)
print("Output NN", out_ffn.shape)

softmax = nn.functional.softmax(out_ffn, dim=-1)
print("Output Softmax", softmax.shape)

Output SFE: torch.Size([64, 256, 64])
Output TransEnc torch.Size([64, 256, 64])
Output Token Embedding torch.Size([64, 256, 64])
Output TransDec torch.Size([64, 256, 64])
Output NN torch.Size([64, 256, 30])
Output Softmax torch.Size([64, 256, 30])


In [605]:
class Decoder(nn.Module):
    def __init__(self, enc_out, layer):
        self.enc_out = enc_out
        self.layer = layer
        for i in range(layer):
            self.add_module(
                f"dec_layer{i}",
                TransformerDecoder(64, 2, 128)
            )
    
    def decoder(self, enc_out, target):
        y = self.dec_input(target)
        for i in range(self.layer):
            dec_layer = getattr(self, f"dec_layer_{i}")
            y = dec_layer(enc_out, y)
        return y

    def forward(self, enc_out, target):
        out = decoder(enc_out, target)
        return out

In [608]:
encoder = nn.Sequential(
            SFE,
            *[TransformerEncoder(64, 2, 128) for _ in range(10)]
        ).to("cuda:0")


In [596]:
out_encoder = encoder(x)

In [609]:
decoder = TransDec(out_encoder, TE(y))

In [611]:
decoder.shape

torch.Size([64, 256, 64])

In [532]:
out2 = TransformerEncoder(embed_dim=64, num_heads=2, feed_forward_dim=128).to(device)(out1)
out2.shape

torch.Size([64, 256, 64])

In [533]:
out_y = TokenEmbedding().to(device)(test_data[1])
out_y.shape

torch.Size([64, 256, 64])

In [542]:
encoder = nn.Sequential(
            SpeechFeatureEmbedding(),
            *[TransformerEncoder(64, 2, 128) for _ in range(2)]
        ).to(device)
out_enc = encoder(test_data[0])
out_enc.shape

torch.Size([64, 256, 64])

In [553]:
out_y2 = TransformerDecoder(64, 2, 128).to(device)(out_enc, out_y)
out_y2.shape

torch.Size([64, 256, 64])

In [551]:
y_pred_testing = nn.Linear(64, 30).to(device)(out_y2).shape

In [516]:
my_train_dataset[:1][0].shape

torch.Size([1, 2752, 64])

In [519]:
model = Transformer()
model.to(device)
output_test = model(my_train_dataset[0])

In [507]:
output_test.shape

torch.Size([1, 256, 30])

In [480]:
output_test.shape

torch.Size([256, 30])

In [557]:
source = test_data[0]
target = test_data[1]

dec_input = target[:, :-1]
dec_target = target[:, 1:]

# def train_step(self, batch):
#     """Processes one batch inside the training loop."""
#     source = batch["source"]
#     target = batch["target"]
#     dec_input = target[:, :-1]
#     dec_target = target[:, 1:]

#     self.optimizer.zero_grad()

#     preds = self([source, dec_input])
#     one_hot = torch.nn.functional.one_hot(dec_target, num_classes=self.num_classes)
#     mask = dec_target != 0
#     loss = self.compiled_loss(preds, one_hot.float(), reduction='none')
#     loss = (loss * mask).sum() / mask.sum()

#     loss.backward()
#     self.optimizer.step()

#     self.loss_metric.update(loss.item())

#     return {"loss": self.loss_metric.result()}


In [562]:
dec_target.shape

torch.Size([64, 255])

In [565]:
one_hot = torch.zeros(dec_target.size(0), dec_target.size(1), 256)
one_hot = one_hot.to(dec_target.device)  # Ensure the tensor is on the same device as dec_target
indices = dec_target.unsqueeze(2).long()  # Convert indices to long data type
one_hot.scatter_(2, indices, 1).shape

torch.Size([64, 255, 256])

In [566]:
mask = dec_target != 0

torch.Size([64, 255])

In [520]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define your model, loss function, and optimizer
model = Transformer().to(device)  # Initialize your Transformer model
criterion = nn.CrossEntropyLoss(ignore_index=0)  # Use Mean Squared Error loss for ASR
optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# Training loop
def train_step(batch_src, batch_tgt):
    

    optimizer.zero_grad()
    targets = batch_tgt.contiguous()
    outputs = model((batch_src, batch_tgt[:, :-1]))  # Use both source and target for the model input
    
    loss = criterion(outputs.contiguous().view(-1, 256), targets[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    return loss.item()

def validate():
    model.eval()
    total_loss = 0.0
    with torch.no_grad():
        for batch_src, batch_tgt in test_set:  # Use your testing/validating data loader
            outputs = model((batch_src, batch_tgt[:, :-1]))
            loss = criterion(outputs, batch_tgt[:, 1:].contiguous())
            total_loss += loss.item()
    return total_loss / len(test_dataloader)  # Calculate average loss

num_epochs = 10  # Set the number of training epochs

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0.0
    for batch_src, batch_tgt in train_set:  # Use your training data loader
        loss = train_step(batch_src, batch_tgt)
        epoch_loss += loss

    avg_epoch_loss = epoch_loss / len(train_dataloader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Avg Loss: {avg_epoch_loss:.4f}")

    validation_loss = validate()
    print(f"Validation Loss: {validation_loss:.4f}")


RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 4 is not equal to len(dims) = 3