# Audio Transcription

### I. Data Gathering

The dataset used for this project is the Common Voice dataset. Common Voice is a massive multi-lingual corpus of read speech by Mozilla [1]. This project used Common Voice Corpus 20.0 subset for Indonesian language.  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

##### 1. Training Data

In [2]:
train_df = pd.read_csv("../data/cv-corpus-6.1-indonesian/train.tsv", sep = "\t")

In [3]:
train_df.head(5)

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accent,locale,segment
0,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192526.mp3,Kamar adik laki-laki saya lebih sempit daripad...,2,0,twenties,male,,id,
1,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192527.mp3,Ayah akan membunuhku.,2,0,twenties,male,,id,
2,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192528.mp3,Ini pulpen.,2,0,twenties,male,,id,
3,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192535.mp3,Akira pandai bermain tenis.,2,0,twenties,male,,id,
4,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192536.mp3,Dia keluar dari ruangan tanpa mengatakan sepat...,2,1,twenties,male,,id,


In [4]:
train_df = train_df[train_df["up_votes"] >= train_df["down_votes"]]
train_df["sentence"] = train_df["sentence"].str.lower()
train_df = train_df[["path", "sentence"]]

In [5]:
train_df.duplicated().sum()

0

In [6]:
train_df.isna().sum()

path        0
sentence    0
dtype: int64

##### 2. Testing Data

In [7]:
test_df = pd.read_csv("../data/cv-corpus-6.1-indonesian/test.tsv", sep = "\t")

In [8]:
test_df.head(5)

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accent,locale,segment
0,057bf45c0c338db897f5717f744bcac8a2ac2eee990a42...,common_voice_id_22888800.mp3,Minggu depan kakak perempuan saya menikah.,2,0,,,,id,
1,0835fbbf1d609a6ed421eef134a48ff06d719121b41f3b...,common_voice_id_24015257.mp3,Berbagai bahasa daerah dan bahasa asing menjad...,2,1,,,,id,
2,0c8ac0307f35c73b09d8fc0d92e4c183e3078adee87212...,common_voice_id_24015280.mp3,apa yang bisa saya berikan kepadamu?,2,0,,,,id,
3,19285f8e012ad31cad237d53bab348ce59a5cc13684754...,common_voice_id_20425643.mp3,Inilah dunia kecil.,2,1,,,,id,
4,3502377c5fb712169a3f2fe5583906e4b3a5ecba27bf2c...,common_voice_id_22185104.mp3,nol,2,0,,,,id,Benchmark


In [9]:
test_df = test_df[test_df["up_votes"] >= test_df["down_votes"]]
train_df["sentence"] = train_df["sentence"].str.lower()
test_df = test_df[["path", "sentence"]]

In [10]:
test_df.duplicated().sum()

0

In [11]:
test_df.isna().sum()

path        0
sentence    0
dtype: int64

##### 3. Split Data

In [12]:
valid_df, test_df = train_test_split(test_df, test_size = 0.5, random_state = 42)

### II. Data Preprocessing

The data preprocessing techniques used for this projects are:
1. Normalization
2. Frame Blocking
3. Windowing
4. Fast Fourier Transform (FFT)
5. Mel Filterbank
6. Discrete Cosine Transform (DCT)

In [13]:
import numpy as np
import scipy

##### 1. Normalization

Normalization is a process of adjusting the range of a signal to a certain range.

In [14]:
def normalization(audio):
    audio = audio / np.max(np.abs(audio))
    return audio

##### 2. Frame Blocking

Frame blocking is a process of splitting speech signal into a series of frames with equal length. Usually, the frame length is 20-40 ms [2]. 

In [15]:
def frame_blocking(audio, sample_rate, FFT_SIZE, hop_size):
    audio = np.pad(audio, FFT_SIZE // 2, mode = "reflect")

    frame_len = int(np.round(sample_rate * hop_size / 1000))

    frame_num = (len(audio) - FFT_SIZE) // frame_len + 1

    frames = np.zeros((frame_num, FFT_SIZE))

    for i in range(frame_num):
        frames[i] = audio[i * frame_len : i * frame_len + FFT_SIZE]

    return frames

##### 3. Windowing

Windowing is a process to help smoothing the signal and avoiding signal discontinuity [2]. Windowing techniques are divided into:
- Rectangular Window
- Hamming Window
- Hanning Window
- Blackman Window

In [16]:
def windowing(frames, FFT_SIZE, windowing_techniques = "hann"):
    window = scipy.signal.get_window(windowing_techniques, FFT_SIZE)
    audio_window = frames * window

    return audio_window

##### 4. Fast Fourier Transform (FFT)

Fast Fourier Transform (FFT) is a process of converting signal from time domain into frequency domain [3].

In [17]:
def FFT(audio_window, FFT_SIZE):
    audio_fft = np.empty((audio_window.shape[0], 1 + FFT_SIZE // 2))

    for i in range(audio_window.shape[0]):
        audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]

    return audio_fft

##### 5. Mel Filterbank

Mel filterbank is a process to approximate the non-linear human auditory system's frequency response [4].

In [18]:
def frequency_to_mel(frequency):
    return 2595 * np.log10(1 + (frequency / 700))

In [19]:
def mel_to_frequency(mel):
    return 700 * (10 ** (mel / 2595) - 1)

In [20]:
def get_filter_points(frequency_min, frequency_max, mel_filter_num, FFT_SIZE, sample_rate):
    mel_min = frequency_to_mel(frequency_min)
    mel_max = frequency_to_mel(frequency_max)
    
    mel = np.linspace(mel_min, mel_max, mel_filter_num + 2)
    
    frequency = mel_to_frequency(mel)

    return np.floor((FFT_SIZE + 1) / sample_rate * frequency).astype(int), frequency

In [21]:
def get_filters(filter_points, FFT_SIZE):
    filters = np.zeros((len(filter_points) - 2, int(FFT_SIZE) // 2 + 1))
    
    for i in range(filters.shape[0]):
        filters[i, filter_points[i]:filter_points[i + 1]] = np.linspace(0, 1, filter_points[i + 1] - filter_points[i])
        filters[i, filter_points[i + 1]:filter_points[i + 2]] = np.linspace(1, 0, filter_points[i + 2] - filter_points[i + 1])
        
    return filters        

In [22]:
def mel_filterbank(sample_rate, frequency_min, frequency_max, mel_filter_num, FFT_SIZE):
    filter_points, frequency = get_filter_points(frequency_min, frequency_max, mel_filter_num, FFT_SIZE, sample_rate)
    filters = get_filters(filter_points, FFT_SIZE)

    return filters

##### 6. Discrete Cosine Transform (DCT)

Discrete Cosine Transform (DCT) is a process to decorrelate the filterbank energies and obtain a compact representation of the spectral envelope of the logarithmically-scaled filterbank [4].

In [23]:
def DCT(filter_num, filter_len):
    basis = np.empty((filter_num, filter_len))
    
    basis[0, :] = 1 / np.sqrt(filter_len)
    
    samples = np.arange(1, 2 * filter_len, 2) * np.pi / (filter_len * 2)
    
    for i in range(1, filter_num):
        basis[i, :] = np.cos(i * samples) / np.sqrt(filter_len * 2)
        
    return basis

### III. Feature Extraction

The feature extraction technique used for this project is the Mel Frequency Cepstral Coefficient (MFCC). MFCC is one of the most commonly used feature extraction in speech recognition because it can work well on inputs with a high level of correlation [5].

In [24]:
import librosa

In [25]:
def MFCC(audio, sample_rate, FFT_SIZE = 512, hop_size = 10, mel_filter_num = 26, frequency_min = 0, frequency_max = None, num_coefficients = 13):
    if frequency_max is None:
        frequency_max = sample_rate // 2  #
    
    audio = normalization(audio)

    frames = frame_blocking(audio, sample_rate, FFT_SIZE, hop_size)

    audio_window = windowing(frames, FFT_SIZE)

    audio_fft = FFT(audio_window, FFT_SIZE)

    filters = mel_filterbank(sample_rate, frequency_min, frequency_max, mel_filter_num, FFT_SIZE)
    
    power_spectrum = np.abs(audio_fft) ** 2 
    
    mel_spectrum = np.dot(power_spectrum, filters.T)

    mel_spectrum_log = np.log(mel_spectrum + 1e-6) 

    mfcc = DCT(mel_filter_num, mel_spectrum_log.shape[1])

    return mfcc.dot(mel_spectrum_log.T).T[:, :num_coefficients] 


In [26]:
def apply_mfcc_to_df(df, sample_rate = 16000, FFT_SIZE = 512, hop_size = 10, mel_filter_num = 26, frequency_min = 0, frequency_max = None, num_coefficients = 13):
    mfccs = []

    for index, row in df.iterrows():
        audio_path = f'../data/cv-corpus-6.1-indonesian/clips/{row["path"]}'
        
        audio, sr = librosa.load(audio_path, sr = sample_rate)

        mfcc = MFCC(audio, sr, FFT_SIZE, hop_size, mel_filter_num, frequency_min, frequency_max, num_coefficients)
        
        mfccs.append(mfcc)
    
    df["mfcc"] = mfccs
    return df

In [27]:
train_df = apply_mfcc_to_df(train_df)
valid_df = apply_mfcc_to_df(valid_df)
test_df = apply_mfcc_to_df(test_df)

  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]
  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]
  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]


### IV. Modeling

The model used for this projects are:
- XLSR-53
- Whisper

In [28]:
import numpy as np
import torch

In [29]:
print("PyTorch CUDA Available:", torch.cuda.is_available())
print("PyTorch CUDA Version:", torch.version.cuda)
print("PyTorch cuDNN Enabled:", torch.backends.cudnn.enabled)

PyTorch CUDA Available: True
PyTorch CUDA Version: 11.8
PyTorch cuDNN Enabled: True


In [30]:
device = torch.device("cuda")

In [31]:
device

device(type='cuda')

In [32]:
train_df["mfcc"] = train_df["mfcc"].apply(lambda x: torch.tensor(x, dtype = torch.float32))
valid_df["mfcc"] = valid_df["mfcc"].apply(lambda x: torch.tensor(x, dtype = torch.float32))
test_df["mfcc"] = test_df["mfcc"].apply(lambda x: torch.tensor(x, dtype = torch.float32))

In [33]:
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`~!@#$%^&*()-_+=|\\]}[{'\":;/?.>,< " 
char_to_idx = {char: idx for idx, char in enumerate(chars)}
char_to_idx["<UNK>"] = len(char_to_idx)

In [34]:
idx_to_char = {idx: char for char, idx in char_to_idx.items()}

In [35]:
char_to_idx

{'a': 0,
 'b': 1,
 'c': 2,
 'd': 3,
 'e': 4,
 'f': 5,
 'g': 6,
 'h': 7,
 'i': 8,
 'j': 9,
 'k': 10,
 'l': 11,
 'm': 12,
 'n': 13,
 'o': 14,
 'p': 15,
 'q': 16,
 'r': 17,
 's': 18,
 't': 19,
 'u': 20,
 'v': 21,
 'w': 22,
 'x': 23,
 'y': 24,
 'z': 25,
 'A': 26,
 'B': 27,
 'C': 28,
 'D': 29,
 'E': 30,
 'F': 31,
 'G': 32,
 'H': 33,
 'I': 34,
 'J': 35,
 'K': 36,
 'L': 37,
 'M': 38,
 'N': 39,
 'O': 40,
 'P': 41,
 'Q': 42,
 'R': 43,
 'S': 44,
 'T': 45,
 'U': 46,
 'V': 47,
 'W': 48,
 'X': 49,
 'Y': 50,
 'Z': 51,
 '`': 52,
 '~': 53,
 '!': 54,
 '@': 55,
 '#': 56,
 '$': 57,
 '%': 58,
 '^': 59,
 '&': 60,
 '*': 61,
 '(': 62,
 ')': 63,
 '-': 64,
 '_': 65,
 '+': 66,
 '=': 67,
 '|': 68,
 '\\': 69,
 ']': 70,
 '}': 71,
 '[': 72,
 '{': 73,
 "'": 74,
 '"': 75,
 ':': 76,
 ';': 77,
 '/': 78,
 '?': 79,
 '.': 80,
 '>': 81,
 ',': 82,
 '<': 83,
 ' ': 84,
 '<UNK>': 85}

In [36]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, df, max_len=None):
        self.df = df
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        mfcc = self.df.iloc[idx]["mfcc"]
        label = self.df.iloc[idx]["sentence"]
        label = "".join([char if char in char_to_idx else "<UNK>" for char in label])

        label_indices = torch.tensor([char_to_idx[char] for char in label], dtype = torch.long)
 
        mfcc_tensor = torch.tensor(mfcc, dtype = torch.float32)

        return {"mfcc": mfcc_tensor, "label": label_indices}

In [37]:
def collate_fn(batch):
    mfccs = [item["mfcc"] for item in batch]
    labels = [item["label"] for item in batch]

    mfccs_padded = torch.nn.utils.rnn.pad_sequence(mfccs, batch_first = True, padding_value = 0)
    label_lengths = torch.tensor([len(label) for label in labels], dtype = torch.long)

    labels_padded = torch.nn.utils.rnn.pad_sequence(labels, batch_first = True, padding_value = 0)

    return {"mfcc": mfccs_padded, "label": labels_padded, "label_lengths": label_lengths}

In [38]:
train_dataset = Dataset(train_df)
valid_dataset = Dataset(valid_df)
test_dataset = Dataset(test_df)

In [39]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 4, shuffle = True, collate_fn = collate_fn)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size = 4, shuffle = False, collate_fn = collate_fn)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = 4, shuffle = True, collate_fn = collate_fn)

In [40]:
def train(model, train_loader, optimizer, ctc_loss):
    model.train()
    running_loss = 0.0

    for batch in train_loader:
        input_values = batch["mfcc"].to(device)
        labels = batch["label"].to(device)
        label_lengths = batch["label_lengths"].to(device)

        optimizer.zero_grad()

        output = model(input_values)

        output = output.transpose(0, 1)

        loss = ctc_loss(output, labels, input_lengths = torch.full((input_values.size(0),), output.size(0)), target_lengths = label_lengths)

        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    return running_loss / len(train_loader)

In [41]:
def evaluate(model, valid_loader, ctc_loss):
    model.eval()
    running_loss = 0.0

    with torch.no_grad():
        for batch in valid_loader:
            input_values = batch["mfcc"].to(device)
            labels = batch["label"].to(device)
            label_lengths = batch["label_lengths"].to(device)

            output = model(input_values)

            output = output.transpose(0, 1)

            loss = ctc_loss(output, labels, input_lengths=torch.full((input_values.size(0),), output.size(0)), target_lengths = label_lengths)

            running_loss += loss.item()

    return running_loss / len(valid_loader)

##### 1. XLSR-53

XLSR-53 is a pretrained model built on wav2vec 2.0 thas has been trained in 53 different languages. There are four importance elements, which are Feature Encoder, Quantization Module, Context Network, and Pretraining and Contrasive Loss [6]. 

![XLSR-53 Architecture](../assets/xlsr-53.png)

Fig. 1. XLSR-53 Architecture [6]

On this project, the input layer of XLSR-53 is modified so that it can accept MFCC.

In [42]:
from transformers import Wav2Vec2ForPreTraining

  from .autonotebook import tqdm as notebook_tqdm


In [43]:
class XLSR53(torch.nn.Module):
    def __init__(self, model, n_mfcc = 13, hidden_size = 1024):
        super().__init__()
        self.wav2vec2 = model
        self.mfcc_projection = torch.nn.Linear(n_mfcc, hidden_size)

    def forward(self, input_values):
        input_values = self.mfcc_projection(input_values)
        output = self.wav2vec2.wav2vec2.encoder(input_values).last_hidden_state
        output = torch.nn.functional.log_softmax(output, dim = -1)
        return output

In [44]:
xlsr53 = XLSR53(Wav2Vec2ForPreTraining.from_pretrained("facebook/wav2vec2-large-xlsr-53"))

In [45]:
xlsr53.to(device)

XLSR53(
  (wav2vec2): Wav2Vec2ForPreTraining(
    (wav2vec2): Wav2Vec2Model(
      (feature_extractor): Wav2Vec2FeatureEncoder(
        (conv_layers): ModuleList(
          (0): Wav2Vec2LayerNormConvLayer(
            (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (activation): GELUActivation()
          )
          (1-4): 4 x Wav2Vec2LayerNormConvLayer(
            (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (activation): GELUActivation()
          )
          (5-6): 2 x Wav2Vec2LayerNormConvLayer(
            (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
            (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (activation): GELUActivation()
          )
        )
      )
      (feature_projection): Wav2Vec2FeatureProjection(
        (layer_

In [46]:
xlsr53_ctc_loss = torch.nn.CTCLoss()

In [47]:
xlsr53_optimizer = torch.optim.Adam(xlsr53.parameters(), lr = 1e-4)

In [51]:
num_epochs = 20

In [52]:
torch.cuda.empty_cache()
torch.cuda.memory_summary(device = None, abbreviated = False)



In [53]:
best_val_loss = float("inf")
patience = 3
counter = 0

In [54]:
for epoch in range(num_epochs):
    train_loss = train(xlsr53, train_loader, xlsr53_optimizer, xlsr53_ctc_loss)
    val_loss = evaluate(xlsr53, valid_loader, xlsr53_ctc_loss)
    
    print(f"Epoch {epoch + 1}/{num_epochs} - Train Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        counter = 0 

        torch.save({
            "epoch": epoch + 1,
            "model_state_dict": xlsr53.state_dict(),
            "optimizer_state_dict": xlsr53_optimizer.state_dict(),
            "loss": val_loss
        }, "../models/best_model.pth")
        
    else:
        counter += 1
        if counter >= patience:
            print("Early stopping triggered. Stopping training.")
            break

  mfcc_tensor = torch.tensor(mfcc, dtype = torch.float32)


Epoch 1/20 - Train Loss: 19.6626, Validation Loss: 12.8608
Epoch 2/20 - Train Loss: 8.5388, Validation Loss: 7.2756
Epoch 3/20 - Train Loss: 5.8260, Validation Loss: 5.5134
Epoch 4/20 - Train Loss: 4.7475, Validation Loss: 4.6750
Epoch 5/20 - Train Loss: 3.9206, Validation Loss: 4.2789
Epoch 6/20 - Train Loss: 3.3557, Validation Loss: 4.1368
Epoch 7/20 - Train Loss: 2.8991, Validation Loss: 4.1675
Epoch 8/20 - Train Loss: 2.6377, Validation Loss: 3.9332
Epoch 9/20 - Train Loss: 2.4177, Validation Loss: 4.0265
Epoch 10/20 - Train Loss: 2.2358, Validation Loss: 3.8659
Epoch 11/20 - Train Loss: 2.1321, Validation Loss: 3.8594
Epoch 12/20 - Train Loss: 1.9989, Validation Loss: 3.7178
Epoch 13/20 - Train Loss: 1.9005, Validation Loss: 3.8757
Epoch 14/20 - Train Loss: 1.8206, Validation Loss: 2.9413
Epoch 15/20 - Train Loss: 1.7607, Validation Loss: 3.1276
Epoch 16/20 - Train Loss: 1.7080, Validation Loss: 3.0639
Epoch 17/20 - Train Loss: 1.7202, Validation Loss: 2.7201
Epoch 18/20 - Train L

##### 2. Whisper

### V. Model Evaluation

### VI. Conclusion

### VII. References

[1] L. Maison and Y. Estève, “Some voices are too common: Building fair speech recognition systems using the Common Voice dataset.”

[2] M. Labied, A. Belangour, M. Banane, and A. Erraissi, “An overview of Automatic Speech Recognition Preprocessing Techniques,” in 2022 International Conference on Decision Aid Sciences and Applications (DASA), IEEE, Mar. 2022, pp. 804–809. doi: 10.1109/DASA54658.2022.9765043.

[3] R. D. Septiawan, P. R. Rayes, and N. A. Robbaniyyah, “Simulasi Penghilangan Noise pada Sinyal Suara menggunakan Metode Fast Fourier Transfrom,” Semeton Mathematics Journal, vol. 1, no. 1, pp. 1–7, Apr. 2024, doi: 10.29303/semeton.v1i1.203.

[4] M. Y. Wang, Z. Chu, C. Entzminger, Y. Ding, and Q. Zhang, “Visualization and Interpretation of Mel-Frequency Cepstral Coefficients for UAV Drone Audio Data,” in Proceedings of the 13th International Conference on Data Science, Technology and Applications, DATA 2024, SciTePress, 2024, pp. 528–534. doi: 10.5220/0012827400003756.
  
[5] W. Mustikarini, R. Hidayat, and A. Bejo, “Real-Time Indonesian Language Speech Recognition with MFCC Algorithms and Python-Based SVM,” 2019.

[6] P. Arisaputra and A. Zahra, “Indonesian Automatic Speech Recognition with XLSR-53,” Ingénierie des systèmes d information, vol. 27, no. 6, pp. 973–982, Dec. 2022, doi: 10.18280/isi.270614.
  
  