# Audio Transcription

### I. Data Gathering

The dataset used for this project is the Common Voice dataset. Common Voice is a massive multi-lingual corpus of read speech by Mozilla [1]. This project used Common Voice Corpus 20.0 subset for Indonesian language.  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

##### 1. Training Data

In [2]:
train_df = pd.read_csv("../data/cv-corpus-6.1-indonesian/train.tsv", sep = "\t")

In [3]:
train_df.head(5)

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accent,locale,segment
0,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192526.mp3,Kamar adik laki-laki saya lebih sempit daripad...,2,0,twenties,male,,id,
1,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192527.mp3,Ayah akan membunuhku.,2,0,twenties,male,,id,
2,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192528.mp3,Ini pulpen.,2,0,twenties,male,,id,
3,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192535.mp3,Akira pandai bermain tenis.,2,0,twenties,male,,id,
4,4c81270f49ada076d376a968994e1533674531b0fae896...,common_voice_id_19192536.mp3,Dia keluar dari ruangan tanpa mengatakan sepat...,2,1,twenties,male,,id,


In [4]:
train_df = train_df[train_df["up_votes"] >= train_df["down_votes"]]
train_df = train_df[["path", "sentence"]]

In [5]:
train_df.duplicated().sum()

0

In [6]:
train_df.isna().sum()

path        0
sentence    0
dtype: int64

##### 2. Testing Data

In [7]:
test_df = pd.read_csv("../data/cv-corpus-6.1-indonesian/test.tsv", sep = "\t")

In [8]:
test_df.head(5)

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accent,locale,segment
0,057bf45c0c338db897f5717f744bcac8a2ac2eee990a42...,common_voice_id_22888800.mp3,Minggu depan kakak perempuan saya menikah.,2,0,,,,id,
1,0835fbbf1d609a6ed421eef134a48ff06d719121b41f3b...,common_voice_id_24015257.mp3,Berbagai bahasa daerah dan bahasa asing menjad...,2,1,,,,id,
2,0c8ac0307f35c73b09d8fc0d92e4c183e3078adee87212...,common_voice_id_24015280.mp3,apa yang bisa saya berikan kepadamu?,2,0,,,,id,
3,19285f8e012ad31cad237d53bab348ce59a5cc13684754...,common_voice_id_20425643.mp3,Inilah dunia kecil.,2,1,,,,id,
4,3502377c5fb712169a3f2fe5583906e4b3a5ecba27bf2c...,common_voice_id_22185104.mp3,nol,2,0,,,,id,Benchmark


In [9]:
test_df = test_df[test_df["up_votes"] >= test_df["down_votes"]]
test_df = test_df[["path", "sentence"]]

In [10]:
test_df.duplicated().sum()

0

In [11]:
test_df.isna().sum()

path        0
sentence    0
dtype: int64

##### 3. Split Data

In [12]:
valid_df, test_df = train_test_split(test_df, test_size = 0.5, random_state = 42)

### II. Data Preprocessing

In [13]:
import numpy as np
import scipy

The data preprocessing techniques used for this projects are:
1. Normalization
2. Frame Blocking
3. Windowing
4. Fast Fourier Transform (FFT)
5. Mel Filterbank
6. Discrete Cosine Transform (DCT)

##### 1. Normalization

Normalization is a process of adjusting the range of a signal to a certain range.

In [14]:
def normalization(audio):
    audio = audio / np.max(np.abs(audio))
    return audio

##### 2. Frame Blocking

Frame blocking is a process of splitting speech signal into a series of frames with equal length. Usually, the frame length is 20-40 ms [2]. 

In [15]:
def frame_blocking(audio, sample_rate, FFT_SIZE, hop_size):
    audio = np.pad(audio, FFT_SIZE // 2, mode = "reflect")

    frame_len = int(np.round(sample_rate * hop_size / 1000))

    frame_num = (len(audio) - FFT_SIZE) // frame_len + 1

    frames = np.zeros((frame_num, FFT_SIZE))

    for i in range(frame_num):
        frames[i] = audio[i * frame_len : i * frame_len + FFT_SIZE]

    return frames

##### 3. Windowing

Windowing is a process to help smoothing the signal and avoiding signal discontinuity [2]. Windowing techniques are divided into:
- Rectangular Window
- Hamming Window
- Hanning Window
- Blackman Window

In [16]:
def windowing(frames, FFT_SIZE, windowing_techniques = "hann"):
    window = scipy.signal.get_window(windowing_techniques, FFT_SIZE)
    audio_window = frames * window

    return audio_window

##### 4. Fast Fourier Transform (FFT)

Fast Fourier Transform (FFT) is a process of converting signal from time domain into frequency domain [3].

In [17]:
def FFT(audio_window, FFT_SIZE):
    audio_fft = np.empty((audio_window.shape[0], 1 + FFT_SIZE // 2))

    for i in range(audio_window.shape[0]):
        audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]

    return audio_fft

##### 5. Mel Filterbank

Mel filterbank is a process to approximate the non-linear human auditory system's frequency response [4].

In [18]:
def frequency_to_mel(frequency):
    return 2595 * np.log10(1 + (frequency / 700))

In [19]:
def mel_to_frequency(mel):
    return 700 * (10 ** (mel / 2595) - 1)

In [20]:
def get_filter_points(frequency_min, frequency_max, mel_filter_num, FFT_SIZE, sample_rate):
    mel_min = frequency_to_mel(frequency_min)
    mel_max = frequency_to_mel(frequency_max)
    
    mel = np.linspace(mel_min, mel_max, mel_filter_num + 2)
    
    frequency = mel_to_frequency(mel)

    return np.floor((FFT_SIZE + 1) / sample_rate * frequency).astype(int), frequency

In [21]:
def get_filters(filter_points, FFT_SIZE):
    filters = np.zeros((len(filter_points) - 2, int(FFT_SIZE) // 2 + 1))
    
    for i in range(filters.shape[0]):
        filters[i, filter_points[i]:filter_points[i + 1]] = np.linspace(0, 1, filter_points[i + 1] - filter_points[i])
        filters[i, filter_points[i + 1]:filter_points[i + 2]] = np.linspace(1, 0, filter_points[i + 2] - filter_points[i + 1])
        
    return filters        

In [22]:
def mel_filterbank(sample_rate, frequency_min, frequency_max, mel_filter_num, FFT_SIZE):
    filter_points, frequency = get_filter_points(frequency_min, frequency_max, mel_filter_num, FFT_SIZE, sample_rate)
    filters = get_filters(filter_points, FFT_SIZE)

    return filters

##### 6. Discrete Cosine Transform (DCT)

Discrete Cosine Transform (DCT) is a process to decorrelate the filterbank energies and obtain a compact representation of the spectral envelope of the logarithmically-scaled filterbank [4].

In [23]:
def DCT(filter_num, filter_len):
    basis = np.empty((filter_num, filter_len))
    
    basis[0, :] = 1 / np.sqrt(filter_len)
    
    samples = np.arange(1, 2 * filter_len, 2) * np.pi / (filter_len * 2)
    
    for i in range(1, filter_num):
        basis[i, :] = np.cos(i * samples) / np.sqrt(filter_len * 2)
        
    return basis

### III. Feature Extraction

The feature extraction technique used for this project is the Mel Frequency Cepstral Coefficient (MFCC). MFCC is one of the most commonly used feature extraction in speech recognition because it can work well on inputs with a high level of correlation [5].

In [24]:
import librosa

In [25]:
def MFCC(audio, sample_rate, FFT_SIZE = 512, hop_size = 10, mel_filter_num = 26, frequency_min = 0, frequency_max = None, num_coefficients = 13):
    if frequency_max is None:
        frequency_max = sample_rate // 2  #
    
    audio = normalization(audio)

    frames = frame_blocking(audio, sample_rate, FFT_SIZE, hop_size)

    audio_window = windowing(frames, FFT_SIZE)

    audio_fft = FFT(audio_window, FFT_SIZE)

    filters = mel_filterbank(sample_rate, frequency_min, frequency_max, mel_filter_num, FFT_SIZE)
    
    power_spectrum = np.abs(audio_fft) ** 2 
    
    mel_spectrum = np.dot(power_spectrum, filters.T)

    mel_spectrum_log = np.log(mel_spectrum + 1e-6) 

    mfcc = DCT(mel_filter_num, mel_spectrum_log.shape[1])

    return mfcc.dot(mel_spectrum_log.T).T[:, :num_coefficients] 


In [26]:
def apply_mfcc_to_df(df, sample_rate = 16000, FFT_SIZE = 512, hop_size = 10, mel_filter_num = 26, frequency_min = 0, frequency_max = None, num_coefficients = 13):
    mfccs = []

    for index, row in df.iterrows():
        audio_path = f'../data/cv-corpus-6.1-indonesian/clips/{row["path"]}'
        
        audio, sr = librosa.load(audio_path, sr = sample_rate)

        mfcc = MFCC(audio, sr, FFT_SIZE, hop_size, mel_filter_num, frequency_min, frequency_max, num_coefficients)
        
        mfccs.append(mfcc)
    
    df["mfcc"] = mfccs
    return df

In [27]:
train_df = apply_mfcc_to_df(train_df)
valid_df = apply_mfcc_to_df(valid_df)
test_df = apply_mfcc_to_df(test_df)

  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]
  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]
  audio_fft[i] = scipy.fftpack.fft(audio_window[i])[:audio_fft.shape[1]]


In [30]:
train_df.to_csv("../data/cv-corpus-6.1-indonesian-processed/train.csv")
valid_df.to_csv("../data/cv-corpus-6.1-indonesian-processed/valid.csv")
test_df.to_csv("../data/cv-corpus-6.1-indonesian-processed/test.csv")

### IV. Modeling

### V. Model Evaluation

### VI. Conclusion

### VII. References

[1] L. Maison and Y. Estève, “Some voices are too common: Building fair speech recognition systems using the Common Voice dataset.”

[2] M. Labied, A. Belangour, M. Banane, and A. Erraissi, “An overview of Automatic Speech Recognition Preprocessing Techniques,” in 2022 International Conference on Decision Aid Sciences and Applications (DASA), IEEE, Mar. 2022, pp. 804–809. doi: 10.1109/DASA54658.2022.9765043.

[3] R. D. Septiawan, P. R. Rayes, and N. A. Robbaniyyah, “Simulasi Penghilangan Noise pada Sinyal Suara menggunakan Metode Fast Fourier Transfrom,” Semeton Mathematics Journal, vol. 1, no. 1, pp. 1–7, Apr. 2024, doi: 10.29303/semeton.v1i1.203.

[4] M. Y. Wang, Z. Chu, C. Entzminger, Y. Ding, and Q. Zhang, “Visualization and Interpretation of Mel-Frequency Cepstral Coefficients for UAV Drone Audio Data,” in Proceedings of the 13th International Conference on Data Science, Technology and Applications, DATA 2024, SciTePress, 2024, pp. 528–534. doi: 10.5220/0012827400003756.
  

[5] W. Mustikarini, R. Hidayat, and A. Bejo, “Real-Time Indonesian Language Speech Recognition with MFCC Algorithms and Python-Based SVM,” 2019.
  
  