<h3>Preprocess the Audio</h3>

In [1]:
!ffmpeg -version

ffmpeg version 2025-06-17-git-ee1f79b0fa-essentials_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 15.1.0 (Rev4, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-openal --enable-libgme --enable-libopenmpt --enable-libopen

In [2]:
from pydub import AudioSegment
import os

In [None]:
input_path = r"C:/Users/HP/Desktop/fyp/data/raw_audio"
output_path = r"C:/Users/HP/Desktop/fyp/data/processed_audio"

#create the output path folder
os.makedirs(output_path, exist_ok=True)

In [5]:
for filename in os.listdir(input_path):
    if filename.endswith(".mp3") or filename.endswith(".wav"):
        #load audio file using pydub
        audio = AudioSegment.from_file(os.path.join(input_path, filename))
        #convert the audio file to 16kHx and mono-use one channel onlu to reduce model complexity
        audio = audio.set_frame_rate(16000).set_channels(1)
        #change the ext from mp3 to wav
        new_name = filename.replace(".mp3", ".wav")
        #export audio to the output path
        audio.export(os.path.join(output_path, new_name), format="wav")

This process is important because Fatser-Whisper expects audios in `.wav` fomat, mono and `16kHz`. If this stepp is skipped we'll get errors when trainin or transcribing as we will be using this audio files to fine tune the model.

<h3>Create the Transcipt file</h3>

Creation of the transcript file will be done manually via excel to dave time

In [None]:
import pandas as pd

data = {
    "Id" : [i for i in range(len(os.listdir(output_path)))],
    "Audio_Filename" : [filename for filename in os.listdir(output_path)],
    "Text": [row for row in pd.read_csv(r"C:/Users/HP\Desktop/fyp/data/audio_text_mapping.csv")["Text"]]
}


In [12]:
#convert data into a pandas dataframe
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Id,Audio_Filename,Text
0,0,audio_sample_1.wav,Its images were used by among others Palestini...
1,1,audio_sample_10.wav,"The Argand lamp used whale oil, olive oil and ..."
2,2,audio_sample_100.wav,Alice was default reading to the point where I...
3,3,audio_sample_1000.wav,The Sankwala mountain ranges were first explor...
4,4,audio_sample_1001.wav,A slough is a wetland usually a swamp or shall...


In [13]:
#convert to tsv, standard format for Whisper
df.to_csv("data.tsv", sep="\t", index=False)

<h3>Split into Train, test and Eval Sets</h3>

In [14]:
from sklearn.model_selection import train_test_split

#load the data
df = pd.read_csv("data.tsv", sep="\t")

#split into train and temp
train, temp = train_test_split(df, test_size=0.3, random_state=40)

#split temp into eval and test sets
eval, test = train_test_split(temp, test_size=0.5, random_state=40)

In [15]:
#save to tsv file
train.to_csv("train.tsv", sep="\t", index=False)
eval.to_csv("eval.tsv", sep="\t", index=False)
test.to_csv("test.tsv", sep="\t", index=False)