# 📚 Import and Install Required Libraries

In this step, we install and import all the necessary Python libraries
(torch, torchaudio, pydub, etc.) that will be used for audio processing,
resampling, and voice activity detection.

### 🔗 Mount Google Drive

Mount Google Drive into the Colab environment so that audio data and processed files can be accessed and saved directly to your Drive.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
import sys
sys.path.append("/content/drive/MyDrive/Naija-Language-Accent_ID")

### 📦 Install Required Packages

Install additional libraries needed for audio processing and voice activity detection:  
- **silero-vad** → for Voice Activity Detection (speech/silence segmentation)  
- **pydub** → for audio slicing, manipulation, and exporting  
- **soundfile** → for reading and streaming audio files efficiently

In [None]:
!pip install -q silero-vad pydub soundfile

In [None]:
# Import Libraries
import torch
import torchaudio
import torchaudio.transforms as tt
from pydub import AudioSegment
import soundfile as sf
from pathlib import Path
import numpy as np

# 📂 Define Source and Destination Paths

### 📂 Source Paths for Raw Audio Data

Here we set up the source directories in Google Drive that contain the raw `.wav` audio files for each language (Igbo, Hausa, Yoruba).  

A helper function `define_source_paths` is used to iterate through each directory and generate a list of file paths, which will be used later for preprocessing and slicing.

In [None]:
# Define the Google Drive Source Path
source_dir_igbo = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/raw_data/spoken_igbo")
source_dir_hausa = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/raw_data/spoken_hausa")
source_dir_yoruba = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/raw_data/spoken_yoruba")
def define_source_paths(source):
  list_of_sources = []
  for item in source.iterdir():
    list_of_sources.append(item)
  return list_of_sources

igbo_source_path = define_source_paths(source_dir_igbo)
hausa_source_path = define_source_paths(source_dir_hausa)
yoruba_source_path = define_source_paths(source_dir_yoruba)

### 📂 Destination Paths for Processed Audio Data

We now specify the Google Drive directories where the processed (sliced and resampled) audio files will be saved for each language (Igbo, Hausa, Yoruba).  

Using the same `define_source_paths` helper, we generate lists of destination paths to organize the output chunks into their respective folders.

In [None]:
# Define the Google Drive Destination Path
processed_dir_igbo = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/processed_audio/spoken_igbo")
processed_dir_hausa = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/processed_audio/spoken_hausa")
processed_dir_yoruba = Path("/content/drive/MyDrive/Naija-Language-Accent_ID/Data/processed_audio/spoken_yoruba")

igbo_processed_path = define_source_paths(processed_dir_igbo)
hausa_processed_path = define_source_paths(processed_dir_hausa)
yoruba_processed_path = define_source_paths(processed_dir_yoruba)

# 🎤 Load Silero VAD Model and Utilities

We load the **Silero Voice Activity Detection (VAD)** model directly from
[torch hub](https://pytorch.org/hub/), along with its utility functions.  

- The `model` is the pretrained PyTorch VAD model.  
- The `silero_utils` include helper functions such as `get_speech_timestamps`, which will be used to detect and extract speech segments from audio files.  

In [None]:
# 1. Load Silero VAD model 🔊 and its utility functions from torch hub
print("Loading VAD model....")
model, silero_utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                              model='silero_vad',
                              force_reload=True)

(get_speech_timestamps, _, _, _, _) = silero_utils

# ⚙️ Define Audio Processing Function

We create a reusable function `resample_and_processing` that performs the following steps:

1. **Read and Stream Audio**  
   - Opens each audio file in chunks (30 seconds at a time) to avoid memory overload.  
   - Converts stereo audio to mono if necessary.  

2. **Resample Audio**  
   - Resamples from the original sample rate to the `16 kHz` target rate required by Silero VAD.  

3. **Apply Voice Activity Detection (VAD)**  
   - Uses `get_speech_timestamps` to detect regions of speech within the audio.  
   - Adjusts timestamps so they remain aligned with the full audio file.  

4. **Slice Audio with PyDub**  
   - Loads the full original `.wav` file.  
   - Uses the speech timestamps to slice out individual speech segments.  
   - Resamples each sliced segment to `16 kHz` before saving.  

5. **Save Processed Segments**  
   - Exports all chunks as `.wav` files into the corresponding processed folder for each audio source.  

✅ This function makes it easy to process multiple large audio files automatically and store clean, uniformly formatted speech chunks for training.

In [None]:
# Function: Process and Resample Audio in Chunks from Disk

def resample_and_processing(source_dir, silero_model, timestamp_func, processed_dir):
  """
  This function takes in a list of paths to audio files,
  a silero_vad model and utils and a list of paths to save the processed audio files.
  """

  for index, source_path in enumerate(source_dir):
    TARGET_RATE = 16000 # The sample rate Silero VAD expects
    print(f"Processing and resampling audio in chunks from: {source_path}")
    file_name = source_path.name

    all_speech_timestamps = []
    current_sample_offset = 0
    chunk_size_seconds = 30

    try:
      with sf.SoundFile(source_path, 'r') as audio_file:
        ORIGINAL_RATE = audio_file.samplerate

        # Resampler object
        resampler = tt.Resample(ORIGINAL_RATE, TARGET_RATE, dtype=torch.float32)

        chunk_size_sample = chunk_size_seconds * ORIGINAL_RATE  # Chunk size based on original rate

        for block in audio_file.blocks(blocksize=chunk_size_sample, dtype='float32', fill_value=0):
          # If the audio is stereo, convert to mono by averaging channels
          if block.ndim > 1:
            block = np.mean(block, axis=1)

          # The 'block' is a numpy array. Convert to a tensor for the model.
          audio_tensor_chunk = torch.from_numpy(block)

          # Resample the chunk
          resampled_chunk = resampler(audio_tensor_chunk)

          # Get timestamps using the RESAMPLED chunk and TARGET rate
          speech_timestamps = timestamp_func(resampled_chunk, silero_model, sampling_rate=TARGET_RATE)

          # Adjust timestamps relative to the RESAMPLED audio timeline
          for ts in speech_timestamps:
              ts['start'] += current_sample_offset
              ts['end'] += current_sample_offset

          all_speech_timestamps.extend(speech_timestamps)

          # Update the offset based on the length of the RESAMPLED chunk
          current_sample_offset += len(resampled_chunk)

      print(f"✅ Found {len(all_speech_timestamps)} speech segments in {source_path}")

    except Exception as e:
      print(f"Error processing audio file: {e}")
      exit()

    # Slicing with pydub
    print("Slicing audio file with pydub...")
    original_audio = AudioSegment.from_wav(source_path)

    # When slicing, resample the final output to 16kHz
    # Pydub slices the original file. Set the frame rate of the exported chunk.
    for i, ts in enumerate(all_speech_timestamps):
      # Timestamps are based on the 16kHz timeline, so we convert them to milliseconds
      start_ms = int((ts['start'] / TARGET_RATE) * 1000)
      end_ms = int((ts['end'] / TARGET_RATE) * 1000)

      speech_segment = original_audio[start_ms:end_ms]

      # Resample the final sliced segment before saving
      # This ensures the saved chunk is also 16kHz
      speech_segment = speech_segment.set_frame_rate(TARGET_RATE)

      output_filename = f"{Path(file_name).stem}_chunk_{i}.wav"
      output_path = processed_dir[index] / output_filename

      speech_segment.export(output_path, format="wav")

    print(f"✅ Slicing complete! {len(all_speech_timestamps)} files saved to {processed_dir[index]}")

    all_speech_timestamps = []

  print(f"All resampling and preprocessing has been completed and exported to respective folders")

### ▶️ Run Preprocessing on Igbo Dataset

We now call the `resample_and_processing` function on the **Igbo audio dataset**

In [None]:
resample_and_processing(source_dir=igbo_source_path,
                        silero_model=model,
                        timestamp_func=get_speech_timestamps,
                        processed_dir=igbo_processed_path)

### ▶️ Run Preprocessing on Hausa Dataset

We now call the `resample_and_processing` function on the **Hausa audio dataset**

In [None]:
resample_and_processing(source_dir=hausa_source_path,
                        silero_model=model,
                        timestamp_func=get_speech_timestamps,
                        processed_dir=hausa_processed_path)

### ▶️ Run Preprocessing on Yoruba Dataset

We now call the `resample_and_processing` function on the **Yoruba audio dataset**

In [None]:
resample_and_processing(source_dir=yoruba_source_path,
                        silero_model=model,
                        timestamp_func=get_speech_timestamps,
                        processed_dir=yoruba_processed_path)