# Automatic Speech Recognition Project - Part 4

Now that all the audio files have been converted to .wav format and their corresponding .tsv transcriptions have been mapped correctly, the next step is to preprocess the data to ensure it is in the correct format for use with the Whisper model."

Here are the preprocessing format details needed:
- **Input Format (Features)**:
  - Audio files must be processed into consistent feature vectors, typically in the shape of `(360, 3000)` where:
    - `360` represents the number of time frames (or audio frames).
    - `3000` represents the number of features per frame (e.g., mel spectrogram features).
  
- **Label Format (Transcriptions)**:
  - Labels must be formatted as sequences of integers corresponding to the words or phonemes, with a fixed length of `128` for consistency in batch processing.
  
- **Data Alignment**:
  - Ensure that the features (audio) and labels (transcriptions) have the same length after preprocessing, meaning each audio sample should have a corresponding transcription of the same sequence length.

- **Padding**:
  - For sequences with lengths shorter than `128`, padding (e.g., zero-padding) may be added to ensure consistent input/output sizes.

- **Truncation**:
  - For sequences with lengths greater than `128`, truncation should be applied to shorten the transcription labels to fit the fixed length. This ensures uniformity across all training samples and avoids errors during model processing.


##Step 1: Downloading Libraries and Dependencies

In [None]:
#libraries
!pip install transformers datasets jiwer openai-whisper torch torchvision torchaudio streamlit
!apt-get install ffmpeg
!pip install pydub


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting jiwer
  Downloading jiwer-3.0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting openai-whisper
  Downloading openai-whisper-20240930.tar.gz (800 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting streamlit
  Downloading streamlit-1.40.2-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collec

In [None]:
#Loading necessary libraries
from transformers import WhisperForConditionalGeneration, WhisperProcessor, Seq2SeqTrainer, Seq2SeqTrainingArguments
from datasets import Dataset, Audio
from jiwer import wer

In [None]:
from datasets import Audio
import pandas as pd
import os


In [None]:
!pip install wandb
import wandb

# Login to WandB using your API key
wandb.login(key="46f9029915ef8194eb4bc9ba2e7d8d85e79ecaf1")




[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Load Whisper model and processor
model_name = "openai/whisper-base"  # Change to "openai/whisper-large" for better accuracy
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

## Step 2: Load and Preprocess dataset





In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# import files
train_file_path = "/content/drive/My Drive/ASR/filtered_train.tsv" #train file
test_file_path = "/content/drive/My Drive/ASR/filtered_test.tsv" #test file
dev_file_path = "/content/drive/My Drive/ASR/filtered_dev.tsv" #validation file

from io import StringIO

def preprocess_file(file_path):
    # Read raw lines from the file
    with open(file_path, 'r', encoding='utf-8') as f:
        raw_data = f.readlines()

    # Replace missing entries and ensure 12 columns per row
    processed_data = []
    for line in raw_data:
        columns = line.strip().split('\t')  # Split line into columns
        # Ensure there are 12 columns by adding empty strings for missing columns
        while len(columns) < 12:
            columns.append("")  # Add empty fields
        processed_data.append('\t'.join(columns))  # Join columns back into a line

    # Convert the cleaned data into a file-like object for pandas
    cleaned_file = StringIO("\n".join(processed_data))
    return cleaned_file


# Preprocess the files
train_cleaned = preprocess_file(train_file_path)
test_cleaned = preprocess_file(test_file_path)
dev_cleaned = preprocess_file(dev_file_path)

# Load the cleaned data into pandas DataFrames
train_df = pd.read_csv(train_cleaned, sep='\t', header = 0)

test_df = pd.read_csv(test_cleaned, sep='\t', header = 0)

dev_df = pd.read_csv(dev_cleaned, sep='\t', header = 0)

# Retain only the necessary columns
required_columns = ['path', 'sentence']
train_df = train_df[required_columns]
test_df = test_df[required_columns]
dev_df = dev_df[required_columns]

# Remove rows where the 'path' column contains the literal string 'path'
train_df = train_df[train_df['path'] != 'path']
test_df = test_df[test_df['path'] != 'path']
dev_df = dev_df[dev_df['path'] != 'path']

# Display the cleaned data
print("Train DataFrame:")
print(train_df.head())

print("Test DataFrame:")
print(test_df.head())

print("Dev DataFrame:")
print(dev_df.head())



Mounted at /content/drive
Train DataFrame:
                            path  \
0  common_voice_luo_40498543.wav   
1  common_voice_luo_40498547.wav   
2  common_voice_luo_40498548.wav   
3  common_voice_luo_40498549.wav   
4  common_voice_luo_40498558.wav   

                                            sentence  
0  bang' Haji to noluwe gi dhako aeto dichuwo ban...  
1                    mar auchiel. Kama nyasi timoree  
2             Tem ahinya iyud batiso mar pi mang’eny  
3                   Jakuo chalo ni odonjo e dala kae  
4                           Jonyuol obiro limo jatuo  
Test DataFrame:
                            path  \
0  common_voice_luo_40545070.wav   
1  common_voice_luo_40545072.wav   
2  common_voice_luo_40545074.wav   
3  common_voice_luo_40545076.wav   
4  common_voice_luo_40545276.wav   

                                            sentence  
0  Tim nonro kuom pek kata chandruok magi kale ko...  
1  Nikech ne ok odewo joma ne piem kode, ne oloyo...  
2           

In [None]:
# Update paths to include full paths
base_audio_path = "/content/drive/My Drive/clips_wav/"  # Update with audio base path
train_df['path'] = train_df['path'].apply(lambda x: base_audio_path + x)
test_df['path'] = test_df['path'].apply(lambda x: base_audio_path + x)
dev_df['path'] = dev_df['path'].apply(lambda x: base_audio_path + x)

# Display the data with new paths
print("Train DataFrame:")
print(train_df.head())

print("Test DataFrame:")
print(test_df.head())

print("Dev DataFrame:")
print(dev_df.head())


Train DataFrame:
                                                path  \
0  /content/drive/My Drive/clips_wav/common_voice...   
1  /content/drive/My Drive/clips_wav/common_voice...   
2  /content/drive/My Drive/clips_wav/common_voice...   
3  /content/drive/My Drive/clips_wav/common_voice...   
4  /content/drive/My Drive/clips_wav/common_voice...   

                                            sentence  
0  bang' Haji to noluwe gi dhako aeto dichuwo ban...  
1                    mar auchiel. Kama nyasi timoree  
2             Tem ahinya iyud batiso mar pi mang’eny  
3                   Jakuo chalo ni odonjo e dala kae  
4                           Jonyuol obiro limo jatuo  
Test DataFrame:
                                                path  \
0  /content/drive/My Drive/clips_wav/common_voice...   
1  /content/drive/My Drive/clips_wav/common_voice...   
2  /content/drive/My Drive/clips_wav/common_voice...   
3  /content/drive/My Drive/clips_wav/common_voice...   
4  /content/drive/My

In [None]:
# Step 2: Load Dataset into Hugging Face Dataset Format
def load_dataset(data):
    dataset = Dataset.from_pandas(data)
    dataset = dataset.cast_column("path", Audio(sampling_rate=16000))
    return dataset

train_dataset = load_dataset(train_df)
test_dataset = load_dataset(test_df)
dev_dataset = load_dataset(dev_df)

#view length of datasets
print(f"Length of train_dataset: {len(train_dataset)}")
print(f"Length of test_dataset: {len(test_dataset)}")
print(f"Length of dev_dataset: {len(dev_dataset)}")

#View head of dataset
print("Head of train_dataset:", train_dataset[:5])
print("Head of test_dataset:", test_dataset[:5])
print("Head of dev_dataset:", dev_dataset[:5])


Length of train_dataset: 2498
Length of test_dataset: 734
Length of dev_dataset: 1570
Head of train_dataset: {'path': [{'path': '/content/drive/My Drive/clips_wav/common_voice_luo_40498543.wav', 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00010525,
       -0.00023456,  0.00019149]), 'sampling_rate': 16000}, {'path': '/content/drive/My Drive/clips_wav/common_voice_luo_40498547.wav', 'array': array([ 2.18278728e-11,  4.72937245e-11,  1.81898940e-11, ...,
       -3.63797881e-12, -1.54614099e-11, -9.09494702e-13]), 'sampling_rate': 16000}, {'path': '/content/drive/My Drive/clips_wav/common_voice_luo_40498548.wav', 'array': array([ 1.81898940e-12,  9.09494702e-13, -7.27595761e-12, ...,
        0.00000000e+00,  0.00000000e+00,  3.63797881e-12]), 'sampling_rate': 16000}, {'path': '/content/drive/My Drive/clips_wav/common_voice_luo_40498549.wav', 'array': array([ 1.09139364e-11, -3.31965566e-11, -9.09494702e-12, ...,
       -6.54836185e-11, -1.45519152e-11,  0.00000000e+00])

In [None]:
import torch

# Step 1: Store the paths before preprocessing
train_paths = train_dataset["path"]
test_paths = test_dataset["path"]

# Step 2: Define preprocessing function
# Define preprocessing function with batch processing
def preprocess_function(batch):
    # Extract audio arrays and sampling rates from the batch
    audio_arrays = [item["array"] for item in batch["path"]]
    sampling_rates = [item["sampling_rate"] for item in batch["path"]]

    # Process audio into log-mel spectrograms for the entire batch
    inputs = processor(
        audio_arrays,
        sampling_rate=sampling_rates[0],  # Assumes consistent sampling rate in batch
        return_tensors="pt",
        padding=True
    )

    # Initialize lists to store processed features and labels
    input_features_batch = []
    labels_batch = []

    # Fixed dimensions for consistency
    max_audio_length = 360  # Set to match the model's expected input length
    max_feature_width = 810  # Based on the maximum observed feature width
    max_label_length = 128  # Tokenizer/model's maximum expected label length

    for i in range(len(audio_arrays)):
        # Process each audio sample in the batch
        feature = inputs.input_features[i][:max_audio_length, :max_feature_width]  # Truncate both dimensions

        # Pad height if needed
        if feature.shape[0] < max_audio_length:
            height_padding = torch.zeros((max_audio_length - feature.shape[0], feature.shape[1]))
            feature = torch.cat((feature, height_padding), dim=0)

        # Pad width if needed
        if feature.shape[1] < max_feature_width:
            width_padding = torch.zeros((feature.shape[0], max_feature_width - feature.shape[1]))
            feature = torch.cat((feature, width_padding), dim=1)

        input_features_batch.append(feature)

        # Tokenize transcription
        labels = processor.tokenizer(
            batch["sentence"][i],
            return_tensors="pt",
            padding=False,  # Handle padding manually
            truncation=False  # Handle truncation manually
        ).input_ids[0]

        # Truncate/pad labels to consistent length
        labels = labels[:max_label_length]
        if labels.shape[0] < max_label_length:
            label_padding = torch.tensor(
                [processor.tokenizer.pad_token_id] * (max_label_length - labels.shape[0])
            )
            labels = torch.cat((labels, label_padding), dim=0)

        labels_batch.append(labels)

    # Stack features and labels into batch tensors
    batch["input_features"] = torch.stack(input_features_batch).detach()
    batch["labels"] = torch.stack(labels_batch).detach()

    return batch




# Apply batch preprocessing
train_dataset = train_dataset.map(
    preprocess_function,
    remove_columns=["path", "sentence"],
    batched=True,
    batch_size=4,  # Adjust batch size based on available memory
    num_proc=1  # Disable multiprocessing
)

test_dataset = test_dataset.map(
    preprocess_function,
    remove_columns=["path", "sentence"],
    batched=True,
    batch_size=4,
    num_proc=1
)



Map:   0%|          | 0/2498 [00:00<?, ? examples/s]

Map:   0%|          | 0/734 [00:00<?, ? examples/s]

In [None]:
# Set format to PyTorch tensors
train_dataset.set_format(type="torch", columns=["input_features", "labels"])
test_dataset.set_format(type="torch", columns=["input_features", "labels"])

print(type(train_dataset[0]["input_features"]))
print(type(train_dataset[0]["labels"]))


<class 'torch.Tensor'>
<class 'torch.Tensor'>


In [None]:
#verify train_dataset and test_dataset
print(train_dataset)
print(test_dataset)

# Step 4: Verify the result
print("Head of train_dataset:", train_dataset[:5])
print("Head of test_dataset:", test_dataset[:5])

Dataset({
    features: ['input_features', 'labels'],
    num_rows: 2498
})
Dataset({
    features: ['input_features', 'labels'],
    num_rows: 734
})
Head of train_dataset: {'input_features': tensor([[[-0.6821, -0.6821, -0.6821,  ..., -0.6821, -0.6821, -0.6821],
         [-0.6821, -0.6821, -0.6821,  ..., -0.6821, -0.6821, -0.6821],
         [-0.6821, -0.6821, -0.6821,  ..., -0.5034, -0.6821, -0.6821],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-0.5947, -0.5947, -0.5947,  ..., -0.5947, -0.5947, -0.5947],
         [-0.5947, -0.5947, -0.5947,  ..., -0.5947, -0.5947, -0.5947],
         [-0.5947, -0.5947, -0.5947,  ..., -0.5947, -0.5947, -0.5947],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.000

In [None]:
# Check shapes of input_features and labels
for idx, sample in enumerate(train_dataset):
    print(f"Sample {idx} input_features shape: {sample['input_features'].shape}")
    print(f"Sample {idx} labels shape: {sample['labels'].shape}")

# Check shapes of input_features and labels
for idx, sample in enumerate(test_dataset):
    print(f"Sample {idx} input_features shape: {sample['input_features'].shape}")
    print(f"Sample {idx} labels shape: {sample['labels'].shape}")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Sample 732 input_features shape: torch.Size([360, 810])
Sample 732 labels shape: torch.Size([128])
Sample 733 input_features shape: torch.Size([360, 810])
Sample 733 labels shape: torch.Size([128])
Sample 734 input_features shape: torch.Size([360, 810])
Sample 734 labels shape: torch.Size([128])
Sample 735 input_features shape: torch.Size([360, 810])
Sample 735 labels shape: torch.Size([128])
Sample 736 input_features shape: torch.Size([360, 810])
Sample 736 labels shape: torch.Size([128])
Sample 737 input_features shape: torch.Size([360, 810])
Sample 737 labels shape: torch.Size([128])
Sample 738 input_features shape: torch.Size([360, 810])
Sample 738 labels shape: torch.Size([128])
Sample 739 input_features shape: torch.Size([360, 810])
Sample 739 labels shape: torch.Size([128])
Sample 740 input_features shape: torch.Size([360, 810])
Sample 740 labels shape: torch.Size([128])
Sample 741 input_features shape: torch.Size(

In [None]:
train_dataset.save_to_disk("/content/drive/My Drive/ASR/preprocessed_train")
test_dataset.save_to_disk("/content/drive/My Drive/ASR/preprocessed_test")



Saving the dataset (0/6 shards):   0%|          | 0/2498 [00:00<?, ? examples/s]

Saving the dataset (0/2 shards):   0%|          | 0/734 [00:00<?, ? examples/s]