# Automatic Speech Recognition Project - Part 3
Due to runtime constraints, I was unable to convert all 22,000 audio files in the dataset from `.mp3` to `.wav` format in the last step, and was left with approximately 12,000 files to use.

To ensure consistency between the converted audio files and their corresponding transcriptions, I focused on updating the paths in the train, test, and dev `.tsv` files, which contain the transcription data. This involved conducting a mapping and filtering process to retain only the paths corresponding to the successfully converted `.wav` files. By aligning the transcription files with the converted audio files, I maintained data integrity and ensured the dataset was properly prepared for training, testing, and development phases of the ASR model.

## Step 1: Import the Necessary Libraries

In [None]:
#libraries
import pandas as pd
import os

# Import the drive module from Google Colab
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')


MessageError: Error: credential propagation was unsuccessful

Step 2: Convert all 'paths' to .wav format

In [None]:
# Paths to your files
audio_folder = "/content/drive/My Drive/clips_wav"

# Step 4: Get list of audio files in 'clips_wav'
audio_files = set(f.split('.')[0] for f in os.listdir(audio_folder) if f.endswith('.wav'))
print(f"Found {len(audio_files)} audio files.")

Found 12962 audio files.


In [None]:
import pandas as pd

# Paths to your TSV files
devtsv_path = "/content/drive/My Drive/ASR/dev.tsv"
traintsv_path = "/content/drive/My Drive/ASR/train.tsv"
testtsv_path = "/content/drive/My Drive/ASR/test.tsv"

# Paths to save filtered files
filtered_dev_path = "/content/drive/My Drive/ASR/filtered_dev.tsv"
filtered_train_path = "/content/drive/My Drive/ASR/filtered_train.tsv"
filtered_test_path = "/content/drive/My Drive/ASR/filtered_test.tsv"

# Function to read TSV files into DataFrames
def read_tsv_resiliently(file_path):
    try:
        # Read the TSV file, skipping bad lines and using the 'python' engine for compatibility
        df = pd.read_csv(file_path, sep='\t', on_bad_lines='skip', engine='python')
        print(f"Successfully loaded {file_path}")
        return df
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return None

# Read the TSV files
train_df = read_tsv_resiliently(traintsv_path)
test_df = read_tsv_resiliently(testtsv_path)
dev_df = read_tsv_resiliently(devtsv_path)

# Optional: Print the first few rows of the DataFrames to verify
if train_df is not None:
    print(train_df.head())
if test_df is not None:
    print(test_df.head())
if dev_df is not None:
    print(dev_df.head())


Successfully loaded /content/drive/My Drive/ASR/train.tsv
Successfully loaded /content/drive/My Drive/ASR/test.tsv
Successfully loaded /content/drive/My Drive/ASR/dev.tsv
                                           client_id  \
0  76220259d4c614b1876412e489524ede62b700927b202d...   
1  76220259d4c614b1876412e489524ede62b700927b202d...   
2  76220259d4c614b1876412e489524ede62b700927b202d...   
3  76220259d4c614b1876412e489524ede62b700927b202d...   
4  76220259d4c614b1876412e489524ede62b700927b202d...   

                            path  \
0  common_voice_luo_40228998.mp3   
1  common_voice_luo_40228999.mp3   
2  common_voice_luo_40229003.mp3   
3  common_voice_luo_40229004.mp3   
4  common_voice_luo_40229005.mp3   

                                         sentence_id  \
0  0b6d23f21850cf38f2da23b85e5b9d243e327f4e1a987c...   
1  0093b37219cd1409b042fc7c111546d1338a34dfa32d04...   
2  11417bdb395c6a8e9095f673662428f371c1ffea41a4d7...   
3  1199d8ab1254d0772d3efa7f1f6113881809fc6c80aff5..

In [None]:
# Change the .mp3 extension to .wav in the 'path' column for all DataFrames
def replace_mp3_with_wav(df):
    if df is not None:
        # Replace '.mp3' with '.wav' in the 'path' column
        df['path'] = df['path'].str.replace('.mp3', '.wav', regex=False)
        print("Replaced .mp3 with .wav in the 'path' column.")
    return df

# Apply the function to each DataFrame
train_df = replace_mp3_with_wav(train_df)
test_df = replace_mp3_with_wav(test_df)
dev_df = replace_mp3_with_wav(dev_df)

# Optional: Print the updated first few rows of the DataFrames to verify
if train_df is not None:
    print(train_df.head())
if test_df is not None:
    print(test_df.head())
if dev_df is not None:
    print(dev_df.head())


Replaced .mp3 with .wav in the 'path' column.
Replaced .mp3 with .wav in the 'path' column.
Replaced .mp3 with .wav in the 'path' column.
                                           client_id  \
0  76220259d4c614b1876412e489524ede62b700927b202d...   
1  76220259d4c614b1876412e489524ede62b700927b202d...   
2  76220259d4c614b1876412e489524ede62b700927b202d...   
3  76220259d4c614b1876412e489524ede62b700927b202d...   
4  76220259d4c614b1876412e489524ede62b700927b202d...   

                            path  \
0  common_voice_luo_40228998.wav   
1  common_voice_luo_40228999.wav   
2  common_voice_luo_40229003.wav   
3  common_voice_luo_40229004.wav   
4  common_voice_luo_40229005.wav   

                                         sentence_id  \
0  0b6d23f21850cf38f2da23b85e5b9d243e327f4e1a987c...   
1  0093b37219cd1409b042fc7c111546d1338a34dfa32d04...   
2  11417bdb395c6a8e9095f673662428f371c1ffea41a4d7...   
3  1199d8ab1254d0772d3efa7f1f6113881809fc6c80aff5...   
4  058ef183ec6c447b759aab31e

## Step 3: Filter for existing paths

In [None]:
import os

# Get the list of audio files (wav files) in the audio folder
audio_files = set(f.replace(".wav", "") for f in os.listdir(audio_folder) if f.endswith(".wav"))

# Define a function to compare paths and audio file names
def compare_paths_with_audio_files(df):
    # Clean up paths by removing the ".mp3" extension and comparing the base name
    df['base_name'] = df['path'].apply(lambda x: x.replace(".wav", "").split("/")[-1])  # Extract base name
    matches = df[df['base_name'].isin(audio_files)]  # Check if base name matches any audio file name
    return matches

# Compare paths in each dataset
dev_matches = compare_paths_with_audio_files(dev_df)
train_matches = compare_paths_with_audio_files(train_df)
test_matches = compare_paths_with_audio_files(test_df)

# Print the counts of matching paths
print(f"Matching entries in dev dataset: {len(dev_matches)}")
print(f"Matching entries in train dataset: {len(train_matches)}")
print(f"Matching entries in test dataset: {len(test_matches)}")

# Print the path entry and corresponding audio name for each dataset
print("\nSample matches in dev dataset (path and audio name):")
print(dev_matches[['path', 'base_name']].head())

print("\nSample matches in train dataset (path and audio name):")
print(train_matches[['path', 'base_name']].head())

print("\nSample matches in test dataset (path and audio name):")
print(test_matches[['path', 'base_name']].head())


Matching entries in dev dataset: 1570
Matching entries in train dataset: 2498
Matching entries in test dataset: 734

Sample matches in dev dataset (path and audio name):
                            path                  base_name
0  common_voice_luo_40556704.wav  common_voice_luo_40556704
1  common_voice_luo_40556707.wav  common_voice_luo_40556707
2  common_voice_luo_40558345.wav  common_voice_luo_40558345
3  common_voice_luo_40558364.wav  common_voice_luo_40558364
4  common_voice_luo_40558389.wav  common_voice_luo_40558389

Sample matches in train dataset (path and audio name):
                              path                  base_name
762  common_voice_luo_40498543.wav  common_voice_luo_40498543
763  common_voice_luo_40498547.wav  common_voice_luo_40498547
764  common_voice_luo_40498548.wav  common_voice_luo_40498548
765  common_voice_luo_40498549.wav  common_voice_luo_40498549
767  common_voice_luo_40498558.wav  common_voice_luo_40498558

Sample matches in test dataset (path and 

In [None]:
# Save the matching entries for each dataset to new .tsv files
dev_matches.to_csv(filtered_dev_path, sep="\t", index=False)
train_matches.to_csv(filtered_train_path, sep="\t", index=False)
test_matches.to_csv(filtered_test_path, sep="\t", index=False)
