# Transcribe_audio notebook

## Purpose
---
- Uses openai's whisper-large-v3 model to take sample audio files, then automate the transcription process.
- Coqui-Ai XTTS fine-tuning process requires a text-transcription for each audio file. If an audio sample does not have this, it would be difficult to write, by hand, the text needed.
- Will be used when annotating speech from personal audio samples as well.
---

## How to use
---
- Requires torch and HuggingFace's transformers API to use the whisper-large-v3 model.
- Define an import dir path where all your .wav audio files exist. 
- Define an output path for a csv file. Here, as each audio file is transcribed, its file-name and transcription will be written to the output CSV. This can be used as the metadata file for the fine-tuning process.

---

In [1]:
'''Requires FFMEG to be installed for whipser model'''
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import csv
import os

In [3]:
'''Load in whipster model using transformers api'''
# Set device and torch data type
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(device)

# Define chunk size used in seconds
chunkSize = 10

# Model identifier
model_id = "openai/whisper-large-v3" # Was about 3G

# Load the model and move it to the selected device
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=False, 
    use_safetensors=True
)
model.to(device)

# Load the processor
processor = AutoProcessor.from_pretrained(model_id, language='en')

# Create the speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=chunkSize,
    batch_size=32, 
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

cuda:0


Device set to use cuda:0


In [4]:
'''Step up paths for imput and output files'''
# Define a path to an output CSV to save transcriptions
outputPath = "datasets/metadata.csv"

# Define where sample audio files are coming from
audioDir = "chunks/"

# Read in all files from chosen dir
fileList = os.listdir(audioDir)


In [5]:
'''Transcribe sample files'''
# Init list to hold all samples 
samples = []

'''Loop here to go through multiple .wav files if needed'''
for i in range(len(fileList)): 
    # Specify the path to your local .wav file
    fileName = fileList[i]
    audioPath = audioDir + fileName
    # Msg to show transcription is proceeding
    print(f"Transcribing {fileName}...")
    # Run the pipeline on the .wav file
    result = pipe(audioPath)["text"]
    # LJ speech format (filename, transcript, normalised transcript)
    samples.append((fileName.split('.')[0], result, result)) # no need to normalzied when fine-tuning. Just duplicate 2nd col


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Transcribing chunk_0000.wav...


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transcribing chunk_0001.wav...
Transcribing chunk_0002.wav...
Transcribing chunk_0003.wav...
Transcribing chunk_0004.wav...
Transcribing chunk_0005.wav...
Transcribing chunk_0006.wav...
Transcribing chunk_0007.wav...
Transcribing chunk_0008.wav...
Transcribing chunk_0009.wav...


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Transcribing chunk_0010.wav...
Transcribing chunk_0011.wav...
Transcribing chunk_0012.wav...
Transcribing chunk_0013.wav...
Transcribing chunk_0014.wav...
Transcribing chunk_0015.wav...
Transcribing chunk_0016.wav...
Transcribing chunk_0017.wav...
Transcribing chunk_0018.wav...
Transcribing chunk_0019.wav...
Transcribing chunk_0020.wav...
Transcribing chunk_0021.wav...
Transcribing chunk_0022.wav...
Transcribing chunk_0023.wav...
Transcribing chunk_0024.wav...
Transcribing chunk_0025.wav...
Transcribing chunk_0026.wav...
Transcribing chunk_0027.wav...
Transcribing chunk_0028.wav...
Transcribing chunk_0029.wav...
Transcribing chunk_0030.wav...
Transcribing chunk_0031.wav...
Transcribing chunk_0032.wav...
Transcribing chunk_0033.wav...
Transcribing chunk_0034.wav...
Transcribing chunk_0035.wav...
Transcribing chunk_0036.wav...
Transcribing chunk_0037.wav...
Transcribing chunk_0038.wav...
Transcribing chunk_0039.wav...
Transcribing chunk_0040.wav...
Transcribing chunk_0041.wav...
Transcri

In [6]:
# Write the samples list to output csv
with open(outputPath, 'w', newline='', encoding='utf-8-sig') as f:
    # create csv writer
    csvWriter = csv.writer(f, delimiter='|')
    
    # Note: No need for headers in LJ sppech format...
    
    # Write each sample to the CSV file
    for entry in samples:
        csvWriter.writerow(entry)

print("Transcriptions written to:", outputPath)

Transcriptions written to: datasets/metadata_01.csv
