# Transcribe_audio notebook

## Purpose
---
- Uses openai's whisper-large-v3 model to take sample audio files, then automate the transcription process.
- Coqui-Ai XTTS fine-tuning process requires a text-transcription for each audio file. If an audio sample does not have this, it would be difficult to write, by hand, the text needed.
- Will be used when annotating speech from personal audio samples as well.
---

## How to use
---
- Requires torch and HuggingFace's transformers API to use the whisper-large-v3 model.
- Define an import dir path where all your .wav audio files exist. 
- Define an output path for a csv file. Here, as each audio file is transcribed, its file-name and transcription will be written to the output CSV. This can be used as the metadata file for the fine-tuning process.

---

In [None]:
'''Requires FFMEG to be installed for whipser model'''
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import csv
import os

In [None]:
'''Load in whipster model using transformers api'''
# Set device and torch data type
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(device)

# Model identifier
model_id = "openai/whisper-large-v3" # Was about 3G

# Load the model and move it to the selected device
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype, 
    low_cpu_mem_usage=False, 
    use_safetensors=True
)
model.to(device)

# Load the processor
processor = AutoProcessor.from_pretrained(model_id)

# Create the speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
'''Step up paths for imput and output files'''
# Define a path to an output CSV to save transcriptions
outputPath = "output/example_dataset.csv"

# Define where sample audio files are coming from
audioDir = "chunks/"

# Read in all files from chosen dir
fileList = os.listdir(audioDir)


In [None]:
'''Transcribe sample files'''
# Init list to hold all samples 
samples = []

'''Loop here to go through multiple .wav files if needed'''
for i in range(len(fileList)): 
    # Specify the path to your local .wav file
    fileName = fileList[i]
    audioPath = audioDir + fileName
    # Msg to show transcription is proceeding
    print(f"Transcribing {fileName}...")
    # Run the pipeline on the .wav file
    result = pipe(audioPath)
    # LJ speech format (filename, transcript, normalised transcript)
    samples.append((fileName.split('.')[0], result, result)) # no need to normalzied when fine-tuning. Just duplicate 2nd col


In [None]:
# Write the samples list to output csv
with open(outputPath, 'w', newline='', encoding='utf-8-sig') as f:
    # create csv writer
    csvWriter = csv.writer(f, delimiter='|')
    
    # Note: No need for headers in LJ sppech format...
    
    # Write each sample to the CSV file
    for entry in samples:
        csvWriter.writerow(entry)

print("Transcriptions written to:", outputPath)