## Import Libraries

In [1]:
import torch
import pandas as pd
from datetime import timedelta
import torchaudio
import pandas as pd
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from jiwer import wer

## Torch Device Initialization

This script checks if the Metal Performance Shaders (MPS) backend is available for PyTorch on a MacBook. If MPS is available, it sets the device to MPS. Otherwise, it defaults to the CPU. MPS is a framework by Apple that allows for efficient computation on the GPU using the Metal API.

In [2]:
# Check if MPS is available
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
device = torch.device("cpu")
print(device)

cpu


## Transcript Data Processing

Now lets load transcript data, extract relevant columns, and convert time units from minutes to milliseconds for easy computation.

In [3]:
# Load transcript data
transcript_df = pd.read_csv('../data/coraal/transcript/text/ATL_se0_ag1_f_01_1.txt', delimiter="\t")

# Extract relevant columns
transcript_data = transcript_df[['StTime', 'EnTime', 'Content']]

# Convert times from minutes to milliseconds for easier calculations
transcript_data.loc[:, 'StTime']*=1000
transcript_data.loc[:, 'EnTime']*=1000

transcript_data

Unnamed: 0,StTime,EnTime,Content
0,443.6,2406.8,"They talking about, don't send him to his daddy."
1,2406.8,2682.9,(pause 0.28)
2,2682.9,4953.8,You just need to go file for child support.
3,5114.2,5612.5,[/Oh man/.]
4,5148.8,5599.5,[Bye.]
...,...,...,...
1228,1855602.9,1856198.4,[Wanna] play?
1229,1855826.2,1856179.8,[Hm.]
1230,1856347.3,1857625.0,Yeah let me- let me see it.
1231,1858394.2,1859392.9,Your phone dead.


## Segmenting Transcript Data

We segment transcript data into 30-second intervals, aggregating content within each segment. Also, we define the segment lengths in milliseconds for easier computation. Now, we intialize variables to track the start and end times of the segments. Iterating through the transcript data we can create segments, aggregating content for each 30-second interval to later save into a DataFrame.

In [4]:
# Define the segment length
segment_length_ms = 30000  # 30 seconds in milliseconds

# Initialize variables
start_time = transcript_data['StTime'].min()  # Start from the earliest timestamp
end_time = transcript_data['EnTime'].max()  # End at the latest timestamp
segments = []

while start_time < end_time:
    segment_end_time = start_time + segment_length_ms

    # Aggregate content for the current 30-second segment
    segment_content = ' '.join(
        transcript_data[(transcript_data['StTime'] < segment_end_time) &
                        (transcript_data['EnTime'] > start_time)]['Content']
    )
    
    # Append the segment
    segments.append({
        'Start Time': start_time,
        'End Time': segment_end_time,
        'Content': segment_content
    })

    # Move to the next segment
    start_time = segment_end_time

# Convert to DataFrame
segments_df = pd.DataFrame(segments)
segments_df

Unnamed: 0,Start Time,End Time,Content
0,443.6,30443.6,"They talking about, don't send him to his dadd..."
1,30443.6,60443.6,"Um, year of birth? <ts> (pause 0.19) Ninety-fi..."
2,60443.6,90443.6,"Twenty-one, Twenty-one. Okay. Let's see, any o..."
3,90443.6,120443.6,[Think he at /Grady back/.] that's a hell of a...
4,120443.6,150443.6,I've- I call her my sister. (pause 1.05) You k...
...,...,...,...
58,1740443.6,1770443.6,"craziest feeling in the world. Ah, man. Mm-hm...."
59,1770443.6,1800443.6,(pause 4.15) But you got more (pause 0.39) sin...
60,1800443.6,1830443.6,<laugh> You can have two rooms [Yeah.] <laugh>...
61,1830443.6,1860443.6,(pause 6.03) When last time you played a video...


## Audio Loading and Resampling

We can load an audio file using `torchaudio` to then resample it to a new frequency. The frequency accepted from our model is of 16 kHz, different from our audio data sampling rate of 44.1 kHz.


In [5]:
# Load the audio file and its sample rate
waveform, sample_rate = torchaudio.load("../data/coraal/audio/wav/ATL_se0_ag1_f_01_1.wav")

# Create a resampler to change the original sample rate to 16 kHz
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)

# Apply the resampler to the waveform
waveform_resampled = resampler(waveform)


# Audio Segment Extraction

This function 
## Function Explanation

The function `extract_audio_segment` extracts a segment from an audio file based on specified start and end times in milliseconds. The steps are as follow:
1. Loads the audio file using `torchaudio`.
2. Converts the start and end times from milliseconds to sample indices.
3. Extracts the audio segment corresponding to the specified time range.
4. Returns the extracted audio segment as a `torch.Tensor`.

**Parameters**
- `audio_file_path` (str): Path to the audio file.
- `start_time_ms` (int): Start time in milliseconds.
- `end_time_ms` (int): End time in milliseconds.
- `sample_rate` (int): Sample rate of the audio file.

**Returns**
- `torch.Tensor`: Extracted audio segment.


In [6]:
def extract_audio_segment(audio_file_path, start_time_ms, end_time_ms, sample_rate):
    """
    Extracts a segment from an audio file based on start and end times.
    
    Parameters:
    - audio_file_path (str): Path to the audio file.
    - start_time_ms (int): Start time in milliseconds.
    - end_time_ms (int): End time in milliseconds.
    - sample_rate (int): Sample rate of the audio file.
    
    Returns:
    - torch.Tensor: Extracted audio segment.
    """
    # Load the audio file
    waveform, sr = torchaudio.load(audio_file_path)
    
    # Convert milliseconds to sample indices
    start_sample = int(start_time_ms * sr / 1000)
    end_sample = int(end_time_ms * sr / 1000)
    
    # Extract the segment
    segment = waveform[:, start_sample:end_sample]
    
    return segment

# Extracting Audio Segments Based on Transcript Intervals

We now want to load our audio segments corresponding to transcript intervals and stores them in a list. We can iterate through each row in the `segments_df` DataFrame to extract audio segments based on start and end times.


In [7]:
# Define the path to your audio file
audio_file_path = '../data/coraal/audio/wav/ATL_se0_ag1_f_01_1.wav'

# Load the sample rate of the audio file
waveform, sample_rate = torchaudio.load(audio_file_path)

# Initialize a list to store extracted audio segments
audio_segments = []

# Extract audio segments based on transcript intervals
for _, row in segments_df.iterrows():
    start_time_ms = row['Start Time']
    end_time_ms = row['End Time']
    
    # Extract audio segment
    segment = extract_audio_segment(audio_file_path, start_time_ms, end_time_ms, sample_rate)
    
    # Optionally, save the segment or process it further
    audio_segments.append({
        'Start Time': start_time_ms,
        'End Time': end_time_ms,
        'Segment': segment
    })
    
# Convert to DataFrame if needed
audio_segments_df = pd.DataFrame(audio_segments)
audio_segments_df.head()

Unnamed: 0,Start Time,End Time,Segment
0,443.6,30443.6,"[[tensor(0.0029), tensor(0.0030), tensor(0.002..."
1,30443.6,60443.6,"[[tensor(0.0001), tensor(0.0005), tensor(0.000..."
2,60443.6,90443.6,"[[tensor(0.0010), tensor(0.0006), tensor(0.000..."
3,90443.6,120443.6,"[[tensor(0.0034), tensor(0.0035), tensor(0.003..."
4,120443.6,150443.6,"[[tensor(-0.0009), tensor(-0.0012), tensor(-0...."


We verify that the length of our transcript DataFrame is the same as our audio DataFrame.

In [8]:
print(len(segments_df) == len(audio_segments_df))

True


## Initializing Whisper Processor and Model

This script initializes the Whisper processor and model for the `whisper-tiny` configuration from Hugging Face's transformers library.


In [9]:
# Initialize processor and model for whisper-tiny
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
# Move the model to the appropriate device
model = model.to(device)

## Here are some details of the `whisper-tiny` model architecture

| Component                               | Sub-Component                                           | Details                                                                                 |
|-----------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------------------------------|
| `WhisperForConditionalGeneration`       |                                                         |                                                                                         |
|                                         | `model`                                                 | `WhisperModel`                                                                          |
|                                         |                                                         |                                                                                         |
| `WhisperModel`                          | `encoder`                                               | `WhisperEncoder`                                                                        |
|                                         |                                                         |                                                                                         |
| `WhisperEncoder`                        | `conv1`                                                 | `Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))`                          |
|                                         | `conv2`                                                 | `Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))`                         |
|                                         | `embed_positions`                                       | `Embedding(1500, 384)`                                                                  |
|                                         | `layers`                                                | `ModuleList` containing 4 `WhisperEncoderLayer`                                         |
|                                         |                                                         |                                                                                         |
| `WhisperEncoderLayer`                   | `self_attn`                                             | `WhisperSdpaAttention`                                                                  |
|                                         | `self_attn.k_proj`                                      | `Linear(in_features=384, out_features=384, bias=False)`                                 |
|                                         | `self_attn.v_proj`                                      | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `self_attn.q_proj`                                      | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `self_attn.out_proj`                                    | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `self_attn_layer_norm`                                  | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         | `activation_fn`                                         | `GELUActivation()`                                                                      |
|                                         | `fc1`                                                   | `Linear(in_features=384, out_features=1536, bias=True)`                                 |
|                                         | `fc2`                                                   | `Linear(in_features=1536, out_features=384, bias=True)`                                 |
|                                         | `final_layer_norm`                                      | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         | `layer_norm`                                            | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         |                                                         |                                                                                         |
| `WhisperModel`                          | `decoder`                                               | `WhisperDecoder`                                                                        |
|                                         |                                                         |                                                                                         |
| `WhisperDecoder`                        | `embed_tokens`                                          | `Embedding(51865, 384, padding_idx=50257)`                                              |
|                                         | `embed_positions`                                       | `WhisperPositionalEmbedding(448, 384)`                                                  |
|                                         | `layers`                                                | `ModuleList` containing 4 `WhisperDecoderLayer`                                         |
|                                         |                                                         |                                                                                         |
| `WhisperDecoderLayer`                   | `self_attn`                                             | `WhisperSdpaAttention`                                                                  |
|                                         | `self_attn.k_proj`                                      | `Linear(in_features=384, out_features=384, bias=False)`                                 |
|                                         | `self_attn.v_proj`                                      | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `self_attn.q_proj`                                      | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `self_attn.out_proj`                                    | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `activation_fn`                                         | `GELUActivation()`                                                                      |
|                                         | `self_attn_layer_norm`                                  | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         | `encoder_attn`                                          | `WhisperSdpaAttention`                                                                  |
|                                         | `encoder_attn.k_proj`                                   | `Linear(in_features=384, out_features=384, bias=False)`                                 |
|                                         | `encoder_attn.v_proj`                                   | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `encoder_attn.q_proj`                                   | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `encoder_attn.out_proj`                                 | `Linear(in_features=384, out_features=384, bias=True)`                                  |
|                                         | `encoder_attn_layer_norm`                               | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         | `fc1`                                                   | `Linear(in_features=384, out_features=1536, bias=True)`                                 |
|                                         | `fc2`                                                   | `Linear(in_features=1536, out_features=384, bias=True)`                                 |
|                                         | `final_layer_norm`                                      | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         | `layer_norm`                                            | `LayerNorm((384,), eps=1e-05, elementwise_affine=True)`                                 |
|                                         |                                                         |                                                                                         |
| `WhisperForConditionalGeneration`       | `proj_out`                                              | `Linear(in_features=384, out_features=51865, bias=False)`                               |


## Audio Preprocessing and Transcription Functions

This script contains two functions:
1. `preprocess_audio`: Resamples an audio tensor to a target sample rate.
2. `transcribe_audio`: Transcribes the preprocessed audio tensor using the Whisper model and processor.

### Functions Explanation

#### `preprocess_audio`

This function resamples an audio tensor to a target sample rate if the original sample rate is different from the target.

**Parameters:**
- `audio_tensor` (torch.Tensor): The input audio tensor.
- `sample_rate` (int): The original sample rate of the audio.
- `target_sample_rate` (int, optional): The target sample rate for resampling. Default is 16000.

**Returns:**
- `torch.Tensor`: The resampled audio tensor.

#### `transcribe_audio`

This function transcribes the audio tensor using the Whisper model and processor.

**Parameters:**
- `audio_tensor` (torch.Tensor): The input audio tensor.
- `model` (WhisperForConditionalGeneration): The Whisper model for transcription.
- `processor` (WhisperProcessor): The Whisper processor for preprocessing the audio tensor.

**Returns:**
- `str`: The transcription of the audio.


In [11]:
def preprocess_audio(audio_tensor, sample_rate, target_sample_rate=16000):
    if sample_rate != target_sample_rate:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
        audio_tensor = resampler(audio_tensor)
    return audio_tensor

def transcribe_audio(audio_tensor, model, processor):
    audio_tensor = audio_tensor.to(device)
    inputs = processor(audio_tensor.squeeze().cpu().numpy(), return_tensors="pt", sampling_rate=16000)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        predicted_ids = model.generate(**inputs)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

# Transcribing Audio Segments and Comparing Results

We want to test the pre-trained `whisper-tiny` model with our data (albeit is it not yet cleaned). After pre-processing our audio and transcript we go into transcribing the audio segment. We append the actual and predicted transcriptions to the respective lists to later print out the start time of the segment, actual transcript, and model transcription.

**Variables**

- `true_transcriptions` (list): List to store the actual transcriptions.
- `predicted_transcriptions` (list): List to store the model-generated transcriptions.



In [12]:
# Initialize lists for metrics
true_transcriptions = []
predicted_transcriptions = []

# Process each segment and compare transcriptions
for idx in range(len(audio_segments_df)):
    audio_tensor = audio_segments_df['Segment'][idx]
    actual_transcript = segments_df['Content'][idx]
    
    # Ensure audio is in the correct format
    audio_tensor = preprocess_audio(audio_tensor, sample_rate)
    
    # Transcribe audio
    model_transcription = transcribe_audio(audio_tensor, model, processor)
    
    # Store transcriptions for metric calculation
    true_transcriptions.append(actual_transcript)
    predicted_transcriptions.append(model_transcription)
    
    # Print segment info
    print(f"Segment Start: {audio_segments_df['Start Time'][idx]}")
    print(f"Actual Transcript: {actual_transcript}")
    print(f"Model Transcription: {model_transcription}")
    print("\n")


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Segment Start: 443.6
Actual Transcript: They talking about, don't send him to his daddy. (pause 0.28) You just need to go file for child support. [/Oh man/.] [Bye.] Why? (pause 0.80) Why? Okay, what's your name? /RD-NAME-2/ (pause 0.52) /RD-NAME-1/ what? (pause 0.48) /RD-NAME-3/ Okay. (pause 0.61) And, uh, (pause 0.39) are you a male or female? I'm a girl, I think. [<laugh>] [I'm just playing.] Okay. (pause 0.19) And your ethnicity? Hum, I'm supposed to say, black or non-hispanic. Okay. (pause 0.74) Um, year of birth?
Model Transcription:  I said, I don't see him to his daddy. You just need to go far for child support. Why? Why? Why? Okay, what's your name? Boy. Okay. And, uh, are your middle female? Oh, I think. How does that sound? Okay, and, uh, ethnicity. Um, I'll pass by black and not his family. Okay. Um,


Segment Start: 30443.6
Actual Transcript: Um, year of birth? <ts> (pause 0.19) Ninety-five. Okay. Nineteen-ninety five, (pause 0.33) that was a good year. (pause 0.56) Um, hom

## Word Error Rate (WER)

Word Error Rate (WER) is a metric used to evaluate the performance of speech recognition systems. It measures the accuracy of a transcribed text compared to a reference (or ground truth) transcript by quantifying the number of errors in the predicted transcription.

### WER Calculation

WER is calculated using the following formula:

\[ \text{WER} = \frac{S + D + I}{N} \]

where:
- **S** is the number of substitutions (words that are incorrect in the prediction).
- **D** is the number of deletions (words that are missing in the prediction).
- **I** is the number of insertions (words that are extra in the prediction).
- **N** is the total number of words in the reference transcript.

### Interpretation

- **WER = 0%**: Perfect match between the predicted transcription and the reference transcript.
- **WER > 0%**: Indicates some level of error; a higher WER signifies poorer performance of the transcription system.

**Example**

If the reference transcript is "Hello world" and the predicted transcript is "Hello there", the errors are:
- **Substitutions**: "world" → "there" (1 substitution)
- **Deletions**: None
- **Insertions**: None

Thus, WER = \( \frac{1}{2} = 0.5 \) or 50%, since there are 2 words in the reference transcript.

WER provides a quantitative measure to compare different speech recognition systems or models, and to assess their performance in understanding and transcribing spoken language.


1. **Calculate WER:**
   - `wer_score = wer(true_transcriptions, predicted_transcriptions)` computes the WER using the `jiwer` library.

2. **Print WER Score:**
   - `print(f"Word Error Rate (WER): {wer_score}")` outputs the WER score to the console.


In [13]:
wer_score = wer(true_transcriptions, predicted_transcriptions)
print(f"Word Error Rate (WER): {wer_score}")

Word Error Rate (WER): 0.8459348870307775
