5/10 (Fri) | Preprocess

# Forced Alignment Preparation for Dialogue Data (ASR Transcript)

## 1. Introduction

Thie notebook generates dialogue transcripts and split full audio data for wav2vec forced alignment.
The preparation process consists of the following stages.

1. Get user ids
2. Load an audio file and a transcript text corresponding to a user id
3. Get a turn-level ids, transcript, and start and end time iteratively
4. Convert transcript to lower-case string with star tokens
5. Split audio speech by the start and end time
6. Save the turn-level transcript and split audio

Before starting the process, the following code block loads required packages and define gloabl variables.

In [1]:
from typing import Tuple, Generator
from pathlib import Path
import re

import pandas as pd
from pydub import AudioSegment

TRANSCRIPT_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/02_Rev_Transcript")
AUDIO_DIR = Path("/home/matsuura/Development/Datasets/teai-woz-2021-03/wav")
SAVE_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/03_FA_Audio_Transcript_Auto")

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the preparation.
The following code block defines a generator of user ids.

In [2]:
def user_id_generator() -> Generator[str, None, None]:
    for uid in range(1, 86):
        uid = str(uid).zfill(3)

        yield uid

The following code block defines functions to load a transcript and an audio.

In [3]:
def load_transcript(uid: str) -> pd.DataFrame:
    csv_path = TRANSCRIPT_DIR / f"{uid}.csv"
    df_transcript = pd.read_csv(csv_path, index_col=0)

    return df_transcript

In [4]:
def load_audio(uid: str) -> AudioSegment:
    if int(uid) == 19:
        uid = "086"

    if int(uid) < 56:
        audio_path = AUDIO_DIR / f"{uid}-user.mp4"
    else:
        audio_path = AUDIO_DIR / f"{uid}-user.m4a"

    audio = AudioSegment.from_file(audio_path)

    return audio

The following code block defines a generator of turn-level ids, transcripts, and start and end times.

In [5]:
def turn_level_info_generator(
        df_transcript: pd.DataFrame
) -> Generator[Tuple[str, str, int, int], None, None]:
    user_mask = (df_transcript["speaker"] == "user")
    
    intro_mask = (df_transcript["topic"] == "intro")
    closing_mask = (df_transcript["topic"] == "closing")
    topic_mask = intro_mask | closing_mask

    mask = user_mask & (~topic_mask)

    df_transcript_masked = df_transcript[mask]
    
    for idx in df_transcript_masked.index:
        transcript = df_transcript_masked.at[idx, "transcript"]
        start_time = df_transcript_masked.at[idx, "start_time"]
        end_time = df_transcript_masked.at[idx, "end_time"]

        turn_id = str(idx).zfill(3)

        yield turn_id, transcript, start_time, end_time

The following code block defines a function to convert original transcripts.

In [6]:
def convert_fa_transcript(transcript: str) -> str:
    # 1. change an inaudible tags to star tokens
    fa_transcript = transcript.replace("<inaudible>", "*")

    # 2. remove punctuations
    for punct in PUNCTUATIONS:
       fa_transcript = fa_transcript.replace(punct, "") 
    fa_transcript = fa_transcript.replace("-", "")

    # 3. change numbers to star tokens
    number_pattern = r"\b\w*\d\w*\b"
    fa_transcript = re.sub(number_pattern, "*", fa_transcript)

    # 4. remove other tags
    tag_pattern = r"\<.*?\>"
    fa_transcript = re.sub(tag_pattern, " ", fa_transcript) 

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")

    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return fa_transcript

The following code block defines a function to split AudioSegment.

In [7]:
def split_audio(audio: AudioSegment, start_time: int, end_time: int) -> AudioSegment:
    return audio[start_time:end_time]

The following code block defines a function to save transcript and audio files.

In [8]:
def save_transcript(uid: str, turn_id: str, fa_transcript: str) -> None:
    save_path = SAVE_DIR / f"{uid}_{turn_id}.txt"
    with open(save_path, "w") as f:
        f.write(fa_transcript)

In [9]:
def save_audio(uid: str, turn_id: str, audio: AudioSegment) -> None:
    save_path = SAVE_DIR / f"{uid}_{turn_id}.wav"
    audio.export(save_path, format="wav")

---

## 3. Conduct Forced Alignment Preparation

the following code block conduct the preparation.

In [10]:
for uid in user_id_generator():
    df_transcript = load_transcript(uid)
    audio = load_audio(uid)

    for tid, transcript, start_time, end_time in turn_level_info_generator(df_transcript):
        if isinstance(transcript, float):
            continue # skip empty transcripts

        fa_transcript = convert_fa_transcript(transcript)
        fa_audio = split_audio(audio, start_time, end_time)

        save_transcript(uid, tid, fa_transcript)
        save_audio(uid, tid, fa_audio)