5/11 (Sat) | Preprocess

# Forced Alignment Preparation for Dialogue Data (Manual Transcript)

## 1. Introduction

This notebook generates dialogue transcripts for wav2vec forced alignment.
The transcript generation procedures consist of the following stages.

1. Get user ids
2. Load a transcript text corresponding to a user id
3. Convert the transcript to turn-wise
4. Get a turn-level ids and transcript iteratively
5. Convert transcript to lower-case string with star tokens
6. Save the turn-level transcript
7. Identify unpaired .txt files and save split .wav audio

Before starting the process, the following code block loads required packages and define global variables.

In [2]:
from typing import Tuple, Generator
from pathlib import Path
import re

import pandas as pd
from pydub import AudioSegment

from utils.transcript import convert_turnwise

TRANSCRIPT_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/01_Manual_TextGrid")
AUDIO_DIR = Path("/home/matsuura/Development/Datasets/teai-woz-2021-03/wav")
SAVE_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data/WoZ_Interview/05_FA_Audio_Transcript_Manu")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

PUNCTUATIONS = [".", ",", ":", "?", "!"]

---

## 2. Define Functions

This section defines functions to complete the preparation.
The following code block defines a generator of user ids.

In [3]:
def user_id_generator() -> Generator[str, None, None]:
    for uid in range(1, 86):
        uid = str(uid).zfill(3)

        yield uid

The following code block defines a function to load a transcript.

In [4]:
def load_transcript(uid: str) -> pd.DataFrame:
    csv_path = TRANSCRIPT_DIR / f"{uid}.csv"
    df_transcript = pd.read_csv(csv_path)

    return df_transcript

The following code block defines a generator of turn-level ids, transcripts, and start and end times.

In [5]:
def turn_level_info_generator(
        df_transcript: pd.DataFrame
) -> Generator[Tuple[str, str], None, None]:
    user_mask = (df_transcript["speaker"] == "user")
    
    intro_mask = (df_transcript["topic"] == "intro")
    closing_mask = (df_transcript["topic"] == "closing")
    topic_mask = intro_mask | closing_mask

    mask = user_mask & (~topic_mask)

    df_transcript_masked = df_transcript[mask]
    
    for idx in df_transcript_masked.index:
        transcript = df_transcript_masked.at[idx, "transcript"]
        turn_id = str(idx).zfill(3)

        yield turn_id, transcript

The following code block defines a function to convert original transcripts.

In [6]:
def convert_fa_transcript(transcript: str) -> str:
    # 1. change an inaudible tags to star tokens
    fa_transcript = transcript.replace("<inaudible>", "*")

    # 2. remove punctuations
    for punct in PUNCTUATIONS:
       fa_transcript = fa_transcript.replace(punct, "") 
    fa_transcript = fa_transcript.replace("-", "")

    # 3. change numbers to star tokens
    number_pattern = r"\b\w*\d\w*\b"
    fa_transcript = re.sub(number_pattern, "*", fa_transcript)

    # 4. remove other tags
    tag_pattern = r"\<.*?\>"
    fa_transcript = re.sub(tag_pattern, " ", fa_transcript) 

    # 5. lower transcript
    fa_transcript = fa_transcript.lower()

    # 6. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")

    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return fa_transcript

The following code block defines a function to save transcript and audio files.

In [7]:
def save_transcript(uid: str, turn_id: str, fa_transcript: str) -> None:
    save_path = SAVE_DIR / f"{uid}_{turn_id}.txt"
    with open(save_path, "w") as f:
        f.write(fa_transcript)

---

## 3. Conduct Forced Alignment Preparation

the following code block conduct the preparation.

In [8]:
for uid in user_id_generator():
    df_transcript = load_transcript(uid)
    df_transcript = convert_turnwise(df_transcript)

    for tid, transcript in turn_level_info_generator(df_transcript):
        if isinstance(transcript, float):
            print(f"{uid}_{tid} was skipped")
            continue # skip empty transcripts

        fa_transcript = convert_fa_transcript(transcript)

        save_transcript(uid, tid, fa_transcript)

---

## 4. Follow-up Split Audio Saving

This section follows up to save split audio files.
The following code block defines a function to load an audio file.

In [9]:
def load_audio(uid: str) -> AudioSegment:
    if int(uid) == 19:
        uid = "086"

    if int(uid) < 56:
        audio_path = AUDIO_DIR / f"{uid}-user.mp4"
    else:
        audio_path = AUDIO_DIR / f"{uid}-user.m4a"

    audio = AudioSegment.from_file(audio_path)

    return audio

The following code block loads start and end times from a transcript and turn id.

In [10]:
def search_start_end_time(df_transcript: pd.DataFrame, tid: str) -> Tuple[int, int]:
    idx = int(tid)
    
    start_tiem = df_transcript.at[idx, "start_time"]
    end_tiem = df_transcript.at[idx, "end_time"]

    return start_tiem, end_tiem

The following code block detects unpaired .txt files and split the audio files.

In [11]:
for uid in user_id_generator():
    df_transcript = load_transcript(uid)
    df_transcript = convert_turnwise(df_transcript)
    audio = None

    for fa_txt_path in SAVE_DIR.glob(f"{uid}_*.txt"):
        fa_wav_path = SAVE_DIR / f"{fa_txt_path.stem}.wav"
        if fa_wav_path.exists():
            continue

        print(f"Save {fa_wav_path.name}...", end="\t")

        _, tid = fa_txt_path.stem.split("_")
        start_time, end_time = search_start_end_time(df_transcript, tid)

        if audio is None:
            audio = load_audio(uid)

        split_audio = audio[start_time:end_time]
        split_audio.export(fa_wav_path, format="wav")
        print(f"DONE!")

Save 001_009.wav...	DONE!
Save 001_011.wav...	DONE!
Save 001_013.wav...	DONE!
Save 001_015.wav...	DONE!
Save 001_017.wav...	DONE!
Save 001_019.wav...	DONE!
Save 001_021.wav...	DONE!
Save 001_023.wav...	DONE!
Save 001_025.wav...	DONE!
Save 001_027.wav...	DONE!
Save 001_029.wav...	DONE!
Save 001_031.wav...	DONE!
Save 001_033.wav...	DONE!
Save 001_035.wav...	DONE!
Save 001_037.wav...	DONE!
Save 001_039.wav...	DONE!
Save 001_041.wav...	DONE!
Save 001_043.wav...	DONE!
Save 001_045.wav...	DONE!
Save 001_047.wav...	DONE!
Save 001_049.wav...	DONE!
Save 001_051.wav...	DONE!
Save 001_053.wav...	DONE!
Save 002_009.wav...	DONE!
Save 002_011.wav...	DONE!
Save 002_013.wav...	DONE!
Save 002_015.wav...	DONE!
Save 002_017.wav...	DONE!
Save 002_019.wav...	DONE!
Save 002_021.wav...	DONE!
Save 002_023.wav...	DONE!
Save 002_025.wav...	DONE!
Save 002_027.wav...	DONE!
Save 002_029.wav...	DONE!
Save 002_031.wav...	DONE!
Save 002_033.wav...	DONE!
Save 002_035.wav...	DONE!
Save 002_037.wav...	DONE!
Save 002_039