5/10 (Fri) | Preprocess

# Forced Alignment Preparation for Monologue Data (ASR Transcript)

## 1. Introduction

This notebook generates monologue transcripts for wav2vec forced alignment.
The transcript generation process consists of the following stages.

1. Load rev-format transcripts
2. Convert transcripts to lower-case string with star tokens
3. Save it as .txt file

Before starting the process, the following code block loads required packages and define global variables.

In [1]:
from typing import List, Dict
from pathlib import Path
import re

import pandas as pd

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA"]

---

## 2. Rev-Format Transcript Loading

In this stage, rev-format transcripts saved as csv file are loaded.
The following code block defines a function to load csv transcripts.

In [2]:
def load_rev_transcripts(task: str) -> Dict[str, pd.DataFrame]:
    load_dir = DATA_DIR / f"{task}/02_Rev_Transcript"
    
    rev_transcripts = {}
    for csv_path in load_dir.glob("*_long.csv"):
        filename = csv_path.stem.removesuffix("_long")
        df_Transcript = pd.read_csv(csv_path, index_col=0, na_values=["", " "], keep_default_na=False)

        rev_transcripts[filename] = df_Transcript

    return rev_transcripts

The following code block loads Transcripts.

In [3]:
transcripts_raw = {}
for task in TASK:
    transcripts_raw[task] = load_rev_transcripts(task)

The following code block shows a loaded transcript sample.

In [4]:
transcripts_raw["Arg_Oly"]["1001_Arg_Oly"]

Unnamed: 0,start_time,end_time,text,type,confidence
0,0.285,0.405,I,text,0.88
1,,,,punct,
2,0.405,0.725,agree,text,0.91
3,,,,punct,
4,0.875,1.165,this,text,0.95
...,...,...,...,...,...
137,,,,punct,
138,48.145,48.425,this,text,0.87
139,,,,punct,
140,48.425,48.825,statement,text,0.98


---

## 3. Transcript Conversion

In this section, I convert rev-format transcripts which are read as DataFrame class to lower-case string with star tokens.
The following code block defines a function to extract words in a transcript from DataFrame.

In [5]:
def extract_word_list(df_transcript: pd.DataFrame) -> List[str]:
    punct_mask = (df_transcript["type"] == "punct")
    df_transcript_wo_punct = df_transcript[~punct_mask]

    word_list = df_transcript_wo_punct["text"].to_list()

    return word_list

The following code block defines a function to generate string by lowering string and changing numbers and tags to star tokens.

In [6]:
def convert_fa_transcript(word_list: List[str]) -> str:
    # 1. change numbers to star tokens
    word_list_conv = []
    number_pattern = r".*?\d"
    for word in word_list:
        if "-" in word:
            word_list_conv.append(word.replace("-", ""))
            continue

        if "." in word:
            word_list_conv.append(word.replace(".", ""))
            continue

        if re.match(number_pattern, word) is None:
            word_list_conv.append(word)
            continue
        
        # if the word is number, change it to a star token
        numbers = word.split(" ")
        for number in numbers:
            
            number = number.replace(",", "") # (e.g., when the word == "Octorber, 1998", convert it to "Octorber *)"
            if number.isalpha():
                word_list_conv.append(number)
                continue

            word_list_conv.append("*")
    
    # 2. change an inaudible tags to star tokens
    fa_transcript = " ".join(word_list_conv)
    fa_transcript = fa_transcript.replace("<inaudible>", "*")

    # 3. remove other tags
    tag_pattern = r"\<.*?\>"
    fa_transcript = re.sub(tag_pattern, " ", fa_transcript) 

    # 4. lower transcript
    fa_transcript = fa_transcript.lower()

    # 5. remove extra pauses
    while "  " in fa_transcript:
        fa_transcript = fa_transcript.replace("  ", " ")

    if fa_transcript[0] == " ":
        fa_transcript = fa_transcript[1:]
    if fa_transcript[-1] == " ":
        fa_transcript = fa_transcript[:-1]

    return fa_transcript

The following code block converts rev-style transcripts.

In [7]:
transcripts_conv = {}

for task in TASK:
    transcripts_conv[task] = {}

    for filename, df_transcript in transcripts_raw[task].items():
        word_list = extract_word_list(df_transcript)
        fa_transcript = convert_fa_transcript(word_list)
        transcripts_conv[task][filename] = fa_transcript

The following code block shows a converted transcript sample.

In [8]:
transcripts_conv["Arg_Oly"]["1001_Arg_Oly"]

'i agree this statement the tokyo olympics in * will bring economic growth to japan because uh because uh many foreigners will come to japan to see the tokyo olympics in * then they can find the traditional japanese food and traditional japanese something is very good maybe mmhmm this the tokyo olympics can make foreigners like japan so i agree this statement'

---

## 4. Saving Converted Transcripts

This section saves converted transcripts.

In [9]:
for task in TASK:
    save_dir = DATA_DIR / f"{task}/03_FA_Audio_Transcript_Auto"

    for filename, fa_transcript in transcripts_conv[task].items():
        save_path = save_dir / f"{filename}.txt"
        
        with open(save_path, "w") as f:
            f.write(fa_transcript)