5/11 (Sat) | Annotation

# Conversion from Forced Alignment CSV to Rev Transcript Json

## 1. Introduction

This notebook converts forced alignment (FA) csv to rev transcript json.
Before starting the conversion, the following code block loads required packages and defines global variables.

In [1]:
import json
from typing import List, Dict, Generator
from pathlib import Path

import pandas as pd

DATA_DIR = Path("/home/matsuura/Development/app/feature_extraction_api/experiment/data")

TASK = ["Arg_Oly", "Cartoon", "RtSwithoutRAA", "RtSwithRAA", "WoZ_Interview"]

---

## 2. Define Functions

This section defines functions for the csv to json format.
The following code block defines a generator to yield FA csv path

In [2]:
def csv_path_generator(task: str) -> Generator[Path, None, None]:
    load_dir = DATA_DIR / f"{task}/04_FA_csv_Auto"

    for csv_path in load_dir.glob("*_filled.csv"):
        yield csv_path

The following code block defines functions to generate rev transcript elements

In [3]:
def generate_word_block(word: str, start_time: float, end_time: float) -> List[dict]:
    word_block_list = []
    
    if not isinstance(word, str):
        word = str(word)

    words = word.split(" ")
    if len(words) == 1:
        word = {
            "type": "text",
            "value": word,
            "ts": start_time,
            "end_ts": end_time,
            "confidence": 1.0
        }
        word_block_list.append(word)
        return word_block_list
    
    delta = (end_time - start_time) / len(words)
    for t, w in enumerate(words):
        word = {
            "type": "text",
            "value": w,
            "ts": start_time + (delta * t),
            "end_ts": start_time + (delta * (t + 1)),
            "confidence": 1.0
        }
        word_block_list.append(word)
        punct = {
            "type": "punct",
            "value": " "
        }
        word_block_list.append(punct)

    return word_block_list[:-1]

def generate_rev_element(df_fa: pd.DataFrame) -> List[Dict[str, str]]:
    element = []
    for idx in df_fa.index:
        word = df_fa.at[idx, "word"]

        if word == "":
            continue

        start_time = df_fa.at[idx, "start_time"]
        end_time = df_fa.at[idx, "end_time"]

        word_block_list = generate_word_block(word, start_time, end_time)
        element += word_block_list
        punct = {
            "type": "punct",
            "value": " "
        }
        element.append(punct)

    element[-1]["value"] = "."

    return element

The following code block defines a function to save a rev transcript json file.

In [4]:
def save_rev_json(element: List[Dict[str, str]], csv_path: Path, task: str):
    filename = csv_path.stem.removesuffix("_filled")
    json_path = DATA_DIR / f"{task}/07_Rev_Json/{filename}.json"

    rev_json = {
        "monologues": [
            {
                "speaker": 0,
                "elements": element
            }
        ]
    }

    with open(json_path, "w") as f:
        json.dump(rev_json, f, indent=4)

---

## 3. Conversion of FA csv to Rev Json

This section converts FA csv files to Rev json.

In [5]:
for task in TASK:
    for csv_path in csv_path_generator(task):
        df_fa = pd.read_csv(csv_path)

        element = generate_rev_element(df_fa)
        save_rev_json(element, csv_path, task)