# Forced alignment

The purpose of this notebook is to demonstrate how you can use forced alignment to automatically generate subtitles from a plain transcript and an audio file. This method is especially useful when you already have a written transcription (without timestamps) and want to align it with the audio to create subtitles.

We’ll use the [BAS Web Service](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface) to generate `.TextGrid` files, which contain time-aligned annotations. Then, we’ll process these files to create `.srt` subtitle files.

##### **Credits:**  
The audio files used in this tutorial were kindly provided by Machteld Venken. All materials are assumed to be placed inside a directory named `content`.


## Using FA to automatically generate subtitles

### 🛠️ Installing required packages

Firstly, the mandatory libraries for correct using of the software must be installed. You can do it by running the following code cell, to do so click on the **play button** or press **Shift+Enter**. After that, wait until a green check mark appears on the bottom left corner of the cell.

**Note:** If the code cell does not execute correctly then an error message will appear typically in red. Usually, this will be due to typos in the code, missing files that haven't been uploaded or earlier code cells that were skipped but are required by following code cells. In this case, ask Gemini to help you with the problem or simply look for the solution on the internet or contact me (vladyslav.siulhin@uni.lu).

In [None]:
!pip install python-docx

### 📁 Uploading Files

Before executing the next code cell, you must upload a transcript in `.docx` format. To upload files:

- Click on the **folder icon** named **Files** on the left sidebar.  
- Click the **upload icon** named **Upload to the session storage** to add your files.  
- Once uploaded, right-click the file and choose **Copy path**.  
- Paste that path where indicated in the code cells below.
- Change the number of words per line for the subtitles if needed by modifying **words_per_line** variable

⚠️ **Note:** Files uploaded this way will be removed once the session ends. You can ignore Colab's warning about temporary files.


In [None]:
# Insert here the path of the uploaded file (must be between quotes)
input_file = '.docx'
# Insert here the path of the output file that will be created after running the next code cell (must be the same as the path of input file but with different name of the file, for example reformatted_my_file.docx)
output_file = '.docx'

# Set the desired number of words per line for the subtitles
words_per_line = 10

In [None]:
from docx import Document

def reformat_text_by_word_count(text, words_per_line):
    words = text.strip().split()
    lines = []
    for i in range(0, len(words), words_per_line):
        line = ' '.join(words[i:i + words_per_line])
        lines.append(line)
    return '\n'.join(lines)

def reformat_document(file_path, output_path, words_per_line):
    doc = Document(file_path)

    for para in doc.paragraphs:
        para.text = reformat_text_by_word_count(para.text, words_per_line)

    doc.save(output_path)
    print(f"Document saved to {output_path}")

reformat_document(input_file, output_file, words_per_line)


After running code cell above, your new **.docx** file should be available in the content folder. You need to download it to your local machine to use it in the next step.

### How to Obtain the `.TextGrid` File using BAS Web Service

Before we can generate subtitles, we need to obtain a `.TextGrid` file that contains time-aligned annotations between the transcript and the audio. It could be done using the [BAS Web Service](https://clarin.phonetik.uni-muenchen.de/BASWebServices/).
Once you are on the site, go to [Pipeline without ASR](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Pipeline) option.

You can follow the official tutorial here:  
🔗 [Pipeline Without ASR – BAS Tutorial](https://clarin.phonetik.uni-muenchen.de/BASWebServices/help/tutorial#PipelineWithoutASR)

Or my personal tutorial:

1. Upload both your **audio file** and **transcript file**. These files must have the **same name** (for example, `MyInterview.mp3` and `MyInterview.docx` or `.docx`).
2. Then, choose the pipeline type:

- For short recordings (under ~30 minutes), select `G2P → MAUS`.  
- For longer recordings, use `G2P → CHUNKER → MAUS`.

3. Next, choose the language. If the language is supported (like English or Russian), just select it. If you're working with **Ukrainian** or another unsupported language, select **Language-Independent (X-SAMPA)**. In that case, click on **"Expert Options (click to show)"**, and for the **Imap mapping file (G2P)** upload the file named `Ukrainian.lmao` from your `content` folder that you copied from GitHub.

Before running the service, don’t forget to check the box confirming you’ve read the terms of usage. Then click **Run Web Service**.

Processing might take some time, especially for longer audio files. Once it finishes, click **Download as ZIP-File**, extract **.TextGrid** file and upload it into your Colab environment to continue with this tutorial.


In [None]:
# Insert here the path of the uploaded .TextGrid file (must be between quotes)
textgrid_file = '.TextGrid'
# Type the name of the output file with extension .txt (must be between quotes)
output_file = '.txt'
# Set the desired number of words per subtitle line
words_per_line = 10

In [None]:
import re

def parse_textgrid_to_word_timings(file_path):
    segments = []
    segment_text = ""
    start_time = None
    end_time = None

    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            if 'item [2]:' in line:
                break

            xmin_match = re.search(r'xmin = ([0-9.]+)', line)
            text_match = re.search(r'text = "(.*)"', line)
            xmax_match = re.search(r'xmax = ([0-9.]+)', line)

            if xmin_match and start_time is None:
                start_time = float(xmin_match.group(1))

            if text_match:
                text = text_match.group(1).strip()
                if text:
                    segment_text += text + " "

            if xmax_match:
                end_time = float(xmax_match.group(1))
                if segment_text.strip():
                    segments.append({
                        'text': segment_text.strip(),
                        'start_time': start_time,
                        'end_time': end_time
                    })
                segment_text = ""
                start_time = None
                end_time = None

    return segments

def compute_word_timings(segments):
    word_timings = []
    for seg in segments:
        text = seg['text']
        seg_start = seg['start_time']
        seg_end = seg['end_time']
        duration = seg_end - seg_start
        words = text.split()
        if not words:
            continue
        word_duration = duration / len(words)
        for i, word in enumerate(words):
            w_start = seg_start + i * word_duration
            w_end = seg_start + (i + 1) * word_duration
            word_timings.append((word, w_start, w_end))
    return word_timings

def group_words_to_subtitles(word_timings, words_per_line):
    subtitles = []
    for i in range(0, len(word_timings), words_per_line):
        chunk = word_timings[i:i + words_per_line]
        if chunk:
            words = [w for w, _, _ in chunk]
            subtitle_text = ' '.join(words)
            subtitle_start = chunk[0][1]
            subtitle_end = chunk[-1][2]
            subtitles.append({
                'text': subtitle_text,
                'start_time': subtitle_start,
                'end_time': subtitle_end
            })
    return subtitles

def write_subtitles(output_file, subtitles):
    with open(output_file, 'w', encoding='utf-8') as f:
        for sub in subtitles:
            f.write(f'"{sub["text"]}"\n')
            f.write(f'xmin: {sub["start_time"]}\n')
            f.write(f'xmax: {sub["end_time"]}\n')
            f.write("\n")


segments = parse_textgrid_to_word_timings(textgrid_file)
word_timings = compute_word_timings(segments)
subtitles = group_words_to_subtitles(word_timings, words_per_line)
write_subtitles(output_file, subtitles)

print(f"Processed {len(subtitles)} subtitle chunks with timecodes.")


### Convert `.txt` to `.srt` Format

The `.srt` format is one of the most widely supported subtitle formats used in video players, editors, and streaming platforms. In this step, we’ll convert the intermediate subtitle file we generated earlier into a clean `.srt` format.

Run the following code block to process the text-based subtitle file and create a standard `.srt` file that includes properly formatted timestamps and numbered subtitle blocks.


In [None]:
# Insert here the path of the file created on the last step (.txt)
input_filename = ".txt"
# Insert here the path of the output file that will be created after running the next code cell (must be the same as the path of input file but with different extension of the file, for example my_file.txt)
output_filename = ".srt"

In [None]:
import re

def seconds_to_srt_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millisecs = int((seconds % 1) * 1000)
    return f"{hours:02}:{minutes:02}:{secs:02},{millisecs:03}"

def process_text_to_srt(input_text):
    pattern = re.compile(r'"(.*?)"\s*xmin:\s*(\d+\.\d+)\s*xmax:\s*(\d+\.\d+)')
    matches = pattern.findall(input_text)

    srt_output = []
    for (text, xmin, xmax) in matches:
        start_time = seconds_to_srt_time(float(xmin))
        end_time = seconds_to_srt_time(float(xmax))

        srt_output.append(f"{start_time} --> {end_time}\n{text}\n\n")

    return "".join(srt_output)

def read_input_file(input_filename):
    with open(input_filename, "r", encoding="utf-8") as file:
        return file.read()

def save_srt_output(output_filename, srt_data):
    with open(output_filename, "w", encoding="utf-8") as file:
        file.write(srt_data)

input_text = read_input_file(input_filename)

srt_result = process_text_to_srt(input_text)

save_srt_output(output_filename, srt_result)
