# SRT Subtitle Interpreter System

**Group 1 - Shakra**
- Bagallon, Radzie, R.
- Castro, Joselito Miguel C.
- Duldulao, Jacob O.
- Gigante, Raphael Nicolai M.

---

## Section 1: Introduction to the Problem/Task and Interpreter System

### What is an Interpreter?

An interpreter is a program that reads and executes code written in a specific language. It is different from compilers that translate entire programs all at once, interpreters process the input line by line.

Some examples of where interpreters are used are the following:

- Python itself uses an interpreter to run code
- Command shells (like bash) interpret terminal commands
- Web browsers interpret HTML and JavaScript
- Game engines interpret scripting languages

Interpreters are important because they make it easier to work with structured data and execute instructions without needing to understand low-level machine code.

### SRT Subtitle System

Because we like watching subbed media (mostly anime and k-drama), we chose to build an SRT subtitle interpreter. This is a system that reads, validates, and displays subtitle files. For this project, we're specifically working with the SRT (SubRip) format. It is one of the most popular subtitle formats because it's simple, text-based, and widely supported.

Real-world applications:

- Video players (VLC, Windows Media Player) parse SRT files to display timed text
- Streaming platforms (Netflix, YouTube) use subtitle interpreters for accessibility
- Video editors rely on subtitle parsers to sync captions with video
- Translation tools process subtitle files to convert dialogue between languages

Target tasks:

1. Read SRT files and break them into meaningful pieces (tokens)
2. Validate the structure to ensure files follow the correct format
3. Execute the subtitles by displaying them with proper timing
4. Translate subtitles to different languages for multilingual support

---


## Section 2: Description of the Input Language

### What is the SRT Format?

The SRT (SubRip) format is a simple text-based subtitle format created for extracting subtitles from video files. It's designed to be human-readable and easy to edit with any text editor.

### Structure of an SRT File

Every SRT file follows a pattern. Each subtitle entry has four parts:

```
1                                    ← Index number
00:00:01,000 --> 00:00:03,000        ← Timestamp (start --> end)
Hello, world!                        ← Text content (can be multiple lines)
                                     ← Blank line separator
2
00:00:04,000 --> 00:00:06,000
Welcome to our demo.

```

### Tokens Recognized by Our Interpreter

Our lexer (tokenizer) recognizes these token types:

| Token Type | Description                       | Example         |
| ---------- | --------------------------------- | --------------- |
| INDEX      | Sequential subtitle number        | `1`, `2`, `3`   |
| TIMESTAMP  | Time in format HH:MM:SS,MMM       | `00:00:01,500`  |
| ARROW      | Separator between start/end times | `-->`           |
| TEXT       | Subtitle content                  | `Hello, world!` |
| NEWLINE    | Line break                        | `\n`            |
| BLANK_LINE | Empty line (separator)            | ` `             |
| EOF        | End of file marker                | (none)          |

### Grammar and Syntax Rules

Valid subtitle structure:

```
subtitle_entry → INDEX NEWLINE timestamp_line NEWLINE text_lines BLANK_LINE
timestamp_line → TIMESTAMP ARROW TIMESTAMP
text_lines → TEXT NEWLINE (TEXT NEWLINE)*
```

Rules for valid statements:

1. Index numbers must be sequential (1, 2, 3, ...)
2. Timestamps must follow format: `HH:MM:SS,MMM` (e.g., `00:01:30,500`)
3. Start time must come before end time
4. Each subtitle must have at least one line of text
5. Subtitles must be separated by a blank line
6. Minutes and seconds must be ≤ 59; milliseconds must be ≤ 999

### Examples: Valid vs Invalid

#### Valid Input

```
1
00:00:01,000 --> 00:00:03,000
This is a valid subtitle.

```

#### Invalid Inputs

```
00:00:01,000 --> 00:00:03,000
Missing the index number!

1
00:00:05,000 --> 00:00:02,000
End time comes before start time!

1
00:99:01,000 --> 00:00:03,000
Minutes can't be 99!

```

---


## Section 3: System Design

### Built-in Python Libraries

| Library | Purpose                | Description                                                    |
| ------- | ---------------------- | -------------------------------------------------------------- |
| `re`    | Regular expressions    | Pattern matching for timestamps and index numbers in the lexer |
| `time`  | Time-related functions | Adding delays between subtitle displays to simulate timing     |

### Third-Party Libraries

| Library           | Purpose                              | Description                                                            |
| ----------------- | ------------------------------------ | ---------------------------------------------------------------------- |
| `deep-translator` | Translation via Google Translate API | Translating subtitle text to Filipino, Korean, Chinese, Japanese, etc. |

We used `deep-translator` because it's free and doesn't require API keys. The tradeoff is that the translations tend to be literal rather than contextually aware, missing some nuance in the process.

---


## Section 4: Data Preprocessing and Cleaning

Our interpreter follows the classic three-stage pipeline used in most compiler and interpreter designs:

```mermaid
flowchart TD
    A[Input - SRT File]
    A --> B[LEXER / Tokenizer<br/><i>Breaks text into tokens</i>]
    B --> C[Tokens]
    C --> D[PARSER<br/><i>Validates structure, builds AST</i>]
    D --> E[Subtitle Entries]
    E --> F[EXECUTOR<br/><i>Displays subtitles with optional translation</i>]
    F --> G[Output]
```

### Lexer (Tokenizer)

- Reads the raw SRT file text character by character
- Identifies meaningful pieces (tokens) like numbers, timestamps, arrows, text
- Uses regex patterns to recognize timestamp format: `\d{2}:\d{2}:\d{2},\d{3}`

Example:

```
Input:  "1\n00:00:01,000 --> 00:00:03,000\nHello\n\n"
Output: [Token(INDEX, '1'), Token(NEWLINE, '\n'),
         Token(TIMESTAMP, '00:00:01,000'), Token(ARROW, '-->'),
         Token(TIMESTAMP, '00:00:03,000'), Token(NEWLINE, '\n'),
         Token(TEXT, 'Hello'), Token(NEWLINE, '\n'),
         Token(BLANK_LINE, ''), Token(EOF, '')]
```

### Parser

- Takes the token stream from the lexer
- Checks that tokens appear in the correct order (grammar validation)
- Builds SubtitleEntry objects (Abstract Syntax Tree)
- Validates semantic rules (e.g., start time < end time)

Example:

```
Input:  [Token(INDEX, '1'), Token(NEWLINE), Token(TIMESTAMP, '00:00:01,000'), ...]
Output: [SubtitleEntry(index=1, start=00:00:01,000, end=00:00:03,000, text=['Hello'])]
```

### Executor

- Takes validated subtitle entries
- Optionally translates text using Google Translate
- Displays subtitles with simulated timing (using `time.sleep()`)
- Formats output as: `[00:00:01.000] DISPLAY: "Hello"`

Example:

```
Input:  [SubtitleEntry(1, start, end, ['Hello'])]
Output: [00:00:01.000] DISPLAY: "Hello"
        [00:00:03.000] CLEAR
```

### Error Handling Strategy

We handle errors at each stage:

1. Lexer Errors (`LexerError`):

   - Malformed timestamps (e.g., `25:99:99,000`)
   - Missing arrow in timestamp line
   - Invalid characters in timestamps

2. Parser Errors (`ParserError`):

   - Missing index numbers
   - Non-sequential indices (e.g., jumps from 1 to 3)
   - Start time after end time
   - Missing text content
   - Missing blank line separator

3. Executor Errors (`ExecutorError`):
   - No subtitles to display (empty file)
   - Translation library not installed
   - Unsupported target language

We chose this three-stage pipeline design (lexer, parser, evaluator) with because it allows each component to handle one job on its own, making the system easier to debug, test and understand.

We chose Fail-fast error handling because it provides clear feedback and we think it suits SRT files well (a single error usually requires fixing the entire file anyway).

---


## Section 5: Implementation Details

### Setup: Import the Interpreter

In [1]:
# Import the main interpreter class
from src.interpreter import SRTInterpreter

# Create an interpreter instance
interpreter = SRTInterpreter()

print("Interpreter ready!")

Interpreter ready!


### Lexer (Tokenization)

1. Reads the file line by line
2. Uses regex patterns to identify timestamps and index numbers
3. Categorizes everything as tokens (INDEX, TIMESTAMP, TEXT, etc.)
4. Keeps track of line numbers for error reporting

Regex patterns:

- Timestamp: `^\d{2}:\d{2}:\d{2},\d{3}$` (matches `00:01:30,500`)
- Index: `^\d+$` (matches any positive integer)


In [2]:
from src.lexer import Lexer

# Create a lexer instance
lexer = Lexer()

# Sample SRT text (one complete subtitle)
srt_text = """1
00:00:01,000 --> 00:00:03,000
Hello, world!

"""

# Tokenize the text
tokens = lexer.tokenize(srt_text)

# Display all tokens
print("Tokens generated by the lexer:")
print("="*50)
for i, token in enumerate(tokens, 1):
    print(f"{i}. {token.type} \t | {repr(token.value)} (line {token.line_number})")

print(f"\nTotal tokens: {len(tokens)}")

Tokens generated by the lexer:
1. INDEX 	 | '1' (line 1)
2. NEWLINE 	 | '\n' (line 1)
3. TIMESTAMP 	 | '00:00:01,000' (line 2)
4. ARROW 	 | '-->' (line 2)
5. TIMESTAMP 	 | '00:00:03,000' (line 2)
6. NEWLINE 	 | '\n' (line 2)
7. TEXT 	 | 'Hello, world!' (line 3)
8. NEWLINE 	 | '\n' (line 3)
9. BLANK_LINE 	 | '' (line 4)
10. BLANK_LINE 	 | '' (line 5)
11. EOF 	 | '' (line 5)

Total tokens: 11


Explanation:

- The lexer identified the index number `1` as an INDEX token
- Recognized the two timestamps with the correct format
- Found the arrow `-->` between timestamps
- Categorized `Hello, world!` as TEXT
- Tracked newlines and the blank line separator
- Added an EOF (end-of-file) token at the end

Error handling example: What if the timestamp is malformed?


In [3]:
from src.lexer import LexerError

# This timestamp has invalid minutes (99 > 59)
bad_srt = """1
00:99:01,000 --> 00:00:03,000
Bad timestamp!
"""

try:
    lexer_test = Lexer()
    tokens = lexer_test.tokenize(bad_srt)
except LexerError as e:
    print(f"Lexer Error: {e}")

The code above executes with no errors because the lexer only validates format (regex pattern matching). It still has the correct structure (two digits for each component), but this would fail in the parser stage where semantic validation occurs.

### Part 2: Parser (Building Subtitle Entries)

1. Iterates through tokens one by one
2. Expects a specific pattern: INDEX → NEWLINE → TIMESTAMP → ARROW → TIMESTAMP → NEWLINE → TEXT → BLANK_LINE
3. Validates semantic rules (e.g., start time before end time)
4. Creates `SubtitleEntry` objects (our AST nodes)

Parsing technique: Recursive descent
- Each grammar rule becomes a function
- Functions call each other to parse nested structures

In [4]:
from src.parser import Parser

# Use the tokens we generated earlier
lexer_demo = Lexer()
tokens_demo = lexer_demo.tokenize(srt_text)

# Parse the tokens
parser = Parser(tokens_demo)
entries = parser.parse()

# Display the parsed subtitle entries
print(f"Parser found {len(entries)} subtitle(s):")
print("="*50)
for entry in entries:
    print(f"\nSubtitle #{entry.index}")
    print(f"  Start time: {entry.start_time}")
    print(f"  End time:   {entry.end_time}")
    print(f"  Text:       {entry.get_text()}")
    print(f"  Duration:   {entry.end_time.to_milliseconds() - entry.start_time.to_milliseconds()}ms")

Parser found 1 subtitle(s):

Subtitle #1
  Start time: 00:00:01,000
  End time:   00:00:03,000
  Text:       Hello, world!
  Duration:   2000ms


Explanation:
- Parser consumed tokens in order
- Created a `SubtitleEntry` object with structured data
- Validated that start time (1 sec) comes before end time (3 sec)
- Stored text as a list of lines (for multiline support)

Error handling example: What if we're missing the index?

In [5]:
from src.parser import ParserError

# This SRT is missing the index number
bad_srt_no_index = """00:00:01,000 --> 00:00:03,000
Missing index!

"""

try:
    lexer_test = Lexer()
    tokens_test = lexer_test.tokenize(bad_srt_no_index)
    parser_test = Parser(tokens_test)
    entries_test = parser_test.parse()
except ParserError as e:
    print(f"Parser Error: {e}")

Parser Error: Expected subtitle number 1


### Part 3: Executor (Display and Translation)

1. Optionally translates all subtitle text using Google Translate API
2. Iterates through subtitle entries
3. Displays each subtitle with its timestamp

Translation process:
- Uses the `deep-translator` library with `GoogleTranslator`
- Translates each line of text individually
- Supports 5 languages: English, Filipino (Tagalog), Korean, Chinese, Japanese
- Falls back to original text if translation fails


In [6]:
from src.executor import Executor

# Use the subtitle entries we parsed earlier
executor = Executor()

print("Displaying subtitles (English):")
print("="*50)
executor.execute(entries, translate_to='english')

Displaying subtitles (English):
[00:00:01.000] DISPLAY: "Hello, world!"
[00:00:03.000] CLEAR


Explanation:
- Executor displays each subtitle with its start timestamp
- Shows "CLEAR" at the end timestamp
- Uses delays to simulate real subtitle timing
- Formats timestamps nicely: `00:00:01.000`

Now let's translate to Filipino:

In [7]:
print("\nDisplaying subtitles (Filipino/Tagalog):")
print("="*50)
executor_fil = Executor()
executor_fil.execute(entries, translate_to='filipino')


Displaying subtitles (Filipino/Tagalog):

Translating to filipino...
Translation complete!                    

[00:00:01.000] DISPLAY: "Kumusta, Mundo!"
[00:00:03.000] CLEAR


Explanation:
1. Before display, executor calls `translate_subtitles()`
2. Maps language name → language code (e.g., `'filipino'` → `'tl'`)
3. Creates a `GoogleTranslator` instance with source='en', target='tl'
4. Translates each text line using the `translate()` method
5. Creates new subtitle entries with translated text
6. Displays translated subtitles

The translation is done by Google's neural machine translation system, which handles context and grammar.

### Full Pipeline Demo:

In [8]:
# Full pipeline demonstration
print("Interpreter Pipeline Demo")
print("="*50)

# Sample SRT with multiple subtitles
full_srt = """1
00:00:01,000 --> 00:00:03,000
Welcome to our interpreter!

2
00:00:04,000 --> 00:00:06,000
It can handle multiple subtitles.

3
00:00:07,000 --> 00:00:09,000
And translate them too!

"""

# Step 1: Lexer
print("\n[Step 1: Lexer - Tokenization]")
lexer_full = Lexer()
tokens_full = lexer_full.tokenize(full_srt)
print(f"✓ Generated {len(tokens_full)} tokens")

# Step 2: Parser
print("\n[Step 2: Parser - Validation & AST Building]")
parser_full = Parser(tokens_full)
entries_full = parser_full.parse()
print(f"✓ Parsed {len(entries_full)} valid subtitle entries")

# Step 3: Executor
print("\n[Step 3: Executor - Display]")
executor_full = Executor()
executor_full.execute(entries_full, translate_to='filipino')

Interpreter Pipeline Demo

[Step 1: Lexer - Tokenization]
✓ Generated 29 tokens

[Step 2: Parser - Validation & AST Building]
✓ Parsed 3 valid subtitle entries

[Step 3: Executor - Display]

Translating to filipino...
Translation complete!                    

[00:00:01.000] DISPLAY: "Maligayang pagdating sa aming tagasalin!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "Maaari itong hawakan ang maraming mga subtitle."
[00:00:06.000] CLEAR
[00:00:07.000] DISPLAY: "At isalin din ang mga ito!"
[00:00:09.000] CLEAR


---

## Section 6: Testing with Valid and Invalid Inputs

### Test 1: Valid Basic Subtitle File

In [9]:
print("Test 1: Valid Basic SRT File")
print("="*50)

# Show the file contents
with open('examples/valid_basic.srt', 'r') as f:
    content = f.read()
    print("File contents:")
    print(content)

# Run the interpreter
print("\nRunning interpreter:")
interpreter.run('examples/valid_basic.srt', 'english')

print("Test passed: File processed successfully")

Test 1: Valid Basic SRT File
File contents:
1
00:00:01,000 --> 00:00:03,000
Hello world!

2
00:00:04,000 --> 00:00:06,000
This is a test.



Running interpreter:
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles

[00:00:01.000] DISPLAY: "Hello world!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "This is a test."
[00:00:06.000] CLEAR

Done!
Test passed: File processed successfully


### Test 2: Valid Multiline Subtitles

In [10]:
print("Test 2: Valid Multiline Subtitles")
print("="*50)

with open('examples/valid_multiline.srt', 'r') as f:
    content = f.read()
    print("File contents:")
    print(content)

print("\nRunning interpreter:")
interpreter.run('examples/valid_multiline.srt', 'english')

print("Test passed: Multiline text handled correctly")

Test 2: Valid Multiline Subtitles
File contents:
1
00:00:01,000 --> 00:00:04,000
This subtitle has
multiple lines
of text.



Running interpreter:
Reading file: examples/valid_multiline.srt

Step 1: Tokenizing...
  Found 15 tokens

Step 2: Parsing...
  Found 1 subtitles

Step 3: Displaying subtitles

[00:00:01.000] DISPLAY: "This subtitle has
multiple lines
of text."
[00:00:04.000] CLEAR

Done!
Test passed: Multiline text handled correctly


### Test 3: Translation to Korean

In [11]:
print("Test 3: Translation Feature (Korean)")
print("="*50)

print("Original (English):")
interpreter.run('examples/valid_basic.srt', 'english')

print("\n" + "-"*50 + "\n")

print("Translated (Korean):")
interpreter.run('examples/valid_basic.srt', 'korean')

print("Test passed: Translation works correctly")

Test 3: Translation Feature (Korean)
Original (English):
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles

[00:00:01.000] DISPLAY: "Hello world!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "This is a test."
[00:00:06.000] CLEAR

Done!

--------------------------------------------------

Translated (Korean):
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles
  (will translate to korean)


Translating to korean...
Translation complete!                    

[00:00:01.000] DISPLAY: "안녕하세요!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "이것은 테스트입니다."
[00:00:06.000] CLEAR

Done!
Test passed: Translation works correctly


### Test 4: Invalid Input - Missing Index

In [12]:
print("Test 4: Invalid Input - Missing Index Number")
print("="*50)

with open('examples/invalid_missing_index.srt', 'r') as f:
    content = f.read()
    print("File contents (INVALID):")
    print(content)

print("\nRunning interpreter (expecting error):")
try:
    interpreter.run('examples/invalid_missing_index.srt', 'english')
except Exception as e:
    print(f"Error caught: {e}")
    print("Test passed: Parser correctly detected missing index")

Test 4: Invalid Input - Missing Index Number
File contents (INVALID):
1
00:00:01,000 --> 00:00:03,000
First subtitle is fine.

00:00:04,000 --> 00:00:06,000
This subtitle is missing its index!



Running interpreter (expecting error):
Reading file: examples/invalid_missing_index.srt

Step 1: Tokenizing...
  Found 18 tokens

Step 2: Parsing...
Parser Error: Expected subtitle number 2


### Test 5: Invalid Input - Bad Timestamp Order

In [13]:
print("Test 5: Invalid Input - Bad Timestamp Order")
print("="*50)

with open('examples/invalid_timestamp_order.srt', 'r') as f:
    content = f.read()
    print("File contents (INVALID - end before start):")
    print(content)

print("\nRunning interpreter (expecting error):")
try:
    interpreter.run('examples/invalid_timestamp_order.srt', 'english')
except Exception as e:
    print(f"Error caught: {e}")
    print("Test passed: Parser correctly detected temporal violation")

Test 5: Invalid Input - Bad Timestamp Order
File contents (INVALID - end before start):
1
00:00:01,000 --> 00:00:03,000
First subtitle is fine.

2
00:00:08,000 --> 00:00:05,000
This subtitle has start time AFTER end time!



Running interpreter (expecting error):
Reading file: examples/invalid_timestamp_order.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
Parser Error: Start time 00:00:08,000 must be before end time 00:00:05,000


### Test 6: Invalid Input - Malformed Timestamp

In [14]:
print("Test 6: Invalid Input - Malformed Timestamp")
print("="*50)

with open('examples/invalid_malformed_time.srt', 'r') as f:
    content = f.read()
    print("File contents (INVALID - bad time format):")
    print(content)

print("\nRunning interpreter (expecting error):")
try:
    interpreter.run('examples/invalid_malformed_time.srt', 'english')
except Exception as e:
    print(f"Error caught: {e}")
    print("Test passed: Lexer correctly detected malformed timestamp")

Test 6: Invalid Input - Malformed Timestamp
File contents (INVALID - bad time format):
1
00:00:01,000 --> 00:00:03,000
First subtitle is fine.

2
00:00:04,000 --> 00:99:99,999
This subtitle has invalid timestamp ranges (minutes and seconds > 59)!



Running interpreter (expecting error):
Reading file: examples/invalid_malformed_time.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
Parser Error: Bad end time: Invalid time: 00:99:99,999


### Test 7: Complex Real-World File

In [15]:
print("Test 7: Complex Real-World SRT File")
print("="*50)

# Show first few subtitles
with open('examples/valid_complex.srt', 'r') as f:
    content = f.read()
    print("File preview (first 400 characters):")
    print(content[:400])
    print("...\n")

print("Running interpreter:")
interpreter.run('examples/valid_complex.srt', 'english')

print("Test passed: Complex file with multiple subtitles processed successfully")

Test 7: Complex Real-World SRT File
File preview (first 400 characters):
1
00:00:01,000 --> 00:00:02,500
Welcome to the film festival.

2
00:00:03,500 --> 00:00:07,000
The opening ceremony begins shortly.

3
00:00:08,000 --> 00:00:18,000
Please silence your mobile phones and enjoy the experience.

4
00:00:19,000 --> 00:00:21,500
Act One

5
00:00:23,000 --> 00:00:28,000
The story begins on a quiet evening in a small coastal town.

6
00:00:29,500 --> 00:00:35,000
The pro
...

Running interpreter:
Reading file: examples/valid_complex.srt

Step 1: Tokenizing...
  Found 202 tokens

Step 2: Parsing...
  Found 20 subtitles

Step 3: Displaying subtitles

[00:00:01.000] DISPLAY: "Welcome to the film festival."
[00:00:02.500] CLEAR
[00:00:03.500] DISPLAY: "The opening ceremony begins shortly."
[00:00:07.000] CLEAR
[00:00:08.000] DISPLAY: "Please silence your mobile phones and enjoy the experience."
[00:00:18.000] CLEAR
[00:00:19.000] DISPLAY: "Act One"
[00:00:21.500] CLEAR
[00:00:23.000] DISPLAY

---

## Section 7: Extensions and Additional Features

### Multi-Language Translation

What it does:
- Translates subtitle text from English to multiple target languages
- Uses Google Translate API via the `deep-translator` library
- Maintains original timing and structure

Supported languages:
1. English (original/default)
2. Filipino (Tagalog) - `'filipino'` or `'tagalog'`
3. Korean - `'korean'`
4. Chinese (Simplified) - `'chinese'`
5. Japanese - `'japanese'`


In [16]:
print("Translation Extension Demo")
print("="*50)

# Original English
print("\n1. Original (English):")
interpreter.run('examples/valid_basic.srt', 'english')

# Filipino
print("\n2. Filipino (Tagalog):")
interpreter.run('examples/valid_basic.srt', 'filipino')

# Korean
print("\n3. Korean:")
interpreter.run('examples/valid_basic.srt', 'korean')

Translation Extension Demo

1. Original (English):
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles

[00:00:01.000] DISPLAY: "Hello world!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "This is a test."
[00:00:06.000] CLEAR

Done!

2. Filipino (Tagalog):
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles
  (will translate to filipino)


Translating to filipino...
Translation complete!                    

[00:00:01.000] DISPLAY: "Hello World!"
[00:00:03.000] CLEAR
[00:00:04.000] DISPLAY: "Ito ay isang pagsubok."
[00:00:06.000] CLEAR

Done!

3. Korean:
Reading file: examples/valid_basic.srt

Step 1: Tokenizing...
  Found 20 tokens

Step 2: Parsing...
  Found 2 subtitles

Step 3: Displaying subtitles
  (will translate to korean)


Translating to korean...
Translation complete!               

This extension helps people watch videos in different languages, which is useful for language learners who want to see translations side by side and content creators trying to reach wider audiences. Major streaming platforms like Netflix already use similar technology.

---


## Section 8: Insights and Conclusions

Building this SRT subtitle interpreter provided us with practical insight into how language processing actually works. The three-stage architecture (lexer, parser, executor) made the code easier to test and maintain. We spent a bit more time on error handling, learning that clear error messages with line numbers matter just as much as the core functionality. Working with SRT's simple format showed us why minimalist designs like JSON and CSV remain popular: they're easy to parse, debug, and implement across different systems.

The interpreter handles real subtitle files effectively, but we think there's still room for growth. Currently, it only displays subtitles sequentially without video synchronization or subtitle editing capabilities. Future versions could integrate with video players, support format conversion between SRT and other subtitle types, or use speech recognition APIs to auto-generate subtitles (imagine that!). Despite these limitations, this machine project gave us an understanding of how interpreters function. 

Sometimes the best way to understand a complex system is to build a simpler version yourself.

---


## Section 9: References

### Scholarly Articles and Academic Papers

Baicoianu, A., & Plajer, I. (2023). Considerations on efficient lexical analysis in the context of compiler design. Bulletin of the Transilvania University of Brasov. Series III: Mathematics and Computer Science, 159–168. https://doi.org/10.31926/but.mif.2023.3.65.2.14

- This paper provided guidelines for constructing efficient lexers (scanners). We applied their recommendations on pattern matching and tokenization to design our Lexer class. The paper's discussion on combining lexers with parsers helped us understand how the two components should interact and pass data between stages.

Jordan, W., Bejo, A., & Persada, A. G. (2019). The Development of Lexer and Parser as Parts of Compiler for GAMA32 Processor’s Instruction-set using Python. 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), 450–455. https://doi.org/10.1109/ISRITI48646.2019.9034617

- This paper demonstrated practical lexer and parser implementation in Python. We learned how to structure our tokenization logic and how to organize token types as constants. Their approach to error handling in the parser stage influenced our ParserError design.

Pai T, V., Jayanthila Devi, A., & Aithal, P. S. (2020). A Systematic Literature Review of Lexical Analyzer Implementation Techniques in Compiler Design. International Journal of Applied Engineering and Management Letters, 285–301. https://doi.org/10.47992/IJAEML.2581.7000.0087

- This literature review helped us compare different approaches to lexical analysis. We chose the regex-based approach (using Python's `re` module) based on their discussion of implementation techniques and trade-offs for simple grammars like SRT.

### Online References, Tutorials, and Documentation

SubRip. (2025). In Wikipedia. https://en.wikipedia.org/w/index.php?title=SubRip&oldid=1311983151

- Primary source for understanding the SRT file format specification. We learned the exact structure of subtitle entries, timestamp format requirements, and common use cases. This defined our grammar rules and validation requirements.

SubRip Subtitle format (SRT). (2023, March 28). [Web page]. https://www.loc.gov/preservation/digital/formats/fdd/fdd000569.shtml

- Official archival documentation that clarified edge cases and format details. We used this to understand why certain validation rules exist (like requiring sequential indices and proper timestamp formatting).

Lexical analysis. (2025). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Lexical_analysis&oldid=1309109190

- General background on lexical analysis concepts. Helped us understand the two-stage process (scanning and evaluating) and the difference between lexemes and tokens. We applied these concepts in our Lexer class design.

Token, Patterns, and Lexemes. (00:00:34+00:00). GeeksforGeeks. https://www.geeksforgeeks.org/compiler-design/token-patterns-and-lexems/

- Clarified the distinction between tokens (categories), patterns (regex), and lexemes (actual text). This helped us design our token type constants (TOKEN_INDEX, TOKEN_TIMESTAMP, etc.) and corresponding regex patterns.

deep-translator: A flexible free and unlimited python tool to translate between different languages in a simple way using multiple translators (Version 1.11.4). (n.d.). [Python; OS Independent]. Retrieved October 28, 2025, from https://github.com/nidhaloff/deep_translator

- API documentation for the translation library we used in the executor. We learned how to use the GoogleTranslator class, handle language codes, and implement error handling for translation failures.

### Artificial Intelligence (AI) Tools

1. [Perplexity.ai](https://www.perplexity.ai/search/how-should-i-structure-a-simpl-S65YcLmbTyCpIHwIUhnRNw#8)
   - Project structuring and general knowledge: Asked for advice on how to organize the interpreter into separate modules (lexer.py, parser.py, executor.py, ast.py)
   - Error handling strategies: Asked for best practices on raising and catching custom exceptions (LexerError, ParserError, ExecutorError)
   - Documentation help: Asked for suggestions on writing clear docstrings and comments

2. [Claude Code](https://www.claude.com/product/claude-code)
   - Python syntax help: Consulted for Python-specific syntax when implementing regex patterns, class structures, and exception handling
   - Code review: Requested feedback on whether our design followed standard interpreter architecture patterns

### Tools and Libraries

Python Standard Library:

- `re` (regular expressions) - Built-in module for pattern matching
- `time` - Built-in module for timing delays

Third-Party Libraries:

- `deep-translator` - Google Translate API wrapper for subtitle translation

Development Tools:

- Python 3.x
- Jupyter Notebook (for this demo)
- Git (version control)

All references were used ethically and in accordance with academic integrity guidelines. We cited sources where ideas came from, used AI as a learning aid (not a code generator), and implemented all functionality ourselves to ensure we understood the concepts.
