# Dev challenge: 2025 Spring Internship

## Problem Statement

__Challenge 1: Formatting a transcript__

__Task:__
You receive an unformatted raw string transcript of a 3-hour
meeting. Write an automated tool that processes any transcript and
outputs it split by speaker in the following format with this exact
indentation:

SPEAKER1:

        Statement.

SPEAKER2:

        Statement2.

Note: There can be up to 30 different speakers. Explain your approach.

Deliverables:
- A high-level overview explaining your approach and some pseudo-code.
- A brief explanation of the technologies/methodologies used.

__Objective__

Develop a robust Python function to transform raw, unstructured meeting transcripts into a cleanly formatted, speaker-separated document.
- Handle transcripts from 3-hour meetings
- Support up to 30 different speakers
- Maintain precise speaker order and statement attribution
- Create a flexible, efficient parsing algorithm

## Algorithm
1) Tokenize the input transcript by replacing ": " with line breaks
2) Iterate through tokens:
- When a SPEAKER token is found, start a new speaker section
- Add previous speaker's statement if exists
- Collect subsequent tokens as the speaker's statement
3) Format output with speakers in their original order
3) Indent statements under each speaker


## Pseudo-code

In [1]:
def format_transcript(raw_transcript): 
    """
    Formats a raw transcript into speaker-separated statements.
    
    Args: raw_transcript (str): The raw transcript text to be formatted.
    
    Returns: str: A formatted transcript with precise speaker separation.
    """
    formatted_lines = []
    current_speaker = None
    current_statement = []

    tokens = raw_transcript.replace(": ", ":\n").split()
    
    for token in tokens:
        if token.startswith("SPEAKER") and token.endswith(":"):
            if current_speaker and current_statement:
                formatted_lines.append(f"    {' '.join(current_statement)}")
                current_statement = []
       
            current_speaker = token
            formatted_lines.append(f"{current_speaker}")
        elif current_speaker:
            current_statement.append(token)
 
    if current_speaker and current_statement:
        formatted_lines.append(f"    {' '.join(current_statement)}")
    
    return "\n".join(formatted_lines)

## Testing

### Example 1: basic scenario

In [2]:
example_transcript_1 = "SPEAKER1: Hello, everyone. SPEAKER2: Hi!"
print(format_transcript(example_transcript_1))

SPEAKER1:
    Hello, everyone.
SPEAKER2:
    Hi!


### Example 2: multiple speakers

In [3]:
example_transcript_2 = "SPEAKER1: Welcome. SPEAKER2: Thanks. SPEAKER3: Let's begin."
print(format_transcript(example_transcript_2))

SPEAKER1:
    Welcome.
SPEAKER2:
    Thanks.
SPEAKER3:
    Let's begin.


### Example 3: complex scenario

In [4]:
example_transcript_3 = "SPEAKER1: Initial point. SPEAKER2: Interesting. SPEAKER1: Elaborating further. SPEAKER3: Question?"
print(format_transcript(example_transcript_3))

SPEAKER1:
    Initial point.
SPEAKER2:
    Interesting.
SPEAKER1:
    Elaborating further.
SPEAKER3:
    Question?


## Summary
The algorithm efficiently transforms raw, unstructured transcripts into a clean, readable format by dynamically tracking speakers and their statements.

__Key Advantages:__

- Handles compact, inline transcript formats
- Preserves original speaker sequence
- Robust to variations in transcript structure

__Possible Future Improvements:__

- Add error handling for malformed transcripts
- Implement configurable speaker detection
- Support additional metadata extraction
- Create more sophisticated parsing rules