# Connectionist Temporal Classification (CTC)

In Automatic Speech Recognition (ASR), a fundamental challenge is the **lack of explicit alignment**
between audio frames and text tokens.

Speech signals are continuous and vary in length, while transcriptions are discrete and much shorter.
CTC provides a principled solution to this alignment problem by allowing models to learn
the mapping from audio frames to text **without requiring frame-level labels**.

This notebook introduces the theory behind CTC and implements a minimal,
fully transparent CTC training pipeline.


## Why Alignment Is Hard in ASR

Speech is inherently variable:
- phonemes have different durations
- speakers speak at different speeds
- silence can appear anywhere

Traditional supervised learning assumes one label per input.
ASR violates this assumption because:
- the number of audio frames >> number of text tokens
- alignment is unknown

CTC addresses this by marginalizing over all possible valid alignments
between input frames and output tokens.


## How CTC Works

CTC defines a many-to-one mapping:
- many frame-level label sequences
- map to one final transcription

During training:
- the model outputs a probability distribution over tokens (including blank) at each frame
- CTC sums probabilities over all valid alignments
- the loss is the negative log-likelihood of the correct transcription

This allows learning **without explicit alignment**.


## Why Use CTC?

CTC is particularly suitable for:
- monotonic alignments (speech → text)
- streaming ASR
- simpler model architectures

Compared to sequence-to-sequence models:
- no attention mechanism
- faster inference
- simpler decoding
- less data-hungry

For this project, CTC provides clarity, interpretability,
and a strong foundation for understanding modern ASR systems.


## Dependencies and Setup

We use PyTorch’s built-in `CTCLoss`, which efficiently implements
the forward-backward algorithm required for CTC.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


## Minimal Acoustic Model for CTC

To demonstrate CTC mechanics, we define a minimal neural network that:
- accepts Log-Mel features
- outputs per-frame token probabilities

This model is intentionally simple to keep the focus on CTC behavior.


In [None]:
class SimpleCTCModel(nn.Module):
    def __init__(self, input_dim, vocab_size):
        super().__init__()
        self.linear = nn.Linear(input_dim, vocab_size)

    def forward(self, x):
        """
        x: (batch, time, features)
        returns: (time, batch, vocab)
        """
        x = self.linear(x)
        x = F.log_softmax(x, dim=-1)
        return x.transpose(0, 1)


## Defining the CTC Loss

CTC loss requires:
- log probabilities (time, batch, vocab)
- target sequences (concatenated)
- input lengths (frames per utterance)
- target lengths (tokens per utterance)

The blank token index must be specified explicitly.


In [None]:
ctc_loss_fn = nn.CTCLoss(
    blank=1,          # <blank> index from tokenizer
    reduction="mean",
    zero_infinity=True
)


## Toy Example: Verifying CTC Behavior

We construct a synthetic example to verify that:
- dimensions are correct
- loss computation works
- CTC accepts variable-length inputs


In [None]:
batch_size = 2
time_steps = 100
n_mels = 80
vocab_size = 30

# Fake acoustic features
inputs = torch.randn(batch_size, time_steps, n_mels)

# Fake targets
targets = torch.tensor([2, 3, 4, 5, 6, 7], dtype=torch.long)
target_lengths = torch.tensor([3, 3], dtype=torch.long)
input_lengths = torch.tensor([time_steps, time_steps], dtype=torch.long)

model = SimpleCTCModel(n_mels, vocab_size)
log_probs = model(inputs)

loss = ctc_loss_fn(
    log_probs,
    targets,
    input_lengths,
    target_lengths
)

print("CTC loss:", loss.item())


## Decoding with CTC 

During inference, the simplest decoding strategy is **greedy decoding**:
- take argmax token at each frame
- collapse repeats
- remove blanks

More advanced approaches (beam search, language models)
can be layered on top, but greedy decoding is sufficient
for validating the pipeline.


## Summary

In this step, we:
- introduced the alignment problem in ASR,
- explained the intuition and theory behind CTC,
- implemented a minimal CTC-compatible model,
- verified correct loss computation with synthetic data.

With acoustic features, tokenized text, and CTC loss in place,
the ASR pipeline is now complete at a conceptual level.

The next step focuses on **end-to-end training and decoding**
using real audio features and real transcriptions.
