## Problem - 1: Speech to Text Detection (Automatic Speech Recognition)
**Goal**

Develop a system that converts spoken language into written text by processing audio signals. The goal is to accurately transcribe speech in real-time or from recorded audio.


### Dataset Description

- **Training Set:**  
  - **LibriSpeech train-clean-100**  
  - Contains ~100 hours of clean English speech from public domain audiobooks (LibriVox).  
  - High-quality recordings with minimal background noise.  
  - Suitable for training ASR models.  

- **Validation Set:**  
  - **LibriSpeech dev-clean**  
  - Clean, studio-quality English speech data used for validation.  
  - Helps monitor model performance during training and avoid overfitting.  


In [1]:
!pip install torch torchaudio jiwer matplotlib

Collecting jiwer
  Downloading jiwer-4.0.0-py3-none-any.whl.metadata (3.3 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidi

### Speech Recognition Setup

- Imports libraries for audio processing, modeling, and evaluation.
- Uses GPU if available, otherwise CPU.
- Defines character vocabulary and mappings for text-to-index conversion.


In [2]:
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchaudio.datasets import LIBRISPEECH
from torchaudio.transforms import MelSpectrogram
from jiwer import wer

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
CHAR_VOCAB = ['<blank>'] + list("abcdefghijklmnopqrstuvwxyz '")
CHAR2IDX = {c: i for i, c in enumerate(CHAR_VOCAB)}
IDX2CHAR = {i: c for c, i in CHAR2IDX.items()}

### Text and Decoding Functions

- `text_to_indices(text)`: Converts text to a list of character indices for model input.

- `greedy_decode(log_probs)`: Performs greedy decoding of model outputs, removes blanks and repeats, and returns final transcripts as text.


In [3]:
def text_to_indices(text):
    return [CHAR2IDX[c] for c in text.lower() if c in CHAR2IDX]

def greedy_decode(log_probs):
    best_path = torch.argmax(log_probs, dim=-1)
    transcripts = []
    for seq in best_path:
        prev = None
        tokens = []
        for idx in seq:
            idx = idx.item()
            if idx != prev and idx != 0:
                tokens.append(IDX2CHAR[idx])
            prev = idx
        transcripts.append("".join(tokens))
    return transcripts

### Feature Extraction and Data Collation

- `mel_transform`: Converts audio waveforms to Mel Spectrograms (80 filter banks, 16kHz sample rate).
- `collate_fn(batch)`: Prepares batches by extracting features, converting transcripts to indices, padding sequences, and moving data to the correct device.


In [None]:
mel_transform = MelSpectrogram(
    sample_rate=16000, n_fft=400, win_length=400,
    hop_length=160, n_mels=80
)

def collate_fn(batch):
    features, targets, input_lengths, target_lengths = [], [], [], []
    for waveform, sample_rate, transcript, speaker_id, chapter_id, utterance_id in batch:
        mel = mel_transform(waveform).squeeze(0).transpose(0, 1)
        target = torch.tensor(text_to_indices(transcript))  
        features.append(mel)
        targets.append(target)
        input_lengths.append(mel.size(0))
        target_lengths.append(len(target))
    features = nn.utils.rnn.pad_sequence(features, batch_first=True)
    targets = torch.cat(targets)
    return (
        features.to(DEVICE),
        targets.to(DEVICE),
        torch.tensor(input_lengths).to(DEVICE),
        torch.tensor(target_lengths).to(DEVICE)
    )


### Dataset Preparation

- Loads the LibriSpeech dataset for training (`train-clean-100`), validation (`dev-clean`), and testing (`test-clean`).
- Automatically downloads datasets if not already present.


In [None]:
from torchaudio.datasets import LIBRISPEECH

train_dataset = LIBRISPEECH(".", url="train-clean-100", download=True)

val_dataset = LIBRISPEECH(".", url="dev-clean", download=True)

test_dataset = LIBRISPEECH(".", url="test-clean", download=True)


100%|██████████| 5.95G/5.95G [06:50<00:00, 15.6MB/s]
100%|██████████| 322M/322M [00:16<00:00, 20.7MB/s]
100%|██████████| 331M/331M [00:17<00:00, 19.7MB/s]


### DataLoader Setup

- Creates DataLoaders for training, validation, and testing.
- Uses `collate_fn` to process batches.
- Training loader shuffles data; validation and test loaders do not.


In [6]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False, collate_fn=collate_fn)


### SpeechRNNCTC Class

A simple speech recognition model using a bidirectional LSTM and linear layer, designed for use with CTC (Connectionist Temporal Classification) loss.


**Initialization**

```python
SpeechRNNCTC(input_dim=80, hidden_dim=512, output_dim=len(CHAR_VOCAB))
```

**Arguments:**

- `input_dim` (int): Size of input features per time step (default: `80`).
- `hidden_dim` (int): Number of hidden units in the LSTM layers (default: `512`).
- `output_dim` (int): Number of output classes, usually length of the character vocabulary (default: `len(CHAR_VOCAB)`).

**Components**

- `rnn`: 3-layer bidirectional LSTM for sequence modeling.
- `fc`: Linear layer mapping LSTM outputs to output classes.

**Forward Pass**

```python
output = model(x)
```

**Arguments:**

- `x` (Tensor): Input tensor of shape `(batch_size, sequence_length, input_dim)`.

**Returns:**

- Tensor of shape `(batch_size, sequence_length, output_dim)` with class scores for each time step.

**Notes**

- Suitable for speech recognition with CTC loss where input-output alignment is unknown.
- Make sure `CHAR_VOCAB` is defined as your character set.


In [7]:
class SpeechRNNCTC(nn.Module):
    def __init__(self, input_dim=80, hidden_dim=512, output_dim=len(CHAR_VOCAB)):
        super().__init__()
        self.rnn = nn.LSTM(input_dim, hidden_dim, num_layers=3,
                           bidirectional=True, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, x):
        x, _ = self.rnn(x)
        return self.fc(x)


### Training and Evaluation Function

- `train_and_evaluate`: Trains and evaluates the model.
- Supports different optimizers (`SGD`, `Adam`, `AdamW`, `RMSprop`) and schedulers (`StepLR`, `ReduceLROnPlateau`).
- Tracks training/validation loss and epoch times.
- Includes gradient clipping and device handling.


In [None]:
import time
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau, StepLR

def train_and_evaluate(
    model_fn, train_dataset, val_dataset, loss_fn,
    optimizer_name="AdamW", batch_size=4, epochs=15,
    device="cuda", scheduler_type="step", early_stopping=None
):
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    model = model_fn().to(device)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

    optimizer = {
        "SGD": lambda: torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
        "Adam": lambda: torch.optim.Adam(model.parameters(), lr=0.001),
        "AdamW": lambda: torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4),
        "RMSprop": lambda: torch.optim.RMSprop(model.parameters(), lr=0.001)
    }.get(optimizer_name, None)

    if optimizer is None:
        raise ValueError("Unsupported optimizer")

    optimizer = optimizer()

    scheduler = {
        "plateau": ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=2),
        "step": StepLR(optimizer, step_size=5, gamma=0.1),
        None: None
    }.get(scheduler_type)

    history = {
        "train_loss": [], "val_loss": [], "epoch_time": []
    }

    for epoch in range(epochs):
        start = time.time()

        model.train()
        total_train_loss = 0.0
        for inputs, labels, input_lengths, target_lengths in train_loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            input_lengths = input_lengths.to(device)
            target_lengths = target_lengths.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)  
            log_probs = F.log_softmax(outputs, dim=2)

            loss = loss_fn(log_probs.transpose(0, 1), labels, input_lengths, target_lengths)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0)
            optimizer.step()
            total_train_loss += loss.item()

        model.eval()
        total_val_loss = 0.0
        with torch.no_grad():
            for inputs, labels, input_lengths, target_lengths in val_loader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                input_lengths = input_lengths.to(device)
                target_lengths = target_lengths.to(device)

                outputs = model(inputs)
                log_probs = F.log_softmax(outputs, dim=2)
                val_loss = loss_fn(log_probs.transpose(0, 1), labels, input_lengths, target_lengths)
                total_val_loss += val_loss.item()

        avg_train_loss = total_train_loss / len(train_loader)
        avg_val_loss = total_val_loss / len(val_loader)

        print(f"Epoch {epoch+1:02d} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f} | Time: {time.time() - start:.2f}s")

        history["train_loss"].append(avg_train_loss)
        history["val_loss"].append(avg_val_loss)
        history["epoch_time"].append(time.time() - start)

        if scheduler_type == "plateau":
            scheduler.step(avg_val_loss)
        elif scheduler_type == "step":
            scheduler.step()


    return model, history


### Model Training

- Defines CTC loss with blank index 0.
- Initializes and trains `SpeechRNNCTC` model using `train_and_evaluate`.
- Uses AdamW optimizer, batch size 8, for 15 epochs on GPU (if available).


In [None]:
from torch.nn import CTCLoss

loss_fn = CTCLoss(blank=0, zero_infinity=True)

model_fn = lambda: SpeechRNNCTC()  
trained_model, training_history = train_and_evaluate(
    model_fn=model_fn,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    loss_fn=loss_fn,
    optimizer_name="AdamW",
    batch_size=8,
    epochs=15,
    device="cuda"
)


Epoch 01 | Train Loss: 1.3977 | Val Loss: 1.0376 | Time: 1204.93s
Epoch 02 | Train Loss: 0.7843 | Val Loss: 0.8200 | Time: 1203.31s
Epoch 03 | Train Loss: 0.6180 | Val Loss: 0.7376 | Time: 1203.32s
Epoch 04 | Train Loss: 0.5215 | Val Loss: 0.6744 | Time: 1201.83s
Epoch 05 | Train Loss: 0.4540 | Val Loss: 0.6372 | Time: 1203.81s
Epoch 06 | Train Loss: 0.3180 | Val Loss: 0.5449 | Time: 1203.08s
Epoch 07 | Train Loss: 0.2608 | Val Loss: 0.5345 | Time: 1203.76s
Epoch 08 | Train Loss: 0.2265 | Val Loss: 0.5384 | Time: 1202.42s
Epoch 09 | Train Loss: 0.1994 | Val Loss: 0.5451 | Time: 1203.68s
Epoch 10 | Train Loss: 0.1758 | Val Loss: 0.5552 | Time: 1202.76s
Epoch 11 | Train Loss: 0.1479 | Val Loss: 0.5594 | Time: 1202.54s
Epoch 12 | Train Loss: 0.1425 | Val Loss: 0.5630 | Time: 1204.59s
Epoch 13 | Train Loss: 0.1387 | Val Loss: 0.5660 | Time: 1202.15s
Epoch 14 | Train Loss: 0.1353 | Val Loss: 0.5692 | Time: 1204.84s
Epoch 15 | Train Loss: 0.1322 | Val Loss: 0.5726 | Time: 1204.30s


### Plotting Training History

- `plot_training_history`: Plots training and validation loss curves over epochs.
- Helps visualize model performance during training.


In [None]:
import matplotlib.pyplot as plt

def plot_training_history(history):
    epochs = range(1, len(history["train_loss"]) + 1)

    plt.figure(figsize=(10, 5))

    plt.plot(epochs, history["train_loss"], label="Train Loss", marker='o')
    plt.plot(epochs, history["val_loss"], label="Val Loss", marker='x')

    plt.title("Training vs Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("CTC Loss")
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()


### Model Evaluation on Test Data

- `evaluate_model_on_test`: Evaluates model using Word Error Rate (WER) and Character Error Rate (CER).
- Displays example predictions and computes average WER/CER.
- Supports limiting number of batches and showing sample outputs.


In [None]:
from jiwer import wer, cer

def evaluate_model_on_test(
    model, test_loader, idx2char, device="cuda", max_batches=5, show_samples=True
):
    model.eval()
    total_wer, total_cer = 0.0, 0.0
    total_samples = 0
    samples_shown = 0

    with torch.no_grad():
        for batch_idx, (inputs, targets, input_lengths, target_lengths) in enumerate(test_loader):
            inputs = inputs.to(device)
            targets = targets.to(device)
            input_lengths = input_lengths.to(device)
            target_lengths = target_lengths.to(device)

            logits = model(inputs)
            log_probs = torch.nn.functional.log_softmax(logits, dim=2)
            pred_texts = greedy_decode(log_probs)

            true_texts = []
            idx = 0
            for length in target_lengths:
                text = "".join([idx2char[i.item()] for i in targets[idx:idx + length]])
                true_texts.append(text)
                idx += length

            for ref, hyp in zip(true_texts, pred_texts):
                total_wer += wer(ref, hyp)
                total_cer += cer(ref, hyp)
                total_samples += 1

                if show_samples and samples_shown < 5:
                    print(f"REF: {ref}")
                    print(f"HYP: {hyp}")
                    print(f"WER: {wer(ref, hyp):.2f}, CER: {cer(ref, hyp):.2f}")
                    print("-" * 60)
                    samples_shown += 1

            if max_batches is not None and (batch_idx + 1) >= max_batches:
                break

    avg_wer = total_wer / total_samples if total_samples > 0 else float("inf")
    avg_cer = total_cer / total_samples if total_samples > 0 else float("inf")
    print(f"\nAverage WER: {avg_wer:.4f}")
    print(f"Average CER: {avg_cer:.4f}")
    return avg_wer, avg_cer


### Run Final Model Evaluation

- Evaluates the trained model on test data.
- Calculates average WER and CER over 10 test batches.
- Displays example transcriptions for comparison.


In [None]:
evaluate_model_on_test(
    model=trained_model,
    test_loader=test_loader,
    idx2char=IDX2CHAR,
    device="cuda",
    max_batches=10  
)

REF: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce
HYP: he hoped there would be sto for dinner turnips and carrats and brused patatoes and fatn button pieces to be ladleld out ind thic peppered flower fatened souc
WER: 0.43, CER: 0.09
------------------------------------------------------------
REF: stuff it into you his belly counselled him
HYP: stuf id into you his belay councteled him
WER: 0.50, CER: 0.14
------------------------------------------------------------
REF: after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels
HYP: after early night fall the yenow lampse would light hap here and there the squalled quartter of the broawfals
WER: 0.44, CER: 0.12
------------------------------------------------------------
REF: hello bertie any good in your mind
HYP: hel iburty and e good in her mind
WER: 0.71, CER: 0.35
-------

(0.2720786615312261, 0.08343493770278568)

### Save the Trained Model

- Saves the model's learned parameters to a `.pth` file for future use.


In [None]:
torch.save(trained_model.state_dict(), "speech_to_text_model.pth")

### Training Results

- Total Epochs: **15**
- Final Training Loss: **0.1322**
- Final Validation Loss: **0.5726**
- Average Epoch Time: ~1203 seconds


### Test Set Evaluation

**What is WER and CER?**

- **WER (Word Error Rate):**  
  Measures the percentage of words incorrectly predicted, calculated as:  
  `(Substitutions + Insertions + Deletions) / Total Words`  
  Lower WER means better transcription accuracy.

- **CER (Character Error Rate):**  
  Similar to WER but at the character level, calculated as:  
  `(Substitutions + Insertions + Deletions) / Total Characters`  
  Lower CER indicates fewer character-level mistakes.


**Observed Test Results**

- WER ranges from **0.43 to 0.71** on individual examples.
- CER ranges from **0.09 to 0.35** on individual examples.
- Common errors include substitutions, deletions, or character-level mistakes.


**Performance Summary**

| Metric | Your Model | Ideal Range (Good Models) |
|--------|------------|---------------------------|
| **WER** | 0.2721 | ≤ 0.15 (Excellent), ≤ 0.25 (Good) |
| **CER** | 0.0834 | ≤ 0.05 (Excellent), ≤ 0.10 (Good) |


**Conclusion**

- **CER** is within a reasonable range, showing good character-level accuracy.
- **WER** indicates room for improvement at the word level.
- Future improvements may include more training data, larger models, or better decoding techniques.

