# PyTorch Tutorial: Audio and Speech Processing

Audio is a time-series signal that can be processed with neural networks for tasks like speech recognition, music generation, and audio classification. This notebook covers the fundamentals of audio feature extraction and deep learning for audio.

## Learning Objectives
- Understand audio fundamentals (waveforms, sampling, frequency)
- Extract spectrograms and mel-spectrograms
- Compute MFCCs (Mel-Frequency Cepstral Coefficients)
- Build audio classifiers with CNNs
- Understand sequence models for speech

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# torchaudio for audio processing
try:
    import torchaudio
    import torchaudio.transforms as T
    import torchaudio.functional as AF
    TORCHAUDIO_AVAILABLE = True
except ImportError:
    print("torchaudio not available - install with: pip install torchaudio")
    TORCHAUDIO_AVAILABLE = False

torch.manual_seed(42)
np.random.seed(42)

## 1. Audio Fundamentals

Audio is a continuous pressure wave that we digitize by **sampling** at regular intervals.

Key concepts:
- **Sample rate**: How many samples per second (e.g., 16000 Hz, 44100 Hz)
- **Nyquist frequency**: Highest frequency we can capture = sample_rate / 2
- **Bit depth**: Precision of each sample (e.g., 16-bit, 32-bit float)

In [None]:
# Let's create a synthetic audio signal
sample_rate = 16000  # 16 kHz - common for speech
duration = 1.0  # 1 second
t = np.linspace(0, duration, int(sample_rate * duration), endpoint=False)

# Create a signal with multiple frequencies (like speech)
# Fundamental frequency + harmonics
f0 = 220  # A3 note
signal = (
    0.5 * np.sin(2 * np.pi * f0 * t) +       # Fundamental
    0.25 * np.sin(2 * np.pi * 2 * f0 * t) +  # 2nd harmonic
    0.125 * np.sin(2 * np.pi * 3 * f0 * t)   # 3rd harmonic
)

# Add some amplitude envelope (like speech)
envelope = np.exp(-2 * t) * (1 - np.exp(-20 * t))
signal = signal * envelope

# Convert to tensor
waveform = torch.tensor(signal, dtype=torch.float32).unsqueeze(0)  # [1, samples]

# Plot
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

# Full waveform
axes[0].plot(t, signal)
axes[0].set_xlabel('Time (s)')
axes[0].set_ylabel('Amplitude')
axes[0].set_title('Audio Waveform (1 second)')

# Zoomed view
zoom_samples = 500
axes[1].plot(t[:zoom_samples], signal[:zoom_samples])
axes[1].set_xlabel('Time (s)')
axes[1].set_ylabel('Amplitude')
axes[1].set_title(f'Zoomed View (first {zoom_samples} samples)')

plt.tight_layout()
plt.show()

print(f"Sample rate: {sample_rate} Hz")
print(f"Duration: {duration} s")
print(f"Total samples: {len(signal)}")
print(f"Nyquist frequency: {sample_rate // 2} Hz")

## 2. Spectrograms

A spectrogram shows frequency content over time. It's created using the **Short-Time Fourier Transform (STFT)**:
1. Divide signal into overlapping windows
2. Apply FFT to each window
3. Stack the results

Result: 2D representation [frequency x time]

In [None]:
# STFT parameters
n_fft = 512       # FFT window size
hop_length = 128  # Step between windows
win_length = 512  # Window length

if TORCHAUDIO_AVAILABLE:
    # Using torchaudio
    spectrogram_transform = T.Spectrogram(
        n_fft=n_fft,
        hop_length=hop_length,
        win_length=win_length,
        power=2.0,  # Power spectrogram
    )
    spectrogram = spectrogram_transform(waveform)
else:
    # Manual STFT using torch
    window = torch.hann_window(n_fft)
    stft = torch.stft(waveform, n_fft=n_fft, hop_length=hop_length, 
                      win_length=win_length, window=window, return_complex=True)
    spectrogram = torch.abs(stft) ** 2

print(f"Spectrogram shape: {spectrogram.shape}")
print(f"  - Channels: {spectrogram.shape[0]}")
print(f"  - Frequency bins: {spectrogram.shape[1]}")
print(f"  - Time frames: {spectrogram.shape[2]}")

# Plot
plt.figure(figsize=(12, 4))
plt.imshow(
    10 * torch.log10(spectrogram[0] + 1e-10).numpy(),
    aspect='auto',
    origin='lower',
    cmap='viridis'
)
plt.colorbar(label='Power (dB)')
plt.xlabel('Time Frame')
plt.ylabel('Frequency Bin')
plt.title('Spectrogram (Power)')
plt.show()

## 3. Mel-Spectrograms

The **mel scale** approximates human hearing perception - we're more sensitive to differences in lower frequencies.

Mel-spectrograms:
1. Compute power spectrogram
2. Apply mel filterbank (triangular filters spaced on mel scale)
3. Take log for better dynamic range

In [None]:
# Mel scale conversion
def hz_to_mel(hz):
    return 2595 * np.log10(1 + hz / 700)

def mel_to_hz(mel):
    return 700 * (10 ** (mel / 2595) - 1)

# Visualize mel scale
hz_range = np.linspace(0, 8000, 100)
mel_range = hz_to_mel(hz_range)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(hz_range, mel_range)
axes[0].set_xlabel('Frequency (Hz)')
axes[0].set_ylabel('Mel')
axes[0].set_title('Hz to Mel Scale')
axes[0].grid(True)

# Show mel filterbank spacing
n_mels = 40
mel_points = np.linspace(hz_to_mel(0), hz_to_mel(8000), n_mels + 2)
hz_points = mel_to_hz(mel_points)

axes[1].bar(range(len(hz_points)), hz_points, width=0.8)
axes[1].set_xlabel('Mel Filter Index')
axes[1].set_ylabel('Center Frequency (Hz)')
axes[1].set_title('Mel Filterbank Center Frequencies')

plt.tight_layout()
plt.show()

print("Notice how mel filters are:")
print("- Closely spaced at low frequencies (where we're sensitive)")
print("- Widely spaced at high frequencies (where we're less sensitive)")

In [None]:
# Compute mel-spectrogram
n_mels = 64

if TORCHAUDIO_AVAILABLE:
    mel_spectrogram_transform = T.MelSpectrogram(
        sample_rate=sample_rate,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
        f_min=0,
        f_max=sample_rate // 2,
    )
    mel_spec = mel_spectrogram_transform(waveform)
else:
    # Manual mel filterbank (simplified)
    def create_mel_filterbank(n_fft, n_mels, sample_rate):
        n_freqs = n_fft // 2 + 1
        mel_min = hz_to_mel(0)
        mel_max = hz_to_mel(sample_rate / 2)
        mel_points = np.linspace(mel_min, mel_max, n_mels + 2)
        hz_points = mel_to_hz(mel_points)
        bin_points = np.floor((n_fft + 1) * hz_points / sample_rate).astype(int)
        
        filterbank = np.zeros((n_mels, n_freqs))
        for i in range(n_mels):
            for j in range(bin_points[i], bin_points[i+1]):
                filterbank[i, j] = (j - bin_points[i]) / (bin_points[i+1] - bin_points[i])
            for j in range(bin_points[i+1], bin_points[i+2]):
                filterbank[i, j] = (bin_points[i+2] - j) / (bin_points[i+2] - bin_points[i+1])
        return torch.tensor(filterbank, dtype=torch.float32)
    
    filterbank = create_mel_filterbank(n_fft, n_mels, sample_rate)
    mel_spec = torch.matmul(filterbank, spectrogram[0])
    mel_spec = mel_spec.unsqueeze(0)

# Convert to log scale (dB)
log_mel_spec = 10 * torch.log10(mel_spec + 1e-10)

print(f"Mel-spectrogram shape: {mel_spec.shape}")

# Plot
plt.figure(figsize=(12, 4))
plt.imshow(
    log_mel_spec[0].numpy(),
    aspect='auto',
    origin='lower',
    cmap='viridis'
)
plt.colorbar(label='Log Power (dB)')
plt.xlabel('Time Frame')
plt.ylabel('Mel Frequency Bin')
plt.title('Log Mel-Spectrogram')
plt.show()

## 4. MFCCs (Mel-Frequency Cepstral Coefficients)

MFCCs are a compact representation of the spectral envelope:
1. Compute log mel-spectrogram
2. Apply Discrete Cosine Transform (DCT)
3. Keep first N coefficients (typically 13-40)

MFCCs capture the "shape" of the spectrum, useful for speech recognition.

In [None]:
n_mfcc = 13

if TORCHAUDIO_AVAILABLE:
    mfcc_transform = T.MFCC(
        sample_rate=sample_rate,
        n_mfcc=n_mfcc,
        melkwargs={
            'n_fft': n_fft,
            'hop_length': hop_length,
            'n_mels': n_mels,
        }
    )
    mfccs = mfcc_transform(waveform)
else:
    # Manual DCT (Type-II)
    def dct_matrix(n_mfcc, n_mels):
        basis = np.zeros((n_mfcc, n_mels))
        for k in range(n_mfcc):
            for n in range(n_mels):
                basis[k, n] = np.cos(np.pi * k * (2*n + 1) / (2 * n_mels))
        basis[0, :] *= np.sqrt(1 / n_mels)
        basis[1:, :] *= np.sqrt(2 / n_mels)
        return torch.tensor(basis, dtype=torch.float32)
    
    dct = dct_matrix(n_mfcc, n_mels)
    mfccs = torch.matmul(dct, log_mel_spec[0])
    mfccs = mfccs.unsqueeze(0)

print(f"MFCC shape: {mfccs.shape}")

# Plot
plt.figure(figsize=(12, 4))
plt.imshow(
    mfccs[0].numpy(),
    aspect='auto',
    origin='lower',
    cmap='coolwarm'
)
plt.colorbar(label='MFCC Value')
plt.xlabel('Time Frame')
plt.ylabel('MFCC Coefficient')
plt.title('MFCCs')
plt.show()

print("\nMFCC coefficients:")
print("- MFCC 0: Overall energy")
print("- MFCC 1-12: Spectral envelope shape")
print("- Higher coefficients: Finer spectral details")

## 5. Delta and Delta-Delta Features

Adding temporal derivatives helps capture dynamics:
- **Delta**: First derivative (velocity of change)
- **Delta-delta**: Second derivative (acceleration of change)

In [None]:
def compute_deltas(features, width=2):
    """
    Compute delta features using finite differences.
    features: [channels, n_features, time]
    """
    # Pad features
    padded = F.pad(features, (width, width), mode='replicate')
    
    # Compute weighted difference
    denominator = 2 * sum(i**2 for i in range(1, width + 1))
    delta = torch.zeros_like(features)
    
    for i in range(1, width + 1):
        delta += i * (padded[:, :, width + i:width + i + features.shape[2]] - 
                     padded[:, :, width - i:width - i + features.shape[2]])
    
    return delta / denominator

# Compute deltas
delta = compute_deltas(mfccs)
delta_delta = compute_deltas(delta)

# Concatenate all features
full_features = torch.cat([mfccs, delta, delta_delta], dim=1)
print(f"Full feature shape (MFCC + delta + delta-delta): {full_features.shape}")

# Plot
fig, axes = plt.subplots(3, 1, figsize=(12, 8))

for i, (feat, name) in enumerate([(mfccs, 'MFCC'), (delta, 'Delta'), (delta_delta, 'Delta-Delta')]):
    im = axes[i].imshow(feat[0].numpy(), aspect='auto', origin='lower', cmap='coolwarm')
    axes[i].set_ylabel(f'{name} Coef')
    plt.colorbar(im, ax=axes[i])

axes[-1].set_xlabel('Time Frame')
plt.suptitle('MFCC with Delta Features')
plt.tight_layout()
plt.show()

## 6. Audio Classification with CNNs

We can treat spectrograms as images and use CNNs for classification!

In [None]:
class AudioCNN(nn.Module):
    """
    CNN for audio classification using mel-spectrograms.
    Input: [batch, 1, n_mels, time]
    """
    
    def __init__(self, n_mels: int = 64, num_classes: int = 10):
        super().__init__()
        
        self.conv_layers = nn.Sequential(
            # Block 1
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            # Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            # Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),  # Fixed output size
        )
        
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )
    
    def forward(self, x):
        # x: [batch, n_mels, time] or [batch, 1, n_mels, time]
        if x.dim() == 3:
            x = x.unsqueeze(1)  # Add channel dimension
        
        x = self.conv_layers(x)
        x = self.classifier(x)
        return x

# Test
model = AudioCNN(n_mels=64, num_classes=10)
test_input = torch.randn(4, 1, 64, 100)  # [batch, channel, n_mels, time]
output = model(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

## 7. Sequence Models for Speech

For tasks like speech recognition, we need to model temporal dependencies. RNNs and Transformers work well here.

In [None]:
class AudioRNN(nn.Module):
    """
    Bidirectional LSTM for sequence-to-sequence audio tasks.
    Input: [batch, time, features]
    """
    
    def __init__(self, input_dim: int, hidden_dim: int = 256, num_layers: int = 2, num_classes: int = 10):
        super().__init__()
        
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=0.3 if num_layers > 1 else 0,
        )
        
        # Output layer (for frame-level predictions)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional
    
    def forward(self, x):
        # x: [batch, time, features]
        lstm_out, _ = self.lstm(x)  # [batch, time, hidden*2]
        output = self.fc(lstm_out)   # [batch, time, num_classes]
        return output

# Test
rnn_model = AudioRNN(input_dim=64, hidden_dim=256, num_classes=29)  # 26 letters + space + blank + unknown
test_input = torch.randn(4, 100, 64)  # [batch, time, features]
output = rnn_model(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Model parameters: {sum(p.numel() for p in rnn_model.parameters()):,}")

In [None]:
class AudioTransformer(nn.Module):
    """
    Transformer encoder for audio classification.
    Input: [batch, time, features]
    """
    
    def __init__(self, input_dim: int, d_model: int = 256, nhead: int = 8, 
                 num_layers: int = 4, num_classes: int = 10):
        super().__init__()
        
        self.input_proj = nn.Linear(input_dim, d_model)
        
        # Positional encoding
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model) * 0.1)
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=d_model * 4,
            dropout=0.1,
            batch_first=True,
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.LayerNorm(d_model),
            nn.Linear(d_model, num_classes),
        )
    
    def forward(self, x):
        # x: [batch, time, features]
        x = self.input_proj(x)
        x = x + self.pos_encoding[:, :x.size(1), :]
        x = self.transformer(x)
        
        # Global average pooling for classification
        x = x.mean(dim=1)  # [batch, d_model]
        x = self.classifier(x)
        return x

# Test
transformer_model = AudioTransformer(input_dim=64, d_model=256, num_classes=10)
test_input = torch.randn(4, 100, 64)
output = transformer_model(test_input)
print(f"Input shape: {test_input.shape}")
print(f"Output shape: {output.shape}")
print(f"Model parameters: {sum(p.numel() for p in transformer_model.parameters()):,}")

## 8. Audio Data Augmentation

Augmentation improves model robustness:

In [None]:
class AudioAugmentation:
    """Common audio augmentation techniques."""
    
    @staticmethod
    def add_noise(waveform: torch.Tensor, noise_level: float = 0.005) -> torch.Tensor:
        """Add Gaussian noise."""
        noise = torch.randn_like(waveform) * noise_level
        return waveform + noise
    
    @staticmethod
    def time_shift(waveform: torch.Tensor, shift_ratio: float = 0.1) -> torch.Tensor:
        """Shift audio in time."""
        shift = int(waveform.shape[-1] * shift_ratio * (2 * torch.rand(1).item() - 1))
        return torch.roll(waveform, shifts=shift, dims=-1)
    
    @staticmethod
    def time_stretch(waveform: torch.Tensor, rate: float = 1.0) -> torch.Tensor:
        """Time stretch (simplified - just resampling)."""
        new_length = int(waveform.shape[-1] / rate)
        return F.interpolate(waveform.unsqueeze(0), size=new_length, mode='linear').squeeze(0)
    
    @staticmethod
    def spec_augment(spec: torch.Tensor, freq_mask_param: int = 10, time_mask_param: int = 20) -> torch.Tensor:
        """
        SpecAugment: Mask random frequency and time bands.
        spec: [channels, freq, time]
        """
        spec = spec.clone()
        _, n_freq, n_time = spec.shape
        
        # Frequency masking
        f = torch.randint(0, freq_mask_param, (1,)).item()
        f0 = torch.randint(0, max(1, n_freq - f), (1,)).item()
        spec[:, f0:f0+f, :] = 0
        
        # Time masking
        t = torch.randint(0, time_mask_param, (1,)).item()
        t0 = torch.randint(0, max(1, n_time - t), (1,)).item()
        spec[:, :, t0:t0+t] = 0
        
        return spec

# Demonstrate augmentations
aug = AudioAugmentation()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Original
axes[0, 0].plot(waveform[0].numpy())
axes[0, 0].set_title('Original')

# With noise
noisy = aug.add_noise(waveform, noise_level=0.02)
axes[0, 1].plot(noisy[0].numpy())
axes[0, 1].set_title('With Noise')

# Time shifted
shifted = aug.time_shift(waveform, shift_ratio=0.2)
axes[1, 0].plot(shifted[0].numpy())
axes[1, 0].set_title('Time Shifted')

# SpecAugment on mel-spectrogram
augmented_spec = aug.spec_augment(log_mel_spec, freq_mask_param=15, time_mask_param=30)
axes[1, 1].imshow(augmented_spec[0].numpy(), aspect='auto', origin='lower', cmap='viridis')
axes[1, 1].set_title('SpecAugment')

plt.tight_layout()
plt.show()

## 9. FAANG Interview Questions

### Q1: What is a mel-spectrogram and why is it used for audio?

**Answer**:

A mel-spectrogram is a spectrogram where frequencies are mapped to the **mel scale**, which approximates human hearing perception.

**Construction**:
1. Compute STFT (Short-Time Fourier Transform)
2. Apply mel filterbank (triangular filters)
3. Take log of power for better dynamic range

**Why mel scale**:
- Humans perceive pitch logarithmically
- More sensitive to low frequency differences
- Matches how cochlea processes sound

**Advantages**:
- Compact representation (fewer bins than full spectrogram)
- Perceptually meaningful
- Works well with CNNs (treat as image)

---

### Q2: What are MFCCs and why are they useful?

**Answer**:

**MFCCs** (Mel-Frequency Cepstral Coefficients) capture the spectral envelope of audio.

**Computation**:
1. Compute log mel-spectrogram
2. Apply DCT (Discrete Cosine Transform)
3. Keep first N coefficients (typically 13-40)

**What they represent**:
- MFCC 0: Overall energy
- Lower coefficients: Coarse spectral shape (formants)
- Higher coefficients: Fine spectral details

**Why useful**:
- Compact representation (~13 features per frame)
- Decorrelated features (DCT removes correlation)
- Capture vocal tract shape (important for speech)
- Robust to pitch variations

---

### Q3: Explain the Nyquist theorem and aliasing.

**Answer**:

**Nyquist theorem**: To accurately capture a signal, sample rate must be at least 2x the highest frequency component.

**Nyquist frequency** = sample_rate / 2

**Aliasing**: When frequencies above Nyquist are present, they "fold back" and appear as lower frequencies (distortion).

**Example**:
- Sample rate: 16 kHz
- Nyquist: 8 kHz
- A 10 kHz tone would alias to 6 kHz

**Prevention**: Apply low-pass (anti-aliasing) filter before sampling.

---

### Q4: What is SpecAugment and why does it help?

**Answer**:

**SpecAugment** is a data augmentation technique that masks random regions of spectrograms:

1. **Time masking**: Zero out random time bands
2. **Frequency masking**: Zero out random frequency bands
3. (Optional) **Time warping**: Warp spectrogram along time axis

**Why it helps**:
- Forces model to use all parts of spectrogram
- Prevents overfitting to specific patterns
- Simulates real-world conditions (noise, occlusion)
- Simple yet very effective

**Results**: Significant improvements in ASR (automatic speech recognition) without additional data.

---

### Q5: Compare CNN vs RNN vs Transformer for audio.

**Answer**:

| Aspect | CNN | RNN/LSTM | Transformer |
|--------|-----|----------|-------------|
| **Input** | 2D spectrogram | 1D sequence | 1D sequence |
| **Local patterns** | Excellent | Good | Via attention |
| **Long-range** | Limited | Good (with LSTM) | Excellent |
| **Parallelization** | Excellent | Poor | Excellent |
| **Memory** | Low | High | High |
| **Best for** | Classification | Seq2seq (ASR) | Everything (modern) |

**Modern approach**: Hybrid architectures
- CNN for local feature extraction
- Transformer for sequence modeling
- Examples: Wav2Vec, Whisper, Conformer

## 10. Key Takeaways

1. **Audio is a time-series** that we digitize by sampling at regular intervals
2. **Spectrograms** show frequency content over time (STFT)
3. **Mel scale** approximates human hearing - more resolution at low frequencies
4. **MFCCs** capture spectral envelope, useful for speech recognition
5. **Delta features** add temporal dynamics (velocity, acceleration)
6. **CNNs** work well on spectrograms (treat as images)
7. **RNNs/Transformers** model temporal sequences for ASR
8. **SpecAugment** is a simple but effective augmentation technique