Here is **Chapter 18: Specialized Applications** — domain-specific deep learning architectures.

---

# **CHAPTER 18: SPECIALIZED APPLICATIONS**

*AI for Every Domain*

## **Chapter Overview**

While CNNs and Transformers form the foundation, specialized domains require architectural adaptations. Time series data violates the i.i.d. assumption; recommendation systems handle sparse, high-dimensional user-item interactions; graph data has irregular structure; and tabular data remains stubbornly resistant to standard deep learning. This chapter equips you with the specialized tools for these critical industry applications.

**Estimated Time:** 50-60 hours (3-4 weeks)  
**Prerequisites:** Chapters 12-17 (CNNs, RNNs, Transformers, GNN fundamentals from Chapter 8)

---

## **18.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Build deep learning forecasting models (DeepAR, N-BEATS, Temporal Fusion Transformers) that handle multiple time series and uncertainty quantification
2. Design two-tower neural architectures for large-scale recommendation and candidate retrieval
3. Implement Graph Neural Networks (GCN, GAT, GraphSAGE) for node classification and link prediction
4. Apply deep learning to tabular data using embeddings and architectures like TabNet that rival XGBoost
5. Process audio signals using spectrograms and implement speech recognition (wav2vec) and text-to-speech systems

---

## **18.1 Time Series Forecasting**

Beyond ARIMA: Deep learning for temporal prediction with multiple covariates.

#### **18.1.1 DeepAR (Probabilistic Forecasting)**

Amazon's DeepAR treats forecasting as a sequence-to-sequence problem, modeling the full conditional distribution $P(z_{i,t_0:T} | z_{i,1:t_0-1}, \mathbf{x}_{i,1:T})$.

**Key Innovations:**
- **Autoregressive Recurrent Network:** LSTM generates hidden state $h_{i,t}$, parameters of likelihood (e.g., Gaussian) depend on $h_{i,t}$.
- **Global Model:** Single model trained on thousands of related time series (e.g., sales of all products), enabling transfer learning.
- **Covariates:** Time-varying features (price, promotion) and static features (category, location).

```python
class DeepAR(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=2, likelihood="gaussian"):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.likelihood = likelihood
        
        # Output layers for distribution parameters
        if likelihood == "gaussian":
            self.mu_proj = nn.Linear(hidden_size, 1)
            self.sigma_proj = nn.Linear(hidden_size, 1)
        
    def forward(self, x, target=None):
        # x: (batch, seq_len, features)
        lstm_out, _ = self.lstm(x)
        
        if self.likelihood == "gaussian":
            mu = self.mu_proj(lstm_out)
            sigma = F.softplus(self.sigma_proj(lstm_out)) + 1e-6  # Ensure positive
            return mu, sigma
        
    def loss(self, x, target):
        mu, sigma = self.forward(x)
        # Negative log-likelihood of Gaussian
        loss = torch.log(sigma) + 0.5 * ((target - mu) / sigma) ** 2
        return loss.mean()
    
    def sample(self, x, num_samples=100):
        # Monte Carlo sampling for prediction intervals
        mu, sigma = self.forward(x)
        dist = torch.distributions.Normal(mu, sigma)
        samples = dist.sample((num_samples,))
        return samples  # (num_samples, batch, seq, 1)
```

#### **18.1.2 N-BEATS (Neural Basis Expansion Analysis)**

Pure deep learning decomposition: Express forecast as sum of basis functions.

$$y_{t+h} = \sum_{i=1}^{\text{stacks}} \sum_{j=1}^{\text{blocks}} g_{i,j}(h) \cdot f_{i,j}(y_{1:t})$$

**Interpretable Version:** Separate stacks for trend (monotonic) and seasonality (periodic).

```python
class NBeatsBlock(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, type="generic"):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
        )
        
        if type == "trend":
            # Polynomial basis: [1, t, t^2, ..., t^p]
            self.degree = 3
            self.basis = nn.Linear(hidden_size, 2 * self.degree)  # backcast + forecast coeffs
        elif type == "seasonality":
            # Fourier basis
            self.num_harmonics = 10
            self.basis = nn.Linear(hidden_size, 2 * self.num_harmonics)
    
    def forward(self, x):
        # x: (batch, input_size)
        x = self.fc(x)
        theta = self.basis(x)
        # ... basis expansion logic ...
        return backcast, forecast
```

#### **18.1.3 Temporal Fusion Transformer (TFT)**

Interpretable multi-horizon forecasting combining several mechanisms:

- **Gating Mechanisms:** GLU (Gated Linear Units) to skip unused components
- **Variable Selection Networks:** Learn which covariates are relevant
- **Static Covariate Encoders:** Encode time-invariant features
- **Interpretable Multi-Head Attention:** Temporal attention showing which past time steps are important

**Use Case:** Demand forecasting where you need to explain *why* a spike is predicted (holiday? promotion?).

---

## **18.2 Recommendation Systems**

Moving beyond matrix factorization to deep neural architectures.

#### **18.2.1 Two-Tower Architecture (Candidate Generation)**

Separates user and item encoders for efficient retrieval at scale (millions of items).

$$\text{score}(u, i) = \langle \text{UserTower}(u), \text{ItemTower}(i) \rangle$$

**Training:** Sampled softmax or batch softmax on in-batch negatives.

```python
class TwoTowerModel(nn.Module):
    def __init__(self, num_users, num_items, embedding_dim=64):
        super().__init__()
        self.user_tower = nn.Sequential(
            nn.Embedding(num_users, embedding_dim),
            nn.Linear(embedding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64)
        )
        
        self.item_tower = nn.Sequential(
            nn.Embedding(num_items, embedding_dim),
            nn.Linear(embedding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64)
        )
        
    def forward(self, user_ids, item_ids):
        user_emb = F.normalize(self.user_tower(user_ids), dim=1)
        item_emb = F.normalize(self.item_tower(item_ids), dim=1)
        scores = torch.sum(user_emb * item_emb, dim=1)  # Dot product
        return scores
    
    def get_item_embeddings(self, item_ids):
        return F.normalize(self.item_tower(item_ids), dim=1)
    
    def retrieve_candidates(self, user_id, item_index, k=100):
        # user_id: single user
        # item_index: FAISS index of all item embeddings
        user_emb = self.user_tower(torch.tensor([user_id]))
        D, I = item_index.search(user_emb.numpy(), k)
        return I  # Top-k item IDs
```

**Serving:**
1. Pre-compute all item embeddings, index in FAISS (GPU or IVF for millions)
2. User query → encode user → FAISS search (sub-millisecond)
3. Re-rank top-1000 with heavy ranking model (see below)

#### **18.2.2 DeepFM (Factorization Machines + DNN)**

Combines low-order feature interactions (FM) with high-order (DNN).

$$y_{FM} = \sum_{i=1}^n \sum_{j=i+1}^n \langle v_i, v_j \rangle x_i x_j$$

```python
class DeepFM(nn.Module):
    def __init__(self, field_dims, embed_dim, mlp_dims):
        super().__init__()
        self.embedding = nn.Embedding(sum(field_dims), embed_dim)
        self.offsets = torch.cumsum(torch.tensor([0] + field_dims[:-1]), dim=0)
        
        # FM component
        self.fm = FactorizationMachine()
        
        # Deep component
        layers = []
        input_dim = len(field_dims) * embed_dim
        for dim in mlp_dims:
            layers.extend([nn.Linear(input_dim, dim), nn.ReLU(), nn.Dropout(0.2)])
            input_dim = dim
        layers.append(nn.Linear(input_dim, 1))
        self.mlp = nn.Sequential(*layers)
        
    def forward(self, x):
        # x: (batch, num_fields) with categorical indices
        x = x + self.offsets.unsqueeze(0).to(x.device)
        embeds = self.embedding(x)  # (batch, num_fields, embed_dim)
        
        fm_out = self.fm(embeds)
        deep_out = self.mlp(embeds.view(embeds.size(0), -1))
        
        return torch.sigmoid(fm_out + deep_out)
```

#### **18.2.3 Session-Based Recommendation (SASRec)**

Use Transformer to model user click sequences, predicting next item.

- **Self-attention** over past items to capture sequential patterns
- **Position embeddings** for order
- Similar to language modeling, but with item IDs as tokens

---

## **18.3 Graph Neural Networks (GNNs)**

Deep learning on non-Euclidean data: social networks, molecules, knowledge graphs.

#### **18.3.1 Message Passing Framework**

The fundamental GNN paradigm: nodes update their representations by aggregating messages from neighbors.

$$h_v^{(l+1)} = \text{UPDATE}^{(l)}\left(h_v^{(l)}, \text{AGGREGATE}^{(l)}\left(\{h_u^{(l)}, \forall u \in \mathcal{N}(v)\}\right)\right)$$

Where $\mathcal{N}(v)$ are neighbors of node $v$.

#### **18.3.2 Graph Convolutional Network (GCN)**

Spectral approach: normalized adjacency matrix multiplication.

$$H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$$

Where $\tilde{A} = A + I$ (add self-loops), $\tilde{D}$ is degree matrix.

**Implementation (PyTorch Geometric):**
```python
import torch_geometric.nn as gnn

class GCN(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = gnn.GCNConv(in_channels, hidden_channels)
        self.conv2 = gnn.GCNConv(hidden_channels, out_channels)
        
    def forward(self, x, edge_index):
        # x: Node features (N, F)
        # edge_index: Graph connectivity (2, E) COO format
        x = self.conv1(x, edge_index).relu()
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        return x
```

#### **18.3.3 Graph Attention Network (GAT)**

Attention over neighbors: learn which neighbors are important.

$$h_v' = \sigma\left(\sum_{u \in \mathcal{N}(v)} \alpha_{vu} W h_u\right)$$

Where $\alpha_{vu}$ are attention coefficients computed via softmax over learned attention scores.

**Multi-head attention:** $K$ independent attention mechanisms concatenated.

#### **18.3.4 GraphSAGE (Sample and Aggregate)**

Inductive learning: Generalize to unseen nodes/graphs.

**Sampling:** Fixed-size neighborhood sampling (computationally efficient for large graphs).

**Aggregation:** Mean, LSTM, or Pooling aggregator.

```python
class GraphSAGE(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=2):
        super().__init__()
        self.convs = nn.ModuleList()
        self.convs.append(gnn.SAGEConv(in_channels, hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(gnn.SAGEConv(hidden_channels, hidden_channels))
        self.convs.append(gnn.SAGEConv(hidden_channels, out_channels))
        
    def forward(self, x, edge_index):
        for conv in self.convs[:-1]:
            x = conv(x, edge_index).relu()
            x = F.dropout(x, p=0.5, training=self.training)
        x = self.convs[-1](x, edge_index)
        return x
```

#### **18.3.5 Knowledge Graph Embeddings (TransE, RotatE)**

Represent entities and relations in vector space for link prediction.

**TransE:** $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ (head relation tail)

**RotatE:** Rotational model in complex space: $\mathbf{t} = \mathbf{h} \circ \mathbf{r}$ (where $\circ$ is Hadamard product in complex numbers), models relation as rotation.

---

## **18.4 Tabular Deep Learning**

Deep learning for structured data (tables), where XGBoost/LightGBM traditionally dominate.

#### **18.4.1 Entity Embeddings for Categoricals**

Map high-cardinality categorical variables (e.g., zip codes, product IDs) to dense vectors.

```python
class TabularNN(nn.Module):
    def __init__(self, categorical_dims, embedding_dims, numerical_dim, hidden_dims):
        super().__init__()
        
        # Embeddings for each categorical feature
        self.embeddings = nn.ModuleList([
            nn.Embedding(dim, emb_dim) 
            for dim, emb_dim in zip(categorical_dims, embedding_dims)
        ])
        
        total_emb_dim = sum(embedding_dims)
        input_dim = total_emb_dim + numerical_dim
        
        # MLP
        layers = []
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(input_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.3)
            ])
            input_dim = hidden_dim
        layers.append(nn.Linear(input_dim, 1))
        
        self.mlp = nn.Sequential(*layers)
        
    def forward(self, x_categorical, x_numerical):
        # x_categorical: (batch, num_cat_features) with indices
        embeddings = [emb(x_categorical[:, i]) for i, emb in enumerate(self.embeddings)]
        x = torch.cat(embeddings + [x_numerical], dim=1)
        return self.mlp(x)
```

#### **18.4.2 TabNet**

Attention-based tabular learning with sequential decision steps (similar to additive models).

**Sparse Attention:** Selects which features to use at each decision step, providing interpretability (feature importance).

**Key Components:**
- **Feature Transformer:** Shared and step-specific layers
- **Attentive Transformer:** Feature selection via sparsemax
- **Mask:** Tracks which features have been used

#### **18.4.3 Regularization for Tabular**

Tabular data overfits easily due to spurious correlations.

- **Mixup:** Interpolate between samples: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$
- **CutMix:** For images, but variants exist for tabular
- **Dropout:** Higher rates than vision (0.5+)
- **Weight Decay:** Strong L2 regularization

---

## **18.5 Speech and Audio**

#### **18.5.1 Signal Processing Basics**

Audio is 1D waveform (pressure over time). Transform to frequency domain:

- **STFT (Short-Time Fourier Transform):** Spectrogram (time x frequency)
- **Mel-Spectrogram:** Compress frequency axis using Mel scale (human perception)
- **MFCCs:** Mel-Frequency Cepstral Coefficients (compact representation)

```python
import torchaudio

# Load audio
waveform, sample_rate = torchaudio.load("audio.wav")

# Mel spectrogram transform
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=16000,
    n_fft=400,
    win_length=400,
    hop_length=160,
    n_mels=80
)

mel_spec = mel_transform(waveform)  # (channel, mel, time)
```

#### **18.5.2 Self-Supervised Speech (wav2vec 2.0)**

BERT for audio: Pretrain on unlabeled speech, fine-tune on small labeled data for ASR.

**Contrastive Task:** Mask spans of latent representations, contrast true quantized latent vs distractors.

**Architecture:** CNN encoder → Transformer → Quantization module.

```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Inference
input_values = processor(waveform, return_tensors="pt").input_values
with torch.no_grad():
    logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
```

#### **18.5.3 Text-to-Speech (TTS)**

**Tacotron 2:** Sequence-to-sequence with attention, predicts Mel-spectrograms from text, then vocoder (WaveGlow) generates waveform.

**VITS:** End-to-end (no vocoder needed), uses normalizing flows for high-quality synthesis.

---

## **18.6 Workbook Labs**

### **Lab 1: Time Series Forecasting**
Predict electricity demand:
1. TFT implementation or DeepAR using PyTorch Forecasting
2. Multiple covariates (temperature, hour of day, day of week)
3. Quantile forecasting (10th, 50th, 90th percentiles)
4. Evaluation: MAPE, RMSE, Coverage of prediction intervals

**Deliverable:** Forecast with uncertainty bands and feature importance analysis.

### **Lab 2: Recommendation System**
MovieLens dataset:
1. Two-tower model for candidate generation
2. DeepFM for ranking (incorporating user features like age, occupation)
3. Evaluate: Recall@10, NDCG@10
4. A/B test simulation: CTR lift vs baseline (popular items)

**Deliverable:** End-to-end recommendation pipeline with FAISS retrieval.

### **Lab 3: Graph Node Classification**
Cora citation network:
1. GCN implementation from scratch (message passing)
2. Compare with GraphSAGE (inductive) and GAT (attention)
3. Visualize embeddings with t-SNE (color by class)
4. Link prediction: Hide edges, predict missing citations

**Deliverable:** Node classifier > 80% accuracy with attention visualization.

### **Lab 4: Tabular Deep Learning**
Kaggle House Prices or similar:
1. TabularNN with embeddings vs XGBoost baseline
2. TabNet with interpretability (which features used for each prediction)
3. Ensembling: Average of XGBoost + Neural Net

**Deliverable:** Report showing neural net matches or beats XGBoost with proper tuning.

---

## **18.7 Common Pitfalls**

1. **Data Leakage in Time Series:** Using future information (rolling mean including target) to predict past. Always use time-based validation.

2. **Cold Start in RecSys:** New users/items have no embeddings. Solution: Content-based features, meta-learning, or average embeddings.

3. **Graph Oversmoothing:** Deep GNNs make all node embeddings similar. Solution: Residual connections, skip connections (JK-Net), or shallow networks.

4. **Categorical Cardinality:** Embedding 1M unique IDs causes overfitting. Solution: Hashing trick, shared embeddings, or entity resolution.

5. **Spectral Bias in Audio:** CNNs bias toward local patterns; use dilated convolutions or Transformers for long-range dependencies in audio.

---

## **18.8 Interview Questions**

**Q1:** Why might XGBoost outperform neural networks on tabular data, and when would you switch to deep learning?
*A: XGBoost naturally handles heterogeneous features (mix of categorical and numerical), is robust to outliers, requires less tuning, and excels when interactions are tree-like. Neural networks win with: (1) high-cardinality categorical variables (embeddings capture similarity), (2) unstructured inputs (text/images alongside tabular), (3) very large datasets where GBDT slows down, (4) need for end-to-end differentiability (embedding learning).*

**Q2:** Explain the inductive bias of GraphSAGE vs GCN.
*A: GCN is transductive: requires entire graph during training, uses normalized adjacency matrix fixed for that graph. GraphSAGE is inductive: learns aggregation functions that generalize to unseen nodes/graphs by sampling neighborhoods and learning to aggregate (e.g., mean of neighbor features). GraphSAGE can predict on new nodes without retraining; GCN cannot easily.*

**Q3:** How do two-tower models handle the massive scale of candidate retrieval (millions of items)?
*A: The dot product or cosine similarity allows pre-computing item tower embeddings offline and indexing them (FAISS, ScaNN). At serving time, only the user query needs forward pass through user tower, then approximate nearest neighbor search in the pre-computed index (sub-millisecond). This decouples the heavy item computation from online serving.*

**Q4:** What is the difference between transductive and inductive link prediction in knowledge graphs?
*A: Transductive: All entities seen during training; test set contains unseen edges between known entities. Inductive: Test set contains entirely unseen entities (e.g., new drugs in drug discovery). Transductive methods (most KGE like TransE) fail at inductive settings because they learn entity-specific embeddings. Inductive methods use GNNs or textual descriptions to generalize to new entities.*

**Q5:** Why do we use Mel scale instead of linear frequency in audio processing?
*A: Human hearing is logarithmic and more sensitive to differences at lower frequencies (< 1kHz) than higher frequencies. The Mel scale approximates this non-linear perception, spacing frequencies linearly at low end and logarithmically at high end. This compresses the frequency representation to focus on perceptually relevant differences, improving model efficiency and often performance.*

---

## **18.9 Further Reading**

**Time Series:**
- "DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks" (Salinas et al.)
- "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (Lim et al.)

**Recommendation:**
- "Deep Learning for Recommender Systems" (Zhang et al., survey)
- "Self-Attentive Sequential Recommendation" (SASRec, Kang & McAuley)

**GNNs:**
- "Graph Neural Networks: A Review of Methods and Applications" (Zhou et al.)
- "How Powerful are Graph Neural Networks?" (GIN paper, Xu et al.)

**Tabular:**
- "TabNet: Attentive Interpretable Tabular Learning" (Arik & Pfister)
- "Neural Networks for Tabular Data: A Survey" (Borisov et al.)

---

## **18.10 Checkpoint Project: Multi-Modal Recommendation Engine**

Build a recommendation system combining multiple data types.

**Domain:** E-commerce (products with images, text descriptions, categorical attributes, user behavior sequences).

**Architecture:**
1. **Item Tower:**
   - Image: Pre-trained EfficientNet (frozen or fine-tuned)
   - Text: BERT embeddings of description
   - Tabular: Embeddings for category, brand, price bucket
   - Fusion: Concatenate → MLP → item embedding

2. **User Tower:**
   - Sequential: Transformer over past item IDs (SASRec style)
   - Context: Time of day, device, location
   - Fusion: MLP → user embedding

3. **Training:**
   - Batch softmax with in-batch negatives
   - Auxiliary losses: Predict category from image (multitask)

4. **Serving:**
   - FAISS index of item embeddings (IVF for 1M+ items)
   - Real-time user embedding computation
   - Re-ranking with DeepFM using rich features

**Deliverables:**
- `multimodal_recsys/` with training and serving code
- Evaluation: HitRate@10, MRR (Mean Reciprocal Rank)
- Ablation study: Show contribution of each modality (image vs text vs tabular)

**Success Criteria:**
- Cold-start handling (recommend new items with only image/description)
- Latency < 100ms for user embedding + retrieval + ranking
- Significant lift over text-only or image-only baselines

---

**End of Chapter 18**

This concludes **Phase 4: Specialized AI Domains**. You now possess expertise across computer vision, NLP, reinforcement learning, and domain-specific applications. **Phase 5: MLOps & Production Engineering** begins with Chapter 19, covering the infrastructure and practices required to deploy these models reliably at scale.

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='17. reinforcement_learning.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../5. MLOPs_and_production_engineering/19. ml_system_design_and_architecture.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
