# Implementation: Top-K Gating

**Goal**: Route tokens.

In [None]:
import torch
import torch.nn.functional as F

# 1. Setup
num_experts = 4
dim = 16
input_token = torch.randn(1, 16)

# Router Weights
router = torch.nn.Linear(dim, num_experts) 

# 2. Forward
logits = router(input_token)
scores = F.softmax(logits, dim=-1)

# 3. Top-K Selection (K=2)
top_k_weights, top_k_indices = torch.topk(scores, 2)

print(f"Scores: {scores.detach().numpy()}")
print(f"Selected Experts: {top_k_indices.numpy()} with weights {top_k_weights.detach().numpy()}")

# We would now run input_token through ONLY those 2 experts.

## Conclusion
The router learns to specialize experts (e.g., Code Expert, Math Expert) automatically.