# 04 — Activation Functions Comparison

**Goal**: Compare all activation functions available in NANO-RUST and understand
when to use each one.

**Activations tested**:
| Activation | Method | MCU Cost | Best For |
|------------|--------|----------|----------|
| ReLU | `add_relu()` | Fastest (branchless) | Hidden layers |
| Sigmoid (fixed) | `add_sigmoid()` | LUT lookup | General use |
| Sigmoid (scaled) | `add_sigmoid_scaled(m, s)` | LUT + rescale | After calibration |
| Tanh (fixed) | `add_tanh()` | LUT lookup | Centered output |
| Tanh (scaled) | `add_tanh_scaled(m, s)` | LUT + rescale | After calibration |
| Softmax | `add_softmax()` | Approximation | Output layer (probabilities) |

**Key insight**: Fixed sigmoid/tanh use a hardcoded input divisor (16/32),
which may not match your actual input scale. Scaled variants let you provide
the exact `(mult, shift)` from calibration for much better accuracy.

**Prerequisites**: `pip install nano-rust-py numpy torch`

In [None]:
import sys
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn

sys.path.insert(0, str(Path.cwd().parent / 'scripts'))
from nano_rust_utils import (
    quantize_to_i8, quantize_weights,
    calibrate_model, compute_activation_scale_params
)
import nano_rust_py

print('✅ All imports OK')

## Step 1: ReLU — The Default Choice

`ReLU(x) = max(0, x)` — zero cost on MCU (just a comparison).
Use this for all hidden layers unless you have a specific reason not to.

In [None]:
# ReLU test: negative values become 0, positive values pass through
torch.manual_seed(42)
model_relu = nn.Sequential(nn.Linear(8, 8), nn.ReLU())
model_relu.eval()

q_weights = quantize_weights(model_relu)
test_input = torch.randn(1, 8)
q_input, scale = quantize_to_i8(test_input.numpy().flatten())
requant = calibrate_model(model_relu, test_input, q_weights, scale)

nano = nano_rust_py.PySequentialModel([8], 1024)
m, s, b = requant['0']
nano.add_dense_with_requant(q_weights['0']['weights'].flatten().tolist(), b, m, s)
nano.add_relu()

result = nano.forward(q_input.tolist())
print(f'ReLU output: {result}')
print(f'All non-negative: {all(v >= 0 for v in result)} ✅')

## Step 2: Sigmoid — For Binary Classification

`Sigmoid(x) = 1 / (1 + e^(-x))` → output in [0, 1].

In i8: maps to approximately [0, 127] (positive half of i8 range).

Two variants:
- **Fixed** (`add_sigmoid()`): assumes input_i8 / 16 as float input
- **Scaled** (`add_sigmoid_scaled(m, s)`): uses calibrated scale

In [None]:
# Sigmoid: fixed vs scaled comparison
torch.manual_seed(42)
model_sig = nn.Sequential(nn.Linear(8, 8), nn.Sigmoid())
model_sig.eval()

q_weights = quantize_weights(model_sig)
test_input = torch.randn(1, 8)
q_input, scale = quantize_to_i8(test_input.numpy().flatten())
requant = calibrate_model(model_sig, test_input, q_weights, scale)

# Fixed sigmoid (no calibration)
nano_fixed = nano_rust_py.PySequentialModel([8], 1024)
m, s, b = requant['0']
nano_fixed.add_dense_with_requant(q_weights['0']['weights'].flatten().tolist(), b, m, s)
nano_fixed.add_sigmoid()
fixed_out = nano_fixed.forward(q_input.tolist())

# Scaled sigmoid (with calibration)
# Compute scaled params from the intermediate activation scale
sig_m, sig_s = compute_activation_scale_params(scale * q_weights['0']['weight_scale'], 16.0)
nano_scaled = nano_rust_py.PySequentialModel([8], 1024)
nano_scaled.add_dense_with_requant(q_weights['0']['weights'].flatten().tolist(), b, m, s)
nano_scaled.add_sigmoid_scaled(sig_m, sig_s)
scaled_out = nano_scaled.forward(q_input.tolist())

# PyTorch reference
with torch.no_grad():
    ref = model_sig(test_input).numpy().flatten()
q_ref, _ = quantize_to_i8(ref)

print(f'PyTorch (i8):     {q_ref.tolist()}')
print(f'Fixed sigmoid:    {fixed_out}')
print(f'Scaled sigmoid:   {scaled_out}')
print(f'\nFixed diff:  {np.max(np.abs(q_ref - np.array(fixed_out, np.int8)))}')
print(f'Scaled diff: {np.max(np.abs(q_ref - np.array(scaled_out, np.int8)))}')

## Step 3: Tanh — Centered Output

`Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))` → output in [-1, 1].

In i8: maps to [-127, 127]. Useful when you need centered (zero-mean) outputs.

In [None]:
nano_tanh = nano_rust_py.PySequentialModel([8], 1024)
nano_tanh.add_dense_with_requant(
    q_weights['0']['weights'].flatten().tolist(), b, m, s
)
nano_tanh.add_tanh()
tanh_out = nano_tanh.forward(q_input.tolist())
print(f'Tanh output: {tanh_out}')
print(f'Range: [{min(tanh_out)}, {max(tanh_out)}]')
print('Centered around 0 ✅' if abs(np.mean(tanh_out)) < 64 else 'Not well centered')

## Step 4: Softmax — Multi-class Output

Pseudo-softmax approximation for i8. Output values represent relative class
scores — higher = more likely. Not true probabilities but sufficient for `argmax`.

In [None]:
nano_sm = nano_rust_py.PySequentialModel([8], 1024)
nano_sm.add_dense_with_requant(
    q_weights['0']['weights'].flatten().tolist(), b, m, s
)
nano_sm.add_softmax()
sm_out = nano_sm.forward(q_input.tolist())
print(f'Softmax output: {sm_out}')
print(f'Predicted class: {np.argmax(sm_out)}')

## Summary

| Activation | Speed | Accuracy | Use Case |
|------------|-------|----------|----------|
| ReLU | ⚡⚡⚡ | Best | Default for hidden layers |
| Sigmoid (fixed) | ⚡⚡ | OK | Quick prototyping |
| Sigmoid (scaled) | ⚡⚡ | Best | After calibration |
| Tanh | ⚡⚡ | OK | Centered features needed |
| Softmax | ⚡ | N/A | Output layer only |