# BatchNorm1d


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mitchell-Mirano/sorix/blob/develop/docs/learn/layers/02-BatchNorm1d.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-black?logo=github)](https://github.com/Mitchell-Mirano/sorix/blob/develop/docs/learn/layers/02-BatchNorm1d.ipynb)
[![Open in Docs](https://img.shields.io/badge/Open%20in-Docs-blue?logo=readthedocs)](http://127.0.0.1:8000/sorix/learn/layers/02-BatchNorm1d)


The **BatchNorm1d** layer implements *batch normalization*, a technique designed to stabilize and accelerate the training of deep neural networks by reducing internal covariate shift. This is achieved by normalizing intermediate activations across the batch dimension and subsequently applying a learnable affine transformation.

## Mathematical definition

Let $ \mathbf{X} \in \mathbb{R}^{N \times d} $
be an input tensor representing a batch of $N$ samples, where each sample has $d$ features. Batch normalization operates **feature-wise**, normalizing each feature independently across the batch.

During training, the batch-wise mean and variance are computed as

$$
\boldsymbol{\mu}_B
=
\frac{1}{N}
\sum_{i=1}^{N}
\mathbf{x}_i
\;\in\;
\mathbb{R}^{1 \times d},
$$

$$
\boldsymbol{\sigma}_B^2
=
\frac{1}{N}
\sum_{i=1}^{N}
(\mathbf{x}_i - \boldsymbol{\mu}_B)^2
\;\in\;
\mathbb{R}^{1 \times d},
$$

where $\mathbf{x}_i \in \mathbb{R}^{1 \times d}$ denotes the $i$-th sample in the batch.

## Normalization step

Each input sample is normalized using the batch statistics:

$$
\widehat{\mathbf{X}}
=
\frac{\mathbf{X} - \boldsymbol{\mu}_B}
{\sqrt{\boldsymbol{\sigma}_B^2 + \varepsilon}},
\quad
\widehat{\mathbf{X}} \in \mathbb{R}^{N \times d},
$$

where $\varepsilon > 0$ is a small constant introduced for numerical stability.

## Learnable affine transformation

To preserve the representational capacity of the network, batch normalization introduces two learnable parameters:

- Scale parameter:
  $
  \boldsymbol{\gamma} \in \mathbb{R}^{1 \times d}
  $
- Shift parameter:
  $
  \boldsymbol{\beta} \in \mathbb{R}^{1 \times d}
  $

The final output of the layer is given by

$$
\mathbf{Y}
=
\boldsymbol{\gamma} \odot \widehat{\mathbf{X}}
+
\boldsymbol{\beta},
\quad
\mathbf{Y} \in \mathbb{R}^{N \times d},
$$

where $\odot$ denotes element-wise multiplication applied **column-wise**, i.e., independently to each feature.

## Running statistics and inference mode

In addition to batch statistics, BatchNorm1d maintains *running estimates* of the mean and variance:

$$
\boldsymbol{\mu}_{\text{run}} \in \mathbb{R}^{1 \times d},
\quad
\boldsymbol{\sigma}^2_{\text{run}} \in \mathbb{R}^{1 \times d}.
$$

These statistics are updated during training using an exponential moving average:

$$
\boldsymbol{\mu}_{\text{run}}
\leftarrow
\alpha \boldsymbol{\mu}_{\text{run}}
+
(1 - \alpha)\boldsymbol{\mu}_B,
$$

$$
\boldsymbol{\sigma}^2_{\text{run}}
\leftarrow
\alpha \boldsymbol{\sigma}^2_{\text{run}}
+
(1 - \alpha)\boldsymbol{\sigma}^2_B,
$$

where $\alpha \in (0,1)$ is the momentum parameter controlling the update rate.

During inference (evaluation mode), normalization is performed using these accumulated running statistics instead of the batch statistics, ensuring deterministic behavior:

$$
\widehat{\mathbf{X}}
=
\frac{\mathbf{X} - \boldsymbol{\mu}_{\text{run}}}
{\sqrt{\boldsymbol{\sigma}^2_{\text{run}} + \varepsilon}}.
$$

## Functional view

The BatchNorm1d layer realizes the mapping

$$
\text{BatchNorm1d}:\;
\mathbb{R}^{N \times d}
\;\longrightarrow\;
\mathbb{R}^{N \times d},
$$

where normalization and affine reparameterization are applied independently to each feature across the batch.

## Parameterization and gradients

The learnable parameters $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are represented as `tensor` objects with `requires_grad=True`. Gradients are computed with respect to these parameters, as well as with respect to the input tensor $\mathbf{X}$, during backpropagation:

$$
\frac{\partial \mathcal{L}}{\partial \boldsymbol{\gamma}}, \quad
\frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}}, \quad
\frac{\partial \mathcal{L}}{\partial \mathbf{X}}.
$$

The running statistics are treated as buffers and do not participate in gradient computation.

## Multi-device support

BatchNorm1d is device-aware and supports execution on CPU and GPU backends. Learnable parameters are stored as tensors on the selected device, while running statistics are maintained as NumPy or CuPy arrays and transferred consistently when changing devices via the `to(device)` method.

## Parameter interface

The trainable parameters of the layer are exposed through the `parameters()` method, which returns

$$
\{\boldsymbol{\gamma}, \boldsymbol{\beta}\}.
$$

## Statistical interpretation

From a statistical perspective, BatchNorm1d performs a feature-wise standardization of the input distribution, followed by a learned affine transformation. This can be interpreted as dynamically re-centering and re-scaling the feature space, which improves numerical conditioning and facilitates optimization in deep networks.


In [1]:
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@develop'

In [2]:
from sorix import tensor
from sorix.nn import BatchNorm1d
import numpy as np

In [3]:
# number of samples and features
samples = 8
features = 3

# input tensor X ∈ ℝ^(samples × features)
X = tensor(np.random.randn(samples, features))
X


tensor([[-0.79076557, -0.09530421, -2.24122608],
        [ 0.48085172, -0.62549223, -2.1529319 ],
        [-0.13736248,  0.21993719,  0.82125192],
        [-0.33432386, -0.21491704,  0.07399757],
        [-0.3230639 ,  0.97823966,  0.14454357],
        [-0.24306372, -1.85875525,  0.32994193],
        [ 1.22507434,  0.33410779, -1.34611515],
        [-0.16913842,  0.05868427, -0.04777623]], dtype=sorix.float64)

In [4]:
bn = BatchNorm1d(features)

# γ ∈ ℝ^(1 × features), β ∈ ℝ^(1 × features)
print(bn.gamma)
print(bn.beta)


tensor([[1., 1., 1.]], requires_grad=True)
tensor([[0., 0., 0.]], requires_grad=True)


In [5]:
# forward pass (training mode)
Y = bn(X)
Y


tensor([[-1.30578473,  0.07087535, -1.52270086],
        [ 0.89556349, -0.61069615, -1.44309716],
        [-0.17465216,  0.47612697,  1.23834854],
        [-0.51561998, -0.08289027,  0.56464372],
        [-0.49612741,  1.45094598,  0.62824614],
        [-0.35763586, -2.19609021,  0.79539641],
        [ 2.18391742,  0.62289647, -0.71569245],
        [-0.22966077,  0.26883185,  0.45485567]], dtype=sorix.float64, requires_grad=True)

In [6]:
# running statistics after the forward pass
print(bn.running_mean)
print(bn.running_var)


tensor([[-0.0036474 , -0.01504375, -0.05522893]], dtype=sorix.float64)
tensor([[0.93336737, 0.96051034, 1.02302517]], dtype=sorix.float64)


In [7]:
# inference mode
bn.training = False
Y_eval = bn(X)
Y_eval


tensor([[-0.81472549, -0.0818933 , -2.16124653],
        [ 0.5014924 , -0.62286759, -2.07395204],
        [-0.13840499,  0.23976144,  0.86655703],
        [-0.34227458, -0.20393956,  0.12776335],
        [-0.33061969,  1.013491  ,  0.19751061],
        [-0.24781359, -1.88122041,  0.38080982],
        [ 1.27181782,  0.35625476, -1.27627035],
        [-0.17129544,  0.07522796,  0.00736832]], dtype=sorix.float64, requires_grad=True)