# Linear

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mitchell-Mirano/sorix/blob/qa/docs/learn/layers/01-Linear.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-black?logo=github)](https://github.com/Mitchell-Mirano/sorix/blob/qa/docs/learn/layers/01-Linear.ipynb)
[![Open in Docs](https://img.shields.io/badge/Open%20in-Docs-blue?logo=readthedocs)](http://127.0.0.1:8000/sorix/learn/layers/01-Linear)


The **Linear** layer implements an affine transformation between finite-dimensional real vector spaces and constitutes a fundamental operator in deep learning architectures. Formally, it defines a linear mapping from an input feature space to an output representation space, optionally augmented by a bias term. This transformation is applied independently to each element of a batch.

## Mathematical definition

Let  $\mathbf{X} \in \mathbb{R}^{N \times d}$
be an input tensor representing a batch of $N$ samples, where each sample is a vector in a $d$-dimensional feature space. The Linear layer defines the affine transformation

$$\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$$

where the involved quantities have the following dimensions:

- $
  \mathbf{X} \in \mathbb{R}^{N \times d}
  $ : input batch matrix  
- $
  \mathbf{W} \in \mathbb{R}^{d \times m}
  $ : weight matrix (trainable parameters)  
- $
  \mathbf{b} \in \mathbb{R}^{1 \times m}
  $ : bias vector associated with the output neurons  
- $
  \mathbf{Y} \in \mathbb{R}^{N \times m}
  $ : output tensor  
- $
  m
  $ : number of neurons, i.e., the dimensionality of the output space  

From a dimensional analysis standpoint, the matrix product

$$
\mathbf{X}\mathbf{W} :
\mathbb{R}^{N \times d} \times \mathbb{R}^{d \times m}
\;\longrightarrow\;
\mathbb{R}^{N \times m}
$$

is well-defined. The bias term $\mathbf{b}$ is then added **column-wise** to the resulting matrix, meaning that each component $b_j$ is added to all entries of the $j$-th output column. Explicitly,

$$
Y_{ij} = (\mathbf{X}\mathbf{W})_{ij} + b_j,
\quad
i = 1,\dots,N,\;
j = 1,\dots,m.
$$

## Interpretation as a linear mapping

At the level of individual samples, for each
$
i \in \{1, \dots, N\},
$
the transformation can be written as

$$
\mathbf{y}_i = \mathbf{x}_i \mathbf{W} + \mathbf{b},
\quad
\mathbf{x}_i \in \mathbb{R}^{1 \times d},\;
\mathbf{y}_i \in \mathbb{R}^{1 \times m}.
$$

Thus, each output vector $\mathbf{y}_i$ is obtained as a linear combination of the input features, defined by the columns of $\mathbf{W}$, followed by a translation in the output space determined by the bias vector $\mathbf{b}$.

## Functional view

The Linear layer realizes the mapping

$$
\text{Linear}:\;
\mathbb{R}^{N \times d}
\;\longrightarrow\;
\mathbb{R}^{N \times m},
$$

where the same affine transformation is applied independently to each sample in the batch. This operator forms the mathematical foundation upon which more complex nonlinear models are constructed when composed with activation and normalization layers.

## Parameterization and gradients

The parameters $\mathbf{W}$ and $\mathbf{b}$ are represented as `tensor` objects with `requires_grad=True`, enabling automatic gradient computation via automatic differentiation. During backpropagation, the following gradients are computed:

$$
\frac{\partial \mathcal{L}}{\partial \mathbf{W}}, \quad
\frac{\partial \mathcal{L}}{\partial \mathbf{b}}, \quad
\frac{\partial \mathcal{L}}{\partial \mathbf{X}},
$$

where $\mathcal{L}$ denotes the global loss function of the model.

## Parameter initialization

The weight matrix $\mathbf{W}$ is initialized from a zero-mean normal distribution with a standard deviation determined by the chosen initialization scheme:

- **He initialization** (recommended for ReLU-like activations):
  $$
  \sigma = \sqrt{\frac{2}{d}}.
  $$

- **Xavier initialization** (suitable for symmetric activations such as $\tanh$):
  $$
  \sigma = \sqrt{\frac{2}{d + m}}.
  $$

Formally,
$$
W_{ij} \sim \mathcal{N}(0, \sigma^2).
$$

When present, the bias vector $\mathbf{b}$ is initialized to zero.

## Forward computation

Given an input tensor $\mathbf{X}$, the forward evaluation of the layer is performed through the matrix operation

$$
\text{Linear}(\mathbf{X}) = \mathbf{X}\mathbf{W} + \mathbf{b}.
$$

In the implementation, this computation is exposed via the `__call__` method, enabling a concise and functional syntax consistent with the rest of the framework.

## Multi-device support

The Linear layer is device-aware. Parameters and computations may reside on either CPU or GPU, using NumPy or CuPy as the numerical backend, respectively. The `to(device)` method ensures consistent parameter transfer across devices while preserving the mathematical semantics of the transformation.

## Parameter interface

The trainable parameters of the layer are exposed through the `parameters()` method, which returns the set

$$
\{\mathbf{W}, \mathbf{b}\},
$$

or only $\mathbf{W}$ when the bias term is disabled. This abstraction allows direct integration with gradient-based optimization algorithms.

## Statistical interpretation

From a statistical perspective, the Linear layer can be interpreted as a multivariate linear regression model, where each output neuron represents a linear combination of the input features. In this context, the coefficients of the weight matrix $\mathbf{W}$ and the bias vector $\mathbf{b}$ define hyperplanes in the output space that approximate the relationship between input and output variables.


In [1]:
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@qa'

In [2]:
from sorix import tensor
from sorix.nn import Linear
import numpy as np

In [3]:
# create random input data
samples = 10
features = 3
neurons = 2

# X ∈ ℝ^(samples × features)
X = tensor(np.random.randn(samples, features))
X

tensor([[-0.68245878, -1.51091108, -1.81919191],
        [-0.71944098, -0.29123973, -0.59063962],
        [-0.5165242 ,  1.26550573, -0.3165631 ],
        [-0.47727734, -0.8987275 ,  0.23155946],
        [-0.28909791, -0.80714485, -0.93510151],
        [ 0.77912457, -0.85347223, -1.664577  ],
        [ 0.63505287,  2.09173128,  0.52771808],
        [-0.50557509, -1.37366385,  0.7768541 ],
        [-0.42453686, -0.65460723, -1.43567044],
        [ 2.43236421, -0.539992  , -0.1724795 ]], dtype=sorix.float64)

In [4]:
# instantiate a Linear layer: ℝ^(samples × features) → ℝ^(samples × neurons)
linear = Linear(features, neurons)

# weight matrix W ∈ ℝ^(features × neurons)
print(linear.W)

# bias vector b ∈ ℝ^(1 × neurons)
print(linear.b)

tensor([[ 0.36082658, -0.7080932 ],
        [-1.5694878 ,  0.30932558],
        [ 0.11454388, -1.246289  ]], requires_grad=True)
tensor([[0., 0.]], requires_grad=True)


In [5]:
# forward pass:
# Y ∈ ℝ^(samples × neurons) = X @ W + b
Y = linear(X)
print(Y)


tensor([[ 1.91672996,  2.28311989],
        [ 0.12984963,  1.15545106],
        [-2.20883185,  1.1517297 ],
        [ 1.26485123, -0.22863256],
        [ 1.05537964,  1.12044447],
        [ 1.42997601,  1.25885041],
        [-2.99335591, -0.4603399 ],
        [ 2.06250762, -1.0350998 ],
        [ 0.70976662,  1.88738522],
        [ 1.70541605, -1.67441465]], dtype=sorix.float64, requires_grad=True)
