In [1]:
import torch

## Scalars

We denote:
- scalars by _ordinary_ lowercase letters (e.g., $x$, $y$, and $z$).
- the space of all continuous _real-valued_ scalars by $\mathbb{R}$.

$x \in \mathbb{R}$ is a way to say that $x$ is a real-valued scalar.

Scalars are implemented as a one element tensor.

In [2]:
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x ** y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

## Vectors

Vectors are fixed-length array of scalars. 
- Scalars are the _elements_ of a vector.
- Vectors are denoted by _bold_ lowercase letters (e.g., $\textbf{x}$, $\textbf{y}$, and $\textbf{z}$).

Vectors are implemented as _first-order_ tensors. Said tensors can have arbitrary lengths, subject to memory limitations.


When training a model, each vector represents an instance, with its components as relevant quantities. For instance, when creating a model predicting loan default risk, an applicant is represented as a vector with quantities like income, employment length, and number of previous defaults.

By default, vectors are visualized vertically. In linear algebra, subscripts begin at $1$.

$$
\begin{split}
    \mathbf{x} = 
    \begin{bmatrix}
        x_{1}  \\ 
        \vdots \\
        x_{n}
    \end{bmatrix},
\end{split}
$$

An element of a vector is referred using a subscript. $x_2$ represent the second element of $\textbf{x}$.

In [3]:
x = torch.arange(3)
x

tensor([0, 1, 2])

In Python, vectors indices begin at $0$.

In [4]:
x[2]

tensor(2)

$\textbf{x} \in \mathbb{R}^n$ indicate that a vector contains $n$ elements. Formally, $n$ is the _dimensionality_ of the vector.

The dimensionality corresponds to a tensor's length. This is accessible via the `len` function.

In [5]:
len(x)

3

The `.shape` attribute indicate a tensor's length along each axis. Tensors with one axis have shapes with just one element.

In [6]:
x.shape

torch.Size([3])

To prevent confusion:
- _order_ means the number of axes.
- _dimensionality_ means the number of elements.

## Matrices

While scalars are 0th-order tensors and vectors are 1st-order tensors—matrices are 2nd-order tensors.

Matrices are denoted by _bold capital_ letters (e.g., $\textbf{X}$, $\textbf{Y}$, and $\textbf{Z}$).

$A \in \mathbb{R}^{m \times n}$ indicate that matrix $A$ contains $m \times n$ real-valued scalars.
- $m$ are rows.
- $n$ are columns.

Matrices are represented in code as tensors with two axes.
$$
\begin{split}\mathbf{A}=
    \begin{bmatrix} 
    a_{11} & a_{12} & \cdots & a_{1n} \\ 
    a_{21} & a_{22} & \cdots & a_{2n} \\ 
    \vdots & \vdots & \ddots & \vdots \\ 
    a_{m1} & a_{m2} & \cdots & a_{mn} \\ 
    \end{bmatrix}.
\end{split}
$$

We represent a matrix $A \in \mathbb{R}^{m \times n}$ by a 2nd-order tensor with shape $(m, n)$.

We subscript both the row and column indices to refer to an individual element $a_{ij}$.

We can convert any _appropriately_ sized $m \times n$ tensor into an $m \times n$ matrix using `.reshape(m, n)`. 

In [7]:
A = torch.arange(6).reshape(3, 2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

A _transposed_ matrix $A$ is denoted as $A^{\top}$. The tranpose of an $m \times n$ matrix is a $n \times m$ matrix.

Use `.T` to transpose any matrix.

In [8]:
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

Symmetric matrices are square matrices that are equal to their own tranposes, $A = A^{\top}$.

In [9]:
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

Typically, matrix rows represent individual instances and columns represent to distinct attributes.

## Tensors

Tensors provide a generic way of describing extensins to $n^\text{th}\text{-}$order arrays. 

The term _tensor_ is used in two different context:
- Mathematical tensors are generalization of vectors and matrices representing multi-dimensional data with strict transformation rules under coordinate changes.
- Software tensors are just multi-dimensional arrays optimized for numerical operations without adhering to the strict transformation rules.

We denote general tensors by capital letters with a special font face ($\textsf{X}$, $\textsf{Y}$, and $\textsf{Z}$). Their indexing follows from that of a matrix ($x_{ijk}$ and $[\mathsf{X}]_{1, 2i-1, 3}$).


One example of the use of tensors is with images. Each image is represented by a 3rd-order tensor with the following axes: height, width, and color channel (RGB). Additionally, a collection of images is represented in code with a 4th-order tensor where each image is indexed along the first axis.

In the same vein as creating a matrix from a vector—higher order tensors are formed by growing the number of shape components.

In [10]:
torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

## Basic Properties of Tensor Arithmetic

Elementwise operations produce outputs that have the same shape as their operands.

In [11]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # copy of A to B by allocating new memory

In [12]:
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [13]:
A+B

tensor([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]])

Elementwise product of two matrices is called their Hadamard product (denoted $\odot$). The entries of a Hadamard product of two matrices is $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$.

$$
\begin{split}\mathbf{A} \odot \mathbf{B} =
\begin{bmatrix}
    a_{11}  b_{11} & a_{12}  b_{12} & \dots  & a_{1n}  b_{1n} \\
    a_{21}  b_{21} & a_{22}  b_{22} & \dots  & a_{2n}  b_{2n} \\
    \vdots & \vdots & \ddots & \vdots \\
    a_{m1}  b_{m1} & a_{m2}  b_{m2} & \dots  & a_{mn}  b_{mn}
\end{bmatrix}.\end{split}
$$

In [14]:
A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

Adding or multiplying a tensor by a scalar produces the same shape as the tensor. Each element of the tensor is added to (or multiplied by) the scalar.

In [15]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
X

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

In [16]:
a + X

tensor([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],

        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]])

In [17]:
(a * X).shape

torch.Size([2, 3, 4])

## Reduction

Given a vector $\textbf{x}$ of length $n$, the sum of elements is $\displaystyle \sum_{i = 1}^n x_i$. This is implemented in code with the `.sum()` function.

In [18]:
x = torch.arange(3, dtype=torch.float32)
x

tensor([0., 1., 2.])

In [19]:
x.sum()

tensor(3.)

The sum of elements of an $m \times n$ matrix $A$ is $\displaystyle \sum_{i = 1}^m \sum_{j = 1}^n a_{ij}$.

In [20]:
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [21]:
A.shape

torch.Size([2, 3])

In [22]:
A.sum()

tensor(15.)

The sum function _reduces_ a tensor along all of its axes—producing a scalar.

We can sum over an axis by adding a keyword argument to the sum function: 
- reducing the row dimension, vertical sum: `.sum(axis=0)`.
- reducing the column dimension, horizontal sum: `.sum(axis=1)`.

In [23]:
A.shape

torch.Size([2, 3])

In [24]:
A.sum(axis=0) # vertical sum

tensor([3., 5., 7.])

In [25]:
A.sum(axis=0).shape

torch.Size([3])

In [26]:
A.sum(axis=1) # horizontal sum

tensor([ 3., 12.])

In [27]:
A.sum(axis=1).shape

torch.Size([2])

Reducing a matrix along its rows and columns is the same as summing all the elements.

In [28]:
A.sum(axis=[0, 1]) == A.sum()  # Same as A.sum()

tensor(True)

A _mean_, also called _average_ or _expected value_ can be  done in two ways.

In [29]:
A.mean()

tensor(2.5000)

In [30]:
A.sum() / A.numel()

tensor(2.5000)

We can also invoke the mean function along a specific axis.

In [31]:
A.mean(axis=0)

tensor([1.5000, 2.5000, 3.5000])

In [32]:
A.sum(axis=0) / A.shape[0]

tensor([1.5000, 2.5000, 3.5000])

## Non-Reduction Sum

Keeping the axes unchanged in reduction functions is useful for broadcasting.

In [33]:
A.sum(axis=1) # doesn't keep its axes

tensor([ 3., 12.])

In [34]:
A.sum(axis=1).shape

torch.Size([2])

In [35]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A # keeps its axes

tensor([[ 3.],
        [12.]])

In [36]:
sum_A.shape

torch.Size([2, 1])

A use case of keeping axes unchanged in reduction is normalizing matrix rows to sum to $1$.

In [37]:
A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

We can also do a cumulative sum of elements along an axis using the `.cumsum(axis=n)` function.

In [38]:
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

## Dot Products

Given two vectors $\textbf{x}, \textbf{y} \in \mathbb{R}^{d}$, the dot product $\textbf{x}^{\top}\textbf{y}$ is the sum over the products of the elements at the sum position: $\displaystyle \textbf{x}^{\top}\textbf{y} = \sum_{i=1}^d x_i y_i$

The dot product is also referred as the inner product $\langle \mathbf{x}, \mathbf{y} \rangle$.

A dot product is implemented using the `torch.dot(x, y)` function.

In [39]:
y = torch.ones(3, dtype=torch.float32)

In [40]:
x

tensor([0., 1., 2.])

In [41]:
y

tensor([1., 1., 1.])

In [42]:
torch.dot(x, y)

tensor(3.)

Alternatively, a dot product of two vectors can be done by implementing the following.

In [43]:
torch.sum(x * y)

tensor(3.)

One use case of dot products is computing a _weighted average_.

Given a vector $\textbf{x} \in \mathbb{R^n}$ and weights $\textbf{w} \in \mathbb{R}^n$, the weighted sum is $\textbf{w}$ is $\textbf{x}^\top \textbf{w}$.

If the weights are nonnegative and sum up to 1 $\sum_{i = 1}^n w_i = 1$, the dot product expresses a _weighted average_.

## Matrix-Vector Products

The product between an $m \times n$ matrix $\textbf{A}$ and an $n\text{-}$dimensional vector $\textbf{x}$ is a column vector of length $m$ whose $i^\text{th}$ element is the doct product $a_i^{\top}\textbf{x}$.

$$
\begin{split}\mathbf{A}\mathbf{x}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix}\mathbf{x}
= \begin{bmatrix}
 \mathbf{a}^\top_{1} \mathbf{x}  \\
 \mathbf{a}^\top_{2} \mathbf{x} \\
\vdots\\
 \mathbf{a}^\top_{m} \mathbf{x}\\
\end{bmatrix}.\end{split}
$$

Matrix-vector products describe the key calculation involved in computing the outputs of each layer in a neural network given the ouputs from the previous layer.

To implement a matrix-vector product in code, use the `torch.mv(A, x)` function.

Note that the axis 1 (column dimension) of matrix $\textbf{A}$ must be equal to the length of $\textbf{x}$.

In [44]:
A.shape[1] == x.shape[0]

True

In [45]:
torch.mv(A, x)

tensor([ 5., 14.])

Python also has a _convenience_ operator `@` that is able to execute both matrix-vector and matrix-matrix products.

In [46]:
A@x

tensor([ 5., 14.])

## Matrix-Matrix Multiplication

Given two matrices $\mathbf{A} \in \mathbb{R}^{n \times k}$ and $\mathbf{B} \in \mathbb{R}^{k \times m}$, the matrix product $\mathbf{C} \in \mathbb{R}^{n \times m}$ is computed as follows.

$$
\begin{split}\mathbf{C} = \mathbf{AB} = \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_n \\
\end{bmatrix}
\begin{bmatrix}
 \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\
\end{bmatrix}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\
 \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\
 \vdots & \vdots & \ddots &\vdots\\
\mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m
\end{bmatrix}.\end{split}
$$

_Matrix-matrix multiplication_ $\mathbf{AB}$ can be thought of as performing the following:
1. $m$ matrix-vector products or $m \times n$ dot products.
2. Combining the results together to form an $n \times m$ matrix.

Matrix-matrix multiplication $\mathbf{AB}$ is implemented using the `torch.mm(A, B)`.

In [47]:
B = torch.ones(3, 4)

In [48]:
torch.mm(A, B)

tensor([[ 3.,  3.,  3.,  3.],
        [12., 12., 12., 12.]])

In [49]:
A@B

tensor([[ 3.,  3.,  3.,  3.],
        [12., 12., 12., 12.]])

The term _matrix-matrix multiplication_ is often simplified to _matrix multiplication_. It should not be confused with the _Hadamard product_.

While Hadamard products take quadratic time, matrix-matrix products take cubic time.

## Norms

Norms are one the most useful operators in linear algebra.

The norm of a vector measures its length or the magnitude of its components. It doesn't refer to its dimensionality.

A norm is a function $\| \cdot \|$ that maps a vector to a scalar. It satisfies the following properties.
1. Given any vector $\textbf{x}$, if we scale all elements of the vector by a scalar $\alpha \in \mathbb{R}$, the norm scales accordingly.

$$
\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.
$$

2. For any vector $\textbf{x}$ and $\textbf{y}$, norms satisfy the triangle inequality.

$$
\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.
$$

3. The norm of a vector is nonnegative and it only vanishes if the vector is zero.

$$
\|\mathbf{x}\| > 0 \textrm{ for all } \mathbf{x} \neq 0.
$$



The $l_2$ norm measures the Euclidean length of a vector. It is expressed as:

$$
\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.
$$

The `torch.norm(u)` method computes the $l_2$ norm.

In [50]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

The $l_1$ norm measures the Manhattan distance. It sums up the absolute value of a vector's  elements. It is expressed as:

$$
\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.
$$

It is less sensitive to outliers compared to the $l_2$ norm.

The $l_1$ norm is implemented using a combination of `.abs()`. and `.sum()`. 

In [51]:
torch.abs(u).sum()

tensor(7.)

Both $l_2$ and $l_1$ are special cases of the general $l_p$ norms:

$$
\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.
$$

The _Frobenius norm_ is defined as the square root of the sum of the squares of matrix's elements:

$$
\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.
$$

The Frobenius norm behaves as if it were an $l_2$ norm of a matrix-shaped vector.

Implementing the following will compute the Frobenius norm of a matrix.

In [52]:
torch.norm(torch.ones((4, 9)))

tensor(6.)

Matrices are can be seen as both collections of entries and as operators that transform vectors. 

For example, we can measure how much longer the matrix-vector product 
$\textbf{Xv}$ is compared to $\textbf{v}$, leading to the spectral norm.

While $l_2$ and $l_1$ are common vector norms, _spectral_ and _Frobenius norms_ are common matrix norms.

In deep learning, optimization problems like maximizing the probability assigned to an observed data; or minimizing prediction errors have distances that are often expressed as norms.

## Exercises