In [65]:
import torch

Scalars are implemented as tensors that contain only one element

In [66]:
x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

#### Basic Properties of Tensor Arithmetic

In [67]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # Assign a copy of A to B by allocating new memory
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

The elementwise product of two matrices is called their Hadamard product

In [68]:
A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

In [69]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

Often, we wish to calculate the sum of a tensor’s elements.

In [70]:
a = torch.arange(3, dtype=torch.float32)
a, a.sum()

(tensor([0., 1., 2.]), tensor(3.))

In [71]:
A, A.shape, A.sum()

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 torch.Size([2, 3]),
 tensor(15.))

In [72]:
A.shape, A.sum(axis=0)

(torch.Size([2, 3]), tensor([3., 5., 7.]))

Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.

In [73]:
A.sum(axis=[0, 1]) == A.sum()  # Same as A.sum()

tensor(True)

#### Non-Reduction Sum

Sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.

In [74]:
A

tensor([[0., 1., 2.],
        [3., 4., 5.]])

In [75]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

For instance, since sum_A keeps its two axes after summing each row, we can divide A by sum_A with broadcasting to create a matrix where each row sums up to 1

In [76]:
(A / sum_A)

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by row), we can call the cumsum function. By design, this function does not reduce the input tensor along any axis.

In [77]:
A = torch.arange(1, 13).reshape(3, 4)
A, A.cumsum(axis=0), A.cumsum(axis=1)

(tensor([[ 1,  2,  3,  4],
         [ 5,  6,  7,  8],
         [ 9, 10, 11, 12]]),
 tensor([[ 1,  2,  3,  4],
         [ 6,  8, 10, 12],
         [15, 18, 21, 24]]),
 tensor([[ 1,  3,  6, 10],
         [ 5, 11, 18, 26],
         [ 9, 19, 30, 42]]))

#### Dot Products

In [78]:
y = torch.arange(3)
x = torch.tensor([1, 1, 1])
x, y, torch.dot(x, y)

(tensor([1, 1, 1]), tensor([0, 1, 2]), tensor(3))

Equivalently, we can calculate the dot product of two vectors by performing an elementwise multiplication followed by a sum

In [79]:
torch.sum(x * y)

tensor(3)

#### Matrix–Vector Products

We visualize our matrix
in terms of its row vectors

$$\mathbf{A}=
\begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix},$$

where each $\mathbf{a}^\top_{i} \in \mathbb{R}^n$
is a row vector representing the $i^\textrm{th}$ row 
of the matrix $\mathbf{A}$.

[**The matrix--vector product $\mathbf{A}\mathbf{x}$
is simply a column vector of length $m$,
whose $i^\textrm{th}$ element is the dot product 
$\mathbf{a}^\top_i \mathbf{x}$:**]

$$
\mathbf{A}\mathbf{x}
= \begin{bmatrix}
\mathbf{a}^\top_{1} \\
\mathbf{a}^\top_{2} \\
\vdots \\
\mathbf{a}^\top_m \\
\end{bmatrix}\mathbf{x}
= \begin{bmatrix}
 \mathbf{a}^\top_{1} \mathbf{x}  \\
 \mathbf{a}^\top_{2} \mathbf{x} \\
\vdots\\
 \mathbf{a}^\top_{m} \mathbf{x}\\
\end{bmatrix}.
$$

We can think of multiplication with a matrix
$\mathbf{A}\in \mathbb{R}^{m \times n}$
as a transformation that projects vectors
from $\mathbb{R}^{n}$ to $\mathbb{R}^{m}$.
These transformations are remarkably useful.
For example, we can represent rotations
as multiplications by certain square matrices.
Matrix--vector products also describe 
the key calculation involved in computing
the outputs of each layer in a neural network
given the outputs from the previous layer.


To express a matrix–vector product in code, we use the mv function. Note that the column dimension of A (its length along axis 1) must be the same as the dimension of x (its length). Python has a convenience operator @ that can execute both matrix–vector and matrix–matrix products (depending on its arguments). Thus we can write A@x.

**In PyTorch, torch.mv and the @ operator treat vectors as 1D tensors, not 2D column vectors, so the result is a 1D tensor**

In [80]:
A = torch.arange(6).reshape(2, 3)
A, x, A.shape, x.shape, torch.mv(A, x), A@x

(tensor([[0, 1, 2],
         [3, 4, 5]]),
 tensor([1, 1, 1]),
 torch.Size([2, 3]),
 torch.Size([3]),
 tensor([ 3, 12]),
 tensor([ 3, 12]))

#### Matrix–Matrix Multiplication

In [83]:
B = torch.ones(3, 4, dtype=torch.int64)
torch.mm(A, B), A@B

(tensor([[ 3,  3,  3,  3],
         [12, 12, 12, 12]]),
 tensor([[ 3,  3,  3,  3],
         [12, 12, 12, 12]]))

#### Norms
Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big it is

The method `norm` calculates the $\ell_2$ norm.

In [84]:
u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

**The $\ell_1$ norm** is also common 
and the associated measure is called the Manhattan distance. 
By definition, the $\ell_1$ norm sums 
the absolute values of a vector's elements:

**$$\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.$$**

In [85]:
torch.abs(u).sum()

tensor(7.)

Both the $\ell_2$ and $\ell_1$ norms are special cases
of the more general $\ell_p$ *norms*:

$$\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$

In the case of matrices, matters are more complicated. 
After all, matrices can be viewed both as collections of individual entries 
*and* as objects that operate on vectors and transform them into other vectors. 
For instance, we can ask by how much longer 
the matrix--vector product $\mathbf{X} \mathbf{v}$ 
could be relative to $\mathbf{v}$. 
This line of thought leads to what is called the *spectral* norm. 
For now, we introduce **the *Frobenius norm*, 
which is much easier to compute** and defined as
the square root of the sum of the squares 
of a matrix's elements:

**$$\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$**

In [86]:
torch.norm(torch.ones((4, 9)))

tensor(6.)

*While we do not want to get too far ahead of ourselves, we already can plant some intuition about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.*