# Linear Algebra

In [1]:
import torch

## Scalars

Most everyday mathematics consists of manipulating numbers one at a time. Formally we call these values **scalars**. We denote scalars by ordinary lower-cased letters (e.g., x, y, and z) and the space of all (continuous) **real-valued scalars** by *R* (*font differnece doesn't come through in markdown*). 

For now, remember that the expression x ∈ *R* is a formal way to say that x is a real-valued scalar. The symbol ∈ (pronounced "in") denotes membership in a set. 

In [2]:
x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

## Vectors

For now, think of a **vector** as a **fixed-length array** of scalars. As with their code counterparts, we call these scalars the **elements** of the vector (*synonyms include entries and componenets*). Take a real-world example. If we were training a model to predict the risk of a loan defaulting, we might associate each applicant with a vector whose elements correspond to quantities like their income, length of employment, or number of previous defaults.

We denote vectors by bold lowercase letters (e.g., **x**, **y**, and **z**). Vectors are implemented as **1st-order tensors**. In general, such tensors can have arbitrary lengths, subject to memory limitations.

*A note of caution*. In Python, a vector's indicies start at 0, also known as *zero-based indexing*, whereas in linear algebra subscripts begin at 1 (*one-based indexing*)

In [3]:
x = torch.arange(3)
x

tensor([0, 1, 2])

We can refer to an element of a vector by using a subscript. For example, x<sub>2</sub> denotes the second element of **x**. Since x<sub>2</sub> is a scalar, we don't bold it. By default, we visualize vectors by stacking their elements vertically. 

In [4]:
x[2]

tensor(2)

In doing so, x<sub>1</sub> . . . x<sub>n</sub> are elements of the vector. Later on, we will distinguish between such **column vectors** and **row vectors** whose elements are stacked horizontally. To indicate that a vector contains *n* elements, we would write **x ∈ R<sup>n</sup>**. Formally we call *n* the **dimensionality** of the vector.

In code, this is related to the tensor's length which we can access using Python's `len` function.

In [5]:
len(x)

3

We can also access the length via the `shape` attribute. The shape is a tuple that indicates a tensor's length along each axis. Tensors with just one axis have shapes with just one element.

In [6]:
x.shape

torch.Size([3])

Oftentimes, the word "dimension" gets overloaded to mean both the number of axes and the length along a particular axis. To avoid this confusion, it's best to use **order** to refer to the number of axes and **dimensionality** exclusively to refer to the number of components. 

## Matrices

Just as scalars are 0th-order tensors and vectors are 1st-order tensors, matrices are **2nd-order** tensors. We denote matrices by bold capital letters (e.g., **X**, **Y**, and **Z**), and represent them in code using tensors with two axes. 

The expression **A** ∈ R<sup>m x n</sup> indicates that a matrix **A** contains *m x n* real-valued scalars, arranged as *m rows* and *n columns*. When m = n, we say that a matrix is **square**. Visually, we can illustrate any matrix as a table.

To refer to an individual element, we subscript both the row and column indices, e.g., **a**<sub>ij</sub> is the value that belongs to **A**'s i<sup>th</sup> row and j<sup>th</sup> column. In code we can represent a matrix **A** ∈ R<sup>m x n</sup> by a 2nd-order tensor with a shape (m, n). We can convert any appropriately sized *m x n* tensor into an *m x n* matrix by passing the desired shape to `reshape`.

In [7]:
A = torch.arange(6).reshape(3, 2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

Sometimes we want to flip the axes. When we exchange a matrix's rows and columns, the result is called its **transpose**. Formally, we signify a matrix **A**'s transpose by **A**<sup>T</sup> and if **B** = **A**<sup>T</sup>, then b<sub>ij</sub> = a<sub>ij</sub> for all i and j. Thus, the transpose of an *m x n* matrix is an *n x m* matrix. 

In code, we can view a matrix's transpose as well.

In [8]:
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

Symmetric matrices are the subset of square matrices that are equal to their own transpose: **A** = **A**<sup>T</sup>. Matrices are useful for representing datasets. Typically rows correspond to individual records and columns correspond to distinct attributes.

## Tensors

Tensors give us a generic way of describing extensions to n<sup>th</sup>-order arrays. We call Python objects of the tensor class **tensors** specifically because they too can have arbitrary numbers of axes. Whilte it may be confusing to use the word tensor for both the mathematical object and its implementation in code, the meaning should be clear from the context.

We denote general tensors by capital letters with a special, skinnier, font face (e.g., `X`, `Y`, and `Z`) and their indexing mechanism (e.g., *x*<sub>ijk</sub> and `[X]`<sub>1,2i - 1,3</sub>) follows naturally from that of matrices.

Tensors will become even more important when we start working with inputs like images. Each image arrives as a 3rd-order tensor with axes corresponding to the height, width, and channel. At each spatial location, the intensities of each color are stacked along the channel. Furthermore, a collection of images is represented in code by a 4th-order tensor, where distinct images are indexed along the first axis. Higher-order tensors are constructed, as were with vectors and matrices, by growing the number of `shape` components. 

In [9]:
torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

## Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and higher-order tensors all share some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.

In [10]:
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assigns a copy of A to B by allocating new memory
A, A + B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

The elementwise product of two matrices is called their **Hadamard product**, denoted by a *dot enclosed by a circle*. We can spell out the entries of the Hadamard product of two matrices **A**,**B** ∈ R<sup>m x n</sup>

In [11]:
A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar. 

In [12]:
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

## Reduction

Often, we wish to calculate the sum of a tensor's elements. To express the sum of the elements in a vector **x** of length *n*, we write Σ<sup>n</sup><sub>i=1</sub> x<sub>i</sub>

In [13]:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

To express sums over the elements of tensors of arbitrary shape, we simply sum over all its axes. For example, the sum of the elements of an *m x n* matrix. **A** could be written Σ<sup>m</sup><sub>i=1</sub> Σ<sup>n</sup><sub>j=1</sub> a<sub>ij</sub>.

In [14]:
A.shape, A.sum()

(torch.Size([2, 3]), tensor(15.))

By default, invoking the sum function *reduces* a tensor along all of its axes, eventually producing a scalar. Python libraries also allow us to specify the axes along which the tensor should be reduced.

To sum over all elements along the rows (axis0), we specift `axis=0` in `sum`. Since the input matrix reduces along axis 0 to generate the output vector, this axis is missing from the shape of the output. 

In [15]:
A.shape, A.sum(axis=0).shape

(torch.Size([2, 3]), torch.Size([3]))

Specifying `axis=1` will reduce the column dimension (axis 1) by summing up elements of all the columns. 

In [16]:
A.shape, A.sum(axis=1).shape

(torch.Size([2, 3]), torch.Size([2]))

Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.

In [17]:
A.sum(axis=[0, 1]) == A.sum() # Same as A.sum()

tensor(True)

A related quantity is the **mean**, also called the *average*. We calculated the mean by dividing the sum by the total number of elements. Because computing the mean is so common, it gets a dedicated library function that works analogously to sum. 

In [18]:
A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

Likewise, the function for calculating the mean can also reduce a tensor along specific axes. 

In [19]:
A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

## Non-Reduction Sum

Sometimes it can be useful to keep the numbers of axes unchanged when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism. 

In [20]:
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

For instance, since `sum_A` keeps its two axes after summing each row, we can divide A by `sum_A` with broadcasting to create a matrix where each row sums up to 1.

In [21]:
A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

Under certain conditions, even when shapes differ, we can still perform elementwise binary operations by invoking the **broadcasting mechanism**. Broadcasting works according to the following two-step procedure:

1) expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape.

2) perform an elementwise operation on the resulting arrays. 

In [22]:
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b

(tensor([[0],
         [1],
         [2]]),
 tensor([[0, 1]]))

For example, since `a` and `b` are 3 x 1 and 1 x 2 matrices, respectively, their shapes do not match up. Broadcasting produces a larger 3 x 2 matrix by replicating matrix `a` along the columns and matrix `b` along the rows before adding them elementwise

In [23]:
a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

If we want to calculate the cumulative sum of elements of `A` along some axis, say `axis=0` (row by row), we can call the `cumsum` function. By design, this function does not reduce the input tensor along any axis

In [25]:
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

## Dot Products

We've only just performed elementwise operations, sums, and averages so far. Fortunately, this is where things start to get much more interesting. One of the fundamental operations is the **dot product**. Given two vectors **x**,**y** ∈ R<sup>d</sup>, their dot product **x**<sup>T</sup>**y** (also known as *inner product*, <**x**,**y**>) is a sum over the products of the elements at the same position: **x**<sup>T</sup>**y** = Σ<sup>d</sup><sub>i=1</sub> x<sub>i</sub>y<sub>i</sub> .

In [26]:
y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

Equivalently, we can calculate the dot product of two vectors by performing elementwise multiplication followed by a sum.

In [27]:
torch.sum(x * y)

tensor(3.)

Dot products are useful in a wide range of contexts. For example, given some set of values, denoted by a vector **x** ∈ R<sup>n</sup>, and a set of weights, denoted by **w** ∈ R<sup>n</sup>, the weighted sum of the values in **x** according to the weights **w** could be expressed as the dot product **x**<sup>T</sup>**w**. 

When the weights are nonnegative and sum to 1, i.e., (Σ<sup>n</sup><sub>i=1</sub> w<sub>i</sub> = 1), the dot productexpresses a **weighted average**. After normalizing two vectors to have unit length, the dot products express the **cosine of the angle between them**. Later we will formally introduce the notion of **length**

## Matrix-Vector Products

Now that we know how to calculate dot products, we can begin to understand the *product* between an *m x n* matrix **A** and an n-dimensional vector **x**. To start off, we visualize our matrix in terms of its row vectors where each **a**<sup>T</sup><sub>i</sub> ∈ R<sup>n</sup> is a row vector representing the i<sup>th</sup> row of the matrix **A**. 

The matrix-vector product **Ax** is simply a column vector of length *m*, whose i<sup>th</sup> element is the dot product **a**<sup>T</sup><sub>i</sub>**x**:

We can also think of multiplication with matrix **A** ∈ R<sup>m x n</sup> as a transformation that projects vectors from R<sup>n</sup> to R<sup>m</sup>. These transformations are remarkably useful. For example, we can represent rotations as multiplications by certain square matrices. Matrix-vector products also describe the key calculation involved in computing the outputs of each layer in a neural network given the outputs from the previous layer. 

To express a matrix-vector product in code, we use the `mv` function. Note that the column dimension of *A* (its length along axis 1) must be the same as the dimension of x (its length). Python has a **convenience operator `@`** that can execute both matrix-vector and matrix-matrix products (depending on its arguments) Thus we can write *A*@x

In [28]:
A.shape, x.shape, torch.mv(A, x), A@x

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))