* To start building sophisticated models, we will need a few tools from linear algebra.
* Here we will start from basic scalar arithmetic and ramp up to matrix mutliplication.

## 1.1 Scalars

* Most everyday mathematics consits of manipulating numbers one at a time.
* We call these values `scalars`.
* Scalars are implemented as tensors that contain only one element.
* Below, we assign two scalars and perform the familiar addition, multiplication, division and exponentiation operations.


In [2]:
import torch
X = torch.tensor(3.0)
Y = torch.tensor(2.0)
X+Y , X*Y, X / Y , X**Y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

## 1.2 Vectors

* Since we're dealing with machine learning you can think of a vector as fixed-length array of scalars.
* As their code counterparts, we call these scalrs the elements of the vectors.
* When vectors represent examples of real-world datasets, their values hold some real-world significance.
* For example, if we were training a model to predict the risk of a loan defaulting, we might associate each applicant with a vector whose components corresponds to quantities like their income, length of employment, or number of previous defaults.

*Vectos are implemented as $1^{st}$ order tensors.

In [3]:
x = torch.arange(3)
x

tensor([0, 1, 2])

* We can refer to an element of a vector by using a subscript. For example, $x_3$ denotes third element of x.
* Note that like in Python vector indices start from 0 (zero-based indexing) while linear  algebra subscripts begin at 1 (one-based indexing).

In [4]:
x[2]

tensor(2)

* To indicate a vector contains $n$ elements we write $x \in \mathbb{R}^n$.
* We call $n$ the `dimensionality` of the vector.
* In code, this correspondes to the tensor's length, accessible via Python's built-in `len` function.

In [5]:
len(x)

3

* We can also access the length via `shape` attribute.
* The shape is atuple that indicatesa tensors's length along each axis.
* Tensors with just one axis have shapes with just one element.

In [6]:
x.shape

torch.Size([3])

* At times `dimension` is gets overloaded with different meanings at times people refer to it as number of axes and the length along a particular axis.
* To avoid this confusion, we use `order` to refer to the number of axes and `dimensionality` exclusively to refer to the number of components.

## 1.3 Matrices

* Just as scalrs are $O^{th}$ order tensors and vectors are $1^{st}$ tensors, matrices are $2^{nd}$ order tensors.

* We denote matrcies by bold letters *X ,Y,Z*, and represent them in code by tensors with two axes.
* The expression $A \in \mathbb{R}^{m x n}$ indicates that a matrix `A` contains `m x n` real valued scalars, arranged as `m` rows and `n` columnns.
* When `m=n`, we say that matriz is `square`.
* To refer to a certain element in the matrix we use the subscript $a_{ij}$ showing the value belongs to `A's` $i^{th}$ and $j^{th}$.

In [7]:
##constructing a matrix
A = torch.arange(6).reshape(3,2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

* We may want to flip the axes of the matrix whereby we exchange the matrix's rows and columns, the result is called a `transpose`.
* Formally, we signiffy a matrix `A's` transpose by $A^T$.

In [8]:
##tranposing matrix A
A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

* Symmetric matrices are subset oF square matrices that are equal to their own tranposes. A = $A^T$

In [9]:
A = torch.tensor([[1,2,3],[2,0,4],[3,4,5]])
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

## 1.4 Tensors

* Tensors give us a generic way of describing extensions to $n^{th}$ order arrays.
* Tensors become more important when working with images.
* Each image arrives as a $3^{rd}$ order tensor with axes corresponding to the height,width and channel.
* At each spacial location, the intensities of each color (red,green and blue) are stacked along the channel.

In [10]:
x = torch.arange(24).reshape(2,3,4)
x

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

## 1.5 Basic Properties of Tensor Arithmetic

* Scalars, vectors, matrices and high order tensors allhave some handy properties.
* Examples of such operations include: elementwise operations which produce outputs that have the same shape as their operands.

In [11]:
A = torch.arange(6,dtype=torch.float32).reshape(2,3)
B = A.clone() #assign a copy of A and B by allocating a new memory
A, A+B

(tensor([[0., 1., 2.],
         [3., 4., 5.]]),
 tensor([[ 0.,  2.,  4.],
         [ 6.,  8., 10.]]))

* The elementwise product of two matrices is called their `Hadamard product`.
* We can spell out the entries of Hadamard product of two matrices `A, B` with:
* $A \odot B =
\begin{bmatrix}
a_{11}b_{11} & a_{12}b_{12} & \dots  & a_{1n}b_{1n} \\
a_{21}b_{21} & a_{22}b_{22} & \dots  & a_{2n}b_{2n} \\
\vdots       & \vdots       & \ddots & \vdots       \\
a_{m1}b_{m1} & a_{m2}b_{m2} & \dots  & a_{mn}b_{mn}
\end{bmatrix}$

In [12]:
A*B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

* Adding or multiplying a scalar and a tensor produces a result with the same shape as the original tensor.

In [13]:
a = 2
X = torch.arange(24).reshape(2,3,4)
a + X, (a*X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],
 
         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

## 1.6 Reduction

* At times we wish to calculate the sum of a tensor's elements.
* To express the sum of the elements in a vector `x` of length `n`, we write $\sum_{i=1}^{n} x_i$.
* There's a simple function in Pytorch which performs this:

In [14]:
x = torch.arange(3,dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

* To express sums over the elements of tensors of arbitrary shape, we simply sum over all its axes.
* For instance,the sum of elements of an `m x n` matrix A could be written as $\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}$

In [15]:
A.shape, A.sum()

(torch.Size([2, 3]), tensor(15.))

* Invoking the sum function `reduces` a tensor along all of its axes, eventually producing a scalar.
* Pytorch allows us to specify the axes along which the tensor should be reduced.
* To sum over all elements along rows (axis 0), we specify axis=0 in the sum function.


In [16]:
A.shape, A.sum(axis=0).shape

(torch.Size([2, 3]), torch.Size([3]))

* Specifying `axis=1` will reduce the column dimension (axis 1) by summing up elements of all the columns.

In [17]:
A.shape, A.sum(axis=1).shape

(torch.Size([2, 3]), torch.Size([2]))

* Reducing a matrix along both rows and columns vi summation equivalent to summing up all the elements of the matrix.

In [26]:
z = torch.arange(10).reshape(2,5)
##reducing tensor z via summation of columns and rows
z.sum(axis=[0,1]) == z.sum()

tensor(True)

* We can also used quantities like mean whereby we calculate the mean by dividing the sum by the total number of elements.
* Since computing mean is common, it gets a dedicated library function that works analogously to sum.

In [27]:
A.mean(),A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

* Also the function for calculating the mean can also reduce a tensor along specific axes.

In [28]:
A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

## 1.7 Non-Reduction Sum

* Sometimes it can be usedul to keep the number of axes unchanged when invoking the function for calculating the sum or mean.
* This matters when we want use the broadcast mechanism.

In [29]:
sum_A = A.sum(axis=1,keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

* For instance, since `sum_A` keeps its two axes after summing each row, we can divide `A` y `sum_A` with broadcasting to create a matrix where each row sums up to 1.

In [33]:
A /sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

In [34]:
## calculating cumulative sum of elements of A along
## some axis
A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

In [32]:
j = torch.arange(12).reshape(2,6)
j_red_sum = j.sum(axis=0)
j_non_sum = j.sum(axis=0,keepdims=True)
j, j_red_sum, j_red_sum.shape, j_non_sum, j_non_sum.shape

(tensor([[ 0,  1,  2,  3,  4,  5],
         [ 6,  7,  8,  9, 10, 11]]),
 tensor([ 6,  8, 10, 12, 14, 16]),
 torch.Size([6]),
 tensor([[ 6,  8, 10, 12, 14, 16]]),
 torch.Size([1, 6]))

## 1.8 Dot Products

* Dot products is one of the most fundemental operations in linear algebra.
* Given two vectos $x, y \in \mathbb{R}^d$, their dot product is $x^{T}y$ (also know as inner product,<x,y>) is sum over the products of the elements at the same positions: $x \cdot y = \sum_{i=1}^{d} x_i y_i$.

In [36]:
y = torch.ones(3,dtype=torch.float32)
x, y, torch.dot(x,y)

(tensor([1., 1., 1.]), tensor([1., 1., 1.]), tensor(3.))

* We can calculate the dot product of two vectos by performing an elementwise multiplication followed by a sum:

In [38]:
torch.sum(x * y)

tensor(3.)

## 1.9 Matrix-Matrix Multiplication


* Say that we have two matrices $A \in \mathbb{R}^{n \times k}$ and $B \in \mathbb{R}^{k \times m}$:

A = \begin{bmatrix}
a_{11} & a_{12} & \cdots & a_{1k} \\
a_{21} & a_{22} & \cdots & a_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \cdots & a_{nk}
\end{bmatrix}

B = \begin{bmatrix}
b_{11} & b_{12} & \cdots & b_{1m} \\
b_{21} & b_{22} & \cdots & b_{2m} \\
\vdots & \vdots & \ddots & \vdots \\
b_{k1} & b_{k2} & \cdots & b_{km}
\end{bmatrix}

* We perform matrix-matrix multiplication using torch.matmul:

In [39]:
## matrix-matrix multiplication
b =  torch.rand(3,4)
c =torch.rand(4,3)
d = torch.matmul(b,c)
d_i = b @ c
d, d_i

(tensor([[0.2722, 0.6157, 0.5766],
         [0.7616, 0.9073, 0.6931],
         [0.9288, 1.7989, 1.5152]]),
 tensor([[0.2722, 0.6157, 0.5766],
         [0.7616, 0.9073, 0.6931],
         [0.9288, 1.7989, 1.5152]]))

## 2.0 Norms

* Norm of vector tells how big a vector is.
* For instance $ℓ2$ norm measures the (Euclidean) length of a vector.
* A norm is a function $||.||$ that maps a vector to a scalar and satisfies the following three properties:
   1. Given any vector `x`, if we scale (all elements of) the vector by scalar $\alpha \in \mathbb{R}$ its norm scales accordingly:
      
      $\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|$
  

   

  2. For any vectos `x` and `y` norms satisfy the triangle inequality.
  $\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|$

      3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:

         $\|\mathbf{x}\| > 0 \text{ for all } \mathbf{x} \neq \mathbf{0}$

* Many functions are valid norms and different norms encode different notions of size.
* The Euclidean norm we all learned in high school when calculating the hypotenuse of right angled triangle is the square root of the sum of squares of a vector's elements.
 * Formally this is called  $ℓ_2$ and is expressed as:
 $\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2}$

In [40]:
##calculating l2 norm
u = torch.tensor([3.0,-4.0])
torch.norm(u)

tensor(5.)

* The  $ ℓ_1$  norm measures the Manhattan distance.
* The  $ ℓ_1 $ norm sums the absolute values of a vector's elements.
$\|\mathbf{x}\|_1 = \sum_{i=1}^{n} |x_i|$.
* The L1 norm is less sensitive to outliers. To compute the L1 norm, we compose the absolute value with the sum operation.


In [41]:
torch.abs(u).sum()

tensor(7.)

* Both the L2 and L1 norms are special cases of the general of Lp norms:
$\|\mathbf{x}\|_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}$.
* In the case of matrices, matters are more complicated. After all, matrices can be viewed
both ascollections of individual entries and as objects that operate on vectors and transform
them into other vectors. For instance, we can ask by how much longer the matrix–vector
product Xv could be relative to v. This line of thought leads to what is called the spectral norm.
* For now we introduce the `Frobenius norm`, which is much easier to compute and defined as the square root sum of the squares of a matrix's elements.

$\|A\|_F = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}^2}$

* We apply these norms when solving optimization problems: like say we want to maximize the probability assigned to observed data, maximize the revenue associated with a recommender model, minimize the distance between predictions and the ground truth observations etc.



In [42]:
torch.norm(torch.ones((4,9)))

tensor(6.)