<a href="https://colab.research.google.com/github/KSharif/Mathematrics-foundation-for-ML/blob/main/Linear_alegbra2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Norms in ML**

Norms are fundamental concepts in machine learning, used to measure the size or length of vectors. They play a crucial role in various machine learning tasks, including defining loss functions, regularization, and optimization. Here is an overview of the different types of norms and their applications in machine learning:
Types of Norms
L0 Norm
Definition: Counts the number of non-zero elements in a vector.
Application: Useful for feature selection and sparsity, though not technically a norm because it does not satisfy all norm properties.
L1 Norm (Manhattan Norm)
Definition: The sum of the absolute values of the vector components.
Formula:
∥
x
∥
1
=
∑
i
=
1
n
∣
x
i
∣
∥x∥
1
​
 =∑
i=1
n
​
 ∣x
i
​
 ∣
Application: Often used in LASSO (Least Absolute Shrinkage and Selection Operator) for regularization, promoting sparsity in models.
L2 Norm (Euclidean Norm)
Definition: The square root of the sum of the squared vector components.
Formula:
∥
x
∥
2
=
∑
i
=
1
n
x
i
2
∥x∥
2
​
 =
∑
i=1
n
​
 x
i
2
​

​

Application: Commonly used in ridge regression and as a loss function (e.g., Mean Squared Error). It is sensitive to outliers.
Squared L2 Norm
Definition: The sum of the squared vector components, without taking the square root.
Formula:
∥
x
∥
2
2
=
∑
i
=
1
n
x
i
2
∥x∥
2
2
​
 =∑
i=1
n
​
 x
i
2
​

Application: Used for computational efficiency in optimization problems, as it simplifies the derivative calculations.
L∞ Norm (Max Norm)
Definition: The maximum absolute value among the vector components.
Formula:
∥
x
∥
∞
=
max
⁡
(
∣
x
1
∣
,
∣
x
2
∣
,
…
,
∣
x
n
∣
)
∥x∥
∞
​
 =max(∣x
1
​
 ∣,∣x
2
​
 ∣,…,∣x
n
​
 ∣)
Application: Useful in scenarios where the largest component dominates the behavior of the vector.
Generalized Lp Norm
Definition: A general form that includes all the above norms as special cases.
Formula:
∥
x
∥
p
=
(
∑
i
=
1
n
∣
x
i
∣
p
)
1
/
p
∥x∥
p
​
 =(∑
i=1
n
​
 ∣x
i
​
 ∣
p
 )
1/p

Application: Allows flexibility in defining norms based on the value of
p
p. For example,
p
=
1
p=1 gives the L1 norm,
p
=
2
p=2 gives the L2 norm, and
p
→
∞
p→∞ gives the L∞ norm.
Applications in Machine Learning
Loss Functions
Norms are used to define loss functions, which measure the error between predicted and actual values. For instance, the L2 norm is used in the Mean Squared Error (MSE) loss function, while the L1 norm is used in the Mean Absolute Error (MAE) loss function
3
.
Regularization
Norms are also used in regularization techniques to prevent overfitting by adding a penalty term to the loss function. L1 regularization (LASSO) adds the L1 norm of the coefficients, promoting sparsity. L2 regularization (ridge regression) adds the L2 norm of the coefficients, promoting smaller but non-zero coefficients.
Optimization
In optimization problems, norms help in defining constraints and objectives. For example, in support vector machines (SVM), the L2 norm is used to maximize the margin between classes.
Understanding and choosing the appropriate norm is crucial for the performance and efficiency of machine learning models. Each norm has its advantages and is suited for different types of problems and data characteristics.

### $L^2$ Norm

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
import torch

In [3]:
x = np.array ([25, 2, 5])
x

array([25,  2,  5])

In [4]:
(25 ** 2 + 2 ** 2 + 5 **2) ** (1/2)

25.573423705088842

In [5]:
np.linalg.norm (x)


25.573423705088842

So, if units in this 3-dimensional vector space are meters, then the vector $x$ has a length of 25.6m

### $L^1$ Norm

In [6]:
x

array([25,  2,  5])

In [7]:
np.abs(25) + np.abs(2) + np.abs (5)

32

### Squared $L^2$ Norm

In [8]:
x


array([25,  2,  5])

In [None]:
(25**2 + 2**2 + 5**2)

654

In [None]:
# we'll cover tensor multiplication more soon but to prove point quickly:
np.dot(x, x)

654

### Max Norm

In [9]:
x

array([25,  2,  5])

In [10]:
np.max([np.abs(25), np.abs(2), np.abs(5)])

25

### Orthogonal Vectors

In [11]:
i = np.array([1 , 0])
i

array([1, 0])

In [12]:
j = np.array([0,1 ])
j


array([0, 1])

In [13]:
np.dot(j,j)

1

### Matrices (Rank 2 Tensors) in NumPy

In [14]:
X = np.array([[25 , 2] , [5 , 26] , [3 , 7]])
X

array([[25,  2],
       [ 5, 26],
       [ 3,  7]])

In [15]:
X.shape

(3, 2)

In [16]:
X.size

6

In [17]:
type(X)

numpy.ndarray

In [18]:
# Select the left column of matrix X( zero -indexed)
X[:,0]

array([25,  5,  3])

In [19]:
# Select the right column of matrix X( zero -indexed)
X[0:, 1]

array([ 2, 26,  7])

In [21]:
#Select the middle row of matrix X:
X[1 ,: ]


array([ 5, 26])

In [22]:
# another Slicing - by - index example
X[0:2 , 0: 2]

array([[25,  2],
       [ 5, 26]])

### Matrices in PyTorch

In [4]:
X_pt = torch.tensor([[25 , 2], [5, 26], [3, 7]])
X_pt

tensor([[25,  2],
        [ 5, 26],
        [ 3,  7]])

In [5]:
X_pt = torch.tensor([[25, 2], [5, 26], [3, 7]])
X_pt

tensor([[25,  2],
        [ 5, 26],
        [ 3,  7]])

In [6]:
X_pt.shape #pythonic relative to tensorFlow

torch.Size([3, 2])

In [7]:
X_pt[1,:] #N.B.: Python is zero-indexed; written algebra is one - indexed



tensor([ 5, 26])

### Matrices in TensorFlow

In [None]:
X_tf = tf.Variable([[25, 2], [5, 26], [3, 7]])
X_tf

<tf.Variable 'Variable:0' shape=(3, 2) dtype=int32, numpy=
array([[25,  2],
       [ 5, 26],
       [ 3,  7]], dtype=int32)>

In [None]:
tf.rank(X_tf)

<tf.Tensor: shape=(), dtype=int32, numpy=2>

In [None]:
tf.shape(X_tf)

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([3, 2], dtype=int32)>

In [None]:
X_tf[1,:]

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([ 5, 26], dtype=int32)>

### Higher-Rank Tensors

As an example, rank 4 tensors are common for images, where each dimension corresponds to:

1. Number of images in training batch, e.g., 32
2. Image height in pixels, e.g., 28 for [MNIST digits](http://yann.lecun.com/exdb/mnist/)
3. Image width in pixels, e.g., 28
4. Number of color channels, e.g., 3 for full-color images (RGB)

In [8]:
images_pt = torch.zeros([32, 28, 28, 3])

In [9]:
images_pt

tensor([[[[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         ...,

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]],

         [[0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.],
          ...,
          [0., 0., 0.],
          [0., 0., 0.],
          [0., 0., 0.]]],


        [[[0., 0.

In [10]:
images_tf = tf.zeros([32 , 38, 33, 3])
images_tf

<tf.Tensor: shape=(32, 38, 33, 3), dtype=float32, numpy=
array([[[[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        ...,

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         ...,
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]],


   

## Segment 2: Common Tensor Operations

In [12]:
x_1 = np.array([[25, 2], [5, 26], [3, 7]])
x_1

array([[25,  2],
       [ 5, 26],
       [ 3,  7]])

In [13]:
x_1.T

array([[25,  5,  3],
       [ 2, 26,  7]])

In [16]:
X_1_py = torch.tensor([[25, 2], [5, 26], [3, 7]])
X_1_py

tensor([[25,  2],
        [ 5, 26],
        [ 3,  7]])

In [21]:
# transpose in pytorch

X_1_py.T

tensor([[25,  5,  3],
        [ 2, 26,  7]])

In [20]:
X_1_tf = tf.Variable([[25, 2], [5, 26], [3, 7]])
X_1_tf

<tf.Variable 'Variable:0' shape=(3, 2) dtype=int32, numpy=
array([[25,  2],
       [ 5, 26],
       [ 3,  7]], dtype=int32)>

In [22]:
tf.transpose(X_1_tf) # transpose function in tensorflow

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[25,  5,  3],
       [ 2, 26,  7]], dtype=int32)>

### Basic Arithmetical Properties

Adding or multiplying with scalar applies operation to all elements and tensor shape is retained:

In [23]:
x_1 *2 # numpy method

array([[50,  4],
       [10, 52],
       [ 6, 14]])

In [24]:
x_1 + 2

array([[27,  4],
       [ 7, 28],
       [ 5,  9]])

In [27]:
x_1*2+2-1

array([[51,  5],
       [11, 53],
       [ 7, 15]])

In [None]:
# pytorch method
# Python are overloaded; could alternatively use torch.mul() or torch.add()
X_1_py*2+2_1

tensor([[71, 25],
        [31, 73],
        [27, 35]])

In [28]:
torch.add(torch.mul(X_1_py, 2), 2)

tensor([[52,  6],
        [12, 54],
        [ 8, 16]])

In [29]:
X_1_tf*2+2 # Operators likewise overloaded; could equally use tf.multiply() tf.add()

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[52,  6],
       [12, 54],
       [ 8, 16]], dtype=int32)>

In [30]:
tf.add(tf.multiply(X_1_tf, 2), 2)

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[52,  6],
       [12, 54],
       [ 8, 16]], dtype=int32)>

If two tensors have the same size, operations are often by default applied element-wise. This is **not matrix multiplication**, which we'll cover later, but is rather called the **Hadamard product** or simply the **element-wise product**.

The mathematical notation is $A \odot X$

In [31]:
x_1

array([[25,  2],
       [ 5, 26],
       [ 3,  7]])

In [34]:
A = x_1 + 2
A

array([[27,  4],
       [ 7, 28],
       [ 5,  9]])

In [35]:
A + x_1

array([[52,  6],
       [12, 54],
       [ 8, 16]])

In [36]:
A * x_1

array([[675,   8],
       [ 35, 728],
       [ 15,  63]])

Pytorch way


In [39]:
A_pt = X_1_py + 2
A_pt

tensor([[27,  4],
        [ 7, 28],
        [ 5,  9]])

In [38]:
A_pt + X_1_py

tensor([[52,  6],
        [12, 54],
        [ 8, 16]])

In [40]:
[ ]
A_pt * X_1_py


tensor([[675,   8],
        [ 35, 728],
        [ 15,  63]])

Tensorflow Way

In [42]:
A_tf = X_1_tf + 2
A_tf

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[27,  4],
       [ 7, 28],
       [ 5,  9]], dtype=int32)>

In [43]:
A_tf + X_1_tf

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[52,  6],
       [12, 54],
       [ 8, 16]], dtype=int32)>

In [44]:
A_tf * X_1_tf

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[675,   8],
       [ 35, 728],
       [ 15,  63]], dtype=int32)>

### Reduction

Calculating the sum across all elements of a tensor is a common operation. For example:

* For vector ***x*** of length *n*, we calculate $\sum_{i=1}^{n} x_i$
* For matrix ***X*** with *m* by *n* dimensions, we calculate $\sum_{i=1}^{m} \sum_{j=1}^{n} X_{i,j}$

Numpy way

In [45]:
x_1

array([[25,  2],
       [ 5, 26],
       [ 3,  7]])

In [46]:
x_1.sum()

68

In [49]:
x_1.sum(axis=0) # summing over all rows

array([33, 35])

In [None]:
x_1.sum(axis=1) #summing over all columns

pytorch way

In [47]:
torch.sum(X_1_py)

tensor(68)

In [50]:
torch.sum(X_1_py, 0) # summing over all rows

tensor([33, 35])

In [52]:
torch.sum(X_1_py, 1)#summing over all columns

tensor([27, 31, 10])

Tensorflow

In [48]:
tf.reduce_sum(X_1_tf)

<tf.Tensor: shape=(), dtype=int32, numpy=68>

In [53]:
tf.reduce_sum(X_1_tf, 0) # summing over all rows

<tf.Tensor: shape=(2,), dtype=int32, numpy=array([33, 35], dtype=int32)>

In [54]:
tf.reduce_sum(X_1_tf, 1) #summing over all columns

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([27, 31, 10], dtype=int32)>

Many other operations can be applied with reduction along all or a selection of axes, e.g.:

* maximum
* minimum
* mean
* product

They're fairly straightforward and used less often than summation, so you're welcome to look them up in library docs if you ever need them.

### The Dot Product

If we have two vectors (say, ***x*** and ***y***) with the same length *n*, we can calculate the dot product between them. This is annotated several different ways, including the following:

* $x \cdot y$
* $x^Ty$
* $\langle x,y \rangle$

Regardless which notation you use (I prefer the first), the calculation is the same; we calculate products in an element-wise fashion and then sum reductively across the products to a scalar value. That is, $x \cdot y = \sum_{i=1}^{n} x_i y_i$

The dot product is ubiquitous in deep learning: It is performed at every artificial neuron in a deep neural network, which may be made up of millions (or orders of magnitude more) of these neurons.

Numpy way

In [59]:
m = np.array ([25, 2, 5])
m

array([25,  2,  5])

In [60]:
n = np.array([0, 1, 2])
n

array([0, 1, 2])

In [61]:
25*0 + 2*1 + 5*2

12

In [62]:
np.dot(m,n)

12

Pytorch way

In [64]:
m_pt = torch.tensor([25, 2, 5])
m_pt

tensor([25,  2,  5])

In [65]:
n_pt = torch.tensor([0, 1, 2])
n_pt

tensor([0, 1, 2])

In [66]:
np.dot(m_pt,n_pt)

12

In [67]:
torch.dot(torch.tensor([25,2,5.0]), torch.tensor([0,1,2.0]))

tensor(12.)

Tensorflow way

In [68]:
m_tf = tf.Variable([25, 2, 5])
m_tf

<tf.Variable 'Variable:0' shape=(3,) dtype=int32, numpy=array([25,  2,  5], dtype=int32)>

In [69]:
n_tf = tf.Variable([0,1,2])
n_tf

<tf.Variable 'Variable:0' shape=(3,) dtype=int32, numpy=array([0, 1, 2], dtype=int32)>

In [70]:
tf.reduce_sum(tf.multiply(m_tf,n_tf))

<tf.Tensor: shape=(), dtype=int32, numpy=12>