# Numpy Broadcastings

> The term broadcasting describes how numpy treats arrays with 
different shapes during arithmetic operations. Subject to certain 
constraints, the smaller array is “broadcast” across the larger 
array so that they have compatible shapes. Broadcasting provides a 
means of vectorizing array operations so that looping occurs in C
instead of Python. It does this without making needless copies of 
data and usually leads to efficient algorithm implementations.

When operating on two arrays, Numpy/PyTorch compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when:

- they are equal, or
- one of them is 1.
- they only disagree on exceeding innermost dimensions. i.e., dimensions that do not exist on smaller object. 

After it has checked that (and passed, otherwise broadcasting cannot continue), it will broadcast the innermost elements of the broadcasted elements into the larger shape. i.e., a scalar  (1, 1, 1) into a 2x3x3 tensor: repeat the scalar twice. For each of instantiation of the scalar, broadcast into a matrix of 3x3. Then you have a new tensor of 2x3x3: 2 matrices of scalar.

If you have a 1-d tensor (3, 1, 1) and broadcast it into (3, 3, 3): each of the three elements of the 
original 1-d tensor get broadcasted into 3x3 matrices. Thus, you end up with a 3-d tensor. 

In [21]:
import numpy as np

m = np.array([[1, 2, 3], [4,5,6], [7,8,9]])
print(m); m.shape

[[1 2 3]
 [4 5 6]
 [7 8 9]]


(3, 3)

## Most basic example

In [36]:
2*m

array([[ 2,  4,  6],
       [ 8, 10, 12],
       [14, 16, 18]])

The trailing dimension are compatible (one of them is of size 1, such that the broadcasting is trivial) but the first dimension is not compatible. So that the broadcasted term 1-d 3 component vector is then broadcasted across the first dimension. Then, the matrices are multiplicated element-wise.

In [31]:
c = np.array([10,20,30]); 
print(c); print(c.shape)

[10 20 30]
(3,)


## Adding vector to a matrix

In [4]:
m + c

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

In this case, the trailing dimension is not comaptible (empty vs 3). Then, the initial tensor is repeated across this missing dimension to make them compatible.

In [28]:
print(c); print("\n", np.broadcast_to(c, m.shape)); np.broadcast_to(c, m.shape) + m

[10 20 30]

 [[10 20 30]
 [10 20 30]
 [10 20 30]]


array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

### Column vector to a matrix

In [40]:
c[:, None].shape # i.e., recast such that we can have another dimension

(3, 1)

In [43]:
c[:, None]

array([[10],
       [20],
       [30]])

If we add this new vector, the trailing dimension will be compatible and this vector will be broadcasted into the new shape. i.e., we have a 1-d tensor with 3 components and need a 2-d 3x3. That is, we have the 0x3, thus create 3 of the columns we have and broadcast:

In [41]:
np.broadcast_to(c[:, None], m.shape)

array([[10, 10, 10],
       [20, 20, 20],
       [30, 30, 30]])

In [42]:
np.broadcast_to(c[:, None], m.shape) + m

array([[11, 12, 13],
       [24, 25, 26],
       [37, 38, 39]])

### Over channels of pixels

Let's imagine we have an image with 3 channels (RGB):

In [62]:
x = np.array([1, 2, 3])
image = np.broadcast_to(x[:, None, None], (3, 2, 2))
image

array([[[1, 1],
        [1, 1]],

       [[2, 2],
        [2, 2]],

       [[3, 3],
        [3, 3]]])

If we want to divide the first channel by 4, the second by 6 and the third by 8:

In [64]:
dividers = np.array([4, 6, 8])
image/dividers[:, None, None]

array([[[0.25      , 0.25      ],
        [0.25      , 0.25      ]],

       [[0.33333333, 0.33333333],
        [0.33333333, 0.33333333]],

       [[0.375     , 0.375     ],
        [0.375     , 0.375     ]]])

Both tensors are compatible across each of the ranks: 

In [73]:
print(dividers[:, None, None].shape, image.shape)

(3, 1, 1) (3, 2, 2)


Thus, what happens is the first dimension gets broadcasted such that each of the elements (4, 6, 8) is compatible with a 2-d tensor of shape 2x2. Then, element division is applied to the 3-d tensors.

### Many images

In [75]:
np.array([image, image]).shape

(2, 3, 2, 2)

In [70]:
np.array([image, image])/dividers[:, None, None]

array([[[[0.25      , 0.25      ],
         [0.25      , 0.25      ]],

        [[0.33333333, 0.33333333],
         [0.33333333, 0.33333333]],

        [[0.375     , 0.375     ],
         [0.375     , 0.375     ]]],


       [[[0.25      , 0.25      ],
         [0.25      , 0.25      ]],

        [[0.33333333, 0.33333333],
         [0.33333333, 0.33333333]],

        [[0.375     , 0.375     ],
         [0.375     , 0.375     ]]]])

Note that we can straightfowardly use broadcasting when the only thing they differ is the innermost dimension, as above. Just repeat the (d-1) tensor you have as many times as necessary. However, this only works with the innermost dimension. With outermost:

In [74]:
image/ dividers # broadcasting cannot continue

ValueError: operands could not be broadcast together with shapes (3,2,2) (3,) 

## Conclusion

1- Broadcasting is very simple when all the outermost dimensions agree, but the first do not:

### Examples

In [78]:
print(np.array([image, image]).shape, dividers[:, None, None].shape)

(2, 3, 2, 2) (3, 1, 1)


In [79]:
np.array([image, image])/ dividers[:, None, None]

array([[[[0.25      , 0.25      ],
         [0.25      , 0.25      ]],

        [[0.33333333, 0.33333333],
         [0.33333333, 0.33333333]],

        [[0.375     , 0.375     ],
         [0.375     , 0.375     ]]],


       [[[0.25      , 0.25      ],
         [0.25      , 0.25      ]],

        [[0.33333333, 0.33333333],
         [0.33333333, 0.33333333]],

        [[0.375     , 0.375     ],
         [0.375     , 0.375     ]]]])

2- If any of the outermost dimensions do not agree, broadcasting is impossible

In [80]:
image / dividers[:, None]

ValueError: operands could not be broadcast together with shapes (3,2,2) (3,1) 

3- After dimensions agree, broadcasting cannot be more simple:
    Repeat each of the innermost figures, as many times as necessary to fill the other dimensions. 

In [81]:
np.broadcast_to(dividers[:, None, None], image.shape) # innermost figure: 

array([[[4, 4],
        [4, 4]],

       [[6, 6],
        [6, 6]],

       [[8, 8],
        [8, 8]]])

In [83]:
np.broadcast_to(dividers[:, None, None], np.array([image, image]).shape)

array([[[[4, 4],
         [4, 4]],

        [[6, 6],
         [6, 6]],

        [[8, 8],
         [8, 8]]],


       [[[4, 4],
         [4, 4]],

        [[6, 6],
         [6, 6]],

        [[8, 8],
         [8, 8]]]])

# Regularization

I have read much, but the example that Jeremy used to really understand regularization was great: regularization is a prior around zero. You can change that prior by playing around with what gets passed to the loss function. 

For example, he used a combination of Naive Bayes and Logistic Regression. Naive Bayes is equivalent to multiply the bag of words (one-hot-encoded matrix) by a prior and then taking the product across these probabilities as if their intersection were null (hence Naive) (with log probs the product changes to a sum). That is, the matrix gets multiplicated by some weights (the prior) and then summed. Essentially, a linear model.

What if we learned those weights from the data, a logistic regression. If we regularize, we say to the algorithm, our prior is that all those weights should be zero.

What if instead, we learned the weights, add a constant (say 0.5) and multiply by the priors? Then, the effective weights that will be used to predict, that tend to be zero, will be ~the priors. Thus, if we implement l2 regularization, the prior (with weights around zero) will be use the priors from naive bayes. If you want to negate those priors, you will have to make the weights negative, but that will cost you in the loss function due to l2 measure on the weights. 

Thus, Jeremy implemented an algorithm that took into account prior information to the model and made the model put attention to it through the use of regularization. 

Very cool!

# Embeddings

> about embeddings like word embeddings word to their core glove or whatever and people love to make them sound like this amazing new complex neural net thing right they're not embedding means make a multiplication by a one hot encoded matrix faster by replacing it with a simple array. where possible, it is best to treat things as categorical variables. easier for a neural net to find a functional form that exploits the difference between values.

Imagine a one hot-encoded matrix: i.e., columns of zeroes and ones. To multiply that matrix by a vector, is equivalent to insert zeroes on the vector where there are zeroes on the rows of the matrix and then sum their elements.

In [104]:
x = np.array([[0, 1, 1], [1, 0, 0], [1, 1, 1]])
x

array([[0, 1, 1],
       [1, 0, 0],
       [1, 1, 1]])

In [105]:
x @ np.array([5, 3, 2])

array([ 5,  5, 10])

However, when you have a massive one-hot encoded matrix, to perform so many multiplications is inefficient. But there's a way to workaround: store your massive matrix as a sparse matrix: i.e., only store the position of the components that are not zero. Then, subset your vector with these positions:

In [111]:
np.array([5, 3, 2])[[1, 2]].sum() # first row

5

In [114]:
np.array([5, 3, 2])[[0]].sum() # second row

5

In [113]:
np.array([5, 3, 2])[[0, 1, 2]].sum() 

10

Thus, we find a representation of the one-hot encoding matrix that multi-layered models can use. 

## Using embeddings on Neural Networks

By doing this, we can skip our SGD models a lot more suffering than they should endure: we do not have to use ordinal variables, which makes it very difficult for them to find information gain, nor have to endure the computational burden from using a one-hot encoding matrix. 