# Lesson 12

As always, the course notebooks are [here](https://github.com/fastai/course22p2).

## Clip Interogator

The [CLIP Interogator](https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator) tries to find a good prompt to create an image similar to that you provided. (Or rather a good caption).  

* Doesn't really reproduce the image, and it really can't.  A function mapping an image to text cannot be invertable, it is many to one mapping.   (But stable diffusion gives an approximate inversion! )

* CLIP interrogator uses [Blip](https://arxiv.org/abs/2201.12086) along with a  hardwired set of artists, images, styles, movements, etc. (See [source code](https://github.com/pharmapsychotic/clip-interrogator/tree/main/clip_interrogator)). It searches through these lists , combining them somehow with the Blip resutls, to find the best match, comparing the text encoding to the image encoding.  


* We can uses these text prompts to get images of a similar nature to the one provided.   


## Matrix multiplication continued

Get back to where we got to before.

In [1]:
import numpy as np
import torch
from minai import mnist_load
from pathlib import Path

path_data = Path('data')
x_train, y_train, x_valid, y_valid = mnist_load.load_data(path_data)


In [2]:
# move these into tensors
x_train,y_train,x_valid,y_valid = map(torch.tensor, (x_train,y_train,x_valid,y_valid))


In [3]:

weights = torch.randn(784,10)
biases = torch.randn(10)
m1 = x_valid[:10]
m2 = weights


Naive python version:

In [13]:
def matmul_naive(a,b):
    (ar,ac),(br,bc) = a.shape,b.shape
    c =np.zeros((ar, bc))
    for i in range(ar):
        for j in range(bc):
            for k in range(ac): c[i,j] += a[i,k] * b[k,j]
    return c
m1np = m1.numpy()
m2np = m2.numpy()
%timeit -n 4 _=matmul_naive(m1np,m2np)

19.1 ms ± 409 μs per loop (mean ± std. dev. of 7 runs, 4 loops each)


*NOTE* the version from the video did not first convert the tensors to numpy arrays. If you pass in tensors, it will be much slower because of the overhead of converting tensors to numpy arrays.  (500 ms!)


Fastest so far is Broadcasting version of the multiplication.  Down to just one for loop.

In [7]:

# reminder of our fastest so far, usign broadcasting:

def matmul(a,b):
    (ar,ac),(br,bc) = a.shape,b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
#       c[i,j] = (a[i,:] * b[:,j]).sum()      # previous version
        c[i]   = (a[i,:,None] * b).sum(dim=0) # broadcast version
    return c
matmul(m1, m2)
%timeit -n 5 _=matmul(m1, m2)

138 μs ± 40.3 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)


This is about 150 times faster.

And for the full matrix:

In [14]:
x_trainnp = x_train.numpy()
weightsnp = weights.numpy()
%time  _=matmul_naive(x_trainnp, weightsnp)

CPU times: user 1min 35s, sys: 0 ns, total: 1min 35s
Wall time: 1min 35s


In [8]:
%timeit -n 5 _=matmul(x_train, weights)

449 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [15]:
(60+35)/.5

190.0

speed up is about 200 times. 

### Einstein Summation

A compact representation for combining products and sums in a general way. The key rules are:

  - Repeating letters between input arrays means that values along those axes will be multiplied together.

  - Omitting a letter from the output means that values along that axis will be summed.

  - Can be used to express many matrix manipulations: 
      - Transpose : `ij -> ji`  or just `ji`
      - Inner Product : `i,i`
      - Trace : `ii`
      - Matrix multiplication: `ik,kj -> ij`  

  - Invented by Albert Einstein in 1916 in his paper on the General Theory of Relativity. In GR there are many cases where (mathematical) tensors need to be summed over, so he invented a convention that repeated indexes should be summed over. This is standard now in physics when dealing with tensors.

In [5]:
# einssum version
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
%timeit -n 50 _=matmul(x_train, weights)

9.53 ms ± 723 μs per loop (mean ± std. dev. of 7 runs, 50 loops each)


Using native pytorch einsum, we can get another 100 x speed up! 

### Pytorch OP

We can also express this using `matmul` or `@` in pytorch directly:

In [7]:
%timeit -n 50 _ = x_train @ weights

9.27 ms ± 642 μs per loop (mean ± std. dev. of 7 runs, 50 loops each)


18:35

If cuda is available we can speed this up even more!

In [29]:
# cuda version
cuda0 = torch.device('cuda:0')
weights = weights.to(cuda0)
x_valid = x_train.to(cuda0)
# force compile
_ = matmul(x_valid, weights)
%timeit -n 50 _=matmul(x_valid, weights)

22.3 μs ± 806 ns per loop (mean ± std. dev. of 7 runs, 50 loops each)


And finally we get another order of 100x from using cuda. so that 1e6 times faster than naive python version