In [1]:
my = 'my'
other = 'other\'s'
print(f"I'm trying to write about {my} ideas, and most importantly {other} ideas!")

I'm trying to write about my ideas, and most importantly other's ideas!


<center><h1> DALL.E </h1></center>

I started to dig into the github repo of CLIP, Contrastive Language-Image Pretraining. It is developed together with DALL.E and I would like to share my digging journey with all of you today. So let's dive in.

In [4]:
base_url = 'https://github.com' + '/openai' + '/CLIP'
base_url

'https://github.com/openai/CLIP'

You must definitely read the first paragraph! They are happy to announce that their neural network have competitive "zero-shot" test results compared to using ResNet50 on ImageNet dataset, and most importantly they don't use 1.28 Million labeled images. So I decided to dig into their code and find out more about their model.

In [5]:
clip_url = base_url + '/tree/main/clip'
clip_url

'https://github.com/openai/CLIP/tree/main/clip'

The first file was a Bummer!

In [8]:
file1_url = clip_url + '/__init__.py'
file1_url

'https://github.com/openai/CLIP/tree/main/clip/__init__.py'

I'm not even sure why they use .clip, but we will figure that out later. Let's look at the second file.

In [9]:
file2_url = clip_url + '/bpe_simple_vocab_16e6.txt.gz'
file2_url

'https://github.com/openai/CLIP/tree/main/clip/bpe_simple_vocab_16e6.txt.gz'

Well this is a zipped file and we have to unzip it and read its content.

In [20]:
# !pip install requests
import requests
import gzip
import io

# Send a GET request to the URL
file2_raw_url = 'https://raw.githubusercontent.com/openai/CLIP/main/clip/bpe_simple_vocab_16e6.txt.gz'
response = requests.get(file2_raw_url)

# Check if the request was successful
if response.status_code == 200:
    # Decompress the gzip file and read its content
    with gzip.GzipFile(fileobj=io.BytesIO(response.content)) as f:
        content = f.read().decode('utf-8')
        print(type(content))
        print(len(content))
        print(content[:100])
else:
    print(f"Failed to retrieve the file: {response.status_code}")


<class 'str'>
3130099
"bpe_simple_vocab_16e6.txt#version: 0.2
i n
t h
a n
r e
a r
e r
th e</w>
in g</w>
o u
o n
s t
o r
e 


So if you are familiar with Byte-Pair Encoding (BPE) you will notice that every line of this text file has 2 characters. Well that's not quite correct because as you dig deeper you discover the line 'th e<\/w>'. Then you try hard to remember BPE and how it worked but you only recall some keywords like token and subword. So each line consists of 2 tokens, and the more frequent these tokens appear in tokenized text, the sooner they show up in the file. Let's go really deep in the content of this text file:

In [23]:
print(content[-100:])

soccer saturday</w>
so zone</w>
smid t</w>
sm city
sli mey</w>
sin claire</w>
sd reader</w>
scare d



Now we are sure that our memory didn't fail us, but we should definitely dig deep into BPE sometime soon. At the end of the day, the whole reason we are here is to practice coding.

In [25]:
f.close()
print(content[-100:])

soccer saturday</w>
so zone</w>
smid t</w>
sm city
sli mey</w>
sin claire</w>
sd reader</w>
scare d



I was curious what happens if i close the file; first of all, when you use open a file using 'with' in python, it closes the file automatically after 'with' block ends. Second of all, content variable is hardcoded into memory and although the file is closed, it is gonna be sitting in memory. So the last two lines of code were just horrible:/ please ignore it:)

In [26]:
file3_url = clip_url + '/clip.py'
file3_url

'https://github.com/openai/CLIP/tree/main/clip/clip.py'

After I arrived at this clip.py file, I realized it imports from other files; especifically lines 13 and 14:

In [36]:
'''
from .model import build_model
from .simple_tokenizer import SimpleTokenizer as _Tokenizer
'''

'\nfrom .model import build_model\nfrom .simple_tokenizer import SimpleTokenizer as _Tokenizer\n'

So at this point I decided not to dig deep into this file and started looking at model.py


In [27]:
file4_url = clip_url + '/model.py'
file4_url

'https://github.com/openai/CLIP/tree/main/clip/model.py'

This was a relief because it seemed that the neural network layers and models using pytorch are sitting here. Now starts the technical part. I will try to be as concise as possible and not open too many parentheses.

## 1. Model.py 

### 1.1 Bottleneck class

This is a subclass of nn.Module. You see 'expansion' variable hardcoded to 4 right after we enter class definition. The input argumets are: 
- inplanes:
    number of input channels. Maybe we can use inchannels?
- planes: first I thought this is the number of output channels especially by looking at self.conv1 layer. However as you dig deep you'll see 'planes * self.expansion' in self.conv3; hence I don't wanna call it output channels.
- stride, with default value of 1. I think there is a rule that all arguments in a class or function definitonv that have default value should come after the other ones without default values. Let's check it out.

#### Parenthesis Open

In [28]:
def fnc(a, b=1, c):
    return a + b + c

SyntaxError: parameter without a default follows parameter with a default (2492566146.py, line 1)

Yep that's the case. Let's try it again:

In [34]:
def fnc(a, c, b=1):
    return a + b + c

print(fnc(0, 0))
print(fnc(0, 0, 3))
print(fnc(0, b=2, c=3))
print(fnc(b=2, c=3))


1
3
5


TypeError: fnc() missing 1 required positional argument: 'a'

in the 3rd print, We replaced order of 'b' and 'c' and it worked. In the 4th print, we forgot to declare a required "positional" argument and that was a failure. This was a basic python sidenote. If you already knew it please disregard my parenthesis.

#### Parenthesis Close

The first line is a familiar one used to initialize the super class (nn.Module):

super().\__init\__()

If you have seen other classes where super has some input arguments we can dig into that later.

The clever reader quickly scans the next few lines of code and notices a pattern: convolution + batchnorm + relu, repeated 3 times. He (or She or they) realizes that self.conv1 is a 2-dimensional convolutional layer. The first 2 arguments, inplanes and planes, are simply the number of channels of the input to this layer (i.e. input channels, inchannels, in_channels, inplanes, or whatever you wanna call it) and the number of desired output channels, respectively. The third argument which is 1 is for kernel size. Here we are using a $1\times 1$ kernel. This kernel has 1 single parameter, tuned (i.e., learned) as training of the model progresses. In the case of an input of size (H, W, C), this number multiplies each pixel (convolution) for all pixels in the same height and width and sums them up; in other words it digs deep into channel space. This is true with every kernel size. For self.conv2 with kernel size 3, the convolution at every step involves $3\times 3=9$ parameters of kernel being multiplied with corresponding 9 pixel values in input and summing all of them up, as well as digging deep into channel space. I will add some visualization to help better understand this concept of 2D convolution. Before moving forward I just wanna mention that this digging deep into channel space means we are combining information across all channels, and every one of those $planes$ Conv2D filters attempts to learn something different and hopefully useful.

The bias is set to False which means this layer (as well as all subsequent Conv2D layers) don't have a bias parameter.

So in a nutshell, this first Conv2D layer consists of planes-many Conv2D filters, each operating on our input with inplanes-many channels and producing an output with 1 channel, resulting in final output with planes-many channels. But we didn't talk about other dimensions. What happens to H and W? The answer in the case of $1\times 1$ kernel is quite easy to visualize: H and W do not change. However when kernel size is more than 1 we remember something about padding and stride.Let's dig deeper.

stride is used to dictate how many steps we wanna jump from one convolution to another one, either row-wise or column-wise. Stride of 1 means no jump, and it is the default value in [PyTorch implementation](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) of `Conv2D`. By looking at pytorch docs, we also identify the first three arguments of Conv2D 'class' (in_channels, out_channels, and kernel_size) which do not have default values and are required when initialzing this class, and note that all other arguments have a default value. The clever reader predicts that stride of 2 can result in H and W dimensions to be halved. However one important piece of puzzle is missing: the padding parameter.

Padding refers to adding some pixels into left, right, top and bottom of our tensor. You might have seen $\text{padding} = \frac{\text{kernel\_size}-1}{2}$ in some codes. What this does is that it pads the tensor with just right amount of pixels so that uing a stride > 1 will result in H and W dimensions to be divided by stride. In the case of $1\times 1$ kernel, it retains Height and Width, because $\text{padding} = \frac{1-1}{2}=0$ is used (the default value).



For Batchnorm layer we just see a single argument which is equal to previous layer's output channels. We have some vague memory of mean and variance parameters used and learned in batchnorm layer and this leads us to speculate that batchnorm learns these parameters for every channel. Later on we will face other types of normalizations like layernorm and we will dig into them.

The self.relu is a familiar layer which performs element-wise on its input. Basically

$\text{ReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}$


We see inplace=True argument but let's dig into that later. After the 2nd convolution and before the 3rd one, there is this 'average pooling' layer with argument stride. Well, we are quite sure what average pooling does: it takes the average of pixels inside ints sliding window. But what is the dimension of that window? The answer us "stride", the only input argument. We decide to look at [pytorch docs](https://pytorch.org/docs/stable/_modules/torch/nn/modules/pooling.html#AvgPool2d), which defines AvgPool2D class, along with many other pooling layers. The interested digger may dig deep into this file. By looking at the commented part of the code in the beginnig of class definition, the clever digger notices the similarity between these comments and what he (or she or they) saw in the previous url, pytorch documentation. They are indeed the exact same thing. We can dig into this later to see how we can leverage this for our benefit, in case we write some source code and like to maintain a documentation page.

The first argument for AvgPool2D is kernel_size. The 2nd one is stride (jump factor). The careful reader realizes that default value of strdie is None but discovers this line of code:
```python
self.kernel_size = kernel_size
self.stride = stride if (stride is not None) else kernel_size
```

He (or ...) learns a new trick: use None for stride as default value. If we assign a value to stride, then use that value. Otherwise, if stride is None, set stride to kernel_size. Note that we cannot directy set default value of stride to kernel_size in definition because kernel_size is itself an input argument! Honestly it was my first time seeing this trick. Happy Digging!!

Let's close the discussion around this AvgPool2D layer by noting that if stride > 1, we use it and it results in downsampling (reducing H, W dimension sizes), while stride=1 will not activate this layer. I started to use the fancy word downsample instead of 'reducing H, W dimensions'. So if stride=2 (which comes soon) we perform $2\times 2$ average pooling and slide this $2\times 2$ window with stride of 2, because stride is equal to kernel_size, as we just learned by digging deep into pytorch source code. This means every pixel contributes to not more than 1 average pooling operation.

The digger may get scared by looking at if statement in line 34, but after digging deep into code (line 53, residual connection) he realizes that this if statement just manipulates (downsamples) the input tensor x to have the same dimension as out= self.bn3(self.conv3(out)) so that they can be added. The operations included in the self.downsample sequential layer resemble the self.avgpool and the following convolution and batchnorm layers, but not the relu. 

#### Bottleneck class almost over

We digged deep into basic pytorch layers like Conv2D and AvgPool2D, but it is obvious that we are not gonna repeat this for the next parts of the code and next digging journeys.

### 1.2 AttentionPool2D class

We already encountered AvgPool2D class, but this layer uses attention mechanism instead of simply averaging. If we want to dig into Attention, we will lose track and might get buried deep under our digged tunnel. So I assume that you have prior knowledge of attention mechanism and are familiar with keywords like 'key', 'value', 'query', 'positional embedding', 'embedding dimension'.

In [6]:
# !pip install torch
import torch
torch.nn.Parameter(torch.tensor([[1,2],[2,1]]))

RuntimeError: Only Tensors of floating point and complex dtype can require gradients

It turns out nn.Parameter thing (class to be precises) requires gradients and integer, unlike floating-point and complex type, is not a good (valid) datatype choice. We can't do that. Gradients don't like integers but they do like floating-point numbers.

In [34]:
float_tensor_1 = torch.tensor([[1, 2], [3, 4]]).float()
float_tensor_2 = torch.tensor([[-1, -2], [-3, -4]], dtype=torch.float16)
param = torch.nn.Parameter(torch.add(float_tensor_1, float_tensor_2))
param

Parameter containing:
tensor([[0., 0.],
        [0., 0.]], requires_grad=True)

Look at the [official doc](https://pytorch.org/docs/stable/_modules/torch/nn/parameter.html#Parameter):

"A kinda Tensor that is to be considered a module parameter." Also feel free to dig into its methods: \__new__, \__deepcopy__, \__repr__, \__reduce_ex__. The easiest one for now would be \__repr__ which doesn't have any input arguement and has a single line of code:

`return 'Parameter containing:\n' + super().__repr__()`
 
 The other methods are a bit scary to me, but I would like to dig into \__new__ and know more about this line, especially because of detach and requires_grad_ operations.

 ```python
 t = data.detach().requires_grad_(requires_grad)
``` 

Let's continue digging in our CLIP tunnel. The input arguments to nn.Parameter are not that scary. We see special_dim**2 + 1 (did they mean spatial?!) and we really get confused by that '+1'. We also see embed_dim but remembering some stuff from attention mechanism we quickly notice that we have a 2D matrix where each row represents a high-dimensional embedded vector. We also remember that the division by $\sqrt{embed\_dim}$ is important when we take dot products between query and key vectors to maintain unit variance and avoid any issue due to large numbers. Well someone has digged into this [here](https://ai.stackexchange.com/questions/21237/why-does-this-multiplication-of-q-and-k-have-a-variance-of-d-k-in-scaled). Basically his question is about the inner working of attention block which is hidden in our code inside the ```F.multi_head_attention_forward``` function. However the reason for seeing division by $\sqrt{embed\_dim}$ in self.positional_embedding is similar: We want to make sure that in line 71, the addition of self.positional_embedding to $x$ will not introduce large variances and this way we keep the overall input distribution stable (last sentence from ChatGPT!).

The next 3 linear projections of query, key, and value are learnable matrices and their input and output size is the same as embed_dim. The last linear layer is part of attention block as it is fed as an input argument to ```F.multi_head_attention_forward```. The naming of input argument ```python out_proj_weight=self.c_proj.weight``` as well as the output size of ```python output_dim or embed_dim``` suggest that this linear projection is applied in the last stage. We also have num_heads for number of attention heads and again an input to ```F.multi_head_attention_forward```

Digging into ```forward```method, we encounter ```flatten``` and ```permute```. Let's play with them a lil bit:

In [15]:
x = torch.randn(1,2,3,4,5).float()
y = x.flatten(start_dim=2)
y.shape 

torch.Size([1, 2, 60])

In [20]:
x = torch.randn(1, 2, 3, 4, 5, dtype=float)
y = x.permute(-1, -2, -3, -4, -5)
print(y.shape)
y = x.permute(1, 0)
print(y.shape)

torch.Size([5, 4, 3, 2, 1])


RuntimeError: permute(sparse_coo): number of dimensions in the tensor input does not match the length of the desired ordering of dimensions i.e. input.dim() = 5 is not equal to len(dims) = 2

In [22]:
print(torch.permute.__doc__)


permute(input, dims) -> Tensor

Returns a view of the original tensor :attr:`input` with its dimensions permuted.

Args:
    input (Tensor): the input tensor.
    dims (tuple of int): The desired ordering of dimensions

Example:
    >>> x = torch.randn(2, 3, 5)
    >>> x.size()
    torch.Size([2, 3, 5])
    >>> torch.permute(x, (2, 0, 1)).size()
    torch.Size([5, 2, 3])



I embarked on a digging journey to find the source code of ```torch.permute```. ChatGPT told me the actual implementation could be in a C++ backend (Aten library) or be called via Python bindings. I don't know what this means. [This](https://github.com/pytorch/pytorch/tree/4404762d7dd955383acee92e6f06b48144a0742e/aten/src/ATen) is the closest I got. You can dig more into folders and files. For example the ```core``` folder has a readme file with some cute information: 

In [23]:
import requests
import gzip
import io

# Send a GET request to the URL
readme_raw_url = 'https://raw.githubusercontent.com/pytorch/pytorch/4404762d7dd955383acee92e6f06b48144a0742e/aten/src/ATen/core/README.md'
response = requests.get(readme_raw_url)

# Check if the request was successful
if response.status_code == 200:
    # plain text file; read and decode the content directly
    content = response.content.decode('utf-8')
    print(content)
else:
    print(f"Failed to retrieve the file: {response.status_code}")

ATen Core
---------

ATen Core is a minimal subset of ATen which is suitable for deployment
on mobile.  Binary size of files in this folder is an important constraint.



After digging into 'native_functions.yaml' I found some traces of permute function:

In [31]:
yaml_url = 'https://raw.githubusercontent.com/pytorch/pytorch/4404762d7dd955383acee92e6f06b48144a0742e/aten/src/ATen/native/native_functions.yaml'
response = requests.get(yaml_url)

if response.status_code == 200:
    content = response.content.decode('utf-8')
    lines = content.splitlines()
    extracted_lines = lines[1609:1612]
    for line in extracted_lines:
        print(line)
else:
    print('Digging failed! Rock encountered!!')

- func: permute(Tensor(a) self, int[] dims) -> Tensor(a)
  matches_jit_signature: True
  variants: method  # This is method-only to match the previous tensor API. In the future we could make this a function too.


This is just a high-level API definition, not the actual implementation code. So let's dig back. But before that take a look at 2nd line and remember jit (just-in-time) word for later diggings. On our way to dig back to CLIP gitub repo, we realized that we didn't fully dig into the [ling](https://discuss.pytorch.org/t/how-is-permutation-implemented-in-pytorch-cuda/39012/5) that helped us find early traces of ```permute```. In this Pytorch forum question, the author wants to dig into pytorch implementation of Pytorch cuda for permute. I looked at the first answer, but it turns out the one before last answer is the one that guides us into [permute implementation](https://github.com/pytorch/pytorch/blob/a6170573c898a1367517d8daf8e777abaf96f752/aten/src/ATen/native/TensorShape.cpp#L367-L385) and the main digger who started this forum discussion actually thanks the person, turned out to be a moderator of site, who provided this link in the last thread of this discussion. The lesson is right in front of us:

<p style="text-align: center;">Did You Dig Deep Enough?</p>

And now we are satisfied with our digging endeavour and decide to really dig back to CLIP github repo. Let's move forward from line 69 by noting the evolution of tensor shapes:
```python 
x = x.flatten(start_dim=2).permute(2, 0, 1)  # NCHW -> (HW)NC
```
x ---> N C H W

x.flatten(start_dim=2) ---> N C HW

x.flatten(start_dim=2).permute(2, 0, 1) ---> HW N C

H, W, N, C represent Height, Width, Batch size, and Channel size. 

Line 70 points at torch.cat, x.mean and keep_dim and we wanna dig into them a lil bit.

In [35]:
torch.cat(float_tensor_1, float_tensor_2)

TypeError: cat() received an invalid combination of arguments - got (Tensor, Tensor), but expected one of:
 * (tuple of Tensors tensors, int dim = 0, *, Tensor out = None)
 * (tuple of Tensors tensors, name dim, *, Tensor out = None)


Python interpreter mentions there is something wrong with our input arguments. The first argument should be a tuple of Tensors tensors:

In [48]:
ensor_3 = torch.randn(6, 7)
tensor_4 = torch.randn(6, 7)
catted_tensor = torch.cat([tensor_3, tensor_4])
catted_tensor.shape

torch.Size([12, 7])

We note that concatenation was done along the rows, i.e. dimension 0, although we didn't have have any input argument. So we suspect this should be the default behaviour in [pytorch doc](https://pytorch.org/docs/stable/generated/torch.cat.html). Based on our previous lesson with ```torch.permute``` we decide not to dig into source code for ```torch.cat```. A curious digger may notice torch.stack() in the docs page and decide to dig into that a little:

In [49]:
stacked_tensor = torch.stack([tensor_3, tensor_4])
# print(stacked_tensor)
print(stacked_tensor.shape)

stacked_tensor = torch.stack([tensor_3, tensor_4], dim=1)
# print(stacked_tensor)
print(stacked_tensor.shape)

torch.Size([2, 6, 7])
torch.Size([6, 2, 7])


```torch.cat``` requires tensors to be of same size expect along the dimension of concatenation. ```torch.stack``` needs them to be of exact same shape and it will concat tensors along a new dimension. I'm curious when and why we may use stack to concat along dim > 0.

Let's dig into ```python
F.multi_head_attention_forward``` by checking its official implementation.

The curious digger may get lost in the [source code](https://github.com/pytorch/pytorch/blob/main/torch/nn/functional.py) of `torch.nn.functional`, trying to dig into [attention](https://github.com/pytorch/pytorch/blob/main/torch/nn/functional.py#L5458-L5460) section. He (or she or they) decides to dig into `_in_projection_packed`, the first function definition in this section. After glancing through the code quickly and reading the docstring, he gets confident that this function is implementing the linear projections of q, k, v.

We locate `linear` [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/functional.py#L2309-L2332), which informs us that `linear` is an alias of `torch._C._nn.linear` (defined in C++ core of PyTorch) and it basically wrapps the latter in a new python function and adds a docstring to it.
 
There is a general note here. Some of you who have a theory background may think that ```linear(x, A, b)``` is following $y = Ax + b$ notation; however pytorch prefers to implement $y = xA^T + b$ in almost all linear layers I have seen so far. It might be different for other libraries like tensorflow, something to dig into. So here is the shape guide:

* x (input, Input): `(*, in_features)` 
* A (Weight, weight, W): `(out_features, in_features)`
* b (bias, Bias, never seen B): `(out_features)`

The experienced digger may like to add a note here: The order of in\out features in `torch.nn.Linear` [docs](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) can be misleading for an amateur digger:

In [50]:
linear_layer = torch.nn.Linear(2, 10)
linear_layer.weight.shape

torch.Size([10, 2])

Hence the first argument of `torch.nn.Linear` is in_features. A curious digger may dig into its [source code](https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/linear.py#L50-L128) and [find out](https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/linear.py#L125-L125) that this class implements linear layer operation by using `torch.nn.functional.linear` function, which we just dug into. There are also other things that might be interesting, like how to initialize the weight matrix, to be digged later:
```Python 
init.kaiming_uniform_(self.weight, a=math.sqrt(5)) 
# or
init.uniform_(self.bias, -bound, bound)
```

Moving away from linear

### Bonus [lesson](https://www.facebook.com/100063550862077/videos/pronunciation-of-dig-dug-and-dog/700819198378153/)