#### How to fill this notebook?

Please replace
```py
# YOUR CODE HERE
raise NotImplementedError()
```
or

``YOUR ANSWER HERE``

by your answer to the questions asked. \
For most of the tasks, you will find a cell with basic tests.  They are mainly placeholders for the additional ones we will run for your evaluation. \
Feel free to add cells if you need. Please add comments or text cells to explain your reasoning; a wrong result with good explanation will still get some points.

In [1]:
!pip install flopco-pytorch
import hashlib
import flopco
from flopco import FlopCo

import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



### Task 1 - Bytes [1 point]

In [2]:
name = "" # Enter your name here # @critical : remove this line before commit to public repo
randseed_str = hashlib.sha256(name.encode('utf-8')).hexdigest()
randseed = int(randseed_str[:10], 16)
print(randseed)

977923228312


Convert the pseudo-random number generated to KB and MB.

In [3]:
n_bytes = randseed
n_KB = n_bytes / 1024
n_MB = n_bytes / (1024 ** 2)

print(f"Number of bytes: {n_bytes}")
print(f"Number of KB: {n_KB}")
print(f"Number of MB: {n_MB}")

Number of bytes: 977923228312
Number of KB: 955003152.6484375
Number of MB: 932620.2662582397


In [4]:
assert n_KB != 0
assert n_MB != 0

### Task 2 - FLOPs and MACs [3 points]

Consider a convolutional layer, such that

- $H_{in} = W_{in} = 11$
- $k_{in} = k_{out} = 3$
- $C_{in} = C_{out} = 16$

Assume $H_{out} = W_{out} = 9$

a) How much padding is needed on both spatial dimensions $H$ and $W$? [1 point]

b) How many FLOP and MAC are performed for standard convolution? [1 point]

c) How many FLOP and MAC are performed for depth-wise convolution? [1 point]

## My answer

#### a) How much padding is needed on both spatial dimensions $H$ and $W$? [1 point]

To determine the padding needed, we use the formula for the output dimension of a convolutional layer:

$$ H_{out} = \frac{H_{in} + 2 \times \text{padding} - \text{kernel size}}{\text{stride}} + 1 $$

Given:
- $H_{in} = 11$
- $\text{kernel size} = 3$
- $\text{stride} = 1$
- $H_{out} = 9$

Plugging in the values:

$$ 9 = \frac{11 + 2 \times \text{padding} - 3}{1} + 1 $$

Solving for padding:

$$ 9 - 1 = 11 + 2 \times \text{padding} - 3 $$
$$ 8 = 8 + 2 \times \text{padding} $$
$$ 2 \times \text{padding} = 2 $$
$$ \text{padding} = 1 $$

So, the padding needed on both spatial dimensions $H$ and $W$ is 1.

#### b) How many FLOP and MAC are performed for standard convolution? [1 point]

For standard convolution, the number of FLOPs (Floating Point Operations) and MACs (Multiply-Accumulate Operations) can be calculated as follows:

- Each output element requires $C_{in} \times \text{kernel size}^2$ MACs.
- There are $H_{out} \times W_{out} \times C_{out}$ output elements.

Given:
- $C_{in} = 16$
- $\text{kernel size} = 3$
- $H_{out} = W_{out} = 9$
- $C_{out} = 16$

Number of MACs per output element:

$$ \text{MACs per output element} = 16 \times 3 \times 3 = 144 $$

Total number of MACs:

$$ \text{Total MACs} = 9 \times 9 \times 16 \times 144 = 186,624 $$

Since each MAC consists of one multiplication and one addition, the number of FLOPs is twice the number of MACs:

$$ \text{Total FLOPs} = 2 \times 186,624 = 373,248 $$

#### c) How many FLOP and MAC are performed for depth-wise convolution? [1 point]

For depth-wise convolution, each input channel is convolved with its own set of filters. The number of MACs and FLOPs can be calculated as follows:

- Each output element requires $\text{kernel size}^2$ MACs.
- There are $H_{out} \times W_{out} \times C_{out}$ output elements.

Given:
- $\text{kernel size} = 3$
- $H_{out} = W_{out} = 9$
- $C_{out} = 16$

Number of MACs per output element:

$$ \text{MACs per output element} = 3 \times 3 = 9 $$

Total number of MACs:

$$ \text{Total MACs} = 9 \times 9 \times 16 \times 9 = 11,664 $$

Since each MAC consists of one multiplication and one addition, the number of FLOPs is twice the number of MACs:

$$ \text{Total FLOPs} = 2 \times 11,664 = 23,328 $$

### Task 3 [2 points]

Consider an input image of size $x$ is $(c, h, w)$ and a batch size $b$.

Compute the number of FLOP for BatchNorm2d(x) (in training mode)

Note: Consider that FLOP(division operation) = 4.

In [5]:
def flops_batchnorm(img):
	b, c, h, w = img.shape
	
	# Calculate the number of elements
	num_elements = b * c * h * w
	
	# Mean calculation: sum and division
	flops_mean = num_elements + c * h * w * 4  # sum + division
	
	# Variance calculation: subtraction, square, sum, and division
	flops_variance = num_elements + num_elements + c * h * w * 4  # subtraction + square + sum + division
	
	# Normalization: subtraction and division
	flops_normalization = num_elements * 2  # subtraction + division
	
	# Scaling and shifting: multiplication and addition
	flops_scaling_shifting = num_elements * 2  # multiplication + addition
	
	# Total FLOPs
	total_flops = flops_mean + flops_variance + flops_normalization + flops_scaling_shifting
	
	return total_flops

In [6]:
b = 4
c = 3
h = 220
w = 220
img = torch.randn(b, c, h, w)

print(f"FLOPs for batch normalization: {flops_batchnorm(img)}")

assert flops_batchnorm(img) != 0

FLOPs for batch normalization: 5227200


### Task 4 [1 point]

Compute the number of FLOP for Parametric ReLU (PReLU):)

$$\operatorname{PReLU}
\left(y_{i}\right)= \begin{cases}y_{i}, & \text { if } y_{i}>0 \\ a_{i} y_{i}, & \text { if } y_{i} \leq 0\end{cases}$$

Note: do not count comparisons as operations

In [7]:
def flops_prelu(x):
	b, n = x.shape
	# PReLU has one multiplication (a_i * y_i) and one addition (y_i > 0 ? y_i : 0)
	flops = b * n * 2
	return flops

In [8]:
b = 4
n = 128
x = torch.randn(b, n)

print(f"FLOPs for PReLU: {flops_prelu(x)}")

assert flops_prelu(x) != 0 or torch.all(x > 0)

FLOPs for PReLU: 1024


### Task 5 [2 points]

Compute the number of FLOP for the Softmax function:


$$\operatorname{Softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j} \exp \left(x_{j}\right)}.$$

Note: consider that $\operatorname{FLOP}(\operatorname{exp}) = 40$.

In [9]:
def flops_softmax(x):
	b, n = x.shape
	# Softmax has n exponentials and n-1 additions
	flops = b * n * 2
	return flops

In [10]:
# do not modify
b = 4
n = 32
x = torch.zeros(b, n)

print(f"FLOPs for softmax: {flops_softmax(x)}")

assert flops_softmax(x) != 0

FLOPs for softmax: 256


### Task 5 - Arithmetic intensity [6 points]

Consider the following model.\
Assuming batch size 1,
- how much FLOP does it perform? [1 point]
- how much memory is transferred? [1 point]
- is it memory or compute bound? [1 point]

*Note: notice there is no bias*

In [11]:
model = nn.Sequential(
	nn.PReLU(),
	nn.Linear(64, 32, bias=False)
).to(device)

inputs = torch.randn(1, 64).to(device)

In [13]:
flops_prelu = flops_prelu(inputs)
flops_linear = 64 * 32 * 2
n_flops = flops_prelu + flops_linear

# Calculate memory transfer
# Memory transfer for inputs and outputs of each layer
mem_prelu = inputs.numel() * 4  # 4 bytes per float32
mem_linear = (inputs.numel() + 32) * 4  # input + output

# Total memory transfer
n_mem = mem_prelu + mem_linear

# Determine if the model is memory or compute bound
# Assuming a typical GPU has a memory bandwidth of 900 GB/s and a compute capability of 10 TFLOPs
memory_bandwidth = 900 * 1024**3  # in bytes per second
compute_capability = 10 * 10**12  # in FLOPs per second

# Calculate arithmetic intensity
arithmetic_intensity = n_flops / n_mem

# Compare with the roofline model
if arithmetic_intensity < compute_capability / memory_bandwidth:
    bound_type = "Memory bound"
else:
    bound_type = "Compute bound"

print(f"FLOPs: {n_flops}")
print(f"Memory transfer: {n_mem} bytes")
print(f"Arithmetic intensity: {arithmetic_intensity}")
print(f"The model is {bound_type}")

FLOPs: 4224
Memory transfer: 640 bytes
Arithmetic intensity: 6.6
The model is Memory bound


YOUR ANSWER HERE

In [None]:
assert n_flops != 0
assert n_mem != 0

Plot the arithmetic intensity for different batch sizes. [2 points]

In [None]:
import matplotlib.pyplot as plt
batch_sizes = [1, 2, 4, 8, 16, 32, 64]
# YOUR CODE HERE
raise NotImplementedError()

What is the optimal batch size for your gpu? [1 point]

YOUR ANSWER HERE

#### Bonus (not mandatory) [2 points]

Answer all questions in task 5 again, but considering that the PReLU and the Linear are fused.

### Task 6 - Memory

Consider this vision model. We will try to estimate its memory usage.

How many parameters does it have? How much memory do the parameters take up? [2 points]

In [None]:
import timm

model = timm.create_model("convnextv2_large").to(device)

In [None]:
n_params = ...
n_memory = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert n_params > 0
assert n_memory > 0

Profile the forward pass. What is the maximum of memory used? Deduce the memory size of activations. [2 points]

*In notebooks, variables are kept in memory between 2 cells execution. Be careful about that when executing a cell twice!*

In [None]:
batch_size = 8
x = torch.randn(batch_size, 3, 224, 224, device = device)

# YOUR CODE HERE
raise NotImplementedError()

Say you want to train this model using the Adam optimizer.
- How much memory will be used at maximum with batch size 1? [1 point]
- What is the maximum batch size that will fit on your gpu? [1 point]
- Try to run a training iteration. Is the actual memory usage close to your estimations? [1 point]

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()