# **Toward AI Sustainability: Low-Level Optimization for High Impact**

**ESPCI 2025: Practical work guide**

This document presents instructions and questions regarding the practical work sessions of this course. All the materials (slides, notebooks, base codes) can be recovered from the corresponding [GitHub repository](https://github.com/Deyht/green_ai_espci).

The notebook is expeted to be run on a Google Colab environment with a T4 GPU:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Deyht/green_ai_espci/blob/main/green_ai_espci_part2.ipynb)


This notebook is **not intended as a standalone** document; the slides should be used as a reference to understand several of the explanations provided in the analysis of the results.

# Practical work 2: CNN efficiency-base optimization on GPU

This part tackles the subject of optimizing a Convolutional Neural Network model for a metric that combines model accuracy, numerical efficiency, and model size. For this exercise, we will use the ASIRRA (Animal Species Image Recognition for Restricting Access) dataset that comprises 25000 labeled images of Cats and Dogs.

The objective is to explore network structures following the guidelines from the lecture to build a classification model that maximizes the following score metric:  

\begin{equation}
    S = \max \left({0,\frac{50-E}{50-E_r}}\right) \times \left(\frac{E_r}{E}\right) \times \left( \frac{T_r}{T} \right)^{w_T} \times \left( \frac{P_r}{P} \right)^{w_P}
\end{equation}  
where $E$ is the classification top-1 error rate (one minus the global accuracy), $T$ is the compute time, and $P$ is the number of trained parameters of the model. The $r$-indexed values represent the same quantities for a simple reference model for which the score result is set to 1.0. More details about this scoring metric are given in the relevant part of the notebook. The idea is that both improving the computing time and the accuracy increase the score. In contrast, trading too much computing time for a small increase in accuracy will not improve the score.


This notebook was designed to run on Google Colab using a T4 GPU, which is the setup that was used to define the baseline compute time for our reference architecture. You can switch the type of Colab runtime to select a T4 GPU in the menu at the top right corner of the page (next to resource monitoring). Note that all files and progress are lost every time the Colab session is reset, so save and download the files, results, and network models regularly. With the free version of Colab, the daily computing time is limited, so consider using multiple Google accounts.




### **CIANNA installation**

This notebook uses CIANNA to build and train network architectures. CIANNA is a general-purpose artificial neural network development framework (like TensorFlow or PyTorch) designed mainly for astronomical applications. Still, it is perfectly usable for classical CNN implementations. It is very fast for small networks, and it provides several performance measurement tools that will be useful in the context of this exercise.

CIANNA is coded in C and CUDA for GPU acceleration and is controlled through a python interface. As it is not available in Colab by default, we start by installing it.

**Link to the CIANNA github repository**
https://github.com/Deyht/CIANNA

#### Query GPU allocation and properties

If nvidia-smi fails, it might indicate that you launched the colab session without GPU reservation.  
To change the type of reservation go to "Runtime"->"Change runtime type" and select "GPU" as your hardware accelerator.

In [None]:
%%shell

nvidia-smi

cd /content/

git clone https://github.com/NVIDIA/cuda-samples/

cd /content/cuda-samples/Samples/1_Utilities/deviceQuery/

cmake CMakeLists.txt

make SMS="50 60 70 80"

./deviceQuery | grep Capability | cut -c50- > ~/cuda_infos.txt
./deviceQuery | grep "CUDA Driver Version / Runtime Version" | cut -c57- >> ~/cuda_infos.txt

cd ~/


#### Clone CIANNA git repository

#### Compiling CIANNA for the allocated GPU generation

There is no guaranteed forward or backward compatibility between Nvidia GPU generations, and some capabilities are generation-specific. For these reasons, CIANNA must be provided the platform GPU generation at compile time.
The following cell will automatically update all the necessary files based on the detected GPU and compile CIANNA.

In [None]:
%%shell

cd /content/

git clone https://github.com/Deyht/CIANNA

cd CIANNA

mult="10"
cat ~/cuda_infos.txt
comp_cap="$(sed '1!d' ~/cuda_infos.txt)"
cuda_vers="$(sed '2!d' ~/cuda_infos.txt)"

lim="11.1"
old_arg=$(awk '{if ($1 < $2) print "-D CUDA_OLD";}' <<<"${cuda_vers} ${lim}")

sm_val=$(awk '{print $1*$2}' <<<"${mult} ${comp_cap}")

gen_val=$(awk '{if ($1 >= 80) print "-D GEN_AMPERE"; else if($1 >= 70) print "-D GEN_VOLTA";}' <<<"${sm_val}")

sed -i "s/.*arch=sm.*/\\t\tcuda_arg=\"\$cuda_arg -D CUDA -D comp_CUDA -lcublas -lcudart -arch=sm_$sm_val $old_arg $gen_val\"/g" compile.cp
sed -i "s/\/cuda-[0-9][0-9].[0-9]/\/cuda-$cuda_vers/g" compile.cp
sed -i "s/\/cuda-[0-9][0-9].[0-9]/\/cuda-$cuda_vers/g" src/python_module_setup.py

./compile.cp CUDA PY_INTERF

mv src/build/lib.linux-x86_64-* src/build/lib.linux-x86_64

#### Testing CIANNA installation

**IMPORTANT NOTE**   
CIANNA is mainly used in a script fashion and was not designed to run in notebooks. Every cell code that directly invokes CIANNA functions must be run as a script to avoid possible errors.  
To do so, the cell must have the following structure:

```
%%shell

cd /content/CIANNA

python3 - <<EOF

[... your python code ...]

EOF
```

This syntax allows one to easily edit python code in the notebook while running the cell as a script. Note that all the notebook variables can not be accessed by the cell in this context, and all variables declared in the cell only exist there.

Another drawback of this solution is that it will copy the cell content in the output in case of error. If such a cell does not run, scroll back to the beginning of the cell output to get the actual error message.


The following cell tests CIANNA installation by running an example over the very classical MNIST dataset (handwritten digit images of 28x28 pixels distributed into 10 classes). The syntax should be mostly straightforward.

You can refer to CIANNA's [WIKI page](https://github.com/Deyht/CIANNA/wiki) for a complete framework description. You can also look at the full [API documentation](https://github.com/Deyht/CIANNA/wiki/4)-Interface-API-documentation) to add layer types that are absent from the LeNET-5 example.

In [None]:
%%shell

#Strictly equivalent to ex_script.py in the CIANNA repo

cd /content/CIANNA/examples/MNIST

python3 - <<EOF

import numpy as np
import matplotlib.pyplot as plt
import os

import sys, glob
sys.path.insert(0,glob.glob('/content/CIANNA/src/build/lib.*/')[-1])
import CIANNA as cnn

############################################################################
##              Data reading (your mileage may vary)
############################################################################

def i_ar(int_list):
	return np.array(int_list, dtype="int")

def f_ar(float_list):
	return np.array(float_list, dtype="float32")

if(not os.path.isdir("mnist_dat")):
	os.system("wget https://share.obspm.fr/s/EkYR5B2Wc2gNis3/download/mnist.tar.gz")
	os.system("tar -xvzf mnist.tar.gz")

print ("Reading inputs ... ", end = "", flush=True)

#Loading binary files
data = np.fromfile("mnist_dat/mnist_input.dat", dtype="float32")
data = np.reshape(data, (80000,28*28))
target = np.fromfile("mnist_dat/mnist_target.dat", dtype="float32")
target = np.reshape(target, (80000,10))

data_train = data[:60000,:]
data_valid = data[60000:70000,:]
data_test  = data[70000:80000,:]

target_train = target[:60000,:]
target_valid = target[60000:70000,:]
target_test  = target[70000:80000,:]

print ("Done !", flush=True)

############################################################################
##               CIANNA network construction and use
############################################################################

#Details about the functions and parameters are given in the GitHub Wiki

cnn.init(in_dim=i_ar([28,28]), in_nb_ch=1, out_dim=10,
		bias=0.1, b_size=16, comp_meth="C_CUDA", #Change to C_BLAS or C_NAIV
		dynamic_load=1, mixed_precision="FP32C_FP32A")

cnn.create_dataset("TRAIN", size=60000, input=data_train, target=target_train)
cnn.create_dataset("VALID", size=10000, input=data_valid, target=target_valid)
cnn.create_dataset("TEST", size=10000, input=data_test, target=target_test)

#Python side datasets are not required anymore, they can be released to save RAM
del (data_train, target_train, data_valid, target_valid, data_test, target_test)

#Used to load a saved network at a given iteration
load_step = 0
if(load_step > 0):
	cnn.load("net_save/net0_s%04d.dat"%(load_step), load_step)
else:
	cnn.conv(f_size=i_ar([5,5]), nb_filters=8 , padding=i_ar([2,2]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.conv(f_size=i_ar([5,5]), nb_filters=16, padding=i_ar([2,2]), activation="RELU")
	cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
	cnn.dense(nb_neurons=256, activation="RELU", drop_rate=0.5)
	cnn.dense(nb_neurons=128, activation="RELU", drop_rate=0.2)
	cnn.dense(nb_neurons=10, strict_size=1, activation="SMAX")

cnn.train(nb_iter=20, learning_rate=0.004, momentum=0.8, confmat=1, save_every=0)
cnn.perf_eval()

#Inference over test set and save prediction
cnn.forward(repeat=1, drop_mode="AVG_MODEL")


EOF

The output represents the training dynamically, while outputing the test set average loss and a confusion matrix periodically.  
In this example, the network should converge around 99.3% total accuracy.
In addition, at the end of the training here, the framework summarizes computing performances for each layer for inference and backpropagation. It also indicates the relative contribution of each layer in percent. In the context of this course, we are mostly interested in the forward compute time that will be used in our evaluation metric. This table will provide valuable information for optimizing your network architecture for the following dataset.

### **ASIRRA**

The ASIRRA (Animal Species Image Recognition for Restricting Access) is a dataset that was originaly used for CAPTCHA and HIP (Human Interactive Proofs).

The original dataset comprises 25000 images of variable resolution (averaging around 350x500) and perfectly distributed over the two classes "Cat" and "Dog". For this exercise, we provide two reduced versions in the form of padded and resized RGB images at either 128x128 or 256x256 as two binary files. This construction is necessary so the dataset can fit into the limited amount of Colab RAM. You can download one or both to test the impact of the input resolution on your network. In these files, the first 12500 images are Cats, and the next 12500 are Dogs. The last 1024 images of each class will be excluded to form our reference test dataset.

#### Downloading and visualizing the data

We start by downloading and visualizing the raw data. You can get one or both of the resized versions.

In [None]:
%%shell

cd /content

wget https://share.obspm.fr/s/6TBsCpAASeETH3S/download/asirra_bin_128.tar.gz
tar -xvzf asirra_bin_128.tar.gz

#wget https://share.obspm.fr/s/52nxyfn7PjzawSe/download/asirra_bin_256.tar.gz
#tar -xvzf asirra_bin_256.tar.gz

In [None]:
%cd /content/

import os
import matplotlib.pyplot as plt
import numpy as np

image_size = 128

v_width = 8; v_height = 5
nb_images = v_width*v_height

f_im_s = image_size*image_size*3

subset_cats = np.reshape(np.fromfile("asirra_bin_%d.dat"%(image_size),
  dtype="uint8", count=f_im_s*(nb_images//2)), (nb_images//2,image_size,image_size,3))

subset_dogs = np.reshape(np.fromfile("asirra_bin_%d.dat"%(image_size),
  dtype="uint8", count=f_im_s*(nb_images//2), offset=12500*f_im_s), (nb_images//2,image_size,image_size,3))

fig, ax = plt.subplots(v_height, v_width, figsize=(v_width*1.5,v_height*1.5), dpi=200, constrained_layout=True)

for i in range(0, v_width*v_height):
  c_x = i // v_width; c_y = i % v_width
  p_c = int((i)%2) #Alternate cats and dogs in display
  if(p_c == 0):
    ax[c_x,c_y].imshow(subset_cats[i//2])
  else:
    ax[c_x,c_y].imshow(subset_dogs[i//2])
  ax[c_x,c_y].axis('off')

plt.show()

#### Data handling and augmentation

In order to ease the manipulation of the dataset and hyperparameter exploration, we first provide a set of helper functions. To be accessible inside the CIANNA script cells, we need to export them into a python file. Every time you would like to change the content of these functions, you will need to re-run the cell to generate a new .py file. If loaded in an interactive cell, you will need to restart the kernel after changing this file to re-import it properly.

In [None]:
%%writefile helper.py

import numpy as np
import matplotlib.pyplot as plt
import os, sys, gc, glob, time, cv2
from threading import Thread
from PIL import Image
import albumentations as A

def data_prep(nb_images_per_iter, raw_image_size, image_size, test_mode=0):
  #Data arrays are declared as global so we can work in place to reduce RAM footpring
  global raw_images, input_data, targets, input_val, targets_val

  raw_images = np.reshape(np.fromfile("asirra_bin_%d.dat"%(raw_image_size), dtype="uint8"), (25000, raw_image_size, raw_image_size,3))

  if(test_mode == 0):
    input_data = np.zeros((nb_images_per_iter,3*image_size**2), dtype="float32") #CIANNA expects "float32" arrays
    targets = np.zeros((nb_images_per_iter,2), dtype="float32")

  input_val = np.zeros((2048,3*image_size**2), dtype="float32")
  targets_val = np.zeros((2048,2), dtype="float32")


def create_augmented_batch(A_transform):

  nb_images = np.shape(input_data)[0]

  for i in range(0,nb_images):

    l_class = np.random.randint(0,2)
    l_id = np.random.randint(0,12500 - 1024)
    #Last 1024 iamges of each class kept only for the val/test set

    patch = raw_images[l_class*12500+l_id]
    transformed = A_transform(image=patch)
    patch_aug = transformed['image']

    image_size = np.shape(patch_aug)[0]

    #CIANNA expects data formated as 2D numpy arrays representing a list of flattened images (with every channel flattened after the others)
    for depth in range(0,3): #We normalize based on mean pixel value
      input_data[i,depth*image_size**2:(depth+1)*image_size**2] = (patch_aug[:,:,depth].flatten("C") - 100.0)/155.0

    targets[i,:] = 0.0
    targets[i,l_class] = 1.0

  return input_data, targets


def create_validation_set(A_transform):

  for i in range(0,2048):

    l_class = i // 1024

    patch = raw_images[(1+l_class)*(12500 - 1024) + i]
    transformed = A_transform(image=patch)
    patch_aug = transformed['image']

    image_size = np.shape(patch_aug)[0]

    for depth in range(0,3):
      input_val[i,depth*image_size**2:(depth+1)*image_size**2] = (patch_aug[:,:,depth].flatten("C") - 100.0)/155.0

    targets_val[i,:] = 0.0
    targets_val[i,l_class] = 1.0

  return input_val, targets_val


def score_fct(error_rate, compute_time, nb_param):
  ref_error_rate = 14.00
  ref_compute_time = 105.0
  nb_param_ref = 1085808

  error_weight = 1
  compute_time_weight = 0.95
  param_weight = 0.05

  return (max(0,(50.0-error_rate)/(50.0-ref_error_rate))*ref_error_rate/error_rate) * (ref_compute_time/compute_time)**(compute_time_weight) * (nb_param_ref/nb_param)**(param_weight)


def score_eval(load_epoch, compute_time, nb_param):

  #os.system("head -n 500 out.txt | grep \"Total Net. nb weights: \" | sed \"s/Total Net. nb weights: //g\" > nb_param.txt")
  #nb_param = np.loadtxt("nb_param.txt")

  raw_pred = np.fromfile("fwd_res/net0_%04d.dat"%(load_epoch), dtype="float32")
  pred = np.reshape(raw_pred, (2048,-1))

  correct = np.shape(np.where(np.argmax(pred[:,:2], axis=1) == np.argmax(targets_val[:,:], axis=1)))[1]
  error_rate = 100.0-(correct/2048)*100.0
  print("Error_rate: %.3f %%\nCompute time: %.3f ms\nNb. parameters: %d\nScore: %.3f"%(
    error_rate, compute_time, nb_param, score_fct(error_rate, compute_time, nb_param)))


def free_data_helper():
  global raw_images, input_data, targets, input_val, targets_val
  del (raw_images, input_data, targets, input_val, targets_val)
  return

We can test the helper functions with a simple example and display the produced images. The next cell illustrates how we can create an augmented batch of images for training from the raw image dataset.  

Use this example to test the effect of combining different transform operations for data augmentation. You can also test the impact of the image resolution and of the position of the resize transformation in the augmentation list.

In [None]:
from helper import *

class_text = ["cat","dog"]
raw_image_size = 128
image_size = 64

v_width = 8; v_height = 5
nb_images_per_iter = v_width*v_height

#See Albumentation documentation for a list of existing augmentations
train_transform = transform = A.Compose([
  #Affine here act more as an aspect ratio transform than a scaling variation
  #A.Affine(scale=(0.85,1.15), rotate=(-15,15), fit_output=True, interpolation=1, p=1.0),
  A.HorizontalFlip(p=0.5),
  #A.ColorJitter(brightness=(0.8,1.2), contrast=(0.8,1.1), saturation=(0.8,1.2), hue=0.1, p=1.0),
  #A.ToGray(p=0.02),
  #Image resize can be done after all other transform to preserve as much details as possible
  #or as the fist operation so other transforms are faster
  A.Resize(image_size,image_size, interpolation=1, p=1.0)])

val_transform = A.Compose([ #Here only a resize, but val transform could be more complex (center crop, padding, etc)
  A.Resize(image_size, image_size, interpolation=2, p=1.0)])

data_prep(nb_images_per_iter, raw_image_size, image_size)
input_data, targets = create_augmented_batch(train_transform)

fig, ax = plt.subplots(v_height, v_width, figsize=(v_width*1.5,v_height*1.5), dpi=200, constrained_layout=True)
patch = np.zeros((image_size,image_size,3), dtype="uint8")

for i in range(0, v_width*v_height):
  c_x = i // v_width; c_y = i % v_width
  #Images in the augmented input_data array are directly in the CIANNA format.
  #We need to convert them back to classical RGB for display.
  for depth in range(0,3):
    patch[:,:,depth] = np.reshape(input_data[i,depth*image_size**2:(depth+1)*image_size**2]*155 + 100,(image_size,image_size))

  ax[c_x,c_y].imshow(patch)
  ax[c_x,c_y].text(4, 10, class_text[(np.argmax(targets[i]))], c="red", fontsize=10, clip_on=True)
  ax[c_x,c_y].axis('off')

plt.show()

free_data_helper()

#### Training a network

The following cell trains a very simple LeNET-5-inspired network on the ASIRRA dataset at a low resolution of 32x32. This network will be very fast at training and inference, but it can only reach a low classification accuracy.

As stated at the beginning of the notebook, **our objective is to search for a model that would optimize our mixed accuracy/efficiency/size metric**:

\begin{equation}
 S = \max \left({0,\frac{50-E}{50-E_r}}\right) \times \left(\frac{E_r}{E}\right) \times \left( \frac{T_r}{T} \right)^{w_T} \times \left( \frac{P_r}{P} \right)^{w_P}
\end{equation}  

where $E$ is the classification error rate (one minus the global accuracy), $T$ is the compute time, and $P$ is the number of weights in the model. The three terms represent the relative errors to this reference model, and the contribution of each part to the total score is weighted by the $w_E$, $w_T$, and $w_P$ powers. The first term measures the deviation from the 50% error rate for which the prediction of the model is considered random and, therefore, has no utility. This term penalizes models that would get too close to this unusable limit. The reference values were obtained for a model trained with the following cell.

The reference values were obtained for a model trained with the following cell ($E_r = 14$ %, $T_r=105$ ms, and $P_r=1085808$). We set $w_E = 1$, $w_T=0.95$, and $w_P=0.05$. With these scaling factors, a model that improves the error rate by a factor of two while having the same compute time and number of parameters as the reference model will get a score of 2.0. It goes almost the same way for improving the compute time by a factor of two while preserving the error rate. The number of parameters in the model has a smaller effect and is only here to prevent the use of highly parametric architectures.
The only limiting rules are not to use external data or pretrained models on another dataset, not to change the data splitting between training and validation/test, and to run inference on a T4 GPU on colab to get the inference time.
Other than that, you are free to:
 * Change the input image resolution
 * Modify/expand the augmentation policy
 * Modify/expand the network backbone

To stimulate your search, we provide a shared [Google Sheet](https://docs.google.com/spreadsheets/d/1_h1eCDR_031Kw-Z2_rUlx6kcfIfAF30J-dElBuucBC0/edit?usp=sharing) to be used as a dynamic leaderboard. You can tackle this part individually, in pairs, or by forming teams. Once you have a working model, you can enter its properties/results in the leaderboard. It will remain accessible until one week after the last session of this course.

Here are a few tips to help you in your search:
 * Try to add layers in the network backbone and evaluate their impact on accuracy / compute time / model size
 * Use the displayed performance per layer to identify bottlenecks in your architecture (Remember that a layer's performance is not only dictated by its own parameters but also by the shape of its input, which depends on the configuration of the previous layers)
 * Increasing the image size is a good way to improve accuracy, but there is a diminishing return at some point. However, this will have a strong negative impact on your computing time. Try to reduce the activation maps' resolution more aggressively over a few layers when using higher input resolutions. Keep an eye on the receptive field of the convolutional part of your backbone.
 * The size of the last activation maps before your first dense layer will have a strong impact on the number of parameters in your model, but it will not necessarily correlate with its accuracy. Keep an eye on the size of your activation maps before the first dense layer. You could also try to build a fully convolutional architecture to get rid of dense layers. This also has the advantage of making your trained model "compatible" with many input image resolutions. Again, the receptive field will be very important with such architecture. Examples are available in the CIANNA repository (see the ImageNET network backbone).
 * When increasing the number of parameters in your architecture (number of layers or their size), you will be more prone to overtraining (test set error increasing). For this reason, changes to the architecture should be accompanied by changes to your image augmentation policy.
 * The batch size has a direct effect on the compute speed of your model, with larger values allowing faster computing (up to some limit). However, you are limited by the amount of GPU memory used by your model. Also, in this exercise, speed only matters at inference time. In contrast, small batch sizes are usually preferable during training to achieve better accuracy and converge in fewer training steps.
 * Using mixed precision is also an efficient way to increase computing speed, but it can reduce the accuracy (usually by only a small amount). Unless you choose to use a very deep architecture, it is very likely that you will achieve better results with FP16C_FP32A activated for both training and inference.
 * Finally, use existing knowledge available regarding architecture design. You are free to be inspired by any architecture you know or find. You will find architecture examples in the CIANNA repository, but you are free to explore various architecture designs (AlexNET, VGG, DenseNET, YOLO/darknet, etc), as long as the required layers are available in CIANNA.

*Link to the [CIANNA](https://github.com/Deyht/CIANNA) repository. You can refer to CIANNA's [WIKI page](https://github.com/Deyht/CIANNA/wiki) for a complete framework description. You can also look at the full [API documentation](https://github.com/Deyht/CIANNA/wiki/4\)-Interface-API-documentation) to add layer types that are absent from the LeNET-5 example.
The saved models are available in the "net_save" repository that is automatically created when starting a network training. The default naming scheme only refers to the training iteration, so rename your saving files with comprehensive information about your model to keep track of your progress. A saved model can be uploaded to a new Colab session for inference or further training.*

In [None]:
%%shell
cd /content/

python3 - <<EOF

from helper import *

sys.path.insert(0,glob.glob('/content/CIANNA/src/build/lib.*/')[-1])
import CIANNA as cnn

def i_ar(int_list):
  return np.array(int_list, dtype="int")

raw_image_size = 128
image_size = 64
nb_images_per_iter = 4096 #Must likely be reduced if the image size is aumgented so examples can fit in RAM


#See Albumentation documentation for a list of existing augmentations
train_transform = transform = A.Compose([
  #Affine here act more as an aspect ratio transform than a scaling variation
  A.Affine(translate_percent=(-0.1,0.1), interpolation=1, p=1.0),
  #A.ToGray(p=0.02),
  #Image resize can be done after all other transform to preserve as much details as possible
  #or as the fist operation so other transforms are faster
  A.Resize(image_size,image_size, interpolation=1, p=1.0),
  A.HorizontalFlip(p=0.5),
  A.ColorJitter(brightness=(0.8,1.3), contrast=(0.8,1.3), saturation=(0.8,1.3), hue=0.15, p=1.0)])

val_transform = A.Compose([ #Here only a resize, but val transform could be more complex (center crop, padding, etc)
  A.Resize(image_size, image_size, interpolation=2, p=1.0)])


#This funtion allow to launch data augmentation on a separate thread.
#This way we can train on the GPU and generate new agumented examples in parallel.
def data_augm():
  input_data, targets = create_augmented_batch(train_transform)
  cnn.delete_dataset("TRAIN_buf", silent=1)
  cnn.create_dataset("TRAIN_buf", nb_images_per_iter, input_data[:,:], targets[:,:], silent=1)
  return

#In case the creation of new augmented data is too long compared to training, you can
#increase the number of training iteration over a single augmentation
nb_iter_per_augm = 1
if(nb_iter_per_augm > 1):
  shuffle_frequency = 1
else:
  shuffle_frequency = 0


total_iter = 400 #Should be increased with the complexity of the network and task
load_iter = 0 #Used to reload a model at a given iteration
if (len(sys.argv) > 1):
  load_iter = int(sys.argv[1])

start_iter = int(load_iter / nb_iter_per_augm)

cnn.init(in_dim=i_ar([image_size,image_size]), in_nb_ch=3, out_dim=2,
  bias=0.1, b_size=16, comp_meth='C_CUDA', dynamic_load=1,
  mixed_precision="FP16C_FP32A", adv_size=30)

data_prep(nb_images_per_iter, raw_image_size, image_size)

input_val, targets_val = create_validation_set(val_transform)
cnn.create_dataset("VALID", 2048, input_val[:,:], targets_val[:,:])
cnn.create_dataset("TEST", 2048, input_val[:,:], targets_val[:,:])
del (input_val, targets_val) #The python arrays are no longer required after import in CIANNA
gc.collect()

#Create fist augmentation before parallelization
input_data, targets = create_augmented_batch(train_transform)
cnn.create_dataset("TRAIN", nb_images_per_iter, input_data[:,:], targets[:,:])

if(load_iter > 0):
  cnn.load("net_save/net0_s%04d.dat"%load_iter, load_iter, bin=1)
else:
	#Network backbone architecture
  cnn.conv(f_size=i_ar([5,5]), nb_filters=8 , padding=i_ar([2,2]), activation="RELU")
  cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
  cnn.conv(f_size=i_ar([5,5]), nb_filters=16, padding=i_ar([2,2]), activation="RELU")
  cnn.pool(p_size=i_ar([2,2]), p_type="MAX")
  cnn.dense(nb_neurons=256, activation="RELU", drop_rate=0.5)
  cnn.dense(nb_neurons=128, activation="RELU", drop_rate=0.2)
  cnn.dense(nb_neurons=2, strict_size=1, activation="SMAX")


cnn.print_arch_tex("./arch/", "arch", activation=1, dropout=1)

for run_iter in range(start_iter,int(total_iter/nb_iter_per_augm)):

  t = Thread(target=data_augm)
  t.start()

  cnn.train(nb_iter=nb_iter_per_augm, learning_rate=0.004, end_learning_rate=0.0002, shuffle_every=shuffle_frequency ,\
    control_interv=20, confmat=1, momentum=0.8, lr_decay=-np.log(0.1)/total_iter, weight_decay=0.0005, save_every=20,\
	  silent=0, save_bin=1, TC_scale_factor=32.0)

  if(run_iter == start_iter):
    cnn.perf_eval()

  t.join()
  cnn.swap_data_buffers("TRAIN")

EOF


#### Evaluate your model

The following cell evaluates the accuracy and computing performance of your model in inference mode. Change the number of parameters in the score_eval() function to obtain a proper projection of the score.

Colab usually puts the GPU into sleep mode after idling for a few seconds. Always run this cell a few times in a row to get the real execution time.

Edit the Google Sheet to add your result:  
https://docs.google.com/spreadsheets/d/1_h1eCDR_031Kw-Z2_rUlx6kcfIfAF30J-dElBuucBC0/edit?usp=sharing

In [None]:

%%shell

cd /content/

python3 - <<EOF

from helper import *

#Comment to access system wide install
sys.path.insert(0,glob.glob('/content/CIANNA/src/build/lib.*/')[-1])
import CIANNA as cnn

def i_ar(int_list):
  return np.array(int_list, dtype="int")

raw_image_size = 128
image_size = 64

val_transform = A.Compose([ #Here only a resize, but val transform could be more complex (center crop, padding, etc)
  A.Resize(image_size, image_size, interpolation=2, p=1.0)])

cnn.init(in_dim=i_ar([image_size,image_size]), in_nb_ch=3, out_dim=2,
  bias=0.1, b_size=256, comp_meth='C_CUDA', dynamic_load=1,
  mixed_precision="FP16C_FP32A", adv_size=30, inference_only=1)

data_prep(0, raw_image_size, image_size, test_mode=1)

#Compute on only half the validation set to reduce memory footprint
input_test, targets_test = create_validation_set(val_transform)
cnn.create_dataset("TEST", 2048, input_test[:,:], targets_test[:,:])

del (input_test, targets_test)
gc.collect()

load_epoch = 400
cnn.load("net_save/net0_s%04d.dat"%load_epoch, load_epoch, bin=1)

cnn.forward(repeat=1, no_error=1, saving=2, drop_mode="AVG_MODEL")

start = time.perf_counter()
cnn.forward(no_error=1, saving=2, drop_mode="AVG_MODEL")
end = time.perf_counter()

cnn.perf_eval()
cnn.print_arch_tex("./arch/", "arch", activation=1, dropout=1)

compute_time = (end-start)*1000 #in miliseconds
##### THE NUMBER OF PARAMETERS MUST BE SET MANUALLY. IT IS GIVEN JUST ABOVE THE PER-LAYER PERFORMANCE TABLE ####
score_eval(load_epoch,compute_time, 1085808)

EOF