# Optimization tips
Optimizing your code is usually a good practice, but it is inevitable in a limited environment like Kaggle's Code Competitions.
In this notebook, I'd like to show you some tips on how you can optimize your code. Hopefully, using these you won't run out of time or memory.

Quick reminder:
> Submissions to this competition must be made through Notebooks. In order for the "Submit to Competition" button to be active after a commit, the following conditions must be met:
- CPU Notebook <= 9 hours run-time
- GPU Notebook <= 2 hours run-time
- No internet access enabled
- External data is allowed, and you are encouraged to train your model offline and use your Notebook for inference.
- Submission file must be named "submission.csv"
>
>+1
- Available memory: 16Gb



## Summary
- Use script instead of notebook
- Import only things you need
- Use logs instead of tqdm
- Cleanup after usage
- Load parquet files once
- Do not load data you don't need
- Check your dtypes
- Preprocess your images once
- Use CUDA for preprocessing
- Do not use albumentation (inference)
- Only use 3 channels if you really need it
- Process in batches
- Optimized TTA


I rate the tips by effect, using medal icons (from 1 to 5)
- 🥇 minor effect
- 🥇🥇🥇🥇🥇 significant effect on memory usage/running time


## Implementation
You can find the implementation (most) of these tips in this kernel:

[https://www.kaggle.com/pestipeti/fast-ensemble-5-folds-20-minutes](https://www.kaggle.com/pestipeti/fast-ensemble-5-folds-20-minutes)

## TIP \#1: 🥇 Use `script` instead of `notebook`.
It is not significant, but we have a bit more free memory by using a script (`17.09Gb`) instead of a notebook (`16.08Gb`).
Jupyter has lots of great features, but in a background run, you can not use any of those.



| Notebook  | Script  |
|:-:|:-:|
|![https://albumizr.com/ia/09b6727ee71143d44313b7fd9a23c9ca.jpg](https://albumizr.com/ia/09b6727ee71143d44313b7fd9a23c9ca.jpg) | ![https://albumizr.com/ia/ea1a157604506f4afc9ebe301871a68d.jpg](https://albumizr.com/ia/ea1a157604506f4afc9ebe301871a68d.jpg) |
|  | * |


If you insist on using Jupyter Notebook, then make sure you reset the namespace ([doc](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-reset)) regularly.

## TIP \#2: 🥇 Import what you need only
In public kernels, there are lots of unnecessary imports. It may not use too many resources, but why would we import for example `Plotly` in an inference kernel.


## TIP \#3: 🥇 Use logs instead of tqdm
Same as *Tip \#2*, using TQDM in an inference kernel is pointless.

## TIP \#4: 🥇 Cleanup after usage
Remove every variable once you don't need it anymore.

In [None]:
from sys import getsizeof, getrefcount

i = 1
ch = 'c'
st = "asasdf"
int_list = [x for x in range(10)]
str_list = [str(x) for x in range(10)]

print("Size of an int: {} bytes".format(getsizeof(i)))
print("Size of a char: {} bytes".format(getsizeof(ch)))
print("Size of a string: {} bytes".format(getsizeof(st)))
print("Size of list of ints: {} bytes".format(getsizeof(int_list)))
print("Size of list of strings: {} bytes".format(getsizeof(str_list)))

In [None]:
print(int_list)

In [None]:
del i
del ch
del st
del int_list, str_list

In [None]:
# Make sure that all of the references deleted.
print("Number of references to variable `int_list`: {}".format(getrefcount(int_list)))

## TIP \#5: 🥇🥇🥇🥇 Load parquet files once

In [None]:
import pyarrow.parquet as pq

In [None]:
%%timeit
parq = pq.read_pandas('/kaggle/input/bengaliai-cv19/train_image_data_0.parquet').to_pandas()

This code is more or less okay if you are using one model (one fold) only.
```
for i in range(4):
    parq = pq.read_pandas('/kaggle/input/bengaliai-cv19/train_image_data_0.parquet').to_pandas()
    # pre-process parquet file
    # ...
    # predict
    # ...
```

But if you want to ensemble many folds/models:
```
for model_idx, model in enumerate(models):
    for i in range(4):
        parq = pq.read_pandas('/kaggle/input/bengaliai-cv19/train_image_data_0.parquet').to_pandas()
        # pre-process parquet file
        # ...
        # predict
        # ...
```
This one is much worse. In this case, I loaded (and preprocessed) the same parquet files multiple times.
Based on the `timeit` above, it takes ~1 minutes 30 seconds to load one parquet file. 
If you have a similar code (like my bad one above) than for a 5-folds ensemble, your script wastes ~30 minutes to loading the data. You can reduce this time to ~5-6 minutes if you load the data once and keep it in memory.

## TIP \#6: 🥇🥇 Do not load data you don't need
Use `pyarrow.parquet.read_pandas`'s `columns` argument to exclude the `image_id`. It is a text field (takes lots of memory), and it is generated from the index: `Test_` + `index`. You can predict/submit without this column.

```
import pyarrow.parquet as pq

# This is a pandas DataFrame
parq_df = pq.read_pandas('... parquet file...', column=[str(x) for x in range(32332)]).to_pandas()
```


## TIP \#7: 🥇🥇🥇 Check your dtypes
The fastest way for running your code is if you can keep all the data (test images) in memory. We have lots of samples, so we have to use a few tricks.

For storing images (pixel values), we only need `uint8` data type. The size of an `uint8` variable is 1 byte.  Most of the preprocessing step can be calculated using `uint8`. If you make complicated steps, don't forget to convert the images back to `uint8`.

In the train set we have 200840 samples (test set has similar), if we store all of the images in memory, it would take `200840 * 137 * 236` bytes (~6.05Gb)

In [None]:
import numpy as np

images = np.random.randint(low=0, high=255, size=(200840, 137 * 236), dtype=np.uint8)
print("{0:.2f}Gb".format(images.nbytes / (1024*1024*1024)))

In [None]:
float_images = 255.0 * np.random.rand(1, 137 * 236)
print("Number of bytes per (float64) image: {}".format(float_images.nbytes))
print("Number of bytes per (float32) image: {}".format(float_images.astype(np.float32).nbytes))

print("Memory usage of the full dataset (float32): {0:.2f}Gb".format(float_images.astype(np.float32).nbytes * 200840 / (1024*1024*1024)))

## TIP \#8: 🥇🥇🥇🥇 Preprocess your images once
If you are ensembling multiple models/folds, it is crucial to preprocess the samples before you start predicting. You don't want to do the same cutting/padding/resizing steps many times.

**Note**: You can not do any calculation during preprocessing if the result would be `float`. The entire dataset would not fit in memory. One of these steps, for example, is normalization.

## TIP \#9: 🥇🥇 Use CUDA for preprocessing
You can save lots of time if you do your preprocessing on the GPU.
- Load all of the samples to memory (`uint8`)
- Move everything to CUDA
- Preprocess your images one-by-one (or in batches if possible)
- Move everything back to RAM

## TIP \#10: 🥇🥇 Do not use albumentation (inference)
At inference time after preprocessing the images, usually only two things left: normalization and convert to CUDA tensor. Using Albumentation (or `torch.transform`) typically this is done in the `DataLoader`. The problem with this is that the sampler requests the images one-by-one for building up the next batch. It is faster if you do the final transformations in batches.

So, instead of this:
```
class MyDataLoader():

    ...

    def __getitem__(self, idx):
        ...
        image = self.transform(image) # albumentation or torch transforms
        return image

    ...
```

You should do this (in your prediction loop):
```
...
for batch_idx, images in enumerate(test_loader):
    
    # You can do the normalization step in your
    # model's forward method
    
    # final transforms / TTA should be here

    ...

    images = images.float().cuda()
    
    outputs = model(images)
    
    ...
    
...
```

## TIP \#11: 🥇🥇 Only use 3 channels if you really need it
3 channels triple the memory footprint of the samples (probably won't fit)

Most of the models take images with 3 channels as input, but you can easily modify that.

```
class BengaliModel(nn.Module):

    def __init__():
        
        self.backbone = torchvision.models.resnet34(pretrained=True)
   
        old_conv1 = self.backbone.conv1
        self.backbone.conv1 = nn.Conv2d(1, 64, kernel_size=7,
            stride=2, padding=3, bias=False)
        
        # Here I copied only the first channel's weights, but you can use
        # average of the 3 channels as well.
        with torch.no_grad():
            self.backbone.conv1.weight = nn.Parameter(
                old_conv1.weight.data[:, 0, :, :].unsqueeze(1))

```


## TIP \#12: 🥇🥇🥇 Process in batches
- Always use batches, if possible. For preprocessing, predicting, etc.
- Try to eliminate all for loops.
- In your prediction loop use the largest batch size that fits in memory
- If you ensemble different size of models use different size of batches too (You only have to re-create the dataloader because of the different batch size)


## TIP \#13: 🥇🥇 Optimized TTA
There are lots of "easy" samples, where all of your models (ResNet-18 as well) predict confidently. In these cases, it is pointless to average two (or more) 0.98 predictions. You can set a threshold (using the validation set) and generate TTA predictions only in uncertain cases.


--------------------------

**Thanks for reading.** Please vote if you find these tips useful.