<a href="https://colab.research.google.com/github/BigTMiami/AdaptOrDie/blob/main/Amazon_Domain_Pre_training_Batch_Size_and_Gradient_Accumulation_Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary
This takes a 50k review (8000 row condensed) dataset to tune the batch size and gradient accumulation settings for the A100.

# Results
A batch size of 34 and gradient accumulation steps of 10 produces an effective batch size of 2040, which is close to the papers 2058.  It uses 36 out of 40 MB, which is what I will use, saving more space for larger datasets.  Timing was only slightly faster for 10 gradient accumulation steps, which is reasonable.  

At 192/reviews per second, expect training to take **7.2 hours for 5M (20%)** of dataset.

Training time: 260.73 seconds
192 reviews / second.
gradient_accumulation_steps:10
per_device_train_batch_size:34
GPU 36403 MB
effective_batch_size:2040

# Details
* Don't push model to hub, only experimenting





# Setup

In [1]:
!pip install datasets
!pip install transformers[torch]


Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed dataset

In [3]:
!pip install pynvml

Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynvml
Successfully installed pynvml-11.5.0


In [12]:
from pynvml import *

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    gpu_used = info.used//1024**2
    print(f"GPU {gpu_used} MB")
    return gpu_used


In [5]:
from time import time

In [10]:
print_gpu_utilization()
!nvidia-smi

GPU 448 MB
Mon Apr  8 17:36:20 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                         

In [11]:
from datasets import load_dataset
dataset = load_dataset("BigTMiami/amazon_25M_50_000_condensed")
print(dataset)
print_gpu_utilization()
!nvidia-smi

Downloading readme:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8277 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/855 [00:00<?, ? examples/s]

GPU 448 MB
Mon Apr  8 17:37:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                         

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
block_size = tokenizer.model_max_length
print(f"block_size:{block_size}")
print_gpu_utilization()
!nvidia-smi

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

block_size:512
GPU 448 MB
Mon Apr  8 17:38:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                          

In [14]:
from transformers import AutoConfig, AutoModelForMaskedLM

config = AutoConfig.from_pretrained("roberta-base")
model = AutoModelForMaskedLM.from_config(config)
print_gpu_utilization()
!nvidia-smi

GPU 448 MB
Mon Apr  8 17:39:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              43W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                         

# Train

In [16]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="amazon_pretraining_tuning",
    learning_rate=0.0005, # Paper for DAPT training
    per_device_train_batch_size=34, # 346,m - TRYING SMALLER BATCH SIZE
    per_device_eval_batch_size=34, #346,m - TRYING SMALLER BATCH SIZE
    num_train_epochs=1, # 1 pass, 12k steps, 25 million reviews
    weight_decay=0.01,
    warmup_ratio=0.06, # Paper: warmup proportion of 0.06
    adam_epsilon=1e-6, # Paper 1e-6 (huggingface default 1e-08)
    adam_beta1=0.9, # Paper: Adam weights 0.9
    adam_beta2=0.98, # Paper: Adam weights 0.98 (huggingface default  0.999)
    lr_scheduler_type="linear",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    # load_best_model_at_end=True,
    # push_to_hub=True,
)
print_gpu_utilization()
!nvidia-smi

GPU 451 MB
Mon Apr  8 17:39:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              43W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                         

In [17]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
)
print_gpu_utilization()
!nvidia-smi

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


GPU 1399 MB
Mon Apr  8 17:40:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              49W / 400W |    953MiB / 40960MiB |     23%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                        

In [18]:
start_time = time()
trainer.train()
end_time = time()
print(f"Training time: {end_time - start_time:.2f} seconds")
print_gpu_utilization()
!nvidia-smi

Epoch,Training Loss,Validation Loss
1,No log,7.096955


Training time: 263.37 seconds
GPU 36379 MB
Mon Apr  8 17:44:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0              53W / 400W |  35933MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                         

In [19]:
print(f"trainer.args.gradient_accumulation_steps:{trainer.args.gradient_accumulation_steps}")
print(f"trainer.args.per_device_train_batch_size:{trainer.args.per_device_train_batch_size}")

trainer.args.gradient_accumulation_steps:1
trainer.args.per_device_train_batch_size:34


In [20]:
# reload model to get same loss
config = AutoConfig.from_pretrained("roberta-base")
model = AutoModelForMaskedLM.from_config(config)

training_args.gradient_accumulation_steps = 10

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
)
print_gpu_utilization()
!nvidia-smi


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


GPU 36379 MB
Mon Apr  8 17:49:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              50W / 400W |  35933MiB / 40960MiB |      3%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                       

In [21]:
start_time = time()
trainer.train()
training_time = time() - start_time
print(f"Training time: {training_time:.2f} seconds")
print(f"gradient_accumulation_steps:{trainer.args.gradient_accumulation_steps}")
print(f"per_device_train_batch_size:{trainer.args.per_device_train_batch_size}")
print_gpu_utilization()
!nvidia-smi

Epoch,Training Loss,Validation Loss
0,No log,7.152045


Training time: 260.73 seconds
gradient_accumulation_steps:10
per_device_train_batch_size:34
GPU 36403 MB
Mon Apr  8 17:53:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0              52W / 400W |  35957MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------

In [22]:
effective_batch_size = 6 * trainer.args.gradient_accumulation_steps * trainer.args.per_device_train_batch_size
print(f"effective_batch_size:{effective_batch_size}")


effective_batch_size:2040


In [None]:
print("Disconnecting Session")
from google.colab import runtime
runtime.unassign()