<a href="https://colab.research.google.com/github/ShaswataJash/LargeDatasetHandling/blob/master/Incremental_min_max_calculation_for_large_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!uname -a
!python --version

Linux 5f5882c553b5 5.10.147+ #1 SMP Sat Dec 10 16:00:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Python 3.9.16


In [None]:
import torch
print(torch.__version__)

1.13.1+cu116


In [None]:
!df -h /dev/shm

Filesystem      Size  Used Avail Use% Mounted on
shm             5.7G     0  5.7G   0% /dev/shm


In [None]:
#https://stackoverflow.com/questions/7878707/how-to-unmount-a-busy-device
#for python multipprocessor, data across child process and main process are being shared through shared memory
#for pytorch Dataloader, shared memory requirement can be quite high
!sudo umount -l /dev/shm/ && sudo mount -t tmpfs -o rw,nosuid,nodev,noexec,relatime,size=9G shm /dev/shm

In [None]:
#refer: https://numpy.org/doc/stable/reference/global_state.html#madvise-hugepage-on-linux
!cat /sys/kernel/mm/transparent_hugepage/enabled
!cat /sys/kernel/mm/transparent_hugepage/defrag
!cat /sys/kernel/mm/transparent_hugepage/use_zero_page
!cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size

always [madvise] never
always defer defer+madvise [madvise] never
1
2097152


In [None]:
%env

In [None]:
#https://stackoverflow.com/questions/37890898/how-to-set-env-variable-in-jupyter-notebook
%env NUMPY_MADVISE_HUGEPAGE=1

env: NUMPY_MADVISE_HUGEPAGE=1


#Determine total availiable GPU memory

In [None]:
#ref: https://stackoverflow.com/questions/59567226/how-to-programmatically-determine-available-gpu-memory-with-tensorflow
import subprocess as sp
import os
def get_gpu_memory():
    command = "nvidia-smi --query-gpu=memory.free --format=csv"
    try:
        memory_free_info = sp.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
        memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
        return memory_free_values[0] * 1024 * 1024 # memory_free_values[0] is in MB, thus converting into bytes
    except Exception as e:
        print(e)
        return -1

#downloading kaggle competitions files

In [None]:
!pip install kaggle==1.5.12

In [None]:
%%python

import sys
import logging
import os
import subprocess

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s:%(levelname)s:%(message)s')
logger = logging.getLogger('my_logger')
#handling of kaggle interaction
try:
    os.environ["KAGGLE_CONFIG_DIR"] = '/home' #kaggle.json file should be uploaded to /home location before executing this cell
    kaggle_write_cmd = "kaggle competitions download -c open-problems-multimodal"
    kaggle_write_call = subprocess.run(kaggle_write_cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
    logger.info(kaggle_write_call.stdout)
    if kaggle_write_call.returncode != 0:
        logger.error("Error in kaggle download, errorcode=%s", kaggle_write_call.returncode)
        sys.stdout.flush()
        sys.exit("Forceful exit as kaggle download returned error")
except BaseException as err:
    logger.error("kaggle download related error", exc_info=True)
    sys.stdout.flush()
    sys.exit("Forceful exit as exception encountered while kaggle download")

In [None]:
!mkdir /content/drive/MyDrive/colab_exp_result/kaggle_data
!unzip /content/open-problems-multimodal.zip -d /content/drive/MyDrive/colab_exp_result/kaggle_data

We can mount Google drive in colab and can copy the kaggle competitions files there. This will help not to run kaggle download code everytime before start of the notebook - it can save lot of time. Instead, everytime we can directly copy the contents from drive into the local filesystem of the underneath VM hosting the notebook.

In [None]:
!nohup cp /content/drive/MyDrive/colab_exp_result/kaggle_data/* /mnt &

nohup: appending output to 'nohup.out'


In [None]:
!ls -l /mnt

total 28181076
-rw------- 1 root root  2418406934 Apr 18 03:25 evaluation_ids.csv
-rw------- 1 root root      551250 Apr 18 03:25 max_cite_inputs.txt
-rw------- 1 root root     5723550 Apr 18 03:26 max_multi_inputs.txt
-rw------- 1 root root      234920 Apr 18 03:26 metadata_cite_day_2_donor_27678.csv
-rw------- 1 root root     9770334 Apr 18 03:26 metadata.csv
-rw------- 1 root root      551250 Apr 18 03:26 min_cite_inputs.txt
-rw------- 1 root root     5723550 Apr 18 03:26 min_multi_inputs.txt
-rw------- 1 root root   843563244 Apr 18 03:26 sample_submission.csv
-rw------- 1 root root   307964530 Apr 18 03:26 test_cite_inputs_day_2_donor_27678.h5
-rw------- 1 root root  1704565845 Apr 18 03:26 test_cite_inputs.h5
-rw------- 1 root root  6473530657 Apr 18 03:27 test_multi_inputs.h5
-rw------- 1 root root  2498128492 Apr 18 03:28 train_cite_inputs.h5
-rw------- 1 root root    38539123 Apr 18 03:28 train_cite_targets.h5
-rw------- 1 root root 11334840656 Apr 18 03:30 train_multi_inputs.

#Installation of required software packages

In [None]:
!pip install h5py==3.8.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Ref: https://docs.h5py.org/en/stable/mpi.html
#check whether parallel version of h5py is availiable
!h5cc -showconfig

In [None]:
!pip install hdf5plugin~=2.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting hdf5plugin~=2.0
  Downloading hdf5plugin-2.3.2-py2.py3-none-manylinux2014_x86_64.whl (5.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: hdf5plugin
Successfully installed hdf5plugin-2.3.2


#HDF5 handling common code

In [1]:
import h5py
import hdf5plugin #without importing this, decompression will not happen by h5py
def get_hdf5_dataset_value_key(hdf5_file, debug = 0):
    groups = []
    def node_visit(name):
        groups.append(name)
    
    hdf5_file.visit(node_visit)
    if debug>0: print(hdf5_file, groups)
    
    for g in groups:
        shape = hdf5_file[g].shape if isinstance(hdf5_file[g], h5py._hl.dataset.Dataset) else None
        if debug>0: print(g, type(hdf5_file[g]), shape)
        if (not shape is None) and (len(shape) == 2):
            return g
    
    return None

def get_hdf5_dataset_with_specific_shape(hdf5_file, size, debug = 0):
    groups = []
    def node_visit(name):
        groups.append(name)
    
    hdf5_file.visit(node_visit)
    if debug>0: print(hdf5_file, groups)
    
    for g in groups:
        shape = hdf5_file[g].shape if isinstance(hdf5_file[g], h5py._hl.dataset.Dataset) else None
        if debug>0: print(g, type(hdf5_file[g]), shape)
        if (not shape is None) and (len(shape) == 1) and (shape[0] == size):
            return g
    
    return None

def get_hdf5_info(hdf5_file):
    print('root-group file-object name:', hdf5_file.name)
    def print_keys(gr, level):
        keys = list(gr.keys())
        for k in keys:
            
            if isinstance(gr[k], h5py._hl.group.Group):
                print('->'*level, k, gr[k])
                print_keys(gr[k], level + 1)
            elif isinstance(gr[k], h5py._hl.dataset.Dataset):
                print('->'*level, k, gr[k], 'dtype=', gr[k].dtype , 'size=', gr[k].size, 'nbytes=', gr[k].nbytes, 
                      'maxshape=', gr[k].maxshape, 'chunks=', gr[k].chunks)

    print_keys(hdf5_file, 1)



In [2]:
import h5py
import hdf5plugin #without importing this, decompression will not happen by h5py
print('============= TRAIN MULTI INPUT ====================')
train_multi_input_file = h5py.File('/mnt/train_multi_inputs.h5') # HDF5 file
get_hdf5_info(train_multi_input_file)
train_multi_input_file.close()
del train_multi_input_file
print('============= TEST MULTI INPUT ====================')
test_multi_input_file = h5py.File('/mnt/test_multi_inputs.h5') # HDF5 file
get_hdf5_info(test_multi_input_file)
test_multi_input_file.close()
del test_multi_input_file

root-group file-object name: /
-> train_multi_inputs <HDF5 group "/train_multi_inputs" (4 members)>
->-> axis0 <HDF5 dataset "axis0": shape (228942,), type "|S26"> dtype= |S26 size= 228942 nbytes= 5952492 maxshape= (228942,) chunks= (2520,)
->-> axis1 <HDF5 dataset "axis1": shape (105942,), type "|S12"> dtype= |S12 size= 105942 nbytes= 1271304 maxshape= (105942,) chunks= (5461,)
->-> block0_items <HDF5 dataset "block0_items": shape (228942,), type "|S26"> dtype= |S26 size= 228942 nbytes= 5952492 maxshape= (228942,) chunks= (2520,)
->-> block0_values <HDF5 dataset "block0_values": shape (105942, 228942), type "<f4"> dtype= float32 size= 24254573364 nbytes= 97018293456 maxshape= (105942, 228942) chunks= (1, 228942)
root-group file-object name: /
-> test_multi_inputs <HDF5 group "/test_multi_inputs" (4 members)>
->-> axis0 <HDF5 dataset "axis0": shape (228942,), type "|S26"> dtype= |S26 size= 228942 nbytes= 5952492 maxshape= (228942,) chunks= (2520,)
->-> axis1 <HDF5 dataset "axis1": shap

In [3]:
import h5py
import hdf5plugin #without importing this, decompression will not happen by h5py
train_mult_input_file = h5py.File('/mnt/train_multi_inputs.h5') # HDF5 file
hdf5_input_key = get_hdf5_dataset_value_key(train_mult_input_file, debug=1)

<HDF5 file "train_multi_inputs.h5" (mode r)> ['train_multi_inputs', 'train_multi_inputs/axis0', 'train_multi_inputs/axis1', 'train_multi_inputs/block0_items', 'train_multi_inputs/block0_values']
train_multi_inputs <class 'h5py._hl.group.Group'> None
train_multi_inputs/axis0 <class 'h5py._hl.dataset.Dataset'> (228942,)
train_multi_inputs/axis1 <class 'h5py._hl.dataset.Dataset'> (105942,)
train_multi_inputs/block0_items <class 'h5py._hl.dataset.Dataset'> (228942,)
train_multi_inputs/block0_values <class 'h5py._hl.dataset.Dataset'> (105942, 228942)


In [None]:
hdf5_col_name_key = get_hdf5_dataset_with_specific_shape(train_mult_input_file, 228942, debug=1)
cols = train_mult_input_file[hdf5_col_name_key]
print(cols.shape)
from tqdm import tqdm
col_name = []
for c_id in tqdm(range(cols.shape[0])):
    col_name.append(str(cols[c_id], 'UTF-8'))

*   https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2 (why ray with shared object store is best sol)
*   Ref: https://towardsdatascience.com/histogram-on-function-space-4a710241f026
*   Ref: https://stackoverflow.com/questions/71844846/is-there-a-faster-way-to-get-correlation-coefficents (fast corr-coef)



# Global min and max determination of the raw-inputs (will be used for min-max normalization of the data)

In [4]:
%%writefile rawInputDataset.py
#read: https://medium.com/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac
#The above medium article explains why Dataset related class (or the function which is being called from multiprocessor)
# has to be defined in separate python class for jupyter notebook
import os
import traceback
import numpy as np
import h5py
import torch
from torch.utils.data import Dataset

import h5py
import hdf5plugin #without importing this, decompression will not happen by h5py

def get_hdf5_dataset_value_key(hdf5_file, debug = 0):
    groups = []
    def node_visit(name):
        groups.append(name)
    
    hdf5_file.visit(node_visit)
    if debug>0: print(hdf5_file, groups)
    
    for g in groups:
        shape = hdf5_file[g].shape if isinstance(hdf5_file[g], h5py._hl.dataset.Dataset) else None
        if debug>0: print(g, type(hdf5_file[g]), shape)
        if (not shape is None) and (len(shape) == 2):
            return g
    
    return None

class RawInputDataset(Dataset):

    def __init__(self, hdf5_input_path, len, print_lock, debug=0):
        self.hdf5_input_path = hdf5_input_path
        self.len = len
        self.print_lock = print_lock
        self.hdf5_per_process = {}
        self.stat_per_process = {}
        self.debug = debug

    def __getstate__(self):
        my_mod_dic = self.__dict__.copy()
        #h5py object are not pickable, thus we are removing them from object state while pickling
        my_mod_dic['hdf5_per_process'] = {}
        with self.print_lock:
            if self.debug>0: print('pid=', os. getpid(), 'pickled object state:', my_mod_dic)
        return my_mod_dic
    
    '''
    def __setstate__(self, d):
        self.hdf5_input_path = d['hdf5_input_path']
        self.len = d['len']
        self.print_lock = d['print_lock']
        self.hdf5_per_process = {}
        self.debug = d['debug']
        with self.print_lock:
            if self.debug>0: print('pid=', os. getpid(), 'unpickled object state:', self.__dict__)

    '''

    def __len__(self): return self.len

    def internal_initialize(self, batch_size=1):
        
        if os.getpid() in self.hdf5_per_process:
            return
        hdf5_input = h5py.File(self.hdf5_input_path, 'r', driver='stdio')
        hdf5_input_key = get_hdf5_dataset_value_key(hdf5_input)
        len = hdf5_input[hdf5_input_key].shape[0]
        assert len == self.len
        input = None
        if batch_size > 1:
            input = np.zeros((batch_size,hdf5_input[hdf5_input_key].shape[1]), dtype=hdf5_input[hdf5_input_key].dtype)           
        self.hdf5_per_process[os.getpid()] = (hdf5_input[hdf5_input_key], input)
        self.stat_per_process[os.getpid()] = []
        if self.debug>0:
            with self.print_lock:
                print('pid=', os. getpid(), 'internal_initialize:', hdf5_input, hdf5_input_key, hdf5_input[hdf5_input_key], flush=True)
                if not (input is None):
                    print('precreated numpy arr input:', input.shape)
    

    def __getitem__(self, row):
        try:
            self.internal_initialize()
            assert row < self.len
            (hdf5_dataset,  _) = self.hdf5_per_process[os.getpid()]
            if self.debug>0: 
                #ref: https://stackoverflow.com/questions/56364119/managed-dict-of-list-not-updated-in-multiprocessing-when-using-operator
                l = self.stat_per_process[os.getpid()]
                l.append(row)
                self.stat_per_process[os.getpid()] = l
            input = hdf5_dataset[row:row+1]
            #print('type of input=', type(input) , 'shape=', input.shape, flush=True)
            numpy_arr = np.ravel(input)
            return torch.from_numpy(numpy_arr).detach()
        except Exception as e:
            print('Exception occurred in __getitem__:', e)
            traceback.print_exc()
            
        return None

    def get_batch(self, starting_row, batch_size):
        try:
            self.internal_initialize(batch_size)
            assert starting_row < self.len
            (hdf5_dataset, input) = self.hdf5_per_process[os.getpid()]
            end_row = min(starting_row + batch_size, self.len)
            if self.debug>0:
                #ref: https://stackoverflow.com/questions/56364119/managed-dict-of-list-not-updated-in-multiprocessing-when-using-operator
                l = self.stat_per_process[os.getpid()]
                l.extend(range(starting_row, end_row))
                self.stat_per_process[os.getpid()] = l
            
            #input = hdf5_dataset[starting_row:end_row]
            
            assert input.shape[0] <= batch_size
            if input.shape[0] != (end_row - starting_row): #will happen for the last batch
                input = np.zeros(((end_row - starting_row),hdf5_dataset.shape[1]), dtype=hdf5_dataset.dtype)
            hdf5_dataset.read_direct(input, source_sel=np.s_[starting_row:end_row,:], dest_sel=None)
            
            #print('type of input=', type(input) , 'shape=', input.shape, flush=True)
            return torch.from_numpy(input).detach()
        except Exception as e:
            print('Exception occurred in get_batch():', e)
            traceback.print_exc()
            
        return None

    def reset_consumed_record(self):
        if self.debug <= 0:
           return
        
        with self.print_lock:
            for key in self.stat_per_process.keys():
                record_consumed = self.stat_per_process[key]
                print('pid=', key, 'consumed element count = ', len(record_consumed))
                if self.debug>2: print(record_consumed)
                record_consumed.clear()    

    def __del__(self):
        
        for key in self.hdf5_per_process.keys():
            (hdf5_input,  hdf5_input_key) = self.hdf5_per_process[key]
            del hdf5_input_key
            hdf5_input.close()
            del hdf5_input


Overwriting rawInputDataset.py


In [5]:
from tqdm import tqdm
import torch
import traceback
import psutil
class OnlineMinMaxSequential:
    def __init__(self, iteration_count=-1):
        self.iteration_count = iteration_count

    def last_call_mem_req(self):
        return self.gpu_req_mem, self.ram_req_mem

    def __call__(self, my_dataset, batch_size, gc_call_per_iteration, result_save_path_max = None, result_save_path_min = None):
        starting_gpu_mem = get_gpu_memory()
        starting_ram = psutil.virtual_memory().available

        global_min = None
        global_max = None

        my_dataset.reset_consumed_record()
        for batch_count,starting_row in tqdm(enumerate(range(0, len(my_dataset), batch_size), start=1)):
            try:
                data = my_dataset.get_batch(starting_row, batch_size)
                #data = data.to(torch.device("cuda:0"))
                
                local_min = torch.min(data, dim=0)[0] #we have to find min for each col (so reduction of dim=0)
                local_max = torch.max(data, dim=0)[0] #we have to find max for each col (so reduction of dim=0)
                if not (global_min is None):
                    global_min = torch.minimum(global_min, local_min)
                    global_max = torch.maximum(global_max, local_max)
                else:
                    global_min = local_min
                    global_max = local_max
                    if not(result_save_path_max is None): np.savetxt(result_save_path_max, global_max.numpy())
                    if not(result_save_path_min is None): np.savetxt(result_save_path_min, global_min.numpy())
                
                my_dataset.reset_consumed_record()
                if gc_call_per_iteration:
                    while(gc.collect() > 0): pass #clean the memory as much as possible
                if (self.iteration_count > 0) and (batch_count >= self.iteration_count ):
                    break
            
            except Exception as e:
                print(e)
                traceback.print_exc()

        end_gpu_mem = get_gpu_memory()
        end_ram = psutil.virtual_memory().available

        self.gpu_req_mem = starting_gpu_mem - end_gpu_mem
        self.ram_req_mem = starting_ram - end_ram

        return global_max, global_min

In [6]:
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
import os
import numpy as np
import traceback
import psutil

def my_worker_init_fn(worker_id):
    worker_info = torch.utils.data.get_worker_info()
    print('pid=', os. getpid(), 'worker-info:', worker_info)
    assert worker_info.id == worker_id
    raw_input_dataset = worker_info.dataset
    print(raw_input_dataset)
    #raw_input_dataset.internal_initialize()

class OnlineMinMax:
    def __init__(self, worker_count, iteration_count=-1):
        self.workers = worker_count
        self.iteration_count = iteration_count

    def last_call_mem_req(self):
        return self.gpu_req_mem, self.ram_req_mem

    def __call__(self, my_dataset, batch_size, gc_call_per_iteration, result_save_path_max = None, result_save_path_min = None):
        starting_gpu_mem = get_gpu_memory()
        starting_ram = psutil.virtual_memory().available
        my_dataset.reset_consumed_record()
        loader = DataLoader(dataset=my_dataset,
                                batch_size=batch_size,
                                shuffle=False,
                                num_workers=self.workers,
                                prefetch_factor=None,
                                #worker_init_fn=my_worker_init_fn,
                                #multiprocessing_context='spawn',
                                #timeout=300,
                                #num_workers=1,
                                #pin_memory=True,
                                )
        
        global_min = None
        global_max = None
        for batch_count,data in tqdm(enumerate(loader, start=1)):
            try:
                #data = data.to(torch.device("cuda:0"))
                #print(data.shape)
                local_min = torch.min(data, dim=0)[0] #we have to find min for each col (so reduction of dim=0)
                local_max = torch.max(data, dim=0)[0] #we have to find max for each col (so reduction of dim=0)
                if not (global_min is None):
                    global_min = torch.minimum(global_min, local_min)
                    global_max = torch.maximum(global_max, local_max)
                else:
                    global_min = local_min
                    global_max = local_max
                    if not(result_save_path_max is None): np.savetxt(result_save_path_max, global_max.numpy())
                    if not(result_save_path_min is None): np.savetxt(result_save_path_min, global_min.numpy())

                if gc_call_per_iteration:
                    while(gc.collect() > 0): pass #clean the memory as much as possible
                if (self.iteration_count > 0) and (batch_count >= self.iteration_count ):
                    break
            
            except Exception as e:
                print(e)
                traceback.print_exc()

        end_gpu_mem = get_gpu_memory()
        end_ram = psutil.virtual_memory().available

        self.gpu_req_mem = starting_gpu_mem - end_gpu_mem
        self.ram_req_mem = starting_ram - end_ram

        del loader
        return global_max, global_min

In [7]:
#https://stackoverflow.com/questions/2918898/prevent-python-from-caching-the-imported-modules
#we have to ensure 'rawInputDataset' module is reloaded everytime (helps us to continously change source code of rawInputDataset.py file)
%load_ext autoreload

In [8]:
%autoreload 2

In [None]:
!python3 -m pip install memray

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting memray
  Downloading memray-1.7.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: memray
Successfully installed memray-1.7.0


In [None]:
%load_ext memray

In [9]:
#%%memray_flamegraph --trace-python-allocators --follow-fork --native --leaks 
import math
from rawInputDataset import RawInputDataset
import traceback
import multiprocessing
from multiprocessing import current_process
from threading import current_thread
import inspect
import gc
import os
import h5py
from tqdm import tqdm

#https://superfastpython.com/multiprocessing-pool-initializer/#Example_of_Accessing_an_Initialized_Variable_in_a_Worker
def my_custom_init(arg):
    # declare global variable
    global custom_data
    # assign the global variable
    custom_data = arg
    # get the current process
    process = current_process()
    # get the current thread
    thread = current_thread()
    # report a message
    with custom_data.print_lock:
        print(f'Initializing worker pid={os. getpid()} process={process.name}, thread={thread.name}, with={custom_data}', flush=True)


def my_custom_min_max_finding_job(arg):
    global custom_data
    (starting_row, my_batch_size) = arg
    data = custom_data.get_batch(starting_row, my_batch_size)
    #data = data.to(torch.device("cuda:0"))
    local_min = torch.min(data, dim=0)[0] #we have to find min for each col (so reduction of dim=0)
    local_max = torch.max(data, dim=0)[0] #we have to find max for each col (so reduction of dim=0)    
    return (local_min, local_max)

def find_min_max(manager, algo_type='SEQ'):
    print_lock = manager.Lock() #https://superfastpython.com/multiprocessing-pool-mutex-lock/
    #print_lock = tqdm.get_lock()
    try:
        hdf5_input_file_path = '/mnt/train_multi_inputs.h5'
        hdf5_input = h5py.File(hdf5_input_file_path, 'r')
        hdf5_input_key = get_hdf5_dataset_value_key(hdf5_input)
        data_len = hdf5_input[hdf5_input_key].shape[0]
        first_elem = hdf5_input[hdf5_input_key][0]
        elem_size_in_bytes = first_elem.size * first_elem.itemsize
        print('first_elem.size:', first_elem.size, 'first_elem.itemsize:', first_elem.itemsize, 'elem_size_in_bytes:', elem_size_in_bytes)
        hdf5_input.close()
        del hdf5_input

        my_dataset = RawInputDataset(hdf5_input_file_path, data_len, print_lock, debug=1)
        print(my_dataset, 'len=', len(my_dataset))
        #optimal_batch_size = math.floor((1024 * 1024 * 1024) / elem_size_in_bytes) #max 1GB of numpy array 
        optimal_batch_size = math.floor((20 * 1024 * 1024) / elem_size_in_bytes) #max 20MB of numpy array 
        print('optimal_batch_size:', optimal_batch_size)

        while(gc.collect() > 0): pass #clean the memory as much as possible

        if algo_type=='SEQ':
            olmm = OnlineMinMaxSequential()
            max, min = olmm(my_dataset, optimal_batch_size, gc_call_per_iteration=True)
        elif algo_type=='TORCH_PARALLEL':
            olmm = OnlineMinMax(os.cpu_count() * 2)
            max, min = olmm(my_dataset, optimal_batch_size, gc_call_per_iteration=True)
        elif algo_type=='CUSTOM_PARALLEL':
            my_dataset.stat_per_process = manager.dict()
            work_arg = []
            for s_row in range(0, len(my_dataset), optimal_batch_size):
            #for s_row in range(0, 6 * optimal_batch_size, optimal_batch_size):
                work_arg.append((s_row, optimal_batch_size))
            print("total task required:", len(work_arg))
            with  multiprocessing.Pool(os.cpu_count(), initializer=my_custom_init, initargs=(my_dataset,)) as p:
                #For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1 
                #(refer: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered)
                result = p.imap_unordered(my_custom_min_max_finding_job, work_arg, chunksize=1)
                min = None
                max = None
                for r in tqdm(result):
                    if not (min is None):
                        min = torch.minimum(min, r[0])
                        max = torch.maximum(max, r[1])
                    else:
                        min = r[0]
                        max = r[1]

                # close the process pool
                p.close()
                # wait for all tasks to complete
                p.join()
            
            my_dataset.reset_consumed_record()
            
        print('max.shape:', max.shape, 'min.shape:', min.shape)
        return max, min
    except Exception as e:
        print(e)
        traceback.print_exc()
    finally:
        del my_dataset

if __name__ == '__main__':
    print('MAIN pid=', os. getpid())

    with multiprocessing.Manager() as manager:
        find_min_max(manager, algo_type='CUSTOM_PARALLEL')
        #find_min_max(manager, algo_type='SEQ')
        #find_min_max(manager, algo_type='TORCH_PARALLEL')
    gc.collect()
    print(gc.get_stats())

MAIN pid= 126087
first_elem.size: 228942 first_elem.itemsize: 4 elem_size_in_bytes: 915768
<rawInputDataset.RawInputDataset object at 0x7f9561fb6f70> len= 105942
optimal_batch_size: 22
total task required: 4816
Initializing worker pid=127233 process=ForkPoolWorker-2, thread=MainThread, with=<rawInputDataset.RawInputDataset object at 0x7f9561fb6f70>
Initializing worker pid=127239 process=ForkPoolWorker-3, thread=MainThread, with=<rawInputDataset.RawInputDataset object at 0x7f9561fb6f70>


0it [00:00, ?it/s]

pid= 127233 internal_initialize: <HDF5 file "train_multi_inputs.h5" (mode r)> train_multi_inputs/block0_values <HDF5 dataset "block0_values": shape (105942, 228942), type "<f4">
precreated numpy arr input: (22, 228942)
pid= 127239 internal_initialize: <HDF5 file "train_multi_inputs.h5" (mode r)> train_multi_inputs/block0_values <HDF5 dataset "block0_values": shape (105942, 228942), type "<f4">
precreated numpy arr input: (22, 228942)


4816it [02:41, 29.76it/s]


pid= 127233 consumed element count =  52954
pid= 127239 consumed element count =  52988
max.shape: torch.Size([228942]) min.shape: torch.Size([228942])
[{'collections': 583, 'collected': 3009, 'uncollectable': 0}, {'collections': 52, 'collected': 654, 'uncollectable': 0}, {'collections': 6, 'collected': 0, 'uncollectable': 0}]


In [None]:
np.savetxt('/content/drive/MyDrive/colab_exp_result/kaggle_data/max_multi_inputs.txt', max_m.numpy())
np.savetxt('/content/drive/MyDrive/colab_exp_result/kaggle_data/min_multi_inputs.txt', min_m.numpy())

In stead of calculating min max of the input, we can read it everytime from a saved location. This will save time in terms of rerunning the min-max finding algorithm.

In [None]:
max_multi = np.float32(np.loadtxt('/content/drive/MyDrive/colab_exp_result/kaggle_data/max_multi_inputs.txt'))
min_multi = np.float32(np.loadtxt('/content/drive/MyDrive/colab_exp_result/kaggle_data/min_multi_inputs.txt'))
print(max_multi.shape)
print(min_multi.shape)