# Write DHS images as MMAP

In order to speed up loading, and avoid this from being a bottleneck during training, we save the DHS images as a memory map (MMAP). This will be much faster than reading an individual .np file for each image during training. This MMAP will take up more storage, but given the speed-up, it's a worthy trade-off. For more information about this, see the [MMAP Ninja library](https://github.com/hristo-vrigazov/mmap.ninja?tab=readme-ov-file).

In [None]:
import mmap_ninja
from mmap_ninja import numpy as np_ninja
import numpy as np
import os
import pandas as pd
import configparser

# Read config file
config = configparser.ConfigParser()
config.read('../config.ini')

DATA_DIR = config['PATHS']['DATA_DIR']

df = pd.read_csv(os.path.join(DATA_DIR, 'dhs_with_imgs.csv'))

Get the paths to the DHS images

In [2]:
dhs_img_paths = df['cluster_id'].apply(lambda x: os.path.join(DATA_DIR, 'dhs_images', x, 'landsat.np'))
dhs_img_paths = dhs_img_paths.tolist()
print(f'Found {len(dhs_img_paths)} DHS images')

Found 68619 DHS images


Write the images as a memory map. This will take a couple of minutes

In [None]:
# Once per project, convert the images to a memory map
mmap_ninja.np_from_generator(
    # Directory in which the memory map will be persisted
    out_dir=os.path.join(DATA_DIR, 'dhs_images_mmap'),
    sample_generator=map(np.load, dhs_img_paths),
    # Maximum number of samples to keep in memory before flushing to disk
    batch_size=1024,
    verbose=True
)

print('Memory map created successfully.')

68619it [09:40, 118.30it/s]
Memory map created successfully.


Test loading and iterating over the MMAP. It should now take less than a second to iterate through the whole dataset.

In [None]:
# Open the memory map
images_mmap = np_ninja.open_existing(os.path.join(DATA_DIR, 'dhs_images_mmap'))

for i in tqdm(range(len(images_mmap))):
    img: np.ndarray = images_mmap[i]

  0%|          | 0/68619 [00:00<?, ?it/s]

100%|██████████| 68619/68619 [00:00<00:00, 559866.61it/s]
