# Acuzle utils

This example requires opencv

In [1]:
!pip install --quiet opencv-python

In [2]:
import acutils as au

CUDA_VISIBLE_DEVICES is not set.Valid example: export CUDA_VISIBLE_DEVICES=1,3,0
|WRN| Using CPU, cucim or cupy not available.


Functions and classes implemented by Acuzle are stored into this library.

In [3]:
au.file # about directories and files
au.handler # about handlers, high level classes to make developer lives easier
au.gpu # about GPU computation, to enable or disable it (only for pathology for now)
au.image # about images (Computer Vision)
au.multiprocess # about multiprocessing and processes
au.pathology # about pathology data for segmentation and tiling
au.sheet # about sheets (pandas.DataFrame)
au.video # about video data, can extract frames or sequences

<module 'acutils.video' from '/home/thomaspdm/.local/lib/python3.10/site-packages/acutils/video.py'>

Let's focus about the DataHandler.

This class is designed to make data preprocessing easier.

In [4]:
au.handler.DataHandler

acutils.handler.DataHandler

Before using the handler, we generate the example data.

In [5]:
from generate_data import reset_fake_data
reset_fake_data()

Init handler by selecting a data directory and allowing CPUs.

In [6]:
import os
DIR = os.path.abspath('fake_data')
ANIMALS_DIR = os.path.join(DIR, "animals")

In [7]:
handler = au.handler.DataHandler(ANIMALS_DIR, allowed_cpus=2)

Load png/jpg/jpeg files from data directory

In [8]:
handler.file_extensions = ["png", "jpg", "jpeg"]
handler.load_data_fromdatapath()

In [9]:
handler.files

array(['6.png', '2.png', 'random5text.jpeg', '9.png', '7.png', '1.png',
       '3.png', '12.png', '11.png', '10.png', '4.png', '8.png'],
      dtype='<U256')

Load labels from sheet file.

In [10]:
LABELING_PATH = os.path.join(ANIMALS_DIR, 'labels.xlsx')

In [11]:
handler.load_labels_fromsheet(LABELING_PATH, idcol="id", labelcol="label")

In [12]:
handler.labels

array(['dog', 'cat', 'cat', 'cat', 'dog', 'cat', 'dog', 'dog', 'dog',
       'cat', 'dog', 'dog'], dtype='<U256')

We are going to resize those images.

First, select new size and new directory.

In [13]:
NEW_WIDTH = 224
NEW_HEIGHT = 224
DATASETS_DIR = os.path.join(DIR, f"sets_{NEW_WIDTH}x{NEW_HEIGHT}")

In [14]:
import shutil
if os.path.isdir(DATASETS_DIR):
    shutil.rmtree(DATASETS_DIR)
os.mkdir(DATASETS_DIR)

In the library, each function that can be use as treatment is prefixed with "tmnt_".

An appropriate function must have "src" and "dst" arguments, wich are absolute paths.

Other arguments of this function should be passed as **kwargs.

In [15]:
DATA_PATH = os.path.join(DATASETS_DIR, "data")

In [16]:
handler.files

array(['6.png', '2.png', 'random5text.jpeg', '9.png', '7.png', '1.png',
       '3.png', '12.png', '11.png', '10.png', '4.png', '8.png'],
      dtype='<U256')

In [17]:
handler.process(DATA_PATH, au.image.tmnt_resize_file, empty_dir=True, 
                new_width=NEW_WIDTH, new_height=NEW_HEIGHT)

CUDA_VISIBLE_DEVICES is not set.Valid example: export CUDA_VISIBLE_DEVICES=1,3,0
|WRN| Using CPU, cucim or cupy not available.
CUDA_VISIBLE_DEVICES is not set.Valid example: export CUDA_VISIBLE_DEVICES=1,3,0
|WRN| Using CPU, cucim or cupy not available.


In [18]:
!ls {DATASETS_DIR}

data


In [19]:
!ls {DATA_PATH}

cat  dog


In [20]:
!ls {os.path.join(DATA_PATH, handler.labels[1])}

10.png	1.png  2.png  9.png  random5text.jpeg


Split into train and validation datasets.

Returns dictionnaries with files as keys and labels as values.

In [21]:
tdata, vdata = handler.split(train_percentage=0.70) # 70% is the default value

In [22]:
vdata

{'random5text.jpeg': 'cat',
 '10.png': 'cat',
 '4.png': 'dog',
 '7.png': 'dog',
 '11.png': 'dog'}

Balance data between labels.

In [23]:
bal_tdata, bal_vdata = handler.balance_datasets(tdata, vdata)

In [24]:
bal_vdata

{'random5text.jpeg': 'cat', '10.png': 'cat', '4.png': 'dog', '11.png': 'dog'}

We can also directly make train and validation directories with treated data.

Load data files and labels from data directory.

Assuming that those files are inside subdirectories (named with unique labels).

In [25]:
TRAIN_PATH = os.path.join(DATASETS_DIR, "train")
VAL_PATH = os.path.join(DATASETS_DIR, "val")

In [26]:
handler.make_datasets(TRAIN_PATH, VAL_PATH, tdata, vdata, 
                        func=au.image.tmnt_resize_file, 
                        new_width=NEW_WIDTH, new_height=NEW_HEIGHT)

In [27]:
!ls {DATASETS_DIR}

data  train  val


In [28]:
TRAIN_BAL_PATH = os.path.join(DATASETS_DIR, "train_bal")
VAL_BAL_PATH = os.path.join(DATASETS_DIR, "val_bal")

In [29]:
handler.make_datasets(TRAIN_BAL_PATH, VAL_BAL_PATH, bal_tdata, bal_vdata, 
                        func=au.image.tmnt_resize_file, 
                        new_width=NEW_WIDTH, new_height=NEW_HEIGHT)

In [30]:
!ls {DATASETS_DIR}

data  train  train_bal	val  val_bal


It was also possible to load labels without sheet file.

Assuming that those files are inside subdirectories (named with unique labels).

In [31]:
ANIMALS_LABELED_DIR = os.path.join(DIR, "labeled_animals")

Note that on DataHandler instanciation, you can indicate expected extensions.

In [32]:
handler2 = au.handler.DataHandler(ANIMALS_LABELED_DIR, file_extensions=["png"])

Load data files and labels from data directory.



In [33]:
handler2.load_labeled_data_fromdatapath()

In [34]:
handler2.files

array(['cat/2.png', 'cat/9.png', 'cat/1.png', 'cat/10.png', 'dog/6.png',
       'dog/7.png', 'dog/3.png', 'dog/12.png', 'dog/11.png', 'dog/4.png',
       'dog/8.png'], dtype='<U256')

In [35]:
handler2.labels

array(['cat', 'cat', 'cat', 'cat', 'dog', 'dog', 'dog', 'dog', 'dog',
       'dog', 'dog'], dtype='<U256')

Finally some files could be related, so they musn't be in both train and val sets.

For example, in pathology, a same patient might have multiple slides.

In [36]:
SLIDES_DIR = os.path.join(DIR, 'slides')

In [37]:
handler3 = au.handler.DataHandler(SLIDES_DIR)

In [38]:
handler3.load_data_fromdatapath()

In [39]:
handler3.files

array(['p7_s10.tif', 'p5_s7.tif', 'p4_s5.tif', 'p1_s2.tif', 'p1_s1.tif',
       'p6_s9.tif', 'p3_s4.tif', 'patients.csv', 'p6_s8.tif',
       'p7_s11.tif', 'p4_s6.tif', 'p2_s3.tif'], dtype='<U256')

In [40]:
handler3.load_labels_fromsheet(os.path.join(SLIDES_DIR, 'patients.csv'), 
                               idcol='slide_id', labelcol='label')

In [41]:
handler3.load_groups_fromsheet(os.path.join(SLIDES_DIR, 'patients.csv'), 
                               idcol='slide_id', groupcol='patient_id')

In [42]:
handler3.files

array(['p7_s10.tif', 'p5_s7.tif', 'p4_s5.tif', 'p1_s2.tif', 'p1_s1.tif',
       'p6_s9.tif', 'p3_s4.tif', 'p6_s8.tif', 'p7_s11.tif', 'p4_s6.tif',
       'p2_s3.tif'], dtype='<U256')

In [43]:
handler3.groups

array(['p7', 'p5', 'p4', 'p1', 'p1', 'p6', 'p3', 'p6', 'p7', 'p4', 'p2'],
      dtype='<U256')

If "groups" attribute is defined, by default, split use them.

In [44]:
tdata, vdata = handler3.split(ignore_groups=False)

In [45]:
tdata

{'p4_s5.tif': 'neg',
 'p4_s6.tif': 'neg',
 'p2_s3.tif': 'neg',
 'p7_s10.tif': 'neg',
 'p7_s11.tif': 'neg',
 'p3_s4.tif': 'pos',
 'p1_s2.tif': 'pos',
 'p1_s1.tif': 'pos'}

In [46]:
vdata

{'p6_s9.tif': 'neg', 'p6_s8.tif': 'neg', 'p5_s7.tif': 'pos'}