# Process illumination files and generate corresponding numpy (npz) files containing image data. It leverages multiprocessing to handle large datasets efficiently.

Using data from plate BR0011703 on Batch_1 as an example, download the sample dataset and metadata can be download at https://github.com/jump-cellpainting/cpg0016-jump-orf-data/tree/master/load_data_csv

In [1]:
#!aws s3 cp --no-sign-reques s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images/ /data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images/ --recursive --exclude "*.sqlite"
#!aws s3 cp --no-sign-reques s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_04_26_Batch1/illum/BR00117035/ /data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/illum/BR00117035 --recursive --exclude "*.sqlite"

In [2]:
save_path = '/data/pub/cell/cpg0016_source4/npz_data_demo' # Path where npz files are saved
illumn_path = '/data/pub/cell/cpg0016_source4/illumn_data_demo/' # Path for metadata
replace_path = '/data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images/' # Path where files are downloaded
# example: download_path: s3://cellpainting-gallery/cpg0016-jump/source_4/images/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images/
# example: loacl path:/data/pub/cell/images/2021_04_26_Batch1/images/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images/

Step1: Fills missing well and site combinations in a CSV file with data from the smallest site number in the same well.

In [3]:
import cpDistiller
cpDistiller.utils.fill_missing_combinations('/data/pub/cell/cpg0016_source4/illumn_data_demo/load_data_with_illum.csv')

Step2: Process illumination files and generate corresponding numpy (npz) files containing image data. It leverages multiprocessing to handle large datasets efficiently. (Estimated time: ~8 h)

<div class="alert note">
<p>

**Note**


1.Lists all files in the `illumn_path`.

2.Initializes a multiprocessing pool with 5 processes.

3.Maps the `multi_multi_process` function to each illum file for concurrent processing.

4.Waits for all processes to complete before exiting.
</p>
</div>


In [4]:
cpDistiller.prepare_union.tiff2npz(save_path,illumn_path,replace_path,5)

/home/ubuntu/bucket/projects/2021_04_26_Production/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images
/data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/illum/BR00117035/BR00117035_IllumDNA.npy
/home/ubuntu/bucket/projects/2021_04_26_Production/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images
/data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/illum/BR00117035/BR00117035_IllumDNA.npy
/home/ubuntu/bucket/projects/2021_04_26_Production/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images
/data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/illum/BR00117035/BR00117035_IllumDNA.npy
/home/ubuntu/bucket/projects/2021_04_26_Production/2021_04_26_Batch1/images/BR00117035__2021-05-02T16_02_51-Measurement1/Images
/data/pub/cell/cpg0016_source4/images/2021_04_26_Batch1/illum/BR00117035/BR00117035_IllumDNA.npy
/home/ubuntu/bucket/projects/2021_04_26_Production/2021_04_26_Batch1/images/BR00117035__2021-05-02T1

Processes data from specified paths, applying histogram normalization, smoothing, and merges processed channels. It uses an instance of the Mesmer application from DeepCell to predict and extract embeddings, which are then saved into a CSV file. (Estimated time: ~3 h)

In [5]:
cpDistiller.prepare_union.npz2embedding(dir_path='/data/pub/cell/cpg0016_source4/demo_embedding/',data_paths = ['/data/pub/cell/cpg0016_source4/npz_data_demo/load_data_with_il.npz'])

2024-10-10 04:35:15,521 - INFO - file_path:/data/pub/cell/cpg0016_source4/npz_data_demo/load_data_with_il.npz
2024-10-10 05:20:32.796766: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-10 05:20:32.917571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22290 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:31:00.0, compute capability: 8.9
2024-10-10 05:45:04.068467: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8600


batch_features: (31104, 4, 8, 8, 256)
(3456, 152, 152, 16)


2024-10-10 07:20:46.207741: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 10220470272 exceeds 10% of free system memory.


batch_features: (3456, 1296)
