Data Preprocessing error #40

MohamedOmar2020 opened this issue Oct 22, 2021 · 27 comments

MohamedOmar2020 commented Oct 22, 2021

Hi Guys, thank you for making this wonderful resource available.
I organized my slides into wsi_train, wsi_val, wsi_test using which ran fine. However I keep getting this error when I run code/

wsi_train/neg: 35359.581701MB, 201 images, overlap_factor=1.00
wsi_train/pos: 92751.49675MB, 451 images, overlap_factor=1.00

getting small crops from 201 images in wsi_train/neg with inverse overlap factor 1.00 outputting in train_folder/train/neg
Traceback (most recent call last):
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/PIL/", line 101, in __init__
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/PIL/", line 979, in _open
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/PIL/", line 1046, in _seek
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/PIL/", line 1170, in _setup
    self._compression = COMPRESSION_INFO[self.tag_v2.get(COMPRESSION, 1)]
KeyError: 33003

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "deepslide/", line 20, in <module>
  File "/athena/marchionnilab/scratch/lab_data/Mohamed/pca_outcome/deepslide/", line 155, in gen_train_patches
  File "/athena/marchionnilab/scratch/lab_data/Mohamed/pca_outcome/deepslide/", line 364, in produce_patches
    uri=image_loc if by_folder else input_folder.joinpath(image_loc))
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/imageio/core/", line 265, in imread
    reader = read(uri, format, "i", **kwargs)
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/imageio/core/", line 186, in get_reader
    return format.get_reader(request)
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/imageio/core/", line 170, in get_reader
    return self.Reader(self, request)
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/imageio/core/", line 221, in __init__
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/imageio/plugins/", line 125, in _open
    self._im = factory(self._fp, "")
  File "/home/mao4005/.conda/envs/deepslide/lib/python3.6/site-packages/PIL/", line 110, in __init__
    raise SyntaxError(v)
SyntaxError: 33003

These are the packages in my conda env:
@MohamedOmar2020 I think your problem is that one or more of the images is corrupted or unreadable. Try printing the image name that causes the crash and then reading it in a separate Python interpreter session. If the same crash occurs, then the problem is caused by that image.

@MohamedOmar2020 I am getting the same error that you had, by any chance did you find a resolution to the problem? Also are you trying to use .svs files?

@Tejussurendran @MohamedOmar2020 Would either of you be able to provide an image that allows me to reproduce this error?

@JosephDiPalma I am trying to use a .svs file, and it appears that svs files are not supported by imread(), however I am not sure if you have any workarounds for this.
Also I tried to provide an image however, github said it was a file type they didnt support.

@Tejussurendran You can fix this issue in 2 ways:

  1. Convert all the svs files to png, jpg, or another supported format.
  2. Replace lines 363-364 in code/ as follows:
slide = OpenSlide(filename=str(image_loc if by_folder else input_folder.joinpath(image_loc)))
image = np.array(slide.read_region(location=(0, 0), level=0, size=slide.dimensions).convert("RGB"))

Also, you will need to have the OpenSlide Python package installed and imported at the top of the file code/ with from openslide import OpenSlide.

@JosephDiPalma Thank you very much for your help!

It is currently running without an issue(hopefully!), and I will let you know the outcome of my attempt.

Thank you once again for making this platform available to use!

JosephDiPalma commented Nov 5, 2021

@Tejussurendran Can you upload the problematic svs file to this dropbox link (

Make sure not to upload any PHI or otherwise unauthorized data.

@JosephDiPalma it appears that at the process step, the process keeps being killed after a specific svs. There is no error being thrown however. Any ideas as to why?

@JosephDiPalma I tried doing the 2_processing step over the weekend, and it appeared to keep getting killed, regardless of it was a validation training or testing.

@Tejussurendran The process is likely being killed due to insufficient RAM.

Try changing the num_workers parameter in to something smaller and see if that helps.

@JosephDiPalma I think I have solved that issue, thank you!

However it appears that when trying to generate the validation evaluation patches it is throwing an error.

I have attached a copy of it below

conda error

@Tejussurendran Does it work for the other sets?
If it does, then one of the images in your validation set is probably corrupted.

@JosephDiPalma It works on the validation set, and I am trying it out on the training set currently. On the testing set it had the same issues as the validation evaluation set.

@JosephDiPalma i tried setting the num_workers variable to 1 and it the program is still being killed. Is there a bigger underlying issue that maybe causing this to happen? This is occurring with all 3 sets being validation testing and training.

@Tejussurendran I believe the underlying issue is that you don't have enough system memory.
How much memory do you have, and what is the size of your images?

@JosephDiPalma I am sshing on to a workstation with 256 gb of ram. I have also uploaded one of the sample images to your dropbox link from earlier. They are only a couple hundred kilobytes

@Tejussurendran That should be more than enough memory.
I'm not sure what the issue is now, so give me some time to test the code using the sample image.

Can you also provide the Python package names, including versions, to us for debugging?

@JosephDiPalma Thank you very much for your help! With regards to the issue, it is not consistent on where it is killed, sometimes it is 5 images, sometimes 1 etc..

Also for the python package names, I am not sure how to find the version number, however I have attached the packages being used in

From what I understand it should be the same packages as those provided.

import functools
import itertools
import math
import time
from multiprocessing import (Process, Queue, RawArray)
from pathlib import Path
from shutil import copyfile
from typing import (Callable, Dict, List, Tuple)
from openslide import OpenSlide
import numpy as np
from PIL import Image
from imageio import (imsave, imread)
from skimage.measure import block_reduce

@JosephDiPalma Also, I am not sure if this will help with solving the issue, but I am also trying to save the images as png.

ntomita commented Nov 9, 2021

@Tejussurendran Could you give us the output of pip list and conda list?

I checked your image in our dropbox. It is an image patch. Is this your input? This library assumes you feed a large slide image file as input, usually a few gigabytes file in the preprocess stage.

Copy link

Also, Im sorry! I think I messed up the file I sent. I have attached the correct file in the dropbox now.

Thank you so much for your help!

@Tejussurendran Using the provided svs file, the code ran successfully for us.
Could you provide some details on the directory structure of the code and your data?

@JosephDiPalma I have a folder one large folder containing all my work for this project. In it I have a folder for the svs files, structures with Has_Diabetes and Not_Has_Diabetes. The patches are depending on the type, are stored in independent folders as well.

I also noticed that with num_workers = 8, the job was killed due to ram consumption. So I tried 4 and it appears that the same issue happened again. I was thinking about potentially generating the patches elsewhere and simply moving them in to the appropriate folders. Would that work?

@JosephDiPalma Hi, I was wondering what the directory difference between the validation set and validation evaluation set were as, I am trying to develop the patches externally and use them in deepslide. What is the difference between the 2 folders?

@JosephDiPalma I was wondering if there is any way to reduce the memory consumption of the preprocessing step?

@Tejussurendran You can reduce the memory consumption of the pre-processing step by reducing the num_workers setting further, or downsampling your slides prior to processing.

