Import a couple packages to help with data inspection, cv2 for image data and glob for file stuff

Dataset is the cell images for detecting malaria

In [59]:
import cv2
import glob

There are two data classes, Parasitized and Uninfected. This is binary classification, classifying as either one or the other, never neither or both.

Data is stored in separate subdirectories in the data directory as .png images.

In [60]:
#Parasitized (Infected) and uninfected folders acting as labels for the images
!dir data

 Volume in drive C has no label.
 Volume Serial Number is AA57-056F

 Directory of C:\Users\Ben\Documents\Uni\Machine Learning KV7006\Workspace\data

06/03/2023  18:59    <DIR>          .
06/03/2023  20:20    <DIR>          ..
06/03/2023  18:44    <DIR>          Parasitized
06/03/2023  18:45    <DIR>          Uninfected
               0 File(s)              0 bytes
               4 Dir(s)  134,842,613,760 bytes free


In [21]:
#path of first image in folder
path = glob.glob("./data/Parasitized/*.png")[0]

#read image
image = cv2.imread(path)
#print shape of image
print(image.shape)

(148, 142, 3)


The above shows that the first image is an image of size 148 x 142 x 3, with the 3 suggesting this as the RGB channel. Therefore, this is very likely an RGB image of size 148 x 142

In [22]:
#path of first image in folder
path = glob.glob("./data/Parasitized/*.png")[1]

#read image
image = cv2.imread(path)
#print shape of image
print(image.shape)

(208, 148, 3)


The above shows the second sample, of size 208 x 148 and is also RGB. This suggests that the dataset consists of images of different sizes, but is likely to be RGB data, which a retrospective look at the images themselves corroborates.

Machine learning is more challenging when the images are different sizes, and therefore the dataset will likely be resized before training.

In [62]:
#read all images and add to a regular python list
infected = []
uninfected = []
for file in glob.glob("./data/Parasitized/*.png"):
    infected.append(cv2.imread(file))
    
for file in glob.glob("./data/Uninfected/*.png"):
    uninfected.append(cv2.imread(file))

In [64]:
#print length of python lists (number of parasitized/infected and uninfected image samples)
print("Infected image samples: " + str(len(images)))
print("Uninfected image samples: " + str(len(uninfected)))

Infected image samples: 13779
Uninfected image samples: 13779


Above shows that there are the same amount of infected samples as there are uninfected samples, and that the dataset is very large considering there are only two classes.

In [36]:
Minimum_X = 10000
Maximum_X = -1
Minimum_Y = 10000
Maximum_Y = -1

for image in infected:
    if (Minimum_X > image.shape[0]):
        Minimum_X = image.shape[0]
    if (Maximum_X < image.shape[0]):
        Maximum_X = image.shape[0]
    if (Minimum_Y > image.shape[1]):
        Minimum_Y = image.shape[1]
    if (Maximum_Y < image.shape[1]):
        Maximum_Y = image.shape[1]

print("Minimum_X: " + str(Minimum_X))
print("Maximum_X: " + str(Maximum_X))
print("Minimum_Y: " + str(Minimum_Y))
print("Maximum_Y: " + str(Maximum_Y))

Minimum_X: 40
Maximum_X: 385
Minimum_Y: 46
Maximum_Y: 394


The above gets the minimum and maximum size of both the x and y axis of the images in the infected dataset.

The results show that the image width ranges from 40 to 385, and the height ranges from 46 to 394.

This gives a large range for potential image resizing, as either way image size will either be greatly reduced or greatly increased.

In [54]:
count_Xmin = 0
count_Ymin = 0
count_Xmax = 0
count_Ymax = 0

for image in infected:
    if (image.shape[0] < 100):
        count_Xmin += 1
    if (image.shape[1] < 100):
        count_Ymin += 1
    if (image.shape[0] > 250):
        count_Xmax += 1
    if (image.shape[1] > 250):
        count_Ymax += 1

print("Count_X min: " + str(count_Xmin))
print("Count_Y min: " + str(count_Ymin))
print("Count_X max: " + str(count_Xmax))
print("Count_Y max: " + str(count_Ymax))

Count_X min: 407
Count_Y min: 407
Count_X max: 12
Count_Y max: 9


This prints out the number of samples that have a width or height of less than 100 pixels or greater than 250 pixels. This is to ensure that the minimum and maximum width and height data from the previous code block was not caused by outliers/anomalous samples. The findings suggest that the data does in fact contain many smaller samples, with enough to suggest that this is not anomalous and should remain included, however does not include very many large samples. These should be considered whether to be included or not, but given that they are larger and will likely be scaled down, this is less of a concern than if a few very small samples were being scaled up.