# Image Processing 

The goal of this note book is to process the Kaggle dataset so that it is ready for use training some models using VGG16. Along with the basic processing, I have also done a little data augmentation, using numpy and basic linear algebra. 

One significant decison I made during this was to omit any test set and only use a train and dev set, this is motivated by the limited size of the dataset. Because of the small size of the dataset, it was more benificial to put more data into development and simply live without a very accurate final evaluation of performance.

In [1]:
import numpy as np
import pandas as pd 
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
Total_data=tf.keras.utils.image_dataset_from_directory('Skin_Data/',batch_size=None
                                                       ,labels="inferred",label_mode='binary')

Found 288 files belonging to 2 classes.


In [19]:
'''plt.figure()
for image, labels in Total_data.take(1):
    print(np.shape(image))
    print(labels)
    plt.imshow(image/255)
    
plt.show()'''

'plt.figure()\nfor image, labels in Total_data.take(1):\n    print(np.shape(image))\n    print(labels)\n    plt.imshow(image/255)\n    \nplt.show()'

We see image_dataset_from_directory has labeled 70% of the data 1, but the positive class only has 30% so we will need to flip these labels later

In [4]:
items=Total_data.cardinality().numpy()
total_labels=0
for images, labels in Total_data.take(items):
    total_labels+=labels[0]
print(total_labels/items)

tf.Tensor(0.7083333, shape=(), dtype=float32)


## Train test split: Convert to Numpy

By first converting our tensors to numpy arrays, we can use sklearn's train test split, which allows for stratified splitting, which is necessary because of the small size of the dataset. As far as I know, this is not possible using tensorflows utensils.

In [5]:
image_list=[]
label_list=[]
for image, label in Total_data.take(Total_data.cardinality().numpy()):
    image_list.append(image.numpy())
    label_list.append(label.numpy())
X=np.array(image_list)
y=np.array(label_list)
y=[1]-y
print(np.shape(X),np.shape(y))

(288, 256, 256, 3) (288, 1)


Because of the limited data I'm using, it seems wise to dispense with a separate test set. this will mean that we will have to do without a very accurate measure of the models performance but this is fime for our purposes. 

In [6]:
# 60-40 split reflects limited size of dataset
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.4,stratify=y, random_state=42)

In [7]:
#check the stratification
print(y_train.mean(),y_dev.mean())

0.29069767441860467 0.29310344827586204


## Augment our training data 

Is it necessary to augment our data when using a pretrained model as a base? Perhaps not. The base model has already learnt to recognise transformed images so this may not be so useful. I decided to give augmenting my data a try anyway as it is a relatively small investment becuase of the size of the dataset. Even considering this, it will slow down training noticably (we are effectively using a training set that is 8 times larger) 

I'm aware that there are some pytorch tools that can be used for this task but I opted to simply use numpy and some linear algebra, making use of numpy's vectorisation.

In [8]:
#define a matrix to use to apply reflections 
def reflection_array(image_number):
    I_array=np.identity(256)
    reflect_array=np.zeros((image_number,256,256,3))
    for n in range(image_number):
        for j in range(3):
            for i in range(256):
                reflect_array[n,i,:,j]=I_array[255-i,:]
    return reflect_array

In [9]:
# applies an x axis reflection to a copy of the data and concatenates it with the original 
def x_reflection(image_set,label_set):
    image_number=np.shape(image_set)[0]
    new_images=np.zeros((256,256,3))
    new_images=np.matmul(np.moveaxis(image_set[:,:,:,:],-1,0),np.moveaxis(reflection_array(image_number),-1,0))
    new_images=np.moveaxis(new_images,0,-1)
    
    total_images=np.concatenate((image_set,new_images),axis=0)  
    total_labels=np.concatenate((label_set,label_set),axis=0) #concatenate preserves row order
    
    return total_images, total_labels


In [10]:
# applies an y axis reflection to a copy of the data and concatenates it with the original
def y_reflection(image_set,label_set):
    image_number=np.shape(image_set)[0]
    new_images=np.zeros((256,256,3))
    new_images=np.matmul(np.moveaxis(reflection_array(image_number),-1,0),np.moveaxis(image_set[:,:,:,:],-1,0))
    new_images=np.moveaxis(new_images,0,-1)
    
    total_images=np.concatenate((image_set,new_images),axis=0)  
    total_labels=np.concatenate((label_set,label_set),axis=0) #concatenate preserves row order
    
    return total_images, total_labels

In [11]:
# flips x and y cooridinates in a copy of the data and concatenates it with the original 
#for those interested in the linear algera, this is equivalent to a reflection in the line y=x
def x_y_flip(image_set,label_set):
    new_images=np.moveaxis(image_set,1,2)
    
    total_images=np.concatenate((image_set,new_images),axis=0)  
    total_labels=np.concatenate((label_set,label_set),axis=0)
    
    return total_images, total_labels

Using the three reflection functions we have created, we can generate all possible linear transformations of our square image onto itself. I decided not to apply any non-linear transformations or change the colour of any images as I am unsure if this is justified for this classification task.   

In [12]:
Augmented_X, Augmented_y=x_reflection(X_train,y_train)
Augmented_X, Augmented_y=y_reflection(Augmented_X, Augmented_y)
Augmented_X, Augmented_y=x_y_flip(Augmented_X, Augmented_y)

In [20]:
'''fig, axs=plt.subplots(2,4)
for n in range(8):
    axs[n//4,n%4].imshow(Augmented_X[20+(n*172),:,:,:]/255.0)
    axs[n//4,n%4].axis("off") 
    axs[n//4,n%4].set_title(f'{Augmented_y[20+(n*172)]}')'''

'fig, axs=plt.subplots(2,4)\nfor n in range(8):\n    axs[n//4,n%4].imshow(Augmented_X[20+(n*172),:,:,:]/255.0)\n    axs[n//4,n%4].axis("off") \n    axs[n//4,n%4].set_title(f\'{Augmented_y[20+(n*172)]}\')'

Final step: run the keras vgg16 preprocessing 

In [14]:
X_train_vvg16=tf.keras.applications.vgg16.preprocess_input(Augmented_X)
X_dev_vvg16=tf.keras.applications.vgg16.preprocess_input(X_dev)

## Save data 

In [15]:
np.save('Skin_cancer_X_train', X_train_vvg16)
np.save('Skin_cancer_y_train', Augmented_y)

np.save('Skin_cancer_X_dev', X_dev_vvg16)
np.save('Skin_cancer_y_dev', y_dev)