#### Loading and preparing the PCam data for training shallow learning models using tensorflow dataset (tfds)

In [1]:
!pip install sklearn
!pip install matplotlib
!pip install tensorflow
!pip install tensorflow_datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=d664a5f21a496cecb71bad96c34e24a82459ef31295337237135fd6d14c12662
  Stored in directory: /root/.cache/pip/wheels/14/25/f7/1cc0956978ae479e75140219088deb7a36f60459df242b1a72
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Loading the required libraries

In [4]:
from sklearn import svm
from sklearn.metrics import accuracy_score
import numpy as np

import tensorflow as tf
import tensorflow_datasets as tfds
from sklearn.preprocessing import StandardScaler

Defining a function that grayscale, resize and flattens the image. This function might also become handy (for deep learning) if the original images are too large for your hardware configuration.

In [5]:
def convert_sample(image):
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image,[32,32]).numpy()
    image = image.reshape(1,-1)
    return image

You can use your google drive to store the data by "mounting" it as follows

In [6]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


Next we use the tensorflow dataset API - tfds - to load data from your mounted google drive. Note this API requite that you should have copied the entire **patch_camelyon** folder from https://syddanskuni-my.sharepoint.com/:f:/g/personal/cmd_sam_sdu_dk/EiWD2LmuxCJBp-_tfGK7aL8Bair7l5z8FU5sp5pLjlhKwg?e=FLzWno to the /content/drive/MyDrive folder on your google drive:

In [None]:
ds1,ds2,ds3 = tfds.load('patch_camelyon',
                        split=['train[:5%]','test[:2%]','validation[:2%]'],
                        data_dir = '/content/drive/MyDrive',
                        download=False,
                        batch_size=-1, # All data...no batches needed 
                        as_supervised=True, # So that we easily can transform data to numpy format
                        shuffle_files=True)

Next we can easily convert both the images and the labels to numpy format 

In [None]:
train_dataset       = tfds.as_numpy(ds1)
train_dataset_image = np.vstack(list(map(convert_sample,train_dataset[0])))
train_dataset_image_Scaled = StandardScaler(with_mean=0, with_std=1).fit_transform(train_dataset_image)
train_dataset_label = train_dataset[1].reshape(-1,)    
print(f'Shape of training data features (observations,features): {train_dataset_image_Scaled.shape}')
print(f'Shape of training data labels (observations,): {train_dataset_label.shape}')

validation_dataset  = tfds.as_numpy(ds3)
validation_dataset_image = np.vstack(list(map(convert_sample,validation_dataset[0])))
validation_dataset_image_Scaled = StandardScaler(with_mean=0, with_std=1).fit_transform(validation_dataset_image)
validation_dataset_label = validation_dataset[1].reshape(-1,) 
   
test_dataset        = tfds.as_numpy(ds2)
test_dataset_image = np.vstack(list(map(convert_sample,test_dataset[0])))
test_dataset_image_Scaled = StandardScaler(with_mean=0, with_std=1).fit_transform(test_dataset_image)
test_dataset_label = test_dataset[1].reshape(-1,)

Shape of training data features (observations,features): (13107, 1024)
Shape of training data labels (observations,): (13107,)


The data is then ready to be applied for training, validation, testing in a shallow learning model such as the SVM classifier...below just a very very simple illustration on how to construct and train a support vector machine based on the data we have prepared

In [None]:
clf = svm.SVC(kernel='rbf')
clf.fit(train_dataset_image_Scaled, train_dataset_label)
y_test_hat = clf.predict(test_dataset_image)

# Obtain accuracy by using the `accuracy_score` function
accuracy_linear = accuracy_score(y_test_hat, test_dataset_label )
# Print results
print(f'SVM achieved {round(accuracy_linear * 100, 1)}% accuracy.')

SVM achieved 51.9% accuracy.
