## Homework description

Write a data loader for image classification and segmentation task (general case with n classes) (Any module: tf or pytorch) (Case PC/Cluster does not matter)

You can use for beginning: https://tensorflow.org/tutorials/images/segmentation - download the Oxford-IIIT Pets dataset section (...37 category pet image dataset with roughly 200 images for each class)

(Use case: we have 10 000 images in a folder and we want to train a ML model. Loading it all to the main memory can cause OutOfMemory error. With own image loader and preprocessor, we can "load" unlimited amount of data, as it does not overload our main memory. It is being loaded in a process of training - yes, 1 image could be loaded 5 times during 5 epochs - we are losing in time of training model but we don't have OOM error)


In [12]:
import os
import tensorflow as tf
import tensorflow_datasets as tfds
print(tf. __version__)

2.9.1


In [3]:
tf.data.experimental.enable_debug_mode()

Manually removed 3 .mat files

In [3]:
image_count = len(os.listdir('oxford_iiit_pet_dataset/images/'))
image_count

7390

In [4]:
data_dir = str('./oxford_iiit_pet_dataset/images/*.jpg')

# create a dataset of all files matching a pattern
list_ds = tf.data.Dataset.list_files(data_dir, shuffle=True)

val_size = int(image_count * 0.2)

train_ds = list_ds.skip(val_size)
test_ds = list_ds.take(val_size)

tf.data API helps to build flexible and efficient input pipelines

cardinality == number of elements

In [6]:
print(tf.data.experimental.cardinality(train_ds).numpy()) #len(train_ds) would work as well in this case
print(tf.data.experimental.cardinality(test_ds).numpy())

5912
1478


### Consider the dataset as for image classification problem - load from the disk (https://www.robots.ox.ac.uk/~vgg/data/pets/)

In [88]:
folder = 'oxford_iiit_pet_dataset/images'
previous_class_name = 0
current_class_label = 0
for i, file_path in enumerate(os.listdir(folder)):
    number_extension = file_path.split('_')[-1]
    splitted = number_extension.split('.')
    img_number = splitted[0]
    extension = splitted[1]
    class_name = file_path.replace('_' + number_extension, '')
    if class_name != previous_class_name and i != 0:
        current_class_label += 1
    os.rename(f'{folder}/{file_path}', f'{folder}/{class_name}_{img_number}_{current_class_label}.{extension}')
    previous_class_name = class_name

In [14]:
num_classes = 37

In [6]:
image_height = 250
image_width = 250

In [7]:
def process_path(file_path):
  label = get_label(file_path)
  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label

def get_label(file_path):
  label_extension = tf.strings.split(file_path, '_')[-1]
  label = tf.strings.split(label_extension, '.')[0]
  label = tf.strings.to_number(label, tf.int32)
  return label

def decode_img(img):
  img = tf.io.decode_jpeg(img, channels=3)
  return tf.image.resize(img, (image_height, image_width))

num_parallel_calls - representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available resources.

In [9]:
train_ds = train_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
test_ds = test_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)

In [10]:
for image, label in train_ds.take(1):
    print(image.numpy().shape)
    print(label.numpy())

(250, 250, 3)
22


Cache - cache a dataset, either in memory or on local storage. This will save some operations (like file opening and data reading) from being executed during each epoch

With a naive synchronous pipeline, where a training step involves opening a file, reading data (line, record) from the file and using the data for training, execution components are sitting idle for a very long time

[open, read, train, epoch times - naive approach](https://www.tensorflow.org/guide/images/data_performance/naive.svg)

Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1.

The number of elements to prefetch can be manually tuned or set to tf.data.AUTOTUNE, which will prompt the tf.data runtime to tune the value dynamically at runtime

[open, read, train, epoch times - prefetched](https://www.tensorflow.org/guide/images/data_performance/prefetched.svg)

In [11]:
batch_size = 16

def configure_for_performance(ds):
  ds = ds.cache()
  ds = ds.batch(batch_size)
  ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
  return ds

train_ds = configure_for_performance(train_ds)
test_ds = configure_for_performance(test_ds)

In [18]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Rescaling(1./255))
model.add(tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(image_width, image_height, 3)))
model.add(tf.keras.layers.MaxPooling2D())
model.add(tf.keras.layers.Conv2D(32, 3, activation='relu'))
model.add(tf.keras.layers.MaxPooling2D())
model.add(tf.keras.layers.Conv2D(32, 3, activation='relu'))
model.add(tf.keras.layers.MaxPooling2D())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(num_classes))

model.compile(
    optimizer='adam',
    # SparseCategoricalCrossentropy(from_logits=True) -> labels are scalar integers
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

In [19]:
from datetime import datetime
start = datetime.now()

model.fit(
    train_ds,
    validation_data=test_ds,
    epochs=10,
    verbose=1
)

print(datetime.now() - start)

Epoch 1/10
Epoch 2/10
Epoch 3/10

KeyboardInterrupt: 