This notebook shows how to split the train set into k partitions to perform crossvalidation.

Before running this notebook, make sure the file `train_tfrecords0.record` is located in `hyperspectral-cnn-soil-estimation/dataset`. If the file is not present, follow the instructions in the notebook `convert_dataset_to_TFRecords.ipynb` to create one or run the following cell to download our ready-to-use dataset.

In [None]:
%cd
%cd hyperspectral-cnn-soil-estimation/dataset

#Challenge train set already converted to TFRecord file format
!gdown https://drive.google.com/uc?id=1wD3vKqKEFh6OfrfLNtOENF-lbe4auQDb

%cd

# Load full train dataset

Navigate to the working directory.

In [1]:
%cd
%cd hyperspectral-cnn-soil-estimation

/home/microsat
/home/microsat/hyperspectral-cnn-soil-estimation


Define dataset path.

In [2]:
train_set_path = 'dataset/train_tfrecords0.record'
output_path='dataset/train_cv_split_{}.record'

Import required libraries and load the dataset.

In [3]:
import os, logging
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
from dataset_processing import *
import tensorflow as tf

train_data=load_tf_records(train_set_path)

# Shuffle and split training dataset into k partitions.

In [4]:
number_of_partitions = 5
num_images = len(list(train_data))

dataset = train_data.shuffle(num_images, seed=958).cache()

In [5]:
to_skip = 0
to_take = num_images//number_of_partitions

# iterate over dataset so that it is cached and the new resulting sets do not have overlapping elements
for i in dataset:  
    pass

for i in range(number_of_partitions):
    writer = tf.data.experimental.TFRecordWriter(output_path.format(i))

    if i<number_of_partitions-1:
      print('Writing ', output_path.format(i))
      writer.write(dataset.skip(to_skip).take(to_take))
    else:
      print('Writing ', output_path.format(i))
      writer.write(dataset.skip(to_skip))
    
    to_skip = to_skip+to_take
    
print()
print('Writing completed')

Writing  dataset/train_cv_split_0.record
Writing  dataset/train_cv_split_1.record
Writing  dataset/train_cv_split_2.record
Writing  dataset/train_cv_split_3.record
Writing  dataset/train_cv_split_4.record

Writing completed


# Validate partitioning

Validation will be performed on filenames.

In [6]:
def get_filename(example_proto):
    '''Function to get the filename of each record'''
    features=tf.io.parse_single_example(example_proto, tf_records_file_features_description_train())

    filename=features['image/filename']

    return filename

List tfrecord partitions.

In [7]:
import os

folder_path = 'dataset'
file_list = []

for file in os.listdir(folder_path):
    if file.startswith("train_cv_split"):
        file_list.append(file)

file_list=sorted(file_list)

Check that 1732 images are present.

In [8]:
i=-1
for file in file_list:
  i+=1
  if i==0:
    check_partitions = load_tf_records('dataset/'+file)
    print('images in parition ', i, ': ', len(list(load_tf_records('dataset/'+file))))
  else:
    check_partitions=check_partitions.concatenate(load_tf_records('dataset/'+file))
    print('images in parition ', i, ': ', len(list(load_tf_records('dataset/'+file))))

print('total number of images: ', len(list(check_partitions)))

images in parition  0 :  346
images in parition  1 :  346
images in parition  2 :  346
images in parition  3 :  346
images in parition  4 :  348
total number of images:  1732


Compare the list of filenames from the original tfrecord file and the list of filenames from the partitions.

In [9]:
check_partitions_filenames = check_partitions.map(get_filename)
check_full_filenames = dataset.map(get_filename)

tfrecords_partitions_filenames = []
tfrecords_original_filenames = []

for record in check_partitions_filenames:
  tfrecords_partitions_filenames.append(record.numpy().decode())

for record in check_full_filenames:
  tfrecords_original_filenames.append(record.numpy().decode())

if sorted(tfrecords_partitions_filenames) == sorted(tfrecords_original_filenames):
    print("The two lists have exactly the same elements.")
else:
    print("The two lists do not have exactly the same elements.")

The two lists have exactly the same elements.
