# Exploring the Leafsnap dataset

In [6]:
import os
import random
from shutil import rmtree

The Leafsnap dataset can be downloaded from http://leafsnap.com/dataset/. Download the .tar file and extract it in the directory containing this notebook. Note that the dataset uses 1.1 Gb of disk space.

The next function splits the data into a training set and a test set. By default, the training set has 80% of the images, and the test set has 20%. Be sure to set the random seed if you want the split to be reproducible.

Each training example consists of an image and a corresponding image mask. The image mask is a .png file with all pixels perfectly white (#ffffff) or black (#000000). The white pixels indicate the region of the image containing the leaf. The mask is all black in a few cases where the segmentation algorithm was unable to find the leaf.

In [7]:
def make_train_test_split(seed=None):
    if not os.path.isdir('../dataset'):
        print('Data directory not found. Did you download and extract the data?')
        return
    if os.path.isdir('../data'):
        rmtree('../data')
    os.mkdir('../data')
    if seed is None:
        random.seed(seed)
    tt_dict = {}
    for dirpath, dirnames, filenames in os.walk('../dataset'):
        if len(dirnames) == 0:
            for tt in ('test', 'train', 'validate'):
                newdir = dirpath.replace('/dataset/', '/data/{}/'.format(tt))
                os.makedirs(newdir, exist_ok=True)
            for filename in filenames:
                source = "{}/{}".format(dirpath, filename)
                basename = filename.rsplit('.', 1)[0]
                if basename in tt_dict:
                    tt = tt_dict[basename]
                else:
                    rnd = random.random()
                    if rnd < 0.6:
                        tt = 'train'
                    elif rnd < 0.8:
                        tt = 'validate'
                    else:
                        tt = 'test'
                    tt_dict[basename] = tt
                destination = source.replace('/dataset/', '/data/{}/'.format(tt))
                os.rename(source, destination)
                print('Renamed {} to {}'.format(source, destination))
    rmtree('../dataset')

In [9]:
make_train_test_split(seed=1)