# SnapLeaf data subset preparation

The original dataset has been downloaded from [kaggle.com](https://www.kaggle.com/xhlulu/leafsnap-dataset) as [leafsnap.com](leafsnap.com/dataset) is not available any more. It is stored at [SURF drive](https://surfdrive.surf.nl/files/index.php/s/MoCVal7gxS4aX51?path=%2Fdata%2FLeafSnap). There are 30 866 (~31k) color images of different sizes. The dataset covers all 185 tree species from the Northeastern United States. The original images of leaves taken from two different sources:

    "Lab" images, consisting of high-quality images taken of pressed leaves, from the Smithsonian collection.
    "Field" images, consisting of "typical" images taken by mobile devices (iPhones mostly) in outdoor environments.

For the purpose of this demo a subset of 20 species of lab and field images has been selected. The lab images have been cropped semi-manually using IrfanView to remove the riles and color calibration image parts. This results in a small dataset of 3283 images.

This notebook (based on a [student project notebook]) is used to load the cropped images resize to images to 64x64, normalize the data and save the files as numpy compressed (NPZ). 

### Imports

In [18]:
import warnings
warnings.simplefilter('ignore')
import os
import PIL
import imageio
import pandas as pd
import numpy as np
import random
import math
import keras


### Read data frame with information about pictures

In the dataset, there is a data frame containing information about the pictures. Relevant for us are the columns:

    path: path to the individual pictures
    species: latin term for each plant
    source: picture taken in lab or field



In [19]:
# data paths
original_data_path = "/home/elena/eStep/XAI/Data/LeafSnap/leafsnap-dataset-20subset/"
#images_folder = os.join(original_data_patn, "dataset")
dataset_info_file = os.path.join(original_data_path, "leafsnap-dataset-20subset-images.txt")

img_info = pd.read_csv(dataset_info_file, sep="\t")
img_info.head()

Unnamed: 0,file_id,image_path,species,source
0,55821,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
1,55822,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
2,55823,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
3,55824,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
4,55825,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab


Add new column for the filename only.

In [20]:
#new column (empty)
img_info["filename"] = None
#index of new column
index_filename = img_info.columns.get_loc("filename")
for i in range(len(img_info)):
    img_info.iloc[i, index_filename] = os.path.basename(str(img_info["image_path"][i]))
img_info.head()    

Unnamed: 0,file_id,image_path,species,source,filename
0,55821,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab,ny1079-01-1.jpg
1,55822,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab,ny1079-01-2.jpg
2,55823,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab,ny1079-01-3.jpg
3,55824,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab,ny1079-01-4.jpg
4,55825,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab,ny1079-02-1.jpg


### Create numeric labels

Then, we want to have numeric labels instead of the latin term for each plant, so we append another column holding these.

    In the data frame, all images of one species are listed consecutively (first the lab images, then the field images).
    Therefore we just loop over the dataframe and increment the numeric label whenever we encounter a latin term that differs from the previous one.



In [21]:
#new column (empty)
img_info["labels_integer"] = None
#index of new column
index_labels_integer = img_info.columns.get_loc("labels_integer")
#index of species column
index_species = img_info.columns.get_loc("species")
#to assign numeric labels starting with 0 for the first species
k = 0 
for i in range(len(img_info)):
    if i == 0:
        img_info.iloc[i, index_labels_integer] = k #here, k == 0
    if i > 0:
        if img_info.iloc[i-1, index_species] == img_info.iloc[i, index_species]:
            img_info.iloc[i, index_labels_integer] = k
        else:
            k += 1
            img_info.iloc[i, index_labels_integer] = k
img_info.tail()


Unnamed: 0,file_id,image_path,species,source,filename,labels_integer
3278,80541,dataset/images/field/zelkova_serrata/129920088...,Zelkova serrata,field,12992008819325.jpg,20
3279,80542,dataset/images/field/zelkova_serrata/129920088...,Zelkova serrata,field,12992008827162.jpg,20
3280,80543,dataset/images/field/zelkova_serrata/129920088...,Zelkova serrata,field,12992008846045.jpg,20
3281,80544,dataset/images/field/zelkova_serrata/129920088...,Zelkova serrata,field,12992008854728.jpg,20
3282,80545,dataset/images/field/zelkova_serrata/129920088...,Zelkova serrata,field,12992008865346.jpg,20


Save dataframe.

In [22]:
dataset_info_file_enh = os.path.join(original_data_path, "leafsnap-dataset-20subset-images-enhanced.txt")
img_info.to_csv(dataset_info_file_enh, index=False)

### Resize pictures

Resizing the pictures to 64x64 is done by reading the filenames from the data frame, generating a resizedversion of the desired size for each picture and saving them to an output directory.


In [None]:
def resizeImage(infile, infile_name_only, output_dir="", size=(64,64)):
    '''
    Resize Images to a requested size (not considerinng aspect ratio)
    Input:
      - infile: image to be resized (with path)
      - infile_name_only: image to be resized (filename only)
      - output_dir: where resized images should be stored
      - size: output size (tuple of (height, width))
    '''
    infullfname = os.path.join(original_data_path, infile)
    outfullfname = os.path.join(output_dir, infile_name_only)
    #print(outfullfname)
      #if infile != outfullfname:
    if not os.path.isfile(outfullfname):
        try:
            im = PIL.Image.open(infullfname)
            #crops to requested size independant from aspect ratio
            im = im.resize(size, PIL.Image.ANTIALIAS) 
            im.save(outfullfname)
        except IOError:
            print("cannot reduce image for ", infullfname)

output_dir = os.path.join(original_data_path,"dataset","resized")
size = (64, 64)
filenames_dir = list(img_info["image_path"])
filenames = list(img_info["filename"])
            
for i in range(len(filenames)):
    resizeImage(filenames_dir[i], filenames[i], output_dir=output_dir, size=size)