# LeafSnap data exploration

The original dataset has been downloaded from [kaggle.com](https://www.kaggle.com/xhlulu/leafsnap-dataset) as [leafsnap.com](leafsnap.com/dataset) is not available any more. It is stored at [SURF drive](https://surfdrive.surf.nl/files/index.php/s/MoCVal7gxS4aX51?path=%2Fdata%2FLeafSnap). There are 30 866 (~31k) color images of different sizes. The dataset covers all 185 tree species from the Northeastern United States. The original images of leaves taken from two different sources:

    "Lab" images, consisting of high-quality images taken of pressed leaves, from the Smithsonian collection.
    "Field" images, consisting of "typical" images taken by mobile devices (iPhones mostly) in outdoor environments.

For the purpose of this demo we want to select a subset of 30 species of lab and field images has been selected. Already a [dataset of 20 classes](https://github.com/NLeSC/XAI/blob/master/Software/LeafSnapDemo/Data_preparation_20subset.ipynb) have been selected, where the lab images have been cropped semi-manually using IrfanView to remove the riles and color calibration image parts. This resulted in a small dataset of 3283 images.

This notebook is used to explore the original dataset and find out the most polpulous 10 classes which have not been included yet int he 20-class dataset.

### Imports

In [73]:
import warnings
warnings.simplefilter('ignore')
import os
import PIL
import imageio
import pandas as pd
import numpy as np


### Read data frame with information about pictures

In the dataset, there is a data frame containing information about the pictures. Relevant for us are the columns:

    path: path to the individual pictures
    species: latin term for each plant
    source: picture taken in lab or field



In [74]:
# original dataset
data_path = "/home/elena/eStep/XAI/Data/LeafSnap/"
dataset_data_path = os.path.join(data_path, "leafsnap-dataset")
dataset_info_file = os.path.join(dataset_data_path, "leafsnap-dataset-images.txt")

img_info = pd.read_csv(dataset_info_file, sep="\t")
img_info.head()

Unnamed: 0,file_id,image_path,segmented_path,species,source
0,55497,dataset/images/lab/abies_concolor/ny1157-01-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
1,55498,dataset/images/lab/abies_concolor/ny1157-01-2.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
2,55499,dataset/images/lab/abies_concolor/ny1157-01-3.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
3,55500,dataset/images/lab/abies_concolor/ny1157-01-4.jpg,dataset/segmented/lab/abies_concolor/ny1157-01...,Abies concolor,lab
4,55501,dataset/images/lab/abies_concolor/ny1157-02-1.jpg,dataset/segmented/lab/abies_concolor/ny1157-02...,Abies concolor,lab


In [75]:
subset20_data_path = os.path.join(data_path, "leafsnap-dataset-20subset")
subset20_info_file = os.path.join(subset20_data_path, "leafsnap-dataset-20subset-images.txt")

img_info20 = pd.read_csv(subset20_info_file, sep="\t")
img_info20.head()

Unnamed: 0,file_id,image_path,species,source
0,55821,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
1,55822,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
2,55823,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
3,55824,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab
4,55825,dataset/images/lab/Auto_cropped/acer_campestre...,Acer campestre,lab


### Get the top 15 most populous species from the original dataset

In [76]:
species = img_info["species"]
species.describe()
species_counts = species.value_counts()
top15_species_counts = species_counts.head(15)
print(top15_species_counts)

Maclura pomifera            448
Ulmus rubra                 317
Prunus virginiana           303
Acer rubrum                 297
Broussonettia papyrifera    294
Prunus sargentii            288
Ptelea trifoliata           270
Ulmus pumila                265
Abies concolor              251
Asimina triloba             249
Diospyros virginiana        248
Quercus montana             247
Ilex opaca                  244
Liriodendron tulipifera     235
Acer negundo                229
Name: species, dtype: int64


In [77]:
species = img_info["species"]
species.describe()

count                30866
unique                 185
top       Maclura pomifera
freq                   448
Name: species, dtype: object

In [78]:
top15_species_counts = species.value_counts().head(15).to_frame()
top15_species_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, Maclura pomifera to Acer negundo
Data columns (total 1 columns):
species    15 non-null int64
dtypes: int64(1)
memory usage: 240.0+ bytes


In [79]:
top15_species_counts

Unnamed: 0,species
Maclura pomifera,448
Ulmus rubra,317
Prunus virginiana,303
Acer rubrum,297
Broussonettia papyrifera,294
Prunus sargentii,288
Ptelea trifoliata,270
Ulmus pumila,265
Abies concolor,251
Asimina triloba,249


### Find possible intersection of top 15 populous species and the 20 subset

In [80]:
species20 = img_info20["species"]
species20.describe()

count            3283
unique             20
top       Ulmus rubra
freq              317
Name: species, dtype: object

In [81]:
species20_counts = species20.value_counts().to_frame()
species20_counts.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, Ulmus rubra to Quercus rubra
Data columns (total 1 columns):
species    20 non-null int64
dtypes: int64(1)
memory usage: 320.0+ bytes


In [82]:
species20_counts

Unnamed: 0,species
Ulmus rubra,317
Diospyros virginiana,248
Ulmus americana,215
Salix nigra,197
Platanus occidentalis,188
Zelkova serrata,183
Quercus alba,175
Tilia americana,159
Magnolia acuminata,148
Quercus bicolor,145


In [83]:
common_species = pd.merge(top15_species_counts,species20_counts)
common_species.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
species    2 non-null int64
dtypes: int64(1)
memory usage: 32.0 bytes


In [84]:
print(common_species)

   species
0      317
1      248


### Remove the common species from the most populous and select the top 10

In [85]:
top10_unused_species_count = pd.concat([top15_species_counts, common_species]).drop_duplicates(keep=False).head(10)
print(top10_unused_species_count)

                          species
Maclura pomifera              448
Prunus virginiana             303
Acer rubrum                   297
Broussonettia papyrifera      294
Prunus sargentii              288
Ptelea trifoliata             270
Ulmus pumila                  265
Abies concolor                251
Asimina triloba               249
Quercus montana               247
