# Fruit-360 preprocessor
This notebook will prepare the fruit-360 dataset for the Peltarion platform.

Two versions of the dataset will be prepared:

1) A reduced version with fewer images in the 'Apple Granny Smith' class

2) A complete version with all images included

**Note**: This notebook requires installation of Sidekick. To install the package within the notebook, run the following code:

import sys !{sys.executable} -m pip install git+https://github.com/Peltarion/sidekick#egg=sidekick

For more information about Sidekick, see: https://github.com/Peltarion/sidekick

The raw dataset is available at: https://storage.googleapis.com/bucket-8732/Fruits/fruits.zip

Third party terms apply, see: https://github.com/Horea94/Fruit-Images-Dataset/blob/master/LICENSE

In [1]:
import functools
import os
from glob import glob
import resource

import pandas as pd
from PIL import Image
import sidekick
from tqdm import tqdm

## Setup

### Paths

In [20]:
# Raw dataset
input_path = './fruits-360/Training'
# Zip output
output_path = './data_complete.zip'
output_path_reduced = './data_reduced.zip'

### Progress bar for Pandas

In [12]:
tqdm.pandas()

### Get list of image paths

In [13]:
images_path = glob(input_path + '/*/*.jpg') + glob(input_path + '/*/*.png')
print("Images found: ", len(images_path))

Images found:  53177


## Create Dataframe
The class column values are derived from the names of the subfolders in the `input_path`.

The image column contains the relative path to the images in the subfolders.

In [14]:
df = pd.DataFrame({'image': images_path})
df['fruit_class'] = df['image'].progress_apply(lambda path: os.path.basename(os.path.dirname(path)))
df.head()

100%|██████████| 53177/53177 [00:00<00:00, 323553.78it/s]


Unnamed: 0,image,fruit_class
0,/Users/joakim/tmp/fruits_wt/fruits-360/Trainin...,Tomato 4
1,/Users/joakim/tmp/fruits_wt/fruits-360/Trainin...,Tomato 4
2,/Users/joakim/tmp/fruits_wt/fruits-360/Trainin...,Tomato 4
3,/Users/joakim/tmp/fruits_wt/fruits-360/Trainin...,Tomato 4
4,/Users/joakim/tmp/fruits_wt/fruits-360/Trainin...,Tomato 4


### Check that all images have the same format, e.g., RGB

In [15]:
def get_mode(path):
    im = Image.open(path)
    im.close()
    return im.mode

df['image_mode'] = df['image'].progress_apply(lambda path: get_mode(path))
print(df['image_mode'].value_counts())
df = df.drop(['image_mode'], axis=1)

100%|██████████| 53177/53177 [00:19<00:00, 2757.26it/s]


RGB    53177
Name: image_mode, dtype: int64


## View number of rows per class

In [16]:
pd.set_option('display.max_rows', 150)
df['fruit_class'].value_counts()

Grape Blue             984
Plum 3                 900
Strawberry Wedge       738
Cherry Rainier         738
Tomato 1               738
Melon Piel de Sapo     738
Cherry 2               738
Peach 2                738
Tomato 3               738
Walnut                 735
Apple Red Yellow 2     672
Tomato 2               672
Pepper Yellow          666
Pepper Red             666
Pear Red               666
Pineapple Mini         493
Apple Golden 1         492
Redcurrant             492
Apple Red 2            492
Cantaloupe 2           492
Apple Braeburn         492
Physalis               492
Pomegranate            492
Cherry Wax Yellow      492
Tomato Cherry Red      492
Physalis with Husk     492
Mulberry               492
Papaya                 492
Nectarine              492
Cherry Wax Red         492
Cherry 1               492
Apple Golden 2         492
Strawberry             492
Cantaloupe 1           492
Grape White 3          492
Rambutan               492
Peach Flat             492
A

## Create a reduced dataset with a lower number of "Apple Granny Smith"

In [17]:
df_ags = df.query('fruit_class=="Apple Granny Smith"')
df_ags = df_ags.sample(frac=0.1, random_state=1)
df_reduced = df.query('fruit_class!="Apple Granny Smith"')
df_reduced = pd.concat([df_reduced, df_ags], sort=False)
df_reduced['fruit_class'].value_counts()

Grape Blue             984
Plum 3                 900
Tomato 1               738
Cherry Rainier         738
Melon Piel de Sapo     738
Tomato 3               738
Cherry 2               738
Peach 2                738
Strawberry Wedge       738
Walnut                 735
Tomato 2               672
Apple Red Yellow 2     672
Pepper Red             666
Pear Red               666
Pepper Yellow          666
Pineapple Mini         493
Physalis               492
Apple Red 1            492
Nectarine              492
Cherry Wax Red         492
Mulberry               492
Tomato Cherry Red      492
Papaya                 492
Pomegranate            492
Pear                   492
Cherry Wax Black       492
Strawberry             492
Redcurrant             492
Apple Golden 1         492
Grape Pink             492
Apple Red 2            492
Cherry 1               492
Cantaloupe 2           492
Apple Golden 2         492
Peach Flat             492
Apricot                492
Physalis with Husk     492
C

## Shuffle the rows

When you save a new version of a dataset on the platform, the rows in the dataset will be shuffled automatically. To ensure that samples from different classes are displayed in the Datasets preview, you can shuffle the rows before the dataset is uploaded to the platform. 

In [18]:
df = df.sample(frac=1.0, random_state=1)
df_reduced = df_reduced.sample(frac=1.0, random_state=1)
print('Complete dataset {}'.format(len(df)))
print('Reduced dataset {}'.format(len(df_reduced)))

Complete dataset 53177
Reduced dataset 52734


## Create dataset bundles

In [19]:
'''
Available modes:
- crop_and_resize
- center_crop_or_pad
- resize_image
'''
image_processor = functools.partial(sidekick.process_image, mode='crop_and_resize', size=(100, 100), file_format='jpeg')

# Reduced dataset
sidekick.create_dataset(
    output_path_reduced,
    df_reduced,
    path_columns=['image'],
    preprocess={
        'image': image_processor
    }
)

# Complete dataset
sidekick.create_dataset(
    output_path,
    df,
    path_columns=['image'],
    preprocess={
        'image': image_processor
    }
)