# The Lemon Dataset

In this notebooks we will check the [lemon dataset](https://github.com/softwaremill/lemon-dataset).
> Lemon dataset has been prepared to investigate the possibilities to tackle the issue of fruit quality control. It contains 2690 annotated images (1056 x 1056 pixels). Raw lemon images have been captured using the procedure described in the following blogpost and manually annotated using CVAT.

![](assets/lemons.png)

and some annotated lemons

![](assets/annotated_lemons.png)

In [2]:
from pycocotools.coco import COCO
import skimage.io as io
import numpy as np
import pandas as pd
import os

import wandb

we can execute bash code directly in jupyter preprending the `!` mark. This will span a new terminal instance and run the command.

In [None]:
!git clone -qq https://github.com/softwaremill/lemon-dataset.git
!unzip -q lemon-dataset/data/lemon-dataset.zip

We will define some global configuration parameters. `ENTITY` should correspond to your W&B Team name if you work in a team, replace it with `None` if you work individually. 

In [None]:
PROJECT_NAME = 'lemon-project'
ENTITY = 'wandb_course'
RAW_DATA_FOLDER = 'lemon-dataset/'
ANNOTATIONS_FILE = 'lemon-dataset/annotations/instances_default.json'

we can read the `COCO` object using the `pycoco` library, this is a standard format for object detection/segmentation.

In [None]:
coco = COCO(ANNOTATIONS_FILE)

get categories (not 😸)

In [None]:
cats = coco.loadCats(coco.getCatIds())
catIds = coco.getCatIds()

get image ids

In [None]:
imgIds = coco.getImgIds()
imgs = coco.loadImgs(imgIds)

we define some hleper functions to load the annotations and labels

In [None]:
def get_anns(img):
    annIds = coco.getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)
    return coco.loadAnns(annIds)

def get_label(ann):
    return [cat['name'] for cat in cats if cat['id'] == ann['category_id']][0]

We will log the images with the corresponding bounding boxes and masks to a `wandb.Table` to visualize and do further EDA on the W&B workspace

In [None]:
def make_wandb_image(img):
    "Helper function to create an annotated wandb.Image"
    pth = os.path.join(RAW_DATA_FOLDER, img['file_name'])
    img_array = io.imread(pth)
    anns = get_anns(img)

    truth_box_data = [{'position': {'minX': ann['bbox'][0],
                                  'minY': ann['bbox'][1],
                                  'maxX': ann['bbox'][0]+ann['bbox'][2],
                                  'maxY': ann['bbox'][1]+ann['bbox'][3]},
                      'class_id': ann['category_id'],
                      'box_caption': get_label(ann),
                      'domain': 'pixel'} for ann in anns]

    masks = [coco.annToMask(ann)*ann['category_id'] for ann in anns]
    mask = np.stack(masks).max(axis=0)  # arbitrary way to select a label...
    return wandb.Image(
                    img_array,
                    classes=cats,
                    boxes={'ground_truth': {'box_data': truth_box_data}},
                    masks={'ground_truth': {'mask_data': mask}}
                    )

we are interested in mold, so let's create a function to filter this catergory

In [None]:
def is_mold(img):
    anns = get_anns(img)
    # 4 is id of mold category
    return 4 in [x['category_id'] for x in anns]

the filename contains the different info about each image separated by an underscore. 
> You will notice that file names are composed to form a specific identifier e.g.: 0037_G_I_120_A: 0037 (individual fruit instance), 120 (relative photo angle), A (photo position). Some of them are restricted to the original project and cannot be published.

We will log this information in separate columns

In [None]:
imgs[0]['file_name']

let's recover the filename without extension

In [None]:
fname = imgs[0]['file_name'].split('/')[1].split('.')[0]
fname

we then want to split on each individual feature

In [None]:
fname.split('_')

In [None]:
def make_row(img):
    "Refactor of each table row"
    fname = img['file_name'].split('/')[1].split('.')[0]
    return [make_wandb_image(img), *fname.split('_'), img['file_name'], is_mold(img)]

In [None]:
make_row(imgs[0])

let's log the dataset as a Table, it takes around 5 minutes depending on your connection.

In [None]:
# imgs = imgs[0:5]  #uncomment to log a sample only

with wandb.init(project=PROJECT_NAME, entity=ENTITY, job_type="EDA") as run:    
    df = pd.DataFrame(data=[make_row(img) for img in imgs],
                      columns='imgs,ids,n1,n2,n3,n4,file_name,is_mold'.split(','))
    run.log({'table_coco_sample': wandb.Table(dataframe=df)})