# Initial Data Preparation

## Objectives

* Fetch data from Kaggle and save as raw data
* Initial data preparation and data cleaning
* Split data into Train, Validation, Test sets

## Inputs

* COCO JSON file

## Outputs

* Generate Lemon Quality Dataset, split into Train, Validation, and Test sets



---

# Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [12]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Obtain and save data from Kaggle API

Install Kaggle

In [None]:
!pip install kaggle


Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [13]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set KaggleDataset variable as the [URL](https://www.kaggle.com/datasets/maciejadamiak/lemons-quality-control-dataset) for the dataset on Kaggle and create destination folder variable for it to be downloaded into.
Run Kaggle command to download dataset into destination folder

In [14]:
KaggleDataset = "maciejadamiak/lemons-quality-control-dataset"
DestinationFolder = "inputs/lemon-quality-dataset-2"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}

Downloading lemons-quality-control-dataset.zip to inputs/lemon-quality-dataset-2
  0%|                                               | 0.00/83.0M [00:00<?, ?B/s]  6%|██▎                                   | 5.00M/83.0M [00:00<00:03, 22.1MB/s] 11%|████                                  | 9.00M/83.0M [00:00<00:06, 11.5MB/s] 30%|███████████▍                          | 25.0M/83.0M [00:01<00:02, 27.6MB/s] 40%|███████████████                       | 33.0M/83.0M [00:01<00:01, 35.8MB/s] 49%|██████████████████▊                   | 41.0M/83.0M [00:01<00:02, 21.2MB/s] 69%|██████████████████████████            | 57.0M/83.0M [00:02<00:00, 29.5MB/s] 78%|█████████████████████████████▊        | 65.0M/83.0M [00:02<00:00, 27.3MB/s]
100%|██████████████████████████████████████| 83.0M/83.0M [00:02<00:00, 32.3MB/s]


Unzip downloaded file and subsequently delete the originally downloaded zipped file

In [15]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/lemons-quality-control-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/lemons-quality-control-dataset.zip')

## Data Cleaning

Set input dataset paths

In [16]:
DataPath = 'inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset'
ImagePath = f'{DataPath}/images'

Import COCO API and use it to read dataset

In [17]:
from pycocotools.coco import COCO
coco = COCO(f'{DataPath}/annotations/instances_default.json')
cats = coco.cats
print(cats)


loading annotations into memory...
Done (t=1.02s)
creating index...
index created!
{1: {'id': 1, 'supercategory': '', 'name': 'image_quality'}, 2: {'id': 2, 'supercategory': '', 'name': 'illness'}, 3: {'id': 3, 'supercategory': '', 'name': 'gangrene'}, 4: {'id': 4, 'supercategory': '', 'name': 'mould'}, 5: {'id': 5, 'supercategory': '', 'name': 'blemish'}, 6: {'id': 6, 'supercategory': '', 'name': 'dark_style_remains'}, 7: {'id': 7, 'supercategory': '', 'name': 'artifact'}, 8: {'id': 8, 'supercategory': '', 'name': 'condition'}, 9: {'id': 9, 'supercategory': '', 'name': 'pedicel'}}


Separate lemons into lists of healthy and unhealthy lemons, decided by passing tags for filtering dataset in

In [47]:
def sort_lemons(coco, tags: list) -> tuple:
    all_lemon_ids = coco.getImgIds()
    temp_list = []
    bad_ids = []
    good_ids = []
    for tag in tags:
        arr = coco.getImgIds(catIds=[tag])
        for i in arr:
            temp_list.append(i)
    
    [bad_ids.append(i) for i in temp_list if i not in bad_ids]
    [good_ids.append(i) for i in all_lemon_ids if i not in bad_ids]
    bad_lemons = coco.loadImgs(ids=bad_ids)
    good_lemons = coco.loadImgs(ids=good_ids)
    return bad_lemons, good_lemons


In [48]:
my_sort = sort_lemons(coco=coco, tags=[2,3,4])
print(len(my_sort))

2


Move image files into folders 

In [49]:
import shutil
def move_sorted_images(sort: tuple):
    labels = ['bad_quality', 'good_quality']
    for label in labels:
        os.makedirs(name=f'{DataPath}/{label}')
    for i in sort[0]: #improve this section of code to iterate through both labels at once
        file_ext = i['file_name']
        image_file = f'{DataPath}/{file_ext}'                             
        shutil.move(image_file, f'{DataPath}/bad_quality')
    for i in sort[1]:
        file_ext = i['file_name']
        image_file = f'{DataPath}/{file_ext}'
        shutil.move(image_file, f'{DataPath}/good_quality')
    os.rmdir(ImagePath)

    

In [51]:
my_move = move_sorted_images(sort=my_sort)


Create folders for lemons to be sorted into

---

## Data Cleaning

Remove documentation folder and README

In [None]:
import shutil
shutil.rmtree('inputs/lemon-quality-dataset-2/docs/')
os.remove('inputs/lemon-quality-dataset-2/README.MD')

In [None]:
image_count = int(len(os.listdir('inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset/images/')))
image_count