# Initial Data Preparation

## Objectives

* Fetch data from Kaggle and save as raw data
* Initial data preparation and data cleaning
* Split data into Train, Validation, Test sets

## Inputs

* COCO JSON file

## Outputs

* Generate Lemon Quality Dataset, split into Train, Validation, and Test sets



---

# Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol'

## Obtain and save data from Kaggle API

Install Kaggle

In [6]:
!pip install kaggle




Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set KaggleDataset variable as the [URL](https://www.kaggle.com/datasets/maciejadamiak/lemons-quality-control-dataset) for the dataset on Kaggle and create destination folder variable for it to be downloaded into.
Run Kaggle command to download dataset into destination folder

In [8]:
KaggleDataset = "maciejadamiak/lemons-quality-control-dataset"
DestinationFolder = "inputs/lemon-quality-dataset-2"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}

Downloading lemons-quality-control-dataset.zip to inputs/lemon-quality-dataset-2
  0%|                                               | 0.00/83.0M [00:00<?, ?B/s]  6%|██▎                                   | 5.00M/83.0M [00:00<00:03, 22.6MB/s] 22%|████████▏                             | 18.0M/83.0M [00:00<00:01, 65.4MB/s] 40%|███████████████                       | 33.0M/83.0M [00:00<00:00, 63.9MB/s] 59%|██████████████████████▍               | 49.0M/83.0M [00:00<00:00, 85.6MB/s] 78%|█████████████████████████████▊        | 65.0M/83.0M [00:00<00:00, 89.0MB/s] 90%|██████████████████████████████████▎   | 75.0M/83.0M [00:01<00:00, 79.1MB/s]
100%|██████████████████████████████████████| 83.0M/83.0M [00:01<00:00, 80.8MB/s]


Unzip downloaded file and subsequently delete the originally downloaded zipped file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/lemons-quality-control-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/lemons-quality-control-dataset.zip')

## Data Cleaning

Set input dataset paths

In [4]:
DataPath = 'inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset'
ImagePath = f'{DataPath}/images'

Import COCO API and use it to read dataset

In [5]:
from pycocotools.coco import COCO
coco = COCO(f'{DataPath}/annotations/instances_default.json')
cats = coco.cats
print(cats)


loading annotations into memory...
Done (t=1.01s)
creating index...
index created!
{1: {'id': 1, 'supercategory': '', 'name': 'image_quality'}, 2: {'id': 2, 'supercategory': '', 'name': 'illness'}, 3: {'id': 3, 'supercategory': '', 'name': 'gangrene'}, 4: {'id': 4, 'supercategory': '', 'name': 'mould'}, 5: {'id': 5, 'supercategory': '', 'name': 'blemish'}, 6: {'id': 6, 'supercategory': '', 'name': 'dark_style_remains'}, 7: {'id': 7, 'supercategory': '', 'name': 'artifact'}, 8: {'id': 8, 'supercategory': '', 'name': 'condition'}, 9: {'id': 9, 'supercategory': '', 'name': 'pedicel'}}


Get list of all unhealthy lemons

In [None]:
all_lemon_ids = coco.getImgIds()
unhealthy_tags = [2,3,4,5,6]
temp_list = []
bad_lemons = []
good_lemons = []
for tag in unhealthy_tags:
    arr = coco.getImgIds(catIds=[tag])
    for i in arr:
        temp_list.append(i)

[bad_lemons.append(i) for i in temp_list if i not in bad_lemons]
[good_lemons.append(i) for i in all_lemon_ids if i not in bad_lemons]

bad_quality = coco.loadImgs(ids=bad_lemons)
good_quality = coco.loadImgs(ids=good_lemons)
print(bad_quality)


Move image files into folders 

In [45]:
import shutil
labels = ['bad_quality', 'good_quality']
for label in labels:
    os.makedirs(name=f'{DataPath}/{label}')
for i in bad_quality: #improve this section of code to iterate through both labels at once
    file_ext = i['file_name']
    image_file = f'{DataPath}/{file_ext}'                             
    shutil.move(image_file, f'{DataPath}/bad_quality')
for i in good_quality:
    file_ext = i['file_name']
    image_file = f'{DataPath}/{file_ext}'
    shutil.move(image_file, f'{DataPath}/good_quality')
os.rmdir(ImagePath)

Create folders for lemons to be sorted into

---

## Data Cleaning

Remove documentation folder and README

In [13]:
import shutil
shutil.rmtree('inputs/lemon-quality-dataset-2/docs/')
os.remove('inputs/lemon-quality-dataset-2/README.MD')

In [14]:
image_count = int(len(os.listdir('inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset/images/')))
image_count

2690