# Initial Data Preparation

## Objectives

* Fetch data from Kaggle and save as raw data in organised folders
* Initial data preparation and data cleaning - removing non image files
* Read COCO annotations of second dataset to split into good and bad quality lemons
* Combine two datasets into a single one for analysis
* Split data into Train, Validation, Test sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Lemon Quality Dataset, split into Train, Validation, and Test sets



---

# Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol'

## Obtain and save data from Kaggle API

Install Kaggle

In [None]:
!pip install kaggle

Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set KaggleDataset variable as the [URL](https://www.kaggle.com/datasets/yusufemir/lemon-quality-dataset) for the dataset on Kaggle and create destination folder variable for it to be downloaded into.
Run Kaggle command to download dataset into destination folder. Repeat process for the second Kaggle dataset at the following [URL](https://www.kaggle.com/datasets/maciejadamiak/lemons-quality-control-dataset)

In [5]:
KaggleDataset = "yusufemir/lemon-quality-dataset"
DestinationFolder = "inputs/lemon-quality-dataset"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}

KaggleDataset2 = "maciejadamiak/lemons-quality-control-dataset"
DestinationFolder2 = "inputs/lemon-quality-dataset-2"
! kaggle datasets download -d {KaggleDataset2} -p {DestinationFolder2}


Downloading lemon-quality-dataset.zip to inputs/lemon-quality-dataset
  0%|                                                | 0.00/233M [00:00<?, ?B/s]  2%|▊                                      | 5.00M/233M [00:00<00:19, 12.2MB/s]  4%|█▌                                     | 9.00M/233M [00:00<00:15, 14.9MB/s] 14%|█████▌                                 | 33.0M/233M [00:00<00:03, 54.5MB/s] 18%|██████▉                                | 41.0M/233M [00:01<00:03, 53.7MB/s] 21%|████████▏                              | 49.0M/233M [00:01<00:05, 36.9MB/s] 35%|█████████████▌                         | 81.0M/233M [00:01<00:03, 50.1MB/s] 45%|██████████████████                      | 105M/233M [00:02<00:02, 66.7MB/s] 49%|███████████████████▍                    | 113M/233M [00:02<00:01, 68.2MB/s] 52%|████████████████████▊                   | 121M/233M [00:02<00:01, 67.7MB/s] 59%|███████████████████████▌                | 137M/233M [00:02<00:01, 83.3MB/s] 63%|█████████████████████████▎        

Unzip downloaded files and subsequently delete the originally downloaded zipped files

In [6]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/lemon-quality-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/lemon-quality-dataset.zip')

with zipfile.ZipFile(DestinationFolder2 + '/lemons-quality-control-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder2)

os.remove(DestinationFolder2 + '/lemons-quality-control-dataset.zip')

---

# Data Preparation

Set input dataset paths for first and second lemon dataset

In [7]:
DataPath = "inputs/lemon-quality-dataset/lemon_dataset"
DataPath2 = 'inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset'

Import COCO API and use it to read dataset

In [8]:
from pycocotools.coco import COCO
coco = COCO(f'{DataPath2}/annotations/instances_default.json')
cats = coco.cats
print(cats)


loading annotations into memory...
Done (t=0.89s)
creating index...
index created!
{1: {'id': 1, 'supercategory': '', 'name': 'image_quality'}, 2: {'id': 2, 'supercategory': '', 'name': 'illness'}, 3: {'id': 3, 'supercategory': '', 'name': 'gangrene'}, 4: {'id': 4, 'supercategory': '', 'name': 'mould'}, 5: {'id': 5, 'supercategory': '', 'name': 'blemish'}, 6: {'id': 6, 'supercategory': '', 'name': 'dark_style_remains'}, 7: {'id': 7, 'supercategory': '', 'name': 'artifact'}, 8: {'id': 8, 'supercategory': '', 'name': 'condition'}, 9: {'id': 9, 'supercategory': '', 'name': 'pedicel'}}


Separate lemons into lists of healthy and unhealthy lemons, decided by passing tags for filtering dataset in. Here, illness, gangrene, and mould are chosen as markers for unhealthy lemons, and the lemons on which said traits were annotated are separated from those on which they were not.

In [9]:
def sort_lemons(coco, tags: list) -> tuple:
    all_lemon_ids = coco.getImgIds()
    temp_list = []
    bad_ids = []
    good_ids = []
    for tag in tags:
        arr = coco.getImgIds(catIds=[tag])
        for i in arr:
            temp_list.append(i)
    
    [bad_ids.append(i) for i in temp_list if i not in bad_ids]
    [good_ids.append(i) for i in all_lemon_ids if i not in bad_ids]
    bad_lemons = coco.loadImgs(ids=bad_ids)
    good_lemons = coco.loadImgs(ids=good_ids)
    return bad_lemons, good_lemons


In [10]:
my_sort = sort_lemons(coco=coco, tags=[2,3,4])
print(len(my_sort))

2


Move sorted image files into folders with first lemon dataset

In [11]:
import shutil
def move_sorted_images(sort: tuple):
    for i in sort[0]: #improve this section of code to iterate through both labels at once
        file_ext = i['file_name']
        image_file = f'{DataPath2}/{file_ext}'                             
        shutil.move(image_file, f'{DataPath}/bad_quality')
    for i in sort[1]:
        file_ext = i['file_name']
        image_file = f'{DataPath2}/{file_ext}'
        shutil.move(image_file, f'{DataPath}/good_quality')

    

In [12]:
my_move = move_sorted_images(sort=my_sort)


## Data Cleaning

Remove second lemon quality dataset folder, along with empty background folder and git hooks folder in first dataset

In [13]:
shutil.rmtree('inputs/lemon-quality-dataset-2/')
shutil.rmtree('inputs/lemon-quality-dataset/lemon_dataset/empty_background/')
shutil.rmtree('inputs/lemon-quality-dataset/lemon_dataset/.git/')

Check for and remove non-image files

In [14]:
def remove_non_image_file(my_data_dir: str):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) 
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
    
    

In [15]:
remove_non_image_file(my_data_dir=DataPath)

Folder: bad_quality - has image file 2901
Folder: bad_quality - has non-image file 0
Folder: good_quality - has image file 1865
Folder: good_quality - has non-image file 0


---

## Perform train-validation-test split on data

Define train-test-validation split function

In [None]:
import random

def split_train_validation_test_images(my_data_dir: str, train_set_ratio: float, validation_set_ratio: float, test_set_ratio: float):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  labels = os.listdir(my_data_dir) 
  if 'test' in labels:
    pass
  else: 
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

Apply function to combined lemon dataset

In [None]:
split_train_validation_test_images(my_data_dir = DataPath,
                            train_set_ratio = 0.7,
                            validation_set_ratio = 0.1,
                            test_set_ratio = 0.2)