# Initial Data Preparation

## Objectives

* Fetch data from Kaggle and save as raw data in organised folders
* Initial data preparation and data cleaning - removing non image files
* Read COCO annotations of second dataset to split into good and bad quality lemons
* Combine two datasets into a single one for analysis
* Perform image focal point isolation and background cropping tasks to uniformise dataset
* Split data into Train, Validation, Test sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Lemon Quality Dataset, split into Train, Validation, and Test sets



---

# Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol'

## Obtain and save data from Kaggle API

Install Kaggle

In [4]:
!pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set KaggleDataset variable as the [URL](https://www.kaggle.com/datasets/yusufemir/lemon-quality-dataset) for the dataset on Kaggle and create destination folder variable for it to be downloaded into.
Run Kaggle command to download dataset into destination folder. Repeat process for the second Kaggle dataset at the following [URL](https://www.kaggle.com/datasets/maciejadamiak/lemons-quality-control-dataset)

In [10]:
KaggleDataset = "yusufemir/lemon-quality-dataset"
DestinationFolder = "inputs/lemon-quality-dataset"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}

KaggleDataset2 = "maciejadamiak/lemons-quality-control-dataset"
DestinationFolder2 = "inputs/lemon-quality-dataset-2"
! kaggle datasets download -d {KaggleDataset2} -p {DestinationFolder2}


Downloading lemon-quality-dataset.zip to inputs/lemon-quality-dataset
 99%|███████████████████████████████████████▌| 230M/233M [00:05<00:00, 51.6MB/s]
100%|████████████████████████████████████████| 233M/233M [00:05<00:00, 44.8MB/s]
Downloading lemons-quality-control-dataset.zip to inputs/lemon-quality-dataset-2
 99%|█████████████████████████████████████▌| 82.0M/83.0M [00:02<00:00, 38.9MB/s]
100%|██████████████████████████████████████| 83.0M/83.0M [00:03<00:00, 29.0MB/s]


Unzip downloaded files and subsequently delete the originally downloaded zipped files

In [11]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/lemon-quality-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/lemon-quality-dataset.zip')

with zipfile.ZipFile(DestinationFolder2 + '/lemons-quality-control-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder2)

os.remove(DestinationFolder2 + '/lemons-quality-control-dataset.zip')

---

# Data Preparation

Set input dataset paths for first and second lemon dataset

In [4]:
DataPath = "inputs/lemon-quality-dataset/lemon_dataset"
DataPath2 = 'inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset'

Import COCO API and use it to read dataset

In [7]:
from pycocotools.coco import COCO
coco = COCO(f'{DataPath2}/annotations/instances_default.json')
cats = coco.cats
info = coco.info()
print(cats)
print(info)



loading annotations into memory...


FileNotFoundError: [Errno 2] No such file or directory: 'inputs/lemon-quality-dataset-2/data/lemon-dataset/lemon-dataset/annotations/instances_default.json'

Remove lemon images containing a prominent pedicel, then separate lemons into lists of healthy and unhealthy lemons, decided by passing tags for filtering dataset in. Here, illness, gangrene, mould, and blemishes are chosen as markers for unhealthy lemons, and the lemons on which said traits were annotated are separated from those on which they were not.

In [5]:
def sort_lemons(coco, tags: list) -> tuple:
'''
Separates annotated lemon images based on a list of annotations passed to the function,
along with removing those images in which a pedicel is prominent. Returns a tuple of lists;
bad quality lemons and good quality lemons.
'''
    all_lemon_ids = coco.getImgIds()
    pedicel_ids = coco.getImgIds(catIds=[9])
    non_pedicel_ids = [i for i in all_lemon_ids if i not in pedicel_ids]
    temp_list = []
    bad_ids = []
    good_ids = []
    for tag in tags:
        arr = coco.getImgIds(catIds=[tag])
        for i in arr:
            if i in non_pedicel_ids:
                temp_list.append(i)
    
    [bad_ids.append(i) for i in temp_list if i not in bad_ids]
    [good_ids.append(i) for i in non_pedicel_ids if i not in bad_ids]
    bad_lemons = coco.loadImgs(ids=bad_ids)
    good_lemons = coco.loadImgs(ids=good_ids)
    print(len(bad_lemons))
    print(len(good_lemons))
    return bad_lemons, good_lemons


In [6]:
my_sort = sort_lemons(coco=coco, tags=[2,3,4,5])
print('Lemons successfully sorted')

NameError: name 'coco' is not defined

Move sorted image files into folders with first lemon dataset

In [36]:
import shutil
def move_sorted_images(sort: tuple):
'''
Passes tuple of lists generated from sort_images function; moving them to the file that matches their class
'''
    for i in sort[0]: #improve this section of code to iterate through both labels at once
        file_ext = i['file_name']
        image_file = f'{DataPath2}/{file_ext}'                             
        shutil.move(image_file, f'{DataPath}/bad_quality')
    for i in sort[1]:
        file_ext = i['file_name']
        image_file = f'{DataPath2}/{file_ext}'
        shutil.move(image_file, f'{DataPath}/good_quality')

    

In [37]:
my_move = move_sorted_images(sort=my_sort)


## Data Cleaning

Remove second lemon quality dataset folder, along with empty background folder and git hooks folder in first dataset

In [38]:
shutil.rmtree('inputs/lemon-quality-dataset-2/')
shutil.rmtree(f'{DataPath}/empty_background/')
shutil.rmtree(f'{DataPath}/.git/')

Check for and remove non-image files

In [39]:
def remove_non_image_file(my_data_dir: str):
'''
Checks for and removes non-image files in the input folders by checking filename extensions match
accepted image formats
'''
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) 
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
    
    

In [40]:
remove_non_image_file(my_data_dir=DataPath)

Folder: bad_quality - has image file 2362
Folder: bad_quality - has non-image file 0
Folder: good_quality - has image file 1159
Folder: good_quality - has non-image file 0


---

## Remove Image Backgrounds

Install rembg package

In [41]:
!pip install rembg

Collecting rembg
  Downloading rembg-2.0.25-py3-none-any.whl (12 kB)
Collecting aiohttp==3.8.1
  Downloading aiohttp-3.8.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting filetype==1.1.0
  Downloading filetype-1.1.0-py2.py3-none-any.whl (17 kB)
Collecting click==8.1.3
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting asyncer==0.0.1
  Downloading asyncer-0.0.1-py3-none-any.whl (8.1 kB)
Collecting gdown==4.5.1
  Downloading gdown-4.5.1.tar.gz (14 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting numpy==1.21.6
  Downloadi

Create directories for processed images

In [9]:
BadQualityPath = f"{DataPath}/bad_quality"
GoodQualityPath = f"{DataPath}/good_quality"
os.makedirs(f"{DataPath}/bad_quality_processed")
os.makedirs(f"{DataPath}/good_quality_processed")
BadQualityProcessed = f"{DataPath}/bad_quality_processed"
GoodQualityProcessed = f"{DataPath}/good_quality_processed"

Remove image background to isolate lemon

In [44]:
from rembg import remove
from PIL import Image
import matplotlib.pyplot as plt



good_quality_files = os.listdir(GoodQualityPath)
for index, my_file in enumerate(good_quality_files):
    image=Image.open(f'{GoodQualityPath}/{my_file}')
    removed = remove(image)
    imageBox = removed.getbbox()
    cropped = removed.crop(imageBox)
    cropped.save(f'{GoodQualityProcessed}/{index}.png')

Downloading...
From: https://drive.google.com/uc?id=1tCU5MM1LhRgGou5OpmpjBQbSrYIUoYab
To: /home/gitpod/.u2net/u2net.onnx
100%|██████████| 176M/176M [00:02<00:00, 71.3MB/s] 


In [11]:
bad_quality_files = os.listdir(BadQualityPath)
for index, my_file in enumerate(bad_quality_files):
    image=Image.open(f'{BadQualityPath}/{my_file}')
    removed = remove(image)
    imageBox = removed.getbbox()
    cropped = removed.crop(imageBox)
    cropped.save(f'{BadQualityProcessed}/{index}.png')

Reinstall requirements.txt to avoid dependency conflicts

In [6]:
!pip install -r requirements.txt

Collecting numpy==1.19.2
  Downloading numpy-1.19.2-cp38-cp38-manylinux2010_x86_64.whl (14.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.5/14.5 MB[0m [31m116.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting click<8.0,>=7.0
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.8/82.8 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions, numpy, click
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.4.0
    Uninstalling typing_extensions-4.4.0:
      Successfully uninstalled typing_extensions-4.4.0
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.6
    Uninstalling numpy-1.21.6:
      Successfully uninstalled numpy-1.21.6
  Attempting uninstall: click
    Found existing i

Remove files of unprocessed images and rename processed image files

In [13]:
import shutil
shutil.rmtree(f'{DataPath}/bad_quality/')
shutil.rmtree(f'{DataPath}/good_quality/')

In [None]:
os.rename(f'{DataPath}/bad_quality_processed/', f'{DataPath}/bad_quality/')
os.rename(f'{DataPath}/good_quality_processed/', f'{DataPath}/good_quality/')

## Perform train-validation-test split on data

Define train-test-validation split function

In [None]:
import random

def split_train_validation_test_images(my_data_dir: str, train_set_ratio: float, validation_set_ratio: float, test_set_ratio: float):
'''
Splits images into train, test, and validation sets and creates new directories to sort them into.
Ratios for train, validation and test set proportions can be chosen by user and passed into function.
'''
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  labels = os.listdir(my_data_dir) 
  if 'test' in labels:
    pass
  else: 
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

Apply function to combined lemon dataset

In [None]:
split_train_validation_test_images(my_data_dir = DataPath,
                            train_set_ratio = 0.7,
                            validation_set_ratio = 0.1,
                            test_set_ratio = 0.2)