# Initial Data Preparation

## Objectives

* Fetch data from Kaggle and save as raw data
* Initial data preparation and data cleaning
* Split data into Train, Validation, Test sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Lemon Quality Dataset, split into Train, Validation, and Test sets



---

# Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/lemon-qualitycontrol'

## Obtain and save data from Kaggle API

Install Kaggle

In [4]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-6.1.2-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/78.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━

Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set KaggleDataset variable as the [URL](https://www.kaggle.com/datasets/yusufemir/lemon-quality-dataset) for the dataset on Kaggle and create destination folder variable for it to be downloaded into.
Run Kaggle command to download dataset into destination folder

In [7]:
KaggleDataset = "yusufemir/lemon-quality-dataset"
DestinationFolder = "inputs/lemon-quality-dataset"
! kaggle datasets download -d {KaggleDataset} -p {DestinationFolder}


Downloading lemon-quality-dataset.zip to inputs/lemon-quality-dataset
  0%|                                                | 0.00/233M [00:00<?, ?B/s]  2%|▊                                      | 5.00M/233M [00:00<00:08, 28.0MB/s]  7%|██▋                                    | 16.0M/233M [00:00<00:03, 65.1MB/s] 10%|████                                   | 24.0M/233M [00:00<00:04, 47.3MB/s] 18%|██████▉                                | 41.0M/233M [00:01<00:05, 38.4MB/s] 25%|█████████▌                             | 57.0M/233M [00:01<00:03, 46.6MB/s] 35%|█████████████▌                         | 81.0M/233M [00:01<00:02, 53.7MB/s] 49%|███████████████████▍                    | 113M/233M [00:02<00:01, 71.1MB/s] 62%|████████████████████████▉               | 145M/233M [00:02<00:01, 87.8MB/s] 69%|███████████████████████████▋            | 161M/233M [00:02<00:01, 61.8MB/s] 77%|██████████████████████████████▊         | 179M/233M [00:02<00:00, 75.0MB/s] 83%|█████████████████████████████████▏

Unzip downloaded file and subsequently delete the originally downloaded zipped file

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/lemon-quality-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/lemon-quality-dataset.zip')

---

# Data Preparation

## Data Cleaning

Remove empty background folder and git hooks folder

In [9]:
import shutil
shutil.rmtree('inputs/lemon-quality-dataset/lemon_dataset/empty_background/')
shutil.rmtree('inputs/lemon-quality-dataset/lemon_dataset/.git/')

Check for and remove non-image files

In [10]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) 
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
    
    

In [11]:
remove_non_image_file(my_data_dir="inputs/lemon-quality-dataset/lemon_dataset")

Folder: bad_quality - has image file 951
Folder: bad_quality - has non-image file 0
Folder: good_quality - has image file 1125
Folder: good_quality - has non-image file 0


---

## Perform train-validation-test split on data

Define train-test-validation split function

In [None]:
import random

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  labels = os.listdir(my_data_dir) 
  if 'test' in labels:
    pass
  else: 
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

Apply function to lemon dataset

In [None]:
split_train_validation_test_images(my_data_dir = f"inputs/lemon-quality-dataset/lemon_dataset",
                            train_set_ratio = 0.7,
                            validation_set_ratio = 0.1,
                            test_set_ratio = 0.2)