# **Data Collection**

---

## Objectives

This notebook addresses Business Requirement 5:
## 5. Prediction Reporting
* Collect and curate a comprehensive dataset that includes labeled images of both pizza and not-pizza.
* Ensure the dataset is well-organized, with clear labels and sufficient diversity to represent real-world scenarios.

## Key Steps
1. Dataset Exploration:
* Explore the "Pizza or Not Pizza" dataset, understanding its structure, and visualizing sample images.
* Check for class imbalances and ensure a diverse representation of pizza variations.

2. Data Processing:
* Handle missing or corrupted data.

3. Labeling:
* Ensure clear labeling of images for supervised learning.
* Verify that each image is correctly labeled according to its content.

4. Dataset Splitting:
* Split the dataset into training, validation, and test sets to facilitate model training and evaluation.

5. Dataset Summary:
* Provide a summary of the dataset statistics, including the number of samples in each class.

## Inputs

* Kaggle JSON file -  authentication token.
* KaggleDatasetPath = "carlosrunner/pizza-not-pizza"

## Outputs

* Generate Dataset: inputs/datasets/carlosrunner/pizza-not-pizza

## Import packages

In [1]:
%pip install -r /workspaces/pizza-not-pizza/requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

## Change working directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/pizza-not-pizza/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/pizza-not-pizza'

## Install Kaggle

In [6]:
!pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m481.5 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m4.2 MB/s[0m eta [

### Change Kaggle configuration directory to current corking directory and permission of kaggle authenticaion json:

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

## Set Kaggle Dataset and Download it:

In [8]:
KaggleDatasetPath = "carlosrunner/pizza-not-pizza"
DestinationFolder = "inputs/carlosrunner/pizza-not-pizza"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading pizza-not-pizza.zip to inputs/carlosrunner/pizza-not-pizza
 96%|█████████████████████████████████████▌ | 97.0M/101M [00:02<00:00, 53.0MB/s]
100%|████████████████████████████████████████| 101M/101M [00:02<00:00, 40.3MB/s]


In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/pizza-not-pizza.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/pizza-not-pizza.zip')

---

# Data Preparation

---

## Data Cleaning

### Check and remove non-image files

The dataset from Kaggle automatically downloads a file called "food101_subset.py" which is uneccessary for this project. Pause here and delete that file from your repository's directory before continuing. With this excess file deleted, there is no need to perform any imputation on the remaining image datasets and they can be split into Train, Validation, and Test sets as is:

In [10]:
def remove_non_image_files(my_data_dir):
    # Define the list of valid image file extensions
    image_extension = ('.png', '.jpg', '.jpeg')

    # List all folders in the specified directory
    folders = os.listdir(my_data_dir)

    # Iterate through each folder
    for folder in folders:
        # List all files in the current folder
        files = os.listdir(my_data_dir + '/' + folder)
        
        i = [] # List to store non-image file count
        j = [] # List to store image file count

        # Iterate through each file in the folder
        for given_file in files:
            # Check if the file does not have a valid image extension
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1) # Increment non-image file count
            else:
                j.append(1) # Increment image file count
                pass

        # Print the counts of image and non-image files for the current folder
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [12]:
remove_non_image_files(my_data_dir='inputs/carlosrunner/pizza-not-pizza/pizza_not_pizza')

Folder: pizza - has image file 983
Folder: pizza - has non-image file 0
Folder: not_pizza - has image file 983
Folder: not_pizza - has non-image file 0


## Split train validation test set

In [13]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

In [14]:
split_train_validation_test_images(my_data_dir=f"inputs/carlosrunner/pizza-not-pizza/pizza_not_pizza",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---