# **Data Collection**

## Objectives

* Fetch data from Kaggle and prepare it for further processing.

## Inputs

* Kaggle JSON file -  authentication token.

## Outputs

* Generate Dataset: inputs/datasets/carlosrunner/pizza-not-pizza

## Additional Comments

* 



---

## Import packages

In [1]:
import numpy
import os

## Change working directory

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/data-analytics/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/data-analytics'

## Install Kaggle

In [5]:
!pip install kaggle==1.5.12



### Change Kaggle configuration directory to current corking directory and permission of kaggle authenticaion json:

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

## Set Kaggle Dataset and Download it:

In [7]:
KaggleDatasetPath = "carlosrunner/pizza-not-pizza"
DestinationFolder = "inputs/carlosrunner/pizza-not-pizza"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

pizza-not-pizza.zip: Skipping, found more recently modified local copy (use --force to force download)


In [15]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/pizza-not-pizza.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/pizza-not-pizza.zip')

---

# Data Preparation

---

## Data Cleaning

### Check and remove non-image files

In [16]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [20]:
remove_non_image_file(my_data_dir='inputs/carlosrunner/pizza-not-pizza/pizza_not_pizza')

Folder: pizza - has image file 983
Folder: pizza - has non-image file 0
Folder: not_pizza - has image file 983
Folder: not_pizza - has non-image file 0


As found previously, there is no explicitly missing or incorrectly formatted data. Therefore there is no need to perform any imputation on the dataset and it can be split into Train and Test sets as is.

## Split train validation test set

In [21]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

In [23]:
split_train_validation_test_images(my_data_dir=f"inputs/carlosrunner/pizza-not-pizza/pizza_not_pizza",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

NOTE

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [25]:
import os
try:
    os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
    print(e)
