# **Data Collection**

## Objectives

* Import packages
* Set the working directory
* Fetch the data from Kaggle
* Clean the data
* Split the data

## Inputs

Kaggle JSON file - authentication token

## Outputs

When the dataset is downloaded from Kaggle it will be orgainized into the following structure:

* ── inputs
* 		└──vehicle_dataset
* 		      └──vehicle
* 					├── test
* 					│	├── non-vehicles
* 					│	└── vehicles
* 					├── train
* 					│	├── non-vehicles
* 					│	└── vehicles
* 					└── validation
* 							├── non-vehicles
* 							└── vehicles

---

# Import Packages

In [1]:
%load_ext pycodestyle_magic

In [2]:
%pycodestyle_on

%pycodestyle_on

## Change Working Directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\theph\\source\\repos\\Thephelpster\\CI_PP5_VD\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\theph\\source\\repos\\Thephelpster\\CI_PP5_VD'

## Kaggle

Install Kaggle

In [6]:
!pip install kaggle

Set Kaggle config directory environment variable to that of current working directory and set authentication to 600 to allow Kaggle package to locate JSON file

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

'chmod' is not recognized as an internal or external command,
operable program or batch file.


Set the kaggle dataset and download it

In [8]:
KaggleDatasetPath = "brsdincer/vehicle-detection-image-set"
DestinationFolder = "inputs/vehicle-detection-image-set"
! kaggle datasets download - d {KaggleDatasetPath} - p {DestinationFolder}



Traceback (most recent call last):
  File "C:\Users\theph\anaconda3\envs\myenv\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\theph\anaconda3\envs\myenv\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\theph\anaconda3\envs\myenv\Scripts\kaggle.exe\__main__.py", line 4, in <module>
  File "C:\Users\theph\anaconda3\envs\myenv\lib\site-packages\kaggle\__init__.py", line 23, in <module>
    api.authenticate()
  File "C:\Users\theph\anaconda3\envs\myenv\lib\site-packages\kaggle\api\kaggle_api_extended.py", line 403, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in c:\Users\theph\source\repos\Thephelpster\CI_PP5_VD. Or use the environment method.


Unzip downloaded files and subsequently delete the originally downloaded zipped files

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/vehicle-detection-image-set.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/vehicle-detection-image-set.zip')

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/vehicle-detection-image-set/vehicle-detection-image-set.zip'

2:80: E501 line too long (93 > 79 characters)


---

# Data Preparation

## Data Cleaning

Check and remove all non-image files

In [10]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(i))
        print(f"Folder: {folder} - has non-image file", len(j))

In [None]:
remove_non_image_file(my_data_dir='inputs/vehicle-detection-image-set/data')

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'inputs/vehicle-detection-image-set/data'

## Split train, validation and test sets

In [15]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    labels = os.listdir(my_data_dir)
    if 'test' in labels:
        pass
    else:
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

7:80: E501 line too long (107 > 79 characters)
10:80: E501 line too long (90 > 79 characters)
33:80: E501 line too long (82 > 79 characters)
37:80: E501 line too long (87 > 79 characters)
41:80: E501 line too long (81 > 79 characters)


* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.20 ratio of data.

In [16]:
split_train_validation_test_images(my_data_dir=f"inputs/vehicle-detection-image-set/data",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

1:80: E501 line too long (90 > 79 characters)


---