# **Data Collection**

## Objectives

* Fetch data from Kaggle and download raw data.
* Check if any non-image files are downloaded them and remove them.
* Split dataset into test, train and validation

## Inputs

* Kaggle JSON authentication token. 

## Outputs

* Generate inputs/cherry-leaves



---

In [1]:
import numpy as np

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/project-5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/project-5'

# Install Kaggle

Section 1 content

In [5]:
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.16-py3-none-any.whl size=110685 sha256=0399076122892856e44f0d0a4b41764387aef4d801ab92d5d756e27e11b17f05
  Stored 

Set kaggle configuration to current working directory

---

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get data from Kaggle

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 28.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 25.8MB/s]


Unzip folder

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Cleaning

Check if there are any non-image files and remove those that are not images

In [9]:
def remove_non_image_files(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # removes non-image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

remove_non_image_files(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

## Split data into train, validation and test sets

In [10]:
import shutil
import random
import joblib

def train_test_val_split(data_dir, train_set_ratio, val_set_ratio, test_set_ratio):

    if sum([train_set_ratio, val_set_ratio, test_set_ratio]) != 1.0:
        print("Sum of train_set_ratio, val_set_ratio and test_set_ratio should amount to 1.0")
        return
        
    # get labels/classes
    labels = os.listdir(data_dir)
    if 'test' in labels:
        pass
    else:
        # create train, validation and test folder with class labels as subfolder names
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=data_dir + '/' + folder + '/' + label)
                
        for label in labels:

            files = os.listdir(data_dir + '/' + label)
            random.shuffle(files)
            
            train_set_qty = int(len(files) * train_set_ratio)
            val_set_qty = int(len(files) * val_set_ratio)
            
            count = 1
            
            for file_name in files:
            
                if count <= train_set_qty:
                    # move given file to train set
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/train/' + label + '/' + file_name)
                
                elif count <= (sum([train_set_qty, val_set_qty])):
                    # move given file to validation set 
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/validation/' + label + '/' + file_name)
                else:
                    # move remaining files to test
                    shutil.move(data_dir + '/' + label + '/' + file_name,
                                data_dir + '/test/' + label + '/' + file_name)
                
                count += 1
                
            os.rmdir(data_dir + '/' + label)

    

In [11]:
train_test_val_split(data_dir=f"inputs/cherry_leaves_dataset/cherry-leaves", train_set_ratio=0.7,
                    val_set_ratio=0.1, test_set_ratio=0.2)

In [12]:
# check that the number of desired files are in the correct directory

file_count, data_split, group = [], [], []
base_dir = f"inputs/cherry_leaves_dataset/cherry-leaves" 

for label in ['healthy', 'powdery_mildew']:
    for folder in ['train', 'test', 'validation']:
        file_path = os.path.join(base_dir, folder, label)
        files = os.listdir(file_path)
        num_files = len(files)
        file_count.append(num_files)
        data_split.append(folder)
        group.append(label)
        print(f'* {folder} - {label}: {len(os.listdir(file_path))} images')
        

* train - healthy: 1472 images
* test - healthy: 422 images
* validation - healthy: 210 images
* train - powdery_mildew: 1472 images
* test - powdery_mildew: 422 images
* validation - powdery_mildew: 210 images


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
# import os
# try:
#     create here your folder
#     os.makedirs(name='')
# except Exception as e:
#     print(e)
