# **Data Collection**

## Objectives

Fetch data from Kaggle and save as raw data

## Inputs

- Kaggle JSON file - authentication token 

## Outputs

- Generate and split Dataset: inputs/datasets/malaria_dataset

## Additional Comments

- You will need to download a json file from Kaggle for authentication, if you already haver one, you can skip these steps



## Get your kaggle json file

1. Log into kaggle and locate your profile picture and click on it. Navigate to the **"Account"** section.
2. Navigate to the **"API"** section
3. To create a new token, click **"Create New API Token"** and it will generate and download the token for you.
4. Find the file in your downloads folder, make sure it is called **kaggle.json**
5. Drag and drop this file into the base directory for yout project

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
- We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")
current_dir = os.getcwd()
current_dir

You set a new current directory


'/workspaces/powdery_mildew_detection'

# Install Kaggle

In [2]:
# install kaggle package
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.16.tar.gz (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding w

Run the cell below to change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Getting the dataset
The dataset used is found [here](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves)

Download the dataset to the workspace

In [5]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_detection"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/mildew_detection
 89%|█████████████████████████████████▊    | 49.0M/55.0M [00:01<00:00, 31.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 36.2MB/s]


Unzip the file, and delete the zip file

In [6]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

Rename the new folder

In [7]:
os.rename("inputs/mildew_detection/cherry-leaves",
            "inputs/mildew_detection/cherry_leaves")

---

# Data Preparation

## Remove non image files

In [8]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [9]:
remove_non_image_file(my_data_dir='inputs/mildew_detection/cherry_leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


---

# Split into train, validation and test set

In [None]:
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

As it is conventionally done, we will split the dataset into:
- 70% training set
- 10% validation set
- 20% test set

Note: (All % values add to 100%)

In [None]:
split_train_validation_test_images(my_data_dir = f"inputs/mildew_detection/cherry_leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

---

# Push files to Repo

We have succsesfully downloaded, cleaned and slpit the dataset into three folders. (train, validation and test)
Now we will push the files to the repo to save our progress.

In [None]:
!git add .

In [None]:
!git commit -m "Download, clean and split data into train, validation and test set"

In [None]:
!git push