# **(ADD HERE THE NOTEBOOK NAME)**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Import packages

In [18]:
%pip install -r /workspace/SpheroidAnalytics/requirements.txt
%pip install -U ismember  # update if needed

Note: you may need to restart the kernel to use updated packages.
Collecting ismember
  Downloading ismember-1.0.4-py3-none-any.whl.metadata (3.4 kB)
Downloading ismember-1.0.4-py3-none-any.whl (7.7 kB)
Installing collected packages: ismember
Successfully installed ismember-1.0.4
Note: you may need to restart the kernel to use updated packages.


In [19]:
import numpy as np
import os
import shutil
from ismember import ismember

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

os.chdir('/workspace/SpheroidAnalytics')
print("You set a new current directory")
current_dir = os.getcwd()
current_dir

You set a new current directory


'/workspace/SpheroidAnalytics'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

# Install Kaggle

In [4]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Provide user id information for Kaggle.
* You need to download and add your token to main first!
* Don't forget to add it to gitignore!

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 /workspace/SpheroidAnalytics/kaggle.json

### Set Kaggle Dataset and Download

In [6]:
KaggleDatasetPath = 'andrgndel/spheroidtestset'
DestinationFolder = 'inputs/RawImages'

In [11]:
# Download Data
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder} --force

Downloading spheroidtestset.zip to inputs/RawImages
 97%|██████████████████████████████████████▋ | 131M/135M [00:02<00:00, 52.4MB/s]
100%|████████████████████████████████████████| 135M/135M [00:02<00:00, 48.2MB/s]


%%% unzip files

In [12]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/spheroidtestset.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/spheroidtestset.zip')

# Data Clean-up

## Check and Remove non-image files by func 'remove-non-image-file'

In [13]:
def remove_non_image_file(my_data_dir):
    # extension for raw images only!
    image_extension = ('.tif')
    # following subfolders in inputs will remain after function runs
    saveDirectory = ('RawImages','RawImages/')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                if os.path.isdir(file_location):
                    if not file_location.lower().endswith(saveDirectory):
                        shutil.rmtree(file_location)    # remove other folders
                else:
                    os.remove(file_location)  # remove non image file
                
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))
        

In [14]:
remove_non_image_file(my_data_dir = 'inputs')

Folder: RawImages - has image file 32
Folder: RawImages - has non-image file 0


## Split Data into Labels
Sample images are generated on an automated imaging platform. For this example references spheroids are associated with column 23 ('alive') and column 24 ('dead'). Other wells are placed in unknown.

In [23]:
import os
#import shutil
import random
import joblib
import math

# define function
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    # 23... means class 1 and 24 means class 2
    fileList = os.listdir(my_data_dir + '/RawImages')  # it should get only the folder name
    n = len(fileList)
    ntrain = math.floor(n*train_set_ratio)
    nval = math.floor(n*validation_set_ratio)
    ntest = n - ntrain - nval
    
    # create train, test folders with classes labels sub-folder
    for folder in ['train', 'validation', 'test']:
        for label in ['alive','dead']:
            if os.path.isdir(my_data_dir + '/' + folder + '/' + label):
                pass
            else:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)
    if os.path.isdir(my_data_dir + '/test/unknown'):
        pass
    else:
        os.makedirs(name=my_data_dir + '/test/unknown')

    random.shuffle(fileList)
    lblList = []
    print(n)
    count = 1
    for file in fileList:
        fileLbl = file[-6:]
        fileLbl = fileLbl[:2]
        if fileLbl == '23':
            lblList.append(1)
            lblFolder = 'alive'
        elif fileLbl == '24':
            lblList.append(2)
            lblFolder = 'dead'
        else:
            lblList.append(0)
            lblFolder = 'unknown'
        
        if lblFolder == 'unknown':
            # move given file to test set
                    shutil.move(my_data_dir + '/RawImages' + '/' + file,
                                my_data_dir + '/test/' + lblFolder + '/' + file)
                    ntest = ntest - 1
                    if ntest<0:
                        ntest = 0
                        nval = nval - 1

                    if nval<0:
                        ntrain = ntrain - 1
        else:
            if count<=ntrain:
                # move a given file to the train set
                    shutil.move(my_data_dir + '/RawImages' + '/' + file,
                                my_data_dir + '/train/' + lblFolder + '/' + file)
            elif count<=(ntrain+nval):
                # move a given file to the validation set
                    shutil.move(my_data_dir + '/RawImages' + '/' + file,
                                my_data_dir + '/validation/' + lblFolder + '/' + file)
            else:
                # move a given file to the test set
                    shutil.move(my_data_dir + '/RawImages' + '/' + file,
                                my_data_dir + '/test/' + lblFolder + '/' + file)
        count += 1
    # LX, LocY = ismember(lblList,[1])
    # print(LX)
    # print(lblList)

In [24]:
# run sample sorting
split_train_validation_test_images(my_data_dir=f"inputs",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

0
[]
[]


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [8]:
import os
try:
    # create here your folder
    # os.makedirs(name='')
except Exception as e:
    print(e)


IndentationError: expected an indented block (1114530593.py, line 5)