# Capstone Project: Usingn CNNs to identify portraits of adults versus portraits of youth.
### Notebook 01: Data Organization

Recommendation: Create a separate environment for keras


citations:
* [Getting Started with Keras]('https://keras.io/getting_started/')
* [Image Data Loading]('https://keras.io/api/data_loading/image/')
* [Image Augmentation Layers]('https://keras.io/api/layers/preprocessing_layers/image_augmentation/')
* [How to split folder of images into test/train/validation sets with stratified sampling]('https://stackoverflow.com/questions/53074712/how-to-split-folder-of-images-into-test-training-validation-sets-with-stratified')
* [tf.keras.preprocessing.image_dataset_from_directory Value Error: No images found]('https://stackoverflow.com/questions/68449103/tf-keras-preprocessing-image-dataset-from-directory-value-error-no-images-found')
* [Access images after tf.keras.utils.image_dataset_from_directory]('https://stackoverflow.com/questions/73672773/access-images-after-tf-keras-utils-image-dataset-from-directory')
* [TensorFlow 2.0 Tutorial 01: Basic Image Classification]('https://lambdalabs.com/blog/tensorflow-2-0-tutorial-01-image-classification-basics')
* [`shutil` High-level file operations]('https://docs.python.org/3/library/shutil.html')
* [Python| os.listdir() method]('https://www.geeksforgeeks.org/python-os-listdir-method/')
* [Python| os.path.join() method]('https://www.geeksforgeeks.org/python-os-path-join-method/')

In [14]:
# imports
import numpy as np
import os
import shutil
from sklearn.model_selection import train_test_split

#### First Goal: Put the image data into the desired directory structure

I will be using the Keras built-in function: `image_dataset_from_directory`. I need the directories need to be structured as:

<code>
data
|
|___train
|      |___class_1
|      |___class_2
|
|___validation
|      |___class_1
|      |___class_2
|
|___test(optional)
       |___class_1
       |___class_2

    
</code>

I am starting with two image directories:
1. POR = Portraits of adults
    - adults_dir = "/Users/marta/Documents/POR/"
2. PLP = Portraits of youth
    - youth_dir = "/Users/marta/Documents/PLP/"

In [33]:
# NOTE: These steps are HEAVILY guided by the sources I cited above.

# FIRST: directory paths
dest_dir = "/Users/marta/Documents/image_data/" #destination

adults_dir = "/Users/marta/Documents/POR/" #sources
youth_dir = "/Users/marta/Documents/PLP/" 

# SECOND: create directories inside my destination directories
# Loop over subsets 'train' and 'validation'
for subset in ['train', 'validation']:
    
    # Loop over labels 'class_1' and 'class_2'
    for label in ['class_1', 'class_2']:
        
        # Create directories for each subset and label combination
        # The directories are created in the destination directory (dest_dir)
        # The 'exist_ok=True' blocks errors raised if the directory already exists
        os.makedirs(os.path.join(dest_dir, subset, label), exist_ok=True)

# THIRD: train_test_split the data into training and validation sets
adults_filenames = os.listdir(adults_dir)
youth_filenames = os.listdir(youth_dir)
print(f'youth_dir = {len(os.listdir(youth_dir))}')
print(f'adult_dir = {len(os.listdir(adults_dir))}')


adults_train, adults_val = train_test_split(adults_filenames, test_size = 0.2, random_state = 42) #stratify?
youth_train, youth_val = train_test_split(youth_filenames, test_size = 0.2, random_state = 42)
# Assisted by Arishti Ovsepyan
print(f'total train length = {len(adults_train) + len(youth_train)}')
print(f'total validation length = {len(adults_val) + len(youth_val)}')
print(f'total of train + validation = {len(adults_train) + len(youth_train) + len(adults_val) + len(youth_val)}')
print('-'* 20)

# FINALLY, copy images to the appropriate directories
for filename in adults_train:

    # Set the source path by joining the adults_dir and 
        # the current filename
    src = os.path.join(adults_dir, filename)

    # Set the destination path by joining the 
        # dest_dir, "train", "class_1", and the current filename
    dest = os.path.join(dest_dir, 'train', 'class_1', filename)
    
    # Copy the file from the source path to 
        # the destination path using shutil.copy()
    shutil.copy(src, dest)


# same as above, without comments
for filename in adults_val:
    src = os.path.join(adults_dir, filename)
    dest = os.path.join(dest_dir, "validation", "class_1", filename)
    shutil.copy(src, dest)

for filename in youth_train:
    src = os.path.join(youth_dir, filename)
    dest = os.path.join(dest_dir, "train", "class_2", filename)
    shutil.copy(src, dest)

for filename in youth_val:
    src = os.path.join(youth_dir, filename)
    dest = os.path.join(dest_dir, "validation", "class_2", filename)
    shutil.copy(src, dest)

print(f'adults_val =  {len(adults_val)}')
print(f'youth_val =  {len(youth_val)}')
print(f' total validation = {len(adults_val) + len(youth_val)}')
print ('-' * 20)
print(f'adult_train = {len(adults_train)}')
print(f'youth_train = {len(youth_train)}')
print(f'toatal_train = {len(adults_train) + len(youth_train)}')
print ('-' * 20)

youth_dir = 5423
adult_dir = 9211
total train length = 11706
total validation length = 2928
total of train + validation = 14634
--------------------
adults_val =  1843
youth_val =  1085
 total validation = 2928
--------------------
adult_train = 7368
youth_train = 4338
toatal_train = 11706
--------------------


Now there are training and validation folders, each with images from class 1 and class 2. 